One pipeline → organized notes, a worked tutorial, and a calibrated practice test — from lectures, books, photos of class notes, and study-group chat. Built on NVIDIA Nemotron Omni in a weekend.
The Problem with AI Study Tools
Most AI tools for students do one thing well and nothing else. ChatGPT can summarize a chapter but won’t reliably make a tutorial with worked examples. Quizlet can produce flashcards but can’t read a lecture video. Auto-generated practice tests are everywhere, but the questions are often ambiguous, hallucinated, or trivial. So students stitch together three or four tools, each unaware of the others, and the resulting study session is a paper-thin patchwork of disconnected outputs.
What students actually need are three things from the same set of source materials:
- Clean notes — organized by topic, with citations back to the textbook page or lecture timestamp so claims can be verified.
- A walk-through tutorial — concepts explained with worked examples, framed in the same language the textbook uses.
- A practice test — questions that are actually unambiguous, with answers the student can trust.
https://youtu.be/6RDvQknPw8A?embedable=true
What We’re Solving
The core abstraction
The pipeline takes one set of source materials and produces three outputs:
┌──► Notes (organized + cited)
(multimodal study corpus) ──► ────┼──► Tutorial (worked examples)
└──► Practice test ──► self-eval filter ──► Calibrated test
For our purposes, three properties:
- The model can answer it confidently. If the model itself can’t decide between two answers, neither can the student.
- The answer is grounded in the source. Every claim traces back to a textbook page, a lecture timestamp, or a notes line.
- The question is non-trivial. Asking “what color is chlorophyll” doesn’t help if the source says “chlorophyll is green” twice. We want questions that require synthesis.
Property 1 is what makes this calibration. Property 2 is what makes it trustworthy. Property 3 is what makes it useful. We’ll get all three from the same self-evaluation pass — the marquee technical idea of this build.
The high-level architecture
┌──────────────────────────────────────────────────────────┐
│ INGESTION │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────┐ │
│ │ Textbook │ │ Lecture │ │ Notes │ │ Study chat │ │
│ │ PDF │ │ Video │ │ photos │ │ (text) │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬───────┘ │
│ └───────────┬┴────────────┴────────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Unified corpus │ │
│ └─────────────────┘ │
└──────────────────────────────────────────────────────────┘
│
┌───────────┼─────────────┐
▼ ▼ ▼
┌───────────┐ ┌──────────┐ ┌──────────────────────┐
│ NOTES │ │ TUTORIAL │ │ PRACTICE TEST │
│ │ │ │ │ (over-generate N=20)│
│ direct │ │ direct │ │ │ │
│ generate │ │ generate │ │ ▼ │
│ + cite │ │ + worked │ │ ┌────────────────┐ │
│ │ │ examples │ │ │ SELF-EVAL │ │
│ │ │ │ │ │ FILTER │ │
│ │ │ │ │ │ • can model │ │
│ │ │ │ │ │ answer it? │ │
│ │ │ │ │ │ • grounded? │ │
│ │ │ │ │ │ • non-trivial? │ │
│ │ │ │ │ └───────┬────────┘ │
│ │ │ │ │ ▼ │
│ │ │ │ │ Calibrated test │
└───────────┘ └──────────┘ └──────────────────────┘
Three things to notice:
- Multimodal at the input boundary, structured everywhere else. PDFs, video, images, and text all collapse into a single corpus that downstream code treats uniformly.
- Three outputs, one corpus, one model. Notes and tutorial are short prompts away. The test gets the heavy treatment because that’s where students get burned by bad AI output.
- The same model does generation and validation. Nemotron Omni generates questions in pass A, then evaluates them in pass B. No second model, no fine-tuning. Just careful prompting.
The stack
- NVIDIA Nemotron 3 Nano Omni — multimodal model, reads video + audio + PDFs + images natively
- vLLM — local inference server, OpenAI-compatible API
- DGX Spark — single-machine deployment for the whole pipeline or GPU or Cloud
- Python — orchestration glue (~150 lines)
Implementation
Step 1: Build the unified corpus
Multimodal inputs sound complicated until you realize Nemotron Omni reads them all natively. Our corpus is a list of typed chunks; the model handles the rest.
import base64
import requests
from pdf2image import convert_from_path
from io import BytesIO
from typing import List, Dict
from pathlib import Path
VLLM_URL = "http://localhost:8000/v1/chat/completions"
MODEL = "nemotron-omni"
def file_to_b64(path: str) -> str:
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
def build_corpus(
textbook_pdf: str,
lecture_video: str,
notes_photos: List[str],
study_group_text: str,
) -> List[Dict]:
"""Build a typed list of corpus chunks Nemotron can ingest."""
corpus = []
# Textbook pages — each page is one image chunk with a source label.
pages = convert_from_path(textbook_pdf, dpi=150)
for i, page in enumerate(pages):
buf = BytesIO()
page.save(buf, format="PNG")
b64 = base64.b64encode(buf.getvalue()).decode("utf-8")
corpus.append({
"type": "image",
"source": f"{Path(textbook_pdf).name}#page={i+1}",
"data": b64,
})
# Lecture video — single chunk. Nemotron's hard cap is 2 minutes;
# for longer lectures, chunk by scene or fixed window upstream.
corpus.append({
"type": "video",
"source": Path(lecture_video).name,
"data": file_to_b64(lecture_video),
})
# Notes photos — each photo is one image chunk.
for photo_path in notes_photos:
corpus.append({
"type": "image",
"source": Path(photo_path).name,
"data": file_to_b64(photo_path),
})
# Study group chat — text chunk. Cleaned offline (names redacted,
# off-topic banter removed). What's left is professor-emphasis
# signal: "she said the Calvin cycle is heavy this semester."
corpus.append({
"type": "text",
"source": "study_group_chat.txt",
"data": study_group_text,
})
return corpus
Step 2: Build the corpus payload
vLLM’s OpenAI-compatible API takes a list of content blocks per message. We turn our typed corpus into that shape:
def corpus_to_content(corpus: List[Dict]) -> List[Dict]:
"""Convert the corpus into OpenAI-style content blocks."""
content = []
for chunk in corpus:
if chunk["type"] == "image":
content.append({
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{chunk['data']}"
},
})
elif chunk["type"] == "video":
content.append({
"type": "video_url",
"video_url": {
"url": f"data:video/mp4;base64,{chunk['data']}"
},
})
elif chunk["type"] == "text":
content.append({
"type": "text",
"text": f"[Source: {chunk['source']}]n{chunk['data']}",
})
return content
Step 3: Generate a raw question pool
We ask Nemotron to produce N questions in structured JSON. The system prompt is the lever — it controls quality more than any other knob.
def generate_question_pool(
corpus: List[Dict],
topic: str,
n_questions: int = 20,
) -> List[Dict]:
"""Generate a raw pool of N questions. Many will be dropped later."""
system_prompt = f"""You are a careful tutor preparing practice questions
for a student studying {topic}. Generate exactly {n_questions} multiple-choice
questions based ONLY on the provided study materials.
Rules:
- Each question must have exactly 4 options labeled A-D
- Exactly one option must be unambiguously correct given the materials
- Cite the source for the correct answer (filename or page number)
- Vary difficulty: ~30% recall, ~50% application, ~20% synthesis
- Do not invent facts not present in the materials
- Output strict JSON: a list of objects with keys
question, options, correct, source, rationale, difficulty"""
user_content = corpus_to_content(corpus) + [{
"type": "text",
"text": f"Generate the {n_questions} questions now. Output JSON only.",
}]
payload = {
"model": MODEL,
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_content},
],
"temperature": 0.7, # variety in the raw pool
"max_tokens": 4096,
"response_format": {"type": "json_object"},
}
r = requests.post(VLLM_URL, json=payload, timeout=600)
r.raise_for_status()
raw = r.json()["choices"][0]["message"]["content"]
import json
parsed = json.loads(raw)
questions = parsed.get("questions", parsed if isinstance(parsed, list) else [])
return questions
Step 4: The self-evaluation pass — the calibration step
This is the heart of the system. For each generated question, we ask the model to answer it from scratch using only the corpus, then compare the model’s answer to the expected answer.
def self_evaluate_question(
corpus: List[Dict],
question: Dict,
) -> Dict:
"""Run the model against its own question. Score the question on
three axes: clarity, groundedness, non-triviality."""
system_prompt = """You are a careful student answering a multiple-choice
question using ONLY the provided study materials. Reply in strict JSON with
keys: chosen_option (A/B/C/D), confidence (0.0-1.0), supporting_evidence
(1-2 sentences from the materials), reasoning (1 sentence)."""
user_content = corpus_to_content(corpus) + [{
"type": "text",
"text": f"""Question: {question['question']}
Options:
A) {question['options'][0]}
B) {question['options'][1]}
C) {question['options'][2]}
D) {question['options'][3]}
Answer using ONLY the materials. Output JSON only."""
}]
payload = {
"model": MODEL,
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_content},
],
"temperature": 0.0, # deterministic when answering
"max_tokens": 512,
"response_format": {"type": "json_object"},
}
r = requests.post(VLLM_URL, json=payload, timeout=600)
r.raise_for_status()
import json
eval_result = json.loads(r.json()["choices"][0]["message"]["content"])
is_correct = eval_result["chosen_option"] == question["correct"]
confidence = float(eval_result["confidence"])
has_evidence = len(eval_result.get("supporting_evidence", "")) > 20
return {
"is_correct": is_correct,
"confidence": confidence,
"has_evidence": has_evidence,
"model_chose": eval_result["chosen_option"],
"evidence": eval_result.get("supporting_evidence", ""),
"reasoning": eval_result.get("reasoning", ""),
Step 5: Filter
def filter_calibrated(
corpus: List[Dict],
questions: List[Dict],
target_count: int = 10,
confidence_floor: float = 0.7,
) -> List[Dict]:
"""Run self-eval on each question. Keep questions that:
- Get the right answer
- With confidence >= confidence_floor
- Have non-trivial supporting evidence"""
keepers = []
for q in questions:
result = self_evaluate_question(corpus, q)
passed = (
result["is_correct"]
and result["confidence"] >= confidence_floor
and result["has_evidence"]
)
q["_eval"] = result
q["_passed"] = passed
if passed:
keepers.append(q)
keepers.sort(key=lambda q: q["_eval"]["confidence"], reverse=True)
return keepers[:target_count]
That’s the full filter. Three lines of decision logic:
- is_correct — model agrees with the question’s stated answer
- confidence >= 0.7 — model isn’t waffling
- has_evidence — model could cite something specific from the corpus
If all three pass, the question is a keeper. If not, drop it. We sort by confidence descending so the best questions land at the top of the test.
Step 6: Assemble the test
def build_test(
textbook_pdf: str,
lecture_video: str,
notes_photos: List[str],
study_group_text: str,
topic: str,
target_count: int = 10,
) -> Dict:
corpus = build_corpus(textbook_pdf, lecture_video, notes_photos, study_group_text)
pool = generate_question_pool(corpus, topic, n_questions=target_count * 2)
keepers = filter_calibrated(corpus, pool, target_count=target_count)
return {
"topic": topic,
"questions_generated": len(pool),
"questions_kept": len(keepers),
"questions_dropped": len(pool) - len(keepers),
"test": [
{
"id": i + 1,
"question": q["question"],
"options": q["options"],
"difficulty": q.get("difficulty", "?"),
}
for i, q in enumerate(keepers)
],
"answer_key": [
{
"id": i + 1,
"correct": q["correct"],
"rationale": q.get("rationale", ""),
"source": q.get("source", ""),
"model_confidence": round(q["_eval"]["confidence"], 2),
}
for i, q in enumerate(keepers)
],
}
Step 7: Notes and tutorial — same corpus, different prompts
Notes and the tutorial reuse the corpus payload from Step 2 directly. No retrieval changes, no extra infrastructure. Just two more chat completions with prompts shaped for each output type:
def generate_notes(corpus: List[Dict], topic: str) -> str:
"""
Organized study notes with citations to source material.
Output is markdown, ready to render in a tab.
"""
return chat_complete(
system=(
"You are a meticulous study-notes writer. Produce concise, "
"topic-grouped notes in markdown. EVERY claim must end with "
"a bracketed citation like [Textbook §9.3] or [Lecture 4 · 12:40]. "
"If a claim has no source in the corpus, do not include it."
),
user=corpus_to_content(corpus) + [
{"type": "text", "text":
f"Write study notes on: {topic}. "
f"Group by sub-topic. Use bullet points. Include citations."}
],
temperature=0.2,
)
def generate_tutorial(corpus: List[Dict], topic: str) -> str:
"""
Walk-through tutorial with worked examples, framed in the
textbook's vocabulary. Different shape than notes — narrative, not bullets.
"""
return chat_complete(
system=(
"You are a patient tutor explaining a concept to a student "
"who has the source material in front of them. Walk through "
"ONE worked example end-to-end, then explain the underlying "
"principle, then give two short practice variations. Match "
"the vocabulary of the source. Cite as you go."
),
user=corpus_to_content(corpus) + [
{"type": "text", "text":
f"Tutorial topic: {topic}. "
f"Pick a representative worked example from the corpus."}
],
temperature=0.4, # slightly higher than notes — narrative needs some flow
)
What it looks like running
For a sample run on a college calculus chapter on integration by substitution — 1 textbook chapter (24 pages), 1 recorded lecture (chunked to a 2-minute summary clip), 6 phone photos of handwritten notes, and ~80 lines of cleaned study group chat:
Generating 20 questions... [done in 38s]
Self-evaluating each... [done in 1m 12s]
Generated: 20
Kept: 12
Dropped: 8
Drop reasons:
- Model picked wrong option: 3
- Confidence below 0.7: 4
- Evidence too thin: 1
Top keeper (confidence 0.97):
Q: When applying u-substitution to ∫ 2x · cos(x²) dx,
what is the most direct choice of u?
Source: lecture_video.mp4 @ 12:43
Top dropped (confidence 0.42, model chose C, expected B):
Q: Which of the following describes when u-substitution
"always" works?
Reason: Question is poorly framed — there's no "always"
case in the source material. Model couldn't commit.
This output is exactly what you’d push to a tutoring app, a flashcard system, or a printout for the student. It’s structured enough to drive UI, calibrated enough to trust. The student studies 12 questions where the AI itself has gone on record agreeing with the answer at 70%+ confidence, instead of 20 questions where 8 are subtly wrong.