How I Built an AI Study Buddy That Generates Notes, Tutorials, and Self-Validated Tests

One pipeline → organized notes, a worked tutorial, and a calibrated practice test — from lectures, books, photos of class notes, and study-group chat. Built on NVIDIA Nemotron Omni in a weekend.

The Problem with AI Study Tools

Most AI tools for students do one thing well and nothing else. ChatGPT can summarize a chapter but won’t reliably make a tutorial with worked examples. Quizlet can produce flashcards but can’t read a lecture video. Auto-generated practice tests are everywhere, but the questions are often ambiguous, hallucinated, or trivial. So students stitch together three or four tools, each unaware of the others, and the resulting study session is a paper-thin patchwork of disconnected outputs.

What students actually need are three things from the same set of source materials:

Clean notes — organized by topic, with citations back to the textbook page or lecture timestamp so claims can be verified.
A walk-through tutorial — concepts explained with worked examples, framed in the same language the textbook uses.
A practice test — questions that are actually unambiguous, with answers the student can trust.

https://youtu.be/6RDvQknPw8A?embedable=true

What We’re Solving

The core abstraction

The pipeline takes one set of source materials and produces three outputs:

                                  ┌──► Notes (organized + cited)
(multimodal study corpus) ──► ────┼──► Tutorial (worked examples)
                                  └──► Practice test ──► self-eval filter ──► Calibrated test

For our purposes, three properties:

The model can answer it confidently. If the model itself can’t decide between two answers, neither can the student.
The answer is grounded in the source. Every claim traces back to a textbook page, a lecture timestamp, or a notes line.
The question is non-trivial. Asking “what color is chlorophyll” doesn’t help if the source says “chlorophyll is green” twice. We want questions that require synthesis.

Property 1 is what makes this calibration. Property 2 is what makes it trustworthy. Property 3 is what makes it useful. We’ll get all three from the same self-evaluation pass — the marquee technical idea of this build.

The high-level architecture

┌──────────────────────────────────────────────────────────┐
│                   INGESTION                              │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────┐   │
│  │ Textbook │ │ Lecture  │ │ Notes    │ │ Study chat │   │
│  │   PDF    │ │  Video   │ │ photos   │ │  (text)    │   │
│  └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬───────┘   │
│       └───────────┬┴────────────┴────────────┘           │
│                   ▼                                      │
│          ┌─────────────────┐                             │
│          │ Unified corpus  │                             │
│          └─────────────────┘                             │
└──────────────────────────────────────────────────────────┘
                   │
       ┌───────────┼─────────────┐
       ▼           ▼             ▼
┌───────────┐  ┌──────────┐  ┌──────────────────────┐
│   NOTES   │  │ TUTORIAL │  │  PRACTICE TEST       │
│           │  │          │  │  (over-generate N=20)│
│ direct    │  │ direct   │  │          │           │
│ generate  │  │ generate │  │          ▼           │
│ + cite    │  │ + worked │  │  ┌────────────────┐  │
│           │  │ examples │  │  │ SELF-EVAL      │  │
│           │  │          │  │  │ FILTER         │  │
│           │  │          │  │  │ • can model    │  │
│           │  │          │  │  │   answer it?   │  │
│           │  │          │  │  │ • grounded?    │  │
│           │  │          │  │  │ • non-trivial? │  │
│           │  │          │  │  └───────┬────────┘  │
│           │  │          │  │          ▼           │
│           │  │          │  │   Calibrated test    │
└───────────┘  └──────────┘  └──────────────────────┘

Three things to notice:

Multimodal at the input boundary, structured everywhere else. PDFs, video, images, and text all collapse into a single corpus that downstream code treats uniformly.
Three outputs, one corpus, one model. Notes and tutorial are short prompts away. The test gets the heavy treatment because that’s where students get burned by bad AI output.
The same model does generation and validation. Nemotron Omni generates questions in pass A, then evaluates them in pass B. No second model, no fine-tuning. Just careful prompting.

The stack

NVIDIA Nemotron 3 Nano Omni — multimodal model, reads video + audio + PDFs + images natively
vLLM — local inference server, OpenAI-compatible API
DGX Spark — single-machine deployment for the whole pipeline or GPU or Cloud
Python — orchestration glue (~150 lines)

Implementation

Step 1: Build the unified corpus

Multimodal inputs sound complicated until you realize Nemotron Omni reads them all natively. Our corpus is a list of typed chunks; the model handles the rest.

import base64
import requests
from pdf2image import convert_from_path
from io import BytesIO
from typing import List, Dict
from pathlib import Path

VLLM_URL = "http://localhost:8000/v1/chat/completions"
MODEL = "nemotron-omni"

def file_to_b64(path: str) -> str:
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

def build_corpus(
    textbook_pdf: str,
    lecture_video: str,
    notes_photos: List[str],
    study_group_text: str,
) -> List[Dict]:
    """Build a typed list of corpus chunks Nemotron can ingest."""
    corpus = []

    # Textbook pages — each page is one image chunk with a source label.
    pages = convert_from_path(textbook_pdf, dpi=150)
    for i, page in enumerate(pages):
        buf = BytesIO()
        page.save(buf, format="PNG")
        b64 = base64.b64encode(buf.getvalue()).decode("utf-8")
        corpus.append({
            "type": "image",
            "source": f"{Path(textbook_pdf).name}#page={i+1}",
            "data": b64,
        })

    # Lecture video — single chunk. Nemotron's hard cap is 2 minutes;
    # for longer lectures, chunk by scene or fixed window upstream.
    corpus.append({
        "type": "video",
        "source": Path(lecture_video).name,
        "data": file_to_b64(lecture_video),
    })

    # Notes photos — each photo is one image chunk.
    for photo_path in notes_photos:
        corpus.append({
            "type": "image",
            "source": Path(photo_path).name,
            "data": file_to_b64(photo_path),
        })

    # Study group chat — text chunk. Cleaned offline (names redacted,
    # off-topic banter removed). What's left is professor-emphasis
    # signal: "she said the Calvin cycle is heavy this semester."
    corpus.append({
        "type": "text",
        "source": "study_group_chat.txt",
        "data": study_group_text,
    })

    return corpus

Step 2: Build the corpus payload

vLLM’s OpenAI-compatible API takes a list of content blocks per message. We turn our typed corpus into that shape:

def corpus_to_content(corpus: List[Dict]) -> List[Dict]:
    """Convert the corpus into OpenAI-style content blocks."""
    content = []
    for chunk in corpus:
        if chunk["type"] == "image":
            content.append({
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/png;base64,{chunk['data']}"
                },
            })
        elif chunk["type"] == "video":
            content.append({
                "type": "video_url",
                "video_url": {
                    "url": f"data:video/mp4;base64,{chunk['data']}"
                },
            })
        elif chunk["type"] == "text":
            content.append({
                "type": "text",
                "text": f"[Source: {chunk['source']}]n{chunk['data']}",
            })
    return content

Step 3: Generate a raw question pool

We ask Nemotron to produce N questions in structured JSON. The system prompt is the lever — it controls quality more than any other knob.

def generate_question_pool(
    corpus: List[Dict],
    topic: str,
    n_questions: int = 20,
) -> List[Dict]:
    """Generate a raw pool of N questions. Many will be dropped later."""
    system_prompt = f"""You are a careful tutor preparing practice questions
for a student studying {topic}. Generate exactly {n_questions} multiple-choice
questions based ONLY on the provided study materials.

Rules:
- Each question must have exactly 4 options labeled A-D
- Exactly one option must be unambiguously correct given the materials
- Cite the source for the correct answer (filename or page number)
- Vary difficulty: ~30% recall, ~50% application, ~20% synthesis
- Do not invent facts not present in the materials
- Output strict JSON: a list of objects with keys
  question, options, correct, source, rationale, difficulty"""

    user_content = corpus_to_content(corpus) + [{
        "type": "text",
        "text": f"Generate the {n_questions} questions now. Output JSON only.",
    }]

    payload = {
        "model": MODEL,
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_content},
        ],
        "temperature": 0.7,  # variety in the raw pool
        "max_tokens": 4096,
        "response_format": {"type": "json_object"},
    }

    r = requests.post(VLLM_URL, json=payload, timeout=600)
    r.raise_for_status()
    raw = r.json()["choices"][0]["message"]["content"]

    import json
    parsed = json.loads(raw)
    questions = parsed.get("questions", parsed if isinstance(parsed, list) else [])
    return questions

Step 4: The self-evaluation pass — the calibration step

This is the heart of the system. For each generated question, we ask the model to answer it from scratch using only the corpus, then compare the model’s answer to the expected answer.

def self_evaluate_question(
    corpus: List[Dict],
    question: Dict,
) -> Dict:
    """Run the model against its own question. Score the question on
    three axes: clarity, groundedness, non-triviality."""

    system_prompt = """You are a careful student answering a multiple-choice
question using ONLY the provided study materials. Reply in strict JSON with
keys: chosen_option (A/B/C/D), confidence (0.0-1.0), supporting_evidence
(1-2 sentences from the materials), reasoning (1 sentence)."""

    user_content = corpus_to_content(corpus) + [{
        "type": "text",
        "text": f"""Question: {question['question']}
Options:
A) {question['options'][0]}
B) {question['options'][1]}
C) {question['options'][2]}
D) {question['options'][3]}

Answer using ONLY the materials. Output JSON only."""
    }]

    payload = {
        "model": MODEL,
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_content},
        ],
        "temperature": 0.0,  # deterministic when answering
        "max_tokens": 512,
        "response_format": {"type": "json_object"},
    }

    r = requests.post(VLLM_URL, json=payload, timeout=600)
    r.raise_for_status()

    import json
    eval_result = json.loads(r.json()["choices"][0]["message"]["content"])

    is_correct = eval_result["chosen_option"] == question["correct"]
    confidence = float(eval_result["confidence"])
    has_evidence = len(eval_result.get("supporting_evidence", "")) > 20

    return {
        "is_correct": is_correct,
        "confidence": confidence,
        "has_evidence": has_evidence,
        "model_chose": eval_result["chosen_option"],
        "evidence": eval_result.get("supporting_evidence", ""),
        "reasoning": eval_result.get("reasoning", ""),

Step 5: Filter

def filter_calibrated(
    corpus: List[Dict],
    questions: List[Dict],
    target_count: int = 10,
    confidence_floor: float = 0.7,
) -> List[Dict]:
    """Run self-eval on each question. Keep questions that:
       - Get the right answer
       - With confidence >= confidence_floor
       - Have non-trivial supporting evidence"""
    keepers = []
    for q in questions:
        result = self_evaluate_question(corpus, q)
        passed = (
            result["is_correct"]
            and result["confidence"] >= confidence_floor
            and result["has_evidence"]
        )
        q["_eval"] = result
        q["_passed"] = passed
        if passed:
            keepers.append(q)

    keepers.sort(key=lambda q: q["_eval"]["confidence"], reverse=True)
    return keepers[:target_count]

That’s the full filter. Three lines of decision logic:

is_correct — model agrees with the question’s stated answer
confidence >= 0.7 — model isn’t waffling
has_evidence — model could cite something specific from the corpus

If all three pass, the question is a keeper. If not, drop it. We sort by confidence descending so the best questions land at the top of the test.

Step 6: Assemble the test

def build_test(
    textbook_pdf: str,
    lecture_video: str,
    notes_photos: List[str],
    study_group_text: str,
    topic: str,
    target_count: int = 10,
) -> Dict:
    corpus = build_corpus(textbook_pdf, lecture_video, notes_photos, study_group_text)
    pool = generate_question_pool(corpus, topic, n_questions=target_count * 2)
    keepers = filter_calibrated(corpus, pool, target_count=target_count)

    return {
        "topic": topic,
        "questions_generated": len(pool),
        "questions_kept": len(keepers),
        "questions_dropped": len(pool) - len(keepers),
        "test": [
            {
                "id": i + 1,
                "question": q["question"],
                "options": q["options"],
                "difficulty": q.get("difficulty", "?"),
            }
            for i, q in enumerate(keepers)
        ],
        "answer_key": [
            {
                "id": i + 1,
                "correct": q["correct"],
                "rationale": q.get("rationale", ""),
                "source": q.get("source", ""),
                "model_confidence": round(q["_eval"]["confidence"], 2),
            }
            for i, q in enumerate(keepers)
        ],
    }

Step 7: Notes and tutorial — same corpus, different prompts

Notes and the tutorial reuse the corpus payload from Step 2 directly. No retrieval changes, no extra infrastructure. Just two more chat completions with prompts shaped for each output type:

def generate_notes(corpus: List[Dict], topic: str) -> str:
    """
    Organized study notes with citations to source material.
    Output is markdown, ready to render in a tab.
    """
    return chat_complete(
        system=(
            "You are a meticulous study-notes writer. Produce concise, "
            "topic-grouped notes in markdown. EVERY claim must end with "
            "a bracketed citation like [Textbook §9.3] or [Lecture 4 · 12:40]. "
            "If a claim has no source in the corpus, do not include it."
        ),
        user=corpus_to_content(corpus) + [
            {"type": "text", "text":
             f"Write study notes on: {topic}. "
             f"Group by sub-topic. Use bullet points. Include citations."}
        ],
        temperature=0.2,
    )

def generate_tutorial(corpus: List[Dict], topic: str) -> str:
    """
    Walk-through tutorial with worked examples, framed in the
    textbook's vocabulary. Different shape than notes — narrative, not bullets.
    """
    return chat_complete(
        system=(
            "You are a patient tutor explaining a concept to a student "
            "who has the source material in front of them. Walk through "
            "ONE worked example end-to-end, then explain the underlying "
            "principle, then give two short practice variations. Match "
            "the vocabulary of the source. Cite as you go."
        ),
        user=corpus_to_content(corpus) + [
            {"type": "text", "text":
             f"Tutorial topic: {topic}. "
             f"Pick a representative worked example from the corpus."}
        ],
        temperature=0.4,  # slightly higher than notes — narrative needs some flow
    )

What it looks like running

For a sample run on a college calculus chapter on integration by substitution — 1 textbook chapter (24 pages), 1 recorded lecture (chunked to a 2-minute summary clip), 6 phone photos of handwritten notes, and ~80 lines of cleaned study group chat:

Generating 20 questions...    [done in 38s]
Self-evaluating each...       [done in 1m 12s]

Generated:  20
Kept:       12
Dropped:     8

Drop reasons:
  - Model picked wrong option:   3
  - Confidence below 0.7:        4
  - Evidence too thin:           1

Top keeper (confidence 0.97):
  Q: When applying u-substitution to ∫ 2x · cos(x²) dx,
     what is the most direct choice of u?
  Source: lecture_video.mp4 @ 12:43

Top dropped (confidence 0.42, model chose C, expected B):
  Q: Which of the following describes when u-substitution
     "always" works?
  Reason: Question is poorly framed — there's no "always"
  case in the source material. Model couldn't commit.

This output is exactly what you’d push to a tutoring app, a flashcard system, or a printout for the student. It’s structured enough to drive UI, calibrated enough to trust. The student studies 12 questions where the AI itself has gone on record agreeing with the answer at 70%+ confidence, instead of 20 questions where 8 are subtly wrong.