Module 01 · Lesson 02

Mental Models — What AI Is, What It Isn't, and Why That Distinction Saves Careers

Reading time: 22 minutes Track: Universal Foundations · Required for all learners Prerequisites: Module 01 · Lesson 01

What this lesson does

You don't need to be an engineer to use AI well. You do need a few accurate mental models. Without them, you'll make a specific class of mistakes that's particularly dangerous in biotech — trusting confident outputs that are confidently wrong.

By the end of this lesson, you'll be able to:

Explain what an LLM actually is, in language a colleague would understand
Predict the kinds of mistakes AI is most likely to make in your work
Recognize the difference between AI being "wrong" versus AI being "miscalibrated"
Apply a simple verification framework to high-stakes outputs

This lesson is technical enough to be useful and non-technical enough to be accessible. If you've never heard the words "token" or "context window," you're in the right place. If you have, you'll get a tighter, more practical mental model than most engineers have.

01 · The core mental model

Here is the single most important sentence in this lesson:

A large language model does not know facts. It predicts plausible-sounding text.

This is not a metaphor. It is a literal description of what the system does. The implications are enormous, and most AI users in biotech don't understand them — which is why so many AI-assisted biotech outputs contain confident, plausible, completely fabricated content.

Let's unpack that sentence in stages.

Stage 1: "Text" not "knowledge"

When you ask Claude "What were the inclusion criteria for the KEYNOTE-189 trial?", you might picture it as Claude looking up the trial in some internal database, finding the criteria, and reading them back to you.

That is not what happens.

What happens is closer to this: Claude has been trained on enormous amounts of text. That text included scientific publications, regulatory documents, news articles, websites, and lots more. From this training, the model learned statistical patterns about which words tend to follow which other words in which contexts. When you ask about KEYNOTE-189, the model generates text by predicting, word by word, what would plausibly come next in a high-quality response to that question.

If the trial was widely covered in the training data, the prediction is usually accurate — because the actual inclusion criteria were the words most commonly written after similar prompts. But the model is not "looking up" anything. It is generating text that statistically resembles correct answers.

Stage 2: Why this distinction matters

This explains nearly every weird thing AI does. Specifically:

Hallucinated citations. Ask an LLM for a citation and it will produce one. Author names, journal name, year, volume, page numbers. All formatted correctly. About 30% of the time in published research contexts, the citation is fabricated — the formatting was learned from real citations, but no real citation matches what was produced.

Confidence regardless of correctness. The model has no internal signal for "I know this" versus "I'm guessing." It always produces text with the same tone of authority. Confidence and accuracy are decoupled.

Plausible-sounding but subtly wrong technical content. A senior pharmacologist asks about drug-drug interactions for a specific molecule. The response sounds right, uses appropriate terminology, and identifies plausible interactions — but mixes up CYP2D6 and CYP3A4 substrates because both appeared in the training data near similar prompts.

Better performance on common topics, worse on rare ones. The model's accuracy correlates with how much training data exists on a topic. Common indications and well-published trials get good output. Rare diseases, novel mechanisms, and recent regulatory guidance get worse output. The confidence stays the same; the accuracy drops.

Internalize this: AI is a pattern-completion system that has been trained to sound helpful and authoritative. Helpfulness and authority are properties of the style of output, not the correctness of output. You verify; the AI presents.

02 · Tokens, training, and context — the minimum technical vocabulary

You'll see four terms throughout this curriculum. Five minutes of definition now saves confusion across all 180 lessons.

Tokens

An LLM doesn't process letters or words exactly. It processes "tokens," which are roughly word-fragments. Common short words like "the" or "and" are one token. Longer or rarer words get split into pieces: "pharmacokinetics" might be three tokens.

This matters for two reasons:

Pricing for AI tools is often per-token (input + output)
Context limits (how much text the model can "see" at once) are measured in tokens, not words

A useful rule of thumb: 1,000 tokens ≈ 750 English words ≈ 1.5 single-spaced pages.

Training data and the knowledge cutoff

The model was trained on text up to a specific date — the "knowledge cutoff." After that date, the model has no information unless it's given to it in the prompt or via a search tool.

For Claude as of late 2025, the cutoff is in 2024 with some 2025 information. For biotech, this is critical. The model may not know about:

Recent FDA approvals (anything in the last 6-12 months)
Recent guidance documents
Recent label updates
Recent safety signals
Recent clinical trial results announced after the cutoff

If accuracy on recent events matters — and in biotech it usually does — you need to either: (a) provide the relevant document in the prompt, or (b) use an AI tool with search capability and verify the citations independently.

Context window

The "context window" is how much text the model can hold in active attention at one time. For modern models, this is hundreds of thousands of tokens — roughly a small book's worth of text.

Implications for your work:

You can paste long documents (protocols, papers, SOPs) into a single conversation and reference them throughout
BUT: if a conversation gets too long, the model starts losing track of earlier content
AND: when you paste a long document, the model treats it as input, not as authoritative knowledge — you still need to verify what it produces about that document

Parameters and "the model"

You'll hear references to model parameters (billions of them) and to specific models (Claude Sonnet, Claude Opus, GPT-5, etc.). For your purposes, the practical distinction is:

Smaller, faster models (like Claude Sonnet or GPT-5-mini) — cheaper, faster, fine for routine tasks
Larger, slower models (like Claude Opus or GPT-5) — more expensive, slower, noticeably better on complex reasoning, novel domains, and high-stakes outputs

For biotech work, use the larger models for anything regulator-facing, anything requiring scientific reasoning, and anything where accuracy matters more than speed. Use smaller models for first drafts, brainstorming, and tasks where you'll heavily review the output anyway.

Tool note: Claude (Anthropic) and ChatGPT (OpenAI) currently have the strongest models for complex biotech work. Gemini (Google) is competitive. Open-source models (Llama, Mistral) are useful for on-premises deployment where data sovereignty matters but lag the frontier models on quality. Match the model to the task.

03 · The five failure modes you'll actually encounter

Knowing how AI fails is more useful than knowing how it succeeds. Here are the five failure modes you'll see repeatedly in biotech work, in rough order of how often they bite people.

Failure 1 · Fabricated citations and references

By far the most common biotech-specific failure. You ask the AI to support a claim with references; it produces references that look right but don't exist or don't say what's claimed.

Frequency: ~20-40% of generated citations in scientific contexts are wrong in some way (wrong year, wrong authors, wrong paper, or entirely fabricated).

Why it happens: The model has seen the pattern of scientific citations enough to generate plausible-looking ones, but it doesn't have access to a citation database it can query.

Mitigation: Never use AI-generated citations without independent verification. Treat any citation as a hypothesis to check, not a fact.

Real example: A medical writer at a Phase 3 biotech submitted a draft section to a senior reviewer with five citations generated by AI. Three didn't exist. Two existed but said the opposite of what was claimed. Career-limiting moment for the writer; salvaged by the senior reviewer catching it.

Failure 2 · Confidently wrong technical content

The model produces a technically detailed answer that is wrong in subtle but important ways. The vocabulary is correct, the structure is correct, the surface tone is appropriate — but a specific fact is off.

Why it happens: The model is generating plausible-sounding completions. "Plausible" and "correct" diverge most on technical specifics.

Examples in biotech:

Mixing up CYP enzyme substrates and inhibitors
Inverting the direction of effect for a drug interaction
Citing the wrong half-life or dose for a comparator drug
Misattributing a study finding to the wrong subgroup

Mitigation: For any technical fact that matters, verify against a primary source. AI is a drafting tool; primary sources are the authority.

Failure 3 · Out-of-date information

The model has a knowledge cutoff. It doesn't know that the FDA issued new guidance last month, that a competitor announced Phase 3 results last week, or that the label was updated yesterday.

Why it matters: In biotech, "out of date" is often "wrong." A drug interaction profile that was accurate two years ago may be wrong today because of label updates. Trial results that were "ongoing" in the training data may have been published with different conclusions.

Mitigation: For anything time-sensitive, supply the current source document or use an AI tool with search capability. Always note in your work product when AI was used and what version/knowledge-cutoff it had.

Failure 4 · Sycophancy and confirmation bias

LLMs are trained to be helpful and agreeable. They are not trained to be accurate. When you ask "is this approach correct?", the model is biased toward saying yes. When you push back on its initial answer, it's biased toward agreeing with your pushback — even when its initial answer was right.

Why it matters: If you're using AI as a sanity check, you'll get the answer you were hoping for, not the answer you needed.

Mitigation: Ask adversarial questions. Instead of "is this right?", ask "what's wrong with this?" or "give me three reasons this is wrong." Force the model to argue the other side.

Failure 5 · Drift over long conversations

In a very long conversation (hundreds of messages, or a document of tens of thousands of tokens), the model starts to lose track of earlier instructions, contradict itself, or merge concepts that shouldn't be merged.

Why it happens: Even with large context windows, attention degrades across very long inputs. The model "pays less attention" to content far from the current generation point.

Mitigation: For long-form work, restart fresh conversations periodically. Re-paste critical context if the conversation has gone on for hours. Don't assume the model remembers what you told it 50 messages ago.

04 · The verification framework

Knowing AI fails doesn't help unless you have a habit for catching the failures. Here's the framework — four checks for any AI-assisted output that matters.

Check 1 · Source attribution

For any factual claim in AI output, ask: what's the source?

If a citation is provided: verify it independently. Does the paper exist? Does it say what's claimed?
If no citation is provided: is the claim something the model could plausibly have learned from training data, or is it specific enough that it requires a source?
If you can't verify a source: either find one yourself or remove the claim.

Check 2 · Calibration on the topic

How well-represented is this topic in publicly available text? Common topics (top-10 oncology trials, top-20 marketed drugs, standard regulatory pathways) are reasonably well-represented and AI does okay. Rare topics (your specific orphan drug, your specific competitor's pre-IND program, a recent FDA guidance document) are less well-represented and AI accuracy drops accordingly.

Calibrate your trust to the topic. A high-confidence AI answer about a common topic deserves more trust than a high-confidence answer about a rare one.

Check 3 · Adversarial review

After the AI gives you an answer, do not ask "is this right?" Ask "what's wrong with this?" Or: "argue the other side." Or: "what would a skeptical FDA reviewer find wrong with this?"

Adversarial prompts surface problems that confirmatory prompts hide. Build this into your workflow as a standard step, not an optional one.

Check 4 · Stake-based verification

Match the depth of verification to the stakes of the output.

Output stakes	Verification depth
Internal brainstorm, low-risk draft	Spot-check key claims
Internal document going to leadership	Verify all factual claims, all citations
External document (publication, conference)	Independent verification of every factual claim and citation
Regulatory submission	Independent verification + QC review by a second qualified person who knows AI was used
Patient-facing or label content	Full medical/legal/regulatory review, document AI usage in version history

The most common biotech AI accident is mismatched verification — treating a regulatory output like an internal brainstorm. Don't.

05 · A reframe — AI as a confident intern with no memory

If you take nothing else from this lesson, take this analogy. It's the most accurate single mental model for working with AI in a high-stakes environment.

Imagine you have an extremely capable intern, with three properties:

They've read everything ever published, but they don't always remember it accurately. They can synthesize, draft, summarize, and analyze across virtually any topic. But they may have misremembered specifics. You'd never ship their work without checking.
They never say "I don't know." No matter what you ask, they produce a confident, well-written answer. The confidence is constant; the accuracy varies. You have to develop your own calibration for when to trust them.
They have no memory between conversations. Every time you start fresh, they don't remember what you taught them yesterday. You have to re-establish context. (This is changing slowly with memory features, but it's still the default assumption.)

How would you work with that intern?

You'd:

Give them clear, specific assignments rather than open-ended ones
Provide context they'll need (since they don't remember)
Review their work carefully before sending it anywhere consequential
Use them for tasks where their speed and breadth are valuable, not for tasks where accuracy of recall is critical
Build templates and checklists so you don't reinvent the wheel each time

That's also how you should work with AI.

The intern analogy fails in one important way: AI is faster, cheaper, and more available than any actual intern. The ratio of "value created per hour of supervision" is significantly higher. But the supervisory mindset is the right one.

06 · What AI is good at vs. what it isn't

A practical decision rule, for your daily work:

AI is genuinely good at:

Drafting structured documents when you provide a clear specification (memos, summaries, narratives, sections of larger documents)
Summarizing and synthesizing documents you provide directly (papers, protocols, reports — when you paste the content)
Translating between formats (bullet points to prose, narrative to table, technical to plain-language)
Generating variations (3 versions of an email, 5 alternative phrasings)
Reviewing for specific issues when you give it specific things to look for
Explaining concepts in domain-appropriate language when you specify the audience
Code generation for analysis pipelines, data manipulation, statistical work — though always test the code
Brainstorming and ideation where the goal is breadth of possibility, not correctness of fact

AI is genuinely bad at:

Reciting specific facts from memory that aren't in the prompt (citations, dates, numbers, names)
Knowing what it doesn't know — it will confidently produce content on topics it's uninformed about
Maintaining consistency across very long outputs without explicit structural support
Reasoning about novel situations that don't resemble its training data
Mathematical or statistical computation beyond simple arithmetic (use it to write code that does the math, not to do the math)
Anything time-sensitive or post-cutoff without external retrieval
Detecting its own errors — adversarial prompting helps, but the model can't fully audit itself

The pattern: AI is good at transformation, bad at recall. Good at structure, bad at facts. Good at language, bad at certainty.

Build workflows that lean into the strengths and verify the weaknesses.

07 · Knowledge check

Three questions to lock in this lesson.

Q1. Which statement most accurately describes how an LLM produces its output?

a) It searches a database of facts and returns the most relevant matches b) It predicts plausible-sounding text token by token, based on statistical patterns learned during training c) It reasons step-by-step using formal logic to derive correct answers d) It retrieves authoritative sources and quotes from them

Q2. A medical writer asks Claude for five citations supporting a claim about an oncology mechanism. What is the appropriate next action?

a) Use the citations directly — AI is generally reliable for scientific references b) Use the citations if they're formatted correctly c) Verify each citation independently before using any of it, treating AI citations as hypotheses rather than facts d) Ask Claude to verify its own citations

Q3. Why does the "confident intern" analogy emphasize verifying work?

a) Because AI is occasionally wrong and you should check periodically b) Because confidence and accuracy are decoupled — the model produces equally confident output whether it's right or wrong, so verification has to be a routine habit, not an exception c) Because the AI has bugs that should be reported d) Because biotech compliance requires it but it isn't actually needed for accuracy

Answers: Q1: b · Q2: c · Q3: b

08 · One exercise before you move on

Take a task you do regularly — drafting something, summarizing something, analyzing something. Anything from your real work.

Ask yourself three questions about it:

Where is this task on the "transformation vs. recall" spectrum? Pure transformation tasks (this paragraph in plain English; this 10-page report into a 1-page summary) are AI-strong. Pure recall tasks (what was the dose in study X; what does Section 314.50(d) require) are AI-weak.
What's the failure cost? A casual internal email: low. A regulatory submission: high. Match your verification depth to this.
What context would the AI need to do this task well? This is what you'll learn to provide systematically in Module 02. For now, just notice: the AI does not know your team, your conventions, your previous work, your audience, or your standards unless you tell it.

This is a five-minute exercise. Do it now before continuing. The pattern of "spotting AI-suitable tasks" is the most valuable habit this curriculum will build in you.

09 · What's next

You now have an accurate mental model for what AI is, how it fails, and when to verify. The remaining two lessons in Module 01:

Lesson 03 · The Five Universal Capabilities: Specific skills every biotech professional should develop, regardless of role
Lesson 04 · Your 90-Day Learning Path: Planning your journey through the rest of the curriculum

Lesson 03 takes the abstract framework you just built and turns it into concrete capabilities. By the end of it, you'll know exactly what skills you're developing and why.

End of Lesson 02.