Module 03 · Lesson 02
Data Classification — The Framework That Makes Compliant AI Use Possible
Reading time: 22 minutes Track: Governance & Compliance · REQUIRED before role specialization Prerequisites: Module 02 (complete) · Module 03 · Lesson 01
What this lesson does
Lesson 01 established the bright lines — data and content that never enter an AI tool, ever. This lesson develops the framework that handles everything else.
Most biotech data isn't bright-line prohibited. It's also not unconstrained. It sits in a middle zone where the right answer depends on what the data is, what tool you're using, and what you're trying to do.
By the end of this lesson, you'll be able to:
- Classify any piece of data into one of five risk categories
- Match data classes to approved tool environments
- Recognize the most common classification mistakes
- Build a personal habit of classifying before prompting
This is the most operationally useful lesson in Module 03. The bright lines (Lesson 01) prevent catastrophes. The classification framework (this lesson) makes everyday work compliant without slowing you down.
01 · Why classification matters before anything else
A common mistake: thinking the question is "is this AI tool secure?" The right question is "is this data appropriate for the tool I'm using?"
Tool security is a binary the IT team owns. Data classification is a judgment call that you own, every time you paste something into a prompt.
The same Claude Enterprise environment may be appropriate for an internal strategy memo and inappropriate for unredacted patient data. The same public ChatGPT may be fine for a literature review brainstorm and a fireable offense for pre-IND draft content.
The data's classification, not the tool's marketing, determines what's allowed.
Before you can match tools to data, you need a clean classification system. Here's the one that works for biotech.
02 · The five-tier classification
Every piece of information you might use with AI fits into one of five tiers. Memorize these. They become reflexive after a week of conscious practice.
Tier 1 — PUBLIC
Information that is already public or intended to be public.
Examples:
- Published papers, public clinical trial registry entries
- FDA-approved labels, public guidance documents
- Company press releases, public investor presentations
- Public conference abstracts and posters
- Job postings, public company website content
AI handling: Any AI tool is acceptable. You're not adding risk by exposing it; it was already exposed.
Caveat: "Public to the world" is different from "public to AI training corpora." Some content is public but not yet indexed. You're not violating anything by using it, but the AI may not know about it.
Tier 2 — INTERNAL
Information that isn't secret but isn't intended for public distribution.
Examples:
- Internal SOPs and process documents
- Internal training materials, organizational charts
- Internal meeting agendas, project plans (without confidential content)
- Internal newsletters, all-hands materials
- Job descriptions in draft
AI handling: Use approved enterprise AI environments (Claude Enterprise, ChatGPT Enterprise, Gemini Workspace, or your company's internal deployment). Don't paste into consumer-tier or free-tier tools.
The exposure risk is low — most of this would be embarrassing if leaked, not catastrophic. The reason to use enterprise tools isn't extreme sensitivity; it's that data-handling guarantees scale with tier, and internal data deserves enterprise handling as a default.
Tier 3 — CONFIDENTIAL
Information that carries meaningful business or scientific value if disclosed, but isn't legally restricted.
Examples:
- Pre-publication research (manuscripts in progress, draft posters)
- Internal strategic plans, competitive analyses
- Pricing strategies, commercial forecasts
- Negotiating positions, BD targets
- Internal financial projections
- Vendor proposals, pre-decisional procurement discussions
AI handling: Approved enterprise AI environments only, with the additional requirement that the environment has explicit data-use guarantees (your prompts and outputs are not used for model training, are not retained beyond your session unless you choose, and are not visible to the vendor's staff except for explicit support requests).
Confidential data is where most well-intentioned mistakes happen. People think "this isn't patient data, so consumer ChatGPT is fine." It's not. Confidential strategic content needs enterprise handling.
Tier 4 — RESTRICTED
Information that is legally, contractually, or regulatorily protected, with specific access controls.
Examples:
- Patient health information (PHI) — names, identifiers, identifiable health details
- Personal identifiable information (PII) of employees, investigators, contractors
- Subject data from clinical trials (even if "de-identified," see Lesson 03)
- CDA/CDA-covered partner data
- Pre-submission regulatory content
- Material non-public information (MNPI) — financial, M&A
- Information covered by attorney-client privilege
AI handling: Highly constrained.
- On-premises or zero-retention enterprise environments only. Standard enterprise SaaS may not meet the bar.
- Many specific uses still prohibited even with the right tool. Look at Lesson 01's bright lines.
- Verify with InfoSec / Privacy / Compliance before any new use case.
Restricted data is where governance most actively constrains practice. Even sophisticated enterprise environments may not be approved for restricted classes; you need to verify, not assume.
Tier 5 — PROHIBITED
Information that never enters any AI tool, regardless of environment.
Examples (from Lesson 01):
- Live patient data without explicit, documented, controlled-environment authorization
- Trade secrets defined as such by your company
- Source code that constitutes core IP
- Adverse event narratives with patient identifiers
- Pre-submission regulatory content unless using a specifically approved environment
- MNPI in any external-facing AI tool
- Anything you'd be sued for disclosing
AI handling: No AI. Other tools, manual processes, or specifically architected internal systems.
The prohibited tier is the bright-line tier from Lesson 01. It's restated here because the classification framework needs the full five-tier picture.
03 · The classification reflex
Fluency on this framework means classifying without thinking about it. Here's the reflex:
Before you paste anything into an AI prompt, ask three questions:
- If this leaked to a competitor tomorrow, what's the worst-case outcome?
- If this leaked publicly tomorrow, what's the worst-case outcome?
- Does this contain any data class that triggers a specific regulation or contract?
The answers map cleanly to tiers:
- "Nothing happens / it's already public" → Tier 1
- "Mild embarrassment, no real harm" → Tier 2
- "Meaningful competitive or strategic harm" → Tier 3
- "Legal, regulatory, contractual exposure" → Tier 4
- "Catastrophic — career, company, patient impact" → Tier 5
Practice this three-question check on the next ten things you'd plausibly paste into AI. By the tenth one, the classification becomes automatic. By the thirtieth, you do it without conscious thought.
04 · Tool-environment matrix
Once you can classify, you need to know which environments handle which tiers. Here's the general matrix. Your specific company will have a more detailed version — find it and learn it.
| Data Tier | Consumer/Free AI | Enterprise SaaS | Zero-Retention Enterprise | On-Premises |
|---|---|---|---|---|
| Tier 1 (Public) | ✅ | ✅ | ✅ | ✅ |
| Tier 2 (Internal) | ❌ | ✅ | ✅ | ✅ |
| Tier 3 (Confidential) | ❌ | ⚠️ Check policy | ✅ | ✅ |
| Tier 4 (Restricted) | ❌ | ❌ | ⚠️ Specific approval | ✅ Where approved |
| Tier 5 (Prohibited) | ❌ | ❌ | ❌ | ❌ |
Reading the matrix:
- ✅ — generally acceptable; verify against company policy
- ⚠️ — situational; requires explicit policy check or use-case approval
- ❌ — not acceptable; do not use
Critical: This is a general framework. Your company's specific policies may be stricter. They are rarely looser. Default to the stricter interpretation when in doubt.
What each environment type means
Consumer/Free AI: Public ChatGPT, free Claude.ai, Gemini consumer, etc. Your prompts may be used for training. There's no data-use contract. Sessions may persist indefinitely. This is fine for personal use and Tier 1 data; not for anything else.
Enterprise SaaS: Claude for Work / Claude Enterprise, ChatGPT Enterprise, Gemini Enterprise. Your prompts are typically not used for training, retention is contractually controlled, and there are administrative protections. Good for Tier 2 and most Tier 3. Verify the specific terms — they vary by tier within enterprise offerings.
Zero-Retention Enterprise: A subset of enterprise offerings that guarantee no retention of prompts/outputs, no use for training, and specific data-handling protocols. Required for sensitive Tier 3 and a precondition for many Tier 4 use cases. The vendor must explicitly attest to these.
On-Premises: The AI model runs on infrastructure you control. Data never leaves your environment. The trade-off: models you can deploy on-premises (open-source like Llama, Mistral) are usually 12-18 months behind frontier models in capability. For highly regulated work where data sovereignty trumps capability, this is the right answer.
05 · The de-identification trap
A specific gray-zone case that catches people: "I removed the patient name, so the data is now safe to use."
This is wrong in important ways.
HIPAA defines two paths to de-identification:
- Safe Harbor — removal of 18 specific identifiers (name, address, dates more specific than year, phone, etc.). After all 18 are removed, data is considered de-identified.
- Expert Determination — a qualified statistician certifies that re-identification risk is "very small."
Common biotech mistakes:
- Removing the name but keeping the visit date and ZIP code. (Date + ZIP often re-identifies; both must go.)
- Removing identifiers but keeping a study ID that can be cross-referenced. (Study IDs can re-identify in many contexts.)
- Sharing "small-N" data where the population is identifiable by demographics alone. (Rare disease populations may be re-identifiable from age + diagnosis + region.)
- Removing direct identifiers but keeping verbatim narrative text. (Free-text narratives often contain identifiers in flowing prose.)
The Module 03 default for clinical and patient data:
"De-identified data" is treated as Tier 4 (Restricted), not Tier 1 (Public), unless an Expert Determination has been performed for the specific use case.
This is more conservative than what some teams practice. It's the appropriate default given how often "de-identified" data turns out to be re-identifiable.
If your team has a different policy, follow it. But understand that the conservative default exists for a reason — the consequences of getting de-identification wrong are severe.
06 · The "what tool am I in?" check
A specific failure mode: people use AI through a context (a connected app, a browser extension, an automation tool) and forget what's actually running underneath.
Examples:
- You use a Chrome extension that "summarizes documents." Where is that summary being generated? By what model? With what data-use policy?
- You use a meeting transcription tool that "enhances notes with AI." Is the AI running locally? Through a vendor? With what classification of the transcript?
- You use a workflow automation tool (Zapier, n8n, Make) that "AI-generates a response." Which AI? Which environment? With what data passing through?
In each case, the AI usage is real but invisible. You may be putting Tier 3 data through a system that's only approved for Tier 2. Or worse.
The check:
For any tool you use that has AI features, know:
- Which AI model is running underneath
- Where the model is hosted (vendor, cloud, on-prem)
- Whether your data is used for training
- Whether the data is retained, and for how long
- Whether this matches the tier of data you're processing
If you don't know the answers, treat the tool as consumer-tier until you do. Ask your IT/AI team. The check takes 5 minutes once. The exposure from skipping the check could be career-ending.
07 · The handoff problem
A second specific failure mode: data classification changes when content moves between contexts, and people don't re-classify.
Example sequence:
- You have an internal strategy memo (Tier 3).
- You ask AI to convert it into a public-facing summary (Tier 1).
- The conversion is in an enterprise environment — fine.
- You then ask AI to revise the original strategy memo based on what you learned writing the summary.
- You're back to Tier 3 work. Still in enterprise? Good. In consumer because you switched tabs? Not good.
The handoff problem is that classification follows the data, not the task. When you move between artifacts, re-check the classification of what you're actually working with.
A useful habit: at the start of any AI session, identify the data tier you're working in. If you switch tasks during the session, re-identify.
08 · Edge cases and gray zones
A few situations where classification is genuinely ambiguous:
Edge case 1 — Synthetic or hypothetical data
You generate a synthetic dataset that mimics real patient data structure. Is it Tier 1 or Tier 4?
Generally: Tier 2 if it's truly synthetic (generated from nothing). Tier 4 if it's "synthetic" in the sense of being derived from real data with noise added.
Practical guidance: If you generated the synthetic data yourself with no real-data seed, it's safe. If a tool generated it from a real-data seed, treat it as the source tier until you have an Expert Determination.
Edge case 2 — Aggregated data
You have aggregated statistics from a clinical trial — counts of patients with various AE classes, not individual records. Is this Tier 1 or Tier 4?
Generally: Tier 3 (Confidential) by default, even when aggregated. Aggregates can be re-identifying in small populations (e.g., rare disease), and aggregates often contain implicit information about populations that's protected.
Practical guidance: If the aggregates are from a published or filed source, they're whatever tier the source is (typically Tier 1). If they're internal pre-disclosure, treat as Tier 3 minimum.
Edge case 3 — Public data about a competitor
You compile information from a competitor's public materials (press releases, conference presentations, filings) into an internal analysis. Is the analysis Tier 1 or Tier 3?
Generally: The inputs are Tier 1 but the analysis is Tier 3. Your strategic interpretation of public data is itself confidential.
Practical guidance: You can use any AI tool to research public information about competitors. Once you start writing internal analyses, summaries, or strategic responses, move to enterprise environments.
Edge case 4 — Email content
You're drafting an email. The email content itself is something you'll send externally. Is the drafting work Tier 1 or higher?
Generally: The classification of an outgoing email is the classification of its content. An email confirming a meeting is Tier 1. An email discussing pricing strategy is Tier 3. An email containing patient names is Tier 4.
Practical guidance: Classify based on what's in the email, not based on the fact that you're going to send it.
Edge case 5 — Conversations about restricted topics
You want to ask a general question about a restricted topic. Example: "What are the typical safety profile concerns for [drug class]?" You're not pasting your data, you're asking a question.
Generally: A general question about a public topic is Tier 1. A specific question about your specific program — even without pasting data — can leak more than you think. "What are typical concerns for our oral PCSK9 inhibitor with [specific PK profile]?" reveals more than the asker realizes.
Practical guidance: When in doubt, make the question more abstract. "What are typical concerns for oral PCSK9 inhibitors generally?" is Tier 1. The specific version is Tier 3 in practice even though it's phrased as a general question.
09 · Building the classification habit
The cognitive load of classifying consciously every time is high. The cognitive load of classifying automatically is near-zero. Here's how to make the transition.
Week 1 — Conscious classification
For one week, before you paste anything into an AI tool, write down (literally — sticky note, scratch pad, notes app) what tier the data is. Don't just think it; write it. The act of writing it forces clarity.
Week 2 — Verbal classification
Drop the writing but say the tier to yourself, out loud or in your head. "This is Tier 2. Enterprise environment is fine."
Week 3 — Pattern recognition
By now you'll notice that ~70% of your AI usage involves the same 3-4 types of data, always in the same tiers. Mental classification becomes pre-conscious for those types. Conscious effort goes only to novel cases.
Week 4+ — Fluency
Classification has become a reflex. You only consciously think about it when something genuinely unfamiliar comes up.
This is how serious habits get installed in serious professionals. It's not magic — it's structured practice. Twenty minutes of intentional habit-building this week pays for itself the rest of your career.
10 · Knowledge check
Three questions to lock in this lesson.
Q1. You've removed the patient's name and date of birth from a clinical trial subject record. The remaining data includes age range, gender, ZIP code, primary diagnosis, and a study ID. How should this be classified for AI use?
a) Tier 1 — the identifiers are removed, it's de-identified b) Tier 2 — it's no longer obviously identifiable c) Tier 4 — "de-identification" by removing only some fields is incomplete; this data is potentially re-identifiable and should be treated as Restricted unless an Expert Determination has been performed d) Tier 5 — all patient data is permanently prohibited
Q2. A workflow automation tool you use has an "AI-enhance" feature that processes meeting transcripts. You haven't verified which AI model or environment runs underneath. What's the right approach?
a) Use it freely — it's an enterprise tool b) Treat the AI usage as consumer-tier until you've verified what's running underneath and confirmed it matches your data classification c) Stop using the tool entirely d) Only use it for short meetings
Q3. You're asking a general question about competitor strategy: "What's typical for biotech companies launching gene therapies in rare diseases?" Same question, more specific: "What's typical for our gene therapy launching in [specific rare disease] given our [specific clinical profile]?" How do these classify?
a) Both Tier 1 — both are general questions b) First is Tier 1; second is effectively Tier 3 because it leaks specific information about your program even though no data is pasted c) Both Tier 3 — any business question is confidential d) First is Tier 1; second is Tier 5 — never ask AI about your own program
Answers: Q1: c · Q2: b · Q3: b
11 · What's next
You now have a working framework for classifying any data you might use with AI. Remaining lessons in Module 03:
- Lesson 03 · Tool and Environment Selection — operationalizing the matrix with specific tool recommendations
- Lesson 04 · Decision Frameworks for Gray Zones — the harder judgment calls
- Lesson 05 · Audit Trails and Documentation — what to record and why
After Module 03 you'll be cleared to begin your role-specific path module.
End of Lesson 02.