Module 04B · Lesson 01
Computational Biology — Where AI Fits in Your Workflow
Reading time: 20 minutes Track: Role Path — Computational Biology / Bioinformatics Prerequisites: Modules 01, 02, 03 complete Audience: Computational biologists, bioinformaticians, data scientists in biology, scientists doing -omics analysis, ML scientists in biotech R&D
What this lesson does
You're a computational scientist in biotech. You spend more time in code than at a bench. Your typical day involves data pipelines, statistical analyses, ML models, and figures — interrupted by meetings where you explain results to non-computational colleagues.
AI's value to your work is enormous. It's also different from how AI helps a bench scientist, a clinical operations lead, or a regulatory writer. This module is specific to your work.
By the end of this lesson, you'll be able to:
- Identify the 8-10 specific workflows where AI most reliably saves computational biologists time
- Recognize the AI failure modes specific to computational work
- Build a personal prompt library for your most common artifacts
- Set realistic expectations for what AI does and doesn't do in computational biology
Subsequent lessons go deep on specific workflows. This lesson maps the territory.
01 · The computational biologist's AI reality
The honest assessment of AI in computational biology as of late 2025 / early 2026.
What AI is reliably useful for:
- Writing and debugging analysis code
- Code review and refactoring
- Bioinformatics pipeline construction
- Statistical methodology consultation
- Translating between programming languages (R↔Python↔shell)
- Documentation of code, methods, and analyses
- Writing assistance for papers, talks, and reports
- Code generation for visualization
- Explaining your work to non-computational colleagues
- Literature review for methods (which approach is best for X?)
- API/library discovery (does X library do Y?)
What AI is not reliably useful for, today:
- Independent interpretation of biological data (it pattern-matches to plausible interpretations)
- Domain-specific reasoning about your particular system
- Producing reliable code without testing
- Replacing statistical training (it can run tests; it can't reliably tell you which is right)
- Knowing about very recent papers, models, or libraries (knowledge cutoff matters)
- Handling truly novel analysis where no analogous code exists in its training
The split is more favorable than for bench scientists — computational work is more directly amenable to AI. But the failure modes are no less serious. Bad analyses get published, bad code gets deployed, and the downstream consequences propagate.
02 · The eight core workflows
Eight workflows account for ~80% of high-value AI use in computational biology. Master these and you've covered most of what matters.
Workflow 1 — Code drafting and debugging
The bread and butter. Given a task description, AI drafts code. Given an error, AI debugs.
Where this is gold:
- Standard data manipulation (pandas, dplyr, etc.)
- Visualization code (matplotlib, ggplot2, plotly)
- Bioinformatics utilities (pyfasta, biopython, samtools wrappers, etc.)
- Pipeline glue (snakemake, nextflow rules)
Where this is risky:
- Statistical analyses where the choice of test matters (AI may write code that runs but uses the wrong test)
- Performance-critical code (AI tends to write functional but slow implementations)
- Code with subtle correctness requirements (e.g., genomic coordinate systems, off-by-one errors)
Lesson 02 covers this workflow in depth.
Workflow 2 — Pipeline design and workflow orchestration
Designing multi-step analyses that run reliably and reproducibly. Includes:
- Snakemake / Nextflow pipeline structure
- Containerization (Docker, Singularity)
- Configuration management
- Logging and error handling
Where this is gold:
- Adapting existing pipeline patterns to your data
- Boilerplate for common steps
- Documentation of pipeline behavior
Where this is risky:
- Performance tuning (AI doesn't know your cluster)
- Edge case handling (AI's default is happy-path; your data has edge cases)
Workflow 3 — Statistical methodology consultation
Asking AI which statistical approach to use, and what it would mean.
Where this is gold:
- "What's the appropriate test for [setup]?" — useful for orienting, then verify with a statistician or methods paper
- Explaining unfamiliar statistical methods (e.g., what is mixed-effects modeling actually doing?)
- Identifying assumptions of methods you're using
Where this is risky:
- Multiple comparisons handling (AI may underweight this)
- Specific test choice in your context (AI's generic advice may not apply)
- Causal inference (AI can confuse association and causation)
Workflow 4 — Bioinformatics methodology
Specifically: how to do common -omics tasks (RNA-seq, ChIP-seq, single-cell, variant calling, etc.).
Where this is gold:
- Standard pipeline construction for well-established methods
- Choice between popular tools (e.g., DESeq2 vs. edgeR, STAR vs. HISAT2)
- Interpretation of standard outputs
Where this is risky:
- Cutting-edge methods (AI's knowledge cutoff matters; new tools/methods may be missing or misunderstood)
- Best practices that have shifted recently
- Edge cases in your specific data type
Workflow 5 — Machine learning
Using AI to write ML code, choose models, or interpret results.
Where this is gold:
- Standard sklearn / PyTorch / TensorFlow boilerplate
- Choice between well-known model families for standard tasks
- Hyperparameter tuning patterns
Where this is risky:
- Bespoke biology applications (AI may suggest models that work generally but not for your specific data structure)
- Interpretation of feature importance and biological meaning (technical correctness ≠ biological meaning)
- Reproducibility (ML pipelines have many random-seed and version dependencies that AI may not fully capture)
Workflow 6 — Writing assistance
Drafting papers, methods sections, vignettes, documentation.
Where this is gold:
- Methods sections that describe what your code does
- Documentation of pipelines and analyses
- Translating between technical and non-technical audiences
- Boilerplate sections (data availability, code availability)
Where this is risky:
- Discussion sections where biological interpretation matters
- Claims about your specific findings (AI may extrapolate beyond what your data supports)
Workflow 7 — Cross-domain translation
Communicating between computational and experimental colleagues.
Where this is gold:
- Explaining computational results to bench scientists
- Translating biological questions into computational specifications
- Producing slide decks and summaries for non-computational audiences
Where this is risky:
- Oversimplifying in ways that lose important nuance
- Producing explanations that sound right but are subtly wrong
Workflow 8 — Literature review for methods
Specifically: finding the right method for a problem.
Where this is gold:
- "What approaches have been used for [problem]?" — useful starting point
- Comparing approaches across papers you provide
- Identifying methodological lineage
Where this is risky:
- Recent methods (knowledge cutoff)
- Fabricated method citations (same issue as Module 04A Lesson 02)
03 · The data classification challenge for computational biology
Your data is often more obviously sensitive than a bench scientist's data:
Tier 1 — Public: Published datasets (GEO, SRA, TCGA, etc.), public reference genomes, your own published work.
Tier 2 — Internal: Internal SOPs, methods documentation, training materials.
Tier 3 — Confidential: Pre-publication analyses, internal pipelines and code, target/program-relevant analyses, competitive intelligence interpretations.
Tier 4 — Restricted: Patient-derived sequencing data (always — even when "de-identified"), pre-IND target data, sponsor-confidential data, anything covered by DUAs (Data Use Agreements).
Tier 5 — Prohibited: Trade-secret algorithms, patient identifying information mixed with sequencing data, anything covered by NDAs with external processing prohibitions.
The computational-biology-specific risks:
- Genomic data is identifying. Even "de-identified" sequencing data can re-identify individuals through unique variant combinations. Treat genomic data as Tier 4 by default.
- Code is IP. Your analysis code may be considered confidential business information. Pasting large blocks of internal code into consumer AI may violate your company's IP policies.
- Models trained on internal data are sensitive. If you've trained an ML model on internal data, the model itself may "encode" that data. Sharing the model is similar to sharing the data.
- Sample IDs and metadata can be identifying. Sample 12345 may not look identifying, but combined with publicly known cohort information, it may be.
The computational biologist's default: Enterprise AI environments for everything beyond clearly public work. Consider on-premises or dedicated environments for patient-derived data.
04 · The verification habits for computational work
Different from bench R&D because the verification opportunities are different.
Code verification
Every piece of AI-generated code should be:
- Read carefully before running. Don't paste and execute. Read for obvious issues.
- Tested on known inputs. Run on a small case where you know the expected output.
- Sanity-checked against alternative implementations. If you've done similar work before, does this match?
- Tested on edge cases. Empty input, single-row input, missing values, large inputs.
- Reviewed for subtle bugs. Off-by-one errors, coordinate system mismatches, type coercion issues.
The discipline: AI-generated code that runs cleanly is not yet verified. Running and verifying are separate steps.
Statistical verification
For any statistical analysis AI helps with:
- Verify the chosen test is appropriate. Ask the AI to justify; check the justification.
- Verify assumptions are met. Normality, independence, equal variances — whatever the test requires.
- Verify the implementation is correct. Compare AI-generated code's output to a known-good implementation.
- Verify the interpretation. "Significant" doesn't mean "biologically meaningful"; the AI may conflate them.
Methodology verification
For bioinformatics methodology suggestions:
- Check recency. AI's knowledge cutoff may mean newer best practices are missing.
- Check tool availability and version. AI may reference tool versions that don't exist or that have known issues.
- Cross-reference with method papers. AI should cite the method's original paper; verify it.
Reproducibility verification
A specific category: AI-generated code that works once but isn't reproducible.
- Pin all dependencies. AI may write
pip install numpywithout a version. Specify the version. - Set all random seeds. AI may forget; you need to remember.
- Capture compute environment. OS, hardware (for GPU work), library versions.
05 · The "AI wrote the code, but did it understand the problem?" check
A specific failure mode for computational biology: AI writes code that correctly does something, but not the thing you actually needed.
Example: you ask for differential expression analysis. AI writes DESeq2 code. The code runs. Results come out. But:
- The model design didn't account for a batch variable in your experimental design
- The filtering threshold for low-expression genes was inappropriate for your sample type
- The contrast wasn't set up the way you wanted
The code is correct. The analysis is wrong.
Defense
After AI writes analysis code, before relying on the results, walk through:
- Does the code solve the actual question? Not "does it run" but "does it answer what you needed answered?"
- Did the code account for your experimental design? Batch variables, covariates, paired samples, etc.
- Are the parameters appropriate for your data? Filtering thresholds, normalization choices, statistical thresholds.
- Are the outputs interpreted correctly? Effect direction, magnitude, biological meaning.
This is where computational biology training is irreplaceable. AI can write the code; you still need to know what code should be written. The "AI replaces computational biologists" narrative is wrong precisely because of this gap.
06 · A worked example
A realistic scenario walked through end-to-end.
Setting: You're a computational biologist analyzing bulk RNA-seq from a small dose-response study (3 conditions × 4 biological replicates × 1 time point). You need to identify dose-responsive genes, characterize their biology, and produce a report for the program team.
Step 1 — Pipeline planning
Prompt Claude Enterprise:
"Senior computational biologist with experience in pharmacology RNA-seq analyses. I have a dose-response RNA-seq experiment: 3 doses × 4 biological replicates × 1 time point (12 samples total) of [cell line] treated with [compound]. I want to identify dose-responsive genes and characterize them. Walk me through:
- The appropriate analysis pipeline
- The key choices I need to make (and your recommendations with reasoning)
- The tools and packages I should use
- The downstream characterization analyses worth running"
AI returns a structured plan: STAR/Salmon for quantification, DESeq2 for differential expression with a dose-as-continuous model, separate analysis with dose-as-factor for follow-up, GSEA for pathway analysis, optional clustering. Reasoning for each.
You verify the plan against current best practices and your prior experience. Approve.
Step 2 — Code drafting
Prompt for each pipeline step. For DESeq2 specifically:
"Generate R code for DESeq2 analysis of the following design: 12 samples, 3 dose levels (continuous variable: 0, 1, 10 µM), 4 replicates per dose. Goals: (1) identify dose-responsive genes using dose as a continuous variable, (2) also test each dose vs. vehicle separately, (3) output a clean results table for each comparison, (4) include diagnostic plots (PCA, dispersion estimate, MA plots).
Use latest DESeq2 conventions. Include comments explaining each step. Pin the DESeq2 version."
You get ~80 lines of R. Read carefully. Two issues:
- AI forgot to set the reference level for the dose factor in the dose-as-factor model
- AI's PCA code uses non-variance-stabilized counts (should use vst)
Fix both in the code. Run.
Step 3 — Verify outputs
- Sample sizes correct? Yes, 12.
- PCA shows expected dose-related structure? Yes.
- Dispersion estimate looks reasonable? Yes.
- DE gene counts plausible? Yes, ~2,000 dose-responsive genes at FDR < 0.05 — in line with what we'd expect.
- Top genes biologically reasonable? Spot-check 10 top genes against known biology of the compound class. 8 of 10 make sense; 2 are unexpected (interesting, worth follow-up).
Step 4 — Downstream characterization
Prompt for GSEA code, clustering code, and visualization code. Run each. Verify outputs.
Step 5 — Report drafting
Prompt for a report structure:
"Senior computational biologist drafting a report for the program team (mixed audience: chemists, biologists, project leads). Report should cover:
- Study design and quality
- Differential expression overview
- Pathway findings
- Gene-level highlights
- Limitations
- Recommended follow-up
Format: 5-7 pages with figures. Tone: precise but accessible. Each section opens with the bottom line; supporting detail follows."
AI drafts. You revise voice and add specific interpretations the AI couldn't have known (e.g., this finding aligns with what we saw in the in vivo data; this gene is the same one X et al. flagged).
Step 6 — Documentation
Save the analysis as a reproducible R Markdown document. Note AI use in the document header. Save the prompts to your prompt library.
Total time: ~3 days. Without AI: ~7-10 days. Quality is comparable or better. The major time savings: code drafting (saves ~50% of code time) and report writing (saves ~70% of writing time).
07 · What you'll build through this module
The remaining lessons in Module 04B:
- Lesson 02 · Code, debugging, and pipeline construction — the highest-volume workflow
- Lesson 03 · Methodology and statistical reasoning — where domain judgment matters most
- Lesson 04 · Communication and writing assistance — translating computational work for the rest of the org
- Lesson 05 · Capstone — your computational biologist's AI playbook
Each lesson is structured like Module 04A's: specific role definitions, prompt templates, worked examples, verification protocols, exercises.
08 · Self-assessment
Quick assessment of where you are now on the eight workflows.
| Workflow | Current state (N/B/R) |
|---|---|
| Code drafting and debugging | |
| Pipeline design and orchestration | |
| Statistical methodology consultation | |
| Bioinformatics methodology | |
| Machine learning | |
| Writing assistance | |
| Cross-domain translation | |
| Literature review for methods |
(N = don't use AI; B = use occasionally; R = use routinely)
Workflows marked N or B are your highest-leverage targets for this module.
09 · Knowledge check
Three questions.
Q1. What's the most accurate characterization of AI's role in computational biology?
a) AI replaces computational biologists for most tasks b) AI accelerates the workflow significantly — writing code, drafting pipelines, helping with writing — but the domain judgment about what to do and how to interpret remains the human's responsibility c) AI is not useful for computational biology d) AI is only useful for ML, not for traditional bioinformatics
Q2. Why is genomic data treated as Tier 4 (Restricted) even when "de-identified"?
a) HIPAA requires it b) Genomic data can re-identify individuals through unique variant combinations, even when conventional identifiers are removed c) Genomic files are too large for most AI tools d) Genomic data is always Tier 5
Q3. What's the verification step most often missed when relying on AI-generated analysis code?
a) Checking that the code runs b) Verifying that the code solves the actual scientific question — accounting for experimental design, appropriate parameters, and biological meaning — not just that it runs and produces output c) Updating the documentation d) Testing on production data
Answers: Q1: b · Q2: b · Q3: b
10 · What's next
Lesson 02 of Module 04B: Code, debugging, and pipeline construction. The workflow you'll use multiple times per day.
End of Module 04B · Lesson 01.