Module 04B · Lesson 01

Computational Biology — Where AI Fits in Your Workflow

Reading time: 20 minutes Track: Role Path — Computational Biology / Bioinformatics Prerequisites: Modules 01, 02, 03 complete Audience: Computational biologists, bioinformaticians, data scientists in biology, scientists doing -omics analysis, ML scientists in biotech R&D

What this lesson does

You're a computational scientist in biotech. You spend more time in code than at a bench. Your typical day involves data pipelines, statistical analyses, ML models, and figures — interrupted by meetings where you explain results to non-computational colleagues.

AI's value to your work is enormous. It's also different from how AI helps a bench scientist, a clinical operations lead, or a regulatory writer. This module is specific to your work.

By the end of this lesson, you'll be able to:

Identify the 8-10 specific workflows where AI most reliably saves computational biologists time
Recognize the AI failure modes specific to computational work
Build a personal prompt library for your most common artifacts
Set realistic expectations for what AI does and doesn't do in computational biology

Subsequent lessons go deep on specific workflows. This lesson maps the territory.

01 · The computational biologist's AI reality

The honest assessment of AI in computational biology as of late 2025 / early 2026.

What AI is reliably useful for:

Writing and debugging analysis code
Code review and refactoring
Bioinformatics pipeline construction
Statistical methodology consultation
Translating between programming languages (R↔Python↔shell)
Documentation of code, methods, and analyses
Writing assistance for papers, talks, and reports
Code generation for visualization
Explaining your work to non-computational colleagues
Literature review for methods (which approach is best for X?)
API/library discovery (does X library do Y?)

What AI is not reliably useful for, today:

Independent interpretation of biological data (it pattern-matches to plausible interpretations)
Domain-specific reasoning about your particular system
Producing reliable code without testing
Replacing statistical training (it can run tests; it can't reliably tell you which is right)
Knowing about very recent papers, models, or libraries (knowledge cutoff matters)
Handling truly novel analysis where no analogous code exists in its training

The split is more favorable than for bench scientists — computational work is more directly amenable to AI. But the failure modes are no less serious. Bad analyses get published, bad code gets deployed, and the downstream consequences propagate.

02 · The eight core workflows

Eight workflows account for ~80% of high-value AI use in computational biology. Master these and you've covered most of what matters.

Workflow 1 — Code drafting and debugging

The bread and butter. Given a task description, AI drafts code. Given an error, AI debugs.

Where this is gold:

Standard data manipulation (pandas, dplyr, etc.)
Visualization code (matplotlib, ggplot2, plotly)
Bioinformatics utilities (pyfasta, biopython, samtools wrappers, etc.)
Pipeline glue (snakemake, nextflow rules)

Where this is risky:

Statistical analyses where the choice of test matters (AI may write code that runs but uses the wrong test)
Performance-critical code (AI tends to write functional but slow implementations)
Code with subtle correctness requirements (e.g., genomic coordinate systems, off-by-one errors)

Lesson 02 covers this workflow in depth.

Workflow 2 — Pipeline design and workflow orchestration

Designing multi-step analyses that run reliably and reproducibly. Includes:

Snakemake / Nextflow pipeline structure
Containerization (Docker, Singularity)
Configuration management
Logging and error handling

Where this is gold:

Adapting existing pipeline patterns to your data
Boilerplate for common steps
Documentation of pipeline behavior

Where this is risky:

Performance tuning (AI doesn't know your cluster)
Edge case handling (AI's default is happy-path; your data has edge cases)

Workflow 3 — Statistical methodology consultation

Asking AI which statistical approach to use, and what it would mean.

Where this is gold:

"What's the appropriate test for [setup]?" — useful for orienting, then verify with a statistician or methods paper
Explaining unfamiliar statistical methods (e.g., what is mixed-effects modeling actually doing?)
Identifying assumptions of methods you're using

Where this is risky:

Multiple comparisons handling (AI may underweight this)
Specific test choice in your context (AI's generic advice may not apply)
Causal inference (AI can confuse association and causation)

Workflow 4 — Bioinformatics methodology

Specifically: how to do common -omics tasks (RNA-seq, ChIP-seq, single-cell, variant calling, etc.).

Where this is gold:

Standard pipeline construction for well-established methods
Choice between popular tools (e.g., DESeq2 vs. edgeR, STAR vs. HISAT2)
Interpretation of standard outputs

Where this is risky:

Cutting-edge methods (AI's knowledge cutoff matters; new tools/methods may be missing or misunderstood)
Best practices that have shifted recently
Edge cases in your specific data type

Workflow 5 — Machine learning

Using AI to write ML code, choose models, or interpret results.

Where this is gold:

Standard sklearn / PyTorch / TensorFlow boilerplate
Choice between well-known model families for standard tasks
Hyperparameter tuning patterns

Where this is risky:

Bespoke biology applications (AI may suggest models that work generally but not for your specific data structure)
Interpretation of feature importance and biological meaning (technical correctness ≠ biological meaning)
Reproducibility (ML pipelines have many random-seed and version dependencies that AI may not fully capture)

Workflow 6 — Writing assistance

Drafting papers, methods sections, vignettes, documentation.

Where this is gold:

Methods sections that describe what your code does
Documentation of pipelines and analyses
Translating between technical and non-technical audiences
Boilerplate sections (data availability, code availability)

Where this is risky:

Discussion sections where biological interpretation matters
Claims about your specific findings (AI may extrapolate beyond what your data supports)

Workflow 7 — Cross-domain translation

Communicating between computational and experimental colleagues.

Where this is gold:

Explaining computational results to bench scientists
Translating biological questions into computational specifications
Producing slide decks and summaries for non-computational audiences

Where this is risky:

Oversimplifying in ways that lose important nuance
Producing explanations that sound right but are subtly wrong

Workflow 8 — Literature review for methods

Specifically: finding the right method for a problem.

Where this is gold:

"What approaches have been used for [problem]?" — useful starting point
Comparing approaches across papers you provide
Identifying methodological lineage

Where this is risky:

Recent methods (knowledge cutoff)
Fabricated method citations (same issue as Module 04A Lesson 02)

03 · The data classification challenge for computational biology

Your data is often more obviously sensitive than a bench scientist's data:

Tier 1 — Public: Published datasets (GEO, SRA, TCGA, etc.), public reference genomes, your own published work.

Tier 2 — Internal: Internal SOPs, methods documentation, training materials.

Tier 3 — Confidential: Pre-publication analyses, internal pipelines and code, target/program-relevant analyses, competitive intelligence interpretations.

Tier 4 — Restricted: Patient-derived sequencing data (always — even when "de-identified"), pre-IND target data, sponsor-confidential data, anything covered by DUAs (Data Use Agreements).

Tier 5 — Prohibited: Trade-secret algorithms, patient identifying information mixed with sequencing data, anything covered by NDAs with external processing prohibitions.

The computational-biology-specific risks:

Genomic data is identifying. Even "de-identified" sequencing data can re-identify individuals through unique variant combinations. Treat genomic data as Tier 4 by default.
Code is IP. Your analysis code may be considered confidential business information. Pasting large blocks of internal code into consumer AI may violate your company's IP policies.
Models trained on internal data are sensitive. If you've trained an ML model on internal data, the model itself may "encode" that data. Sharing the model is similar to sharing the data.
Sample IDs and metadata can be identifying. Sample 12345 may not look identifying, but combined with publicly known cohort information, it may be.

The computational biologist's default: Enterprise AI environments for everything beyond clearly public work. Consider on-premises or dedicated environments for patient-derived data.

04 · The verification habits for computational work

Different from bench R&D because the verification opportunities are different.

Code verification

Every piece of AI-generated code should be:

Read carefully before running. Don't paste and execute. Read for obvious issues.
Tested on known inputs. Run on a small case where you know the expected output.
Sanity-checked against alternative implementations. If you've done similar work before, does this match?
Tested on edge cases. Empty input, single-row input, missing values, large inputs.
Reviewed for subtle bugs. Off-by-one errors, coordinate system mismatches, type coercion issues.

The discipline: AI-generated code that runs cleanly is not yet verified. Running and verifying are separate steps.

Statistical verification

For any statistical analysis AI helps with:

Verify the chosen test is appropriate. Ask the AI to justify; check the justification.
Verify assumptions are met. Normality, independence, equal variances — whatever the test requires.
Verify the implementation is correct. Compare AI-generated code's output to a known-good implementation.
Verify the interpretation. "Significant" doesn't mean "biologically meaningful"; the AI may conflate them.

Methodology verification

For bioinformatics methodology suggestions:

Check recency. AI's knowledge cutoff may mean newer best practices are missing.
Check tool availability and version. AI may reference tool versions that don't exist or that have known issues.
Cross-reference with method papers. AI should cite the method's original paper; verify it.

Reproducibility verification

A specific category: AI-generated code that works once but isn't reproducible.

Pin all dependencies. AI may write pip install numpy without a version. Specify the version.
Set all random seeds. AI may forget; you need to remember.
Capture compute environment. OS, hardware (for GPU work), library versions.

05 · The "AI wrote the code, but did it understand the problem?" check

A specific failure mode for computational biology: AI writes code that correctly does something, but not the thing you actually needed.

Example: you ask for differential expression analysis. AI writes DESeq2 code. The code runs. Results come out. But:

The model design didn't account for a batch variable in your experimental design
The filtering threshold for low-expression genes was inappropriate for your sample type
The contrast wasn't set up the way you wanted

The code is correct. The analysis is wrong.

Defense

After AI writes analysis code, before relying on the results, walk through:

Does the code solve the actual question? Not "does it run" but "does it answer what you needed answered?"
Did the code account for your experimental design? Batch variables, covariates, paired samples, etc.
Are the parameters appropriate for your data? Filtering thresholds, normalization choices, statistical thresholds.
Are the outputs interpreted correctly? Effect direction, magnitude, biological meaning.

This is where computational biology training is irreplaceable. AI can write the code; you still need to know what code should be written. The "AI replaces computational biologists" narrative is wrong precisely because of this gap.

06 · A worked example

A realistic scenario walked through end-to-end.

Setting: You're a computational biologist analyzing bulk RNA-seq from a small dose-response study (3 conditions × 4 biological replicates × 1 time point). You need to identify dose-responsive genes, characterize their biology, and produce a report for the program team.

Step 1 — Pipeline planning

Prompt Claude Enterprise:

"Senior computational biologist with experience in pharmacology RNA-seq analyses. I have a dose-response RNA-seq experiment: 3 doses × 4 biological replicates × 1 time point (12 samples total) of [cell line] treated with [compound]. I want to identify dose-responsive genes and characterize them. Walk me through:

The appropriate analysis pipeline

The key choices I need to make (and your recommendations with reasoning)

The tools and packages I should use

The downstream characterization analyses worth running"

AI returns a structured plan: STAR/Salmon for quantification, DESeq2 for differential expression with a dose-as-continuous model, separate analysis with dose-as-factor for follow-up, GSEA for pathway analysis, optional clustering. Reasoning for each.

You verify the plan against current best practices and your prior experience. Approve.

Step 2 — Code drafting

Prompt for each pipeline step. For DESeq2 specifically:

"Generate R code for DESeq2 analysis of the following design: 12 samples, 3 dose levels (continuous variable: 0, 1, 10 µM), 4 replicates per dose. Goals: (1) identify dose-responsive genes using dose as a continuous variable, (2) also test each dose vs. vehicle separately, (3) output a clean results table for each comparison, (4) include diagnostic plots (PCA, dispersion estimate, MA plots).

Use latest DESeq2 conventions. Include comments explaining each step. Pin the DESeq2 version."

You get ~80 lines of R. Read carefully. Two issues:

AI forgot to set the reference level for the dose factor in the dose-as-factor model
AI's PCA code uses non-variance-stabilized counts (should use vst)

Fix both in the code. Run.

Step 3 — Verify outputs

Sample sizes correct? Yes, 12.
PCA shows expected dose-related structure? Yes.
Dispersion estimate looks reasonable? Yes.
DE gene counts plausible? Yes, ~2,000 dose-responsive genes at FDR < 0.05 — in line with what we'd expect.
Top genes biologically reasonable? Spot-check 10 top genes against known biology of the compound class. 8 of 10 make sense; 2 are unexpected (interesting, worth follow-up).

Step 4 — Downstream characterization

Prompt for GSEA code, clustering code, and visualization code. Run each. Verify outputs.

Step 5 — Report drafting

Prompt for a report structure:

"Senior computational biologist drafting a report for the program team (mixed audience: chemists, biologists, project leads). Report should cover:

Study design and quality

Differential expression overview

Pathway findings

Gene-level highlights

Limitations

Recommended follow-up

Format: 5-7 pages with figures. Tone: precise but accessible. Each section opens with the bottom line; supporting detail follows."

AI drafts. You revise voice and add specific interpretations the AI couldn't have known (e.g., this finding aligns with what we saw in the in vivo data; this gene is the same one X et al. flagged).

Step 6 — Documentation

Save the analysis as a reproducible R Markdown document. Note AI use in the document header. Save the prompts to your prompt library.

Total time: ~3 days. Without AI: ~7-10 days. Quality is comparable or better. The major time savings: code drafting (saves ~50% of code time) and report writing (saves ~70% of writing time).

07 · What you'll build through this module

The remaining lessons in Module 04B:

Lesson 02 · Code, debugging, and pipeline construction — the highest-volume workflow
Lesson 03 · Methodology and statistical reasoning — where domain judgment matters most
Lesson 04 · Communication and writing assistance — translating computational work for the rest of the org
Lesson 05 · Capstone — your computational biologist's AI playbook

Each lesson is structured like Module 04A's: specific role definitions, prompt templates, worked examples, verification protocols, exercises.

08 · Self-assessment

Quick assessment of where you are now on the eight workflows.

Workflow	Current state (N/B/R)
Code drafting and debugging
Pipeline design and orchestration
Statistical methodology consultation
Bioinformatics methodology
Machine learning
Writing assistance
Cross-domain translation
Literature review for methods

(N = don't use AI; B = use occasionally; R = use routinely)

Workflows marked N or B are your highest-leverage targets for this module.

09 · Knowledge check

Three questions.

Q1. What's the most accurate characterization of AI's role in computational biology?

a) AI replaces computational biologists for most tasks b) AI accelerates the workflow significantly — writing code, drafting pipelines, helping with writing — but the domain judgment about what to do and how to interpret remains the human's responsibility c) AI is not useful for computational biology d) AI is only useful for ML, not for traditional bioinformatics

Q2. Why is genomic data treated as Tier 4 (Restricted) even when "de-identified"?

a) HIPAA requires it b) Genomic data can re-identify individuals through unique variant combinations, even when conventional identifiers are removed c) Genomic files are too large for most AI tools d) Genomic data is always Tier 5

Q3. What's the verification step most often missed when relying on AI-generated analysis code?

a) Checking that the code runs b) Verifying that the code solves the actual scientific question — accounting for experimental design, appropriate parameters, and biological meaning — not just that it runs and produces output c) Updating the documentation d) Testing on production data

Answers: Q1: b · Q2: b · Q3: b

10 · What's next

Lesson 02 of Module 04B: Code, debugging, and pipeline construction. The workflow you'll use multiple times per day.

End of Module 04B · Lesson 01.