PILOT — Private preview. Progress is saved for this browser session only.
HaiPhai.AI Fluency for Biotech

Code, Debugging, and Pipeline Construction — Your Highest-Volume Workflow

Lesson 2~24 min3-question check

Module 04B · Lesson 02

Code, Debugging, and Pipeline Construction — Your Highest-Volume Workflow

Reading time: 24 minutes Track: Role Path — Computational Biology Prerequisites: Module 04B · Lesson 01


What this lesson does

If you're a computational biologist, code is your medium. You write it, debug it, refactor it, glue it into pipelines, hand it off to colleagues, and maintain it for years.

AI changes this workflow more than almost any other in biotech. Coding productivity gains of 30-60% are realistic and sustainable. But AI-generated code that runs is not the same as code that's correct, and the difference matters in biology where downstream conclusions ride on the analysis.

By the end of this lesson, you'll be able to:

  1. Use AI to draft analysis code that you can verify confidently
  2. Debug efficiently using AI as a partner rather than an oracle
  3. Build reproducible pipelines that survive your departure and your colleagues' edits
  4. Recognize the seven specific code failure modes that ruin downstream analyses

This lesson is dense. Take notes.


01 · The five modes of AI-assisted coding

There are five distinct modes of using AI for coding work. Each has its own discipline.

Mode 1 — Draft from a specification

You know what you want. You describe it. AI writes it.

Best for: Clear, bounded tasks. Data manipulation. Visualization. Standard statistical tests. Pipeline boilerplate.

Discipline: Specify clearly (Module 02). Read the code carefully before running. Test on known inputs.

Mode 2 — Pair programming

You're writing code; AI is in the loop. You write a function signature; AI completes the body. You write a comment describing intent; AI writes the implementation.

Best for: Augmenting your own work in real time. Tools like Cursor, Claude Code, and GitHub Copilot enable this.

Discipline: Don't accept suggestions you don't fully understand. The autocomplete cadence makes this easy to violate.

Mode 3 — Debug a specific error

You have an error. You give AI the code, the error message, and the context. AI diagnoses and proposes fixes.

Best for: Error messages where you have all the context. Stack traces. Unexpected output.

Discipline: Verify the proposed fix actually addresses the root cause, not just the symptom. AI sometimes proposes fixes that mask errors rather than fix them.

Mode 4 — Refactor existing code

You have working code. You want it cleaner, faster, more readable, or restructured. AI proposes refactored versions.

Best for: Cleanup of code that grew organically. Performance optimization (with caveats). Conversion between styles (e.g., procedural to functional).

Discipline: Refactored code needs the same test coverage as the original. AI may inadvertently change behavior during refactoring.

Mode 5 — Explain unfamiliar code

You're reading someone else's code (a colleague's pipeline, a Bioconductor package, an inherited project) and need to understand it.

Best for: Code archaeology. Documentation generation. Onboarding to a codebase.

Discipline: AI may explain confidently but incorrectly. Cross-check explanations against actual code behavior.

Most computational biology work happens in Modes 1, 2, and 3. The other two come up regularly but less often.


02 · The specification-first discipline

The single biggest determinant of AI coding output quality is the quality of your specification. Sloppy specs produce sloppy code; precise specs produce code that needs minimal revision.

A weak vs. strong specification

Weak:

"Write code to analyze my RNA-seq data."

Strong:

"Write a Python script that performs differential expression analysis on the attached count matrix.

Input: counts.csv with 25,000 rows (genes) × 18 columns (samples). First column is gene_id (Ensembl). Sample names follow the pattern [treatment]_[replicate] where treatment is one of {vehicle, drugA_low, drugA_high, drugB_low, drugB_high, combo} and replicate is 1-3.

Required analysis:

  1. Load counts and parse sample metadata from column names
  2. Filter genes with mean count < 10 across all samples
  3. Use DESeq2 (via pyDESeq2) to fit a model with treatment as a factor
  4. Generate contrast tables for: each treatment vs. vehicle; combo vs. drugA_high; combo vs. drugB_high
  5. Apply FDR correction (Benjamini-Hochberg) within each contrast
  6. Output: one CSV per contrast with columns [gene_id, log2FC, lfcSE, stat, pvalue, padj]

Code requirements:

  • Reproducible: set random seed; pin pyDESeq2 version in a requirements comment
  • Modular: separate functions for loading, filtering, fitting, contrast extraction
  • Documented: docstrings on each function
  • Test-friendly: include a main() that runs the full pipeline; load input from a path argument
  • Error handling: graceful failure if input has unexpected columns or empty groups

Output the code with a brief explanation of design choices at the top."

The strong version is ~200 words. It produces code that's nearly publishable. The weak version produces code you'll spend 90 minutes rewriting.

The principle: the time you spend on the spec is reclaimed multiple times over in the code quality.


03 · Reading AI-generated code

A specific discipline that separates productive from dangerous AI coding: read the code before running it.

This sounds obvious. It isn't, because the natural workflow is paste-and-execute. The AI's output is right there; running it is one click. Reading first feels like friction.

Build the friction in.

What to look for

Structure:

  • Does the overall structure match what you asked for?
  • Are the functions named and organized sensibly?
  • Are there obvious missing pieces (no error handling, no logging)?

Correctness:

  • Are file paths, column names, parameter values matching your data?
  • Are statistical methods correct for the data?
  • Are biological conventions respected (e.g., correct coordinate systems for genomics)?

Subtleties:

  • Are random seeds set?
  • Are dependencies pinned?
  • Are edge cases handled (empty inputs, missing values, single-element groups)?

Suspicious patterns:

  • Hardcoded values that should be parameters
  • Magic numbers without explanation
  • Functions that do too much
  • Comments that don't match the code

Most AI-generated code passes this read on first inspection. ~10-20% has issues you'll catch only by reading carefully. Build the habit of always reading.

A specific tip — explain it back

If you're not sure a piece of AI-generated code is right, ask the AI to explain it back to you in detail:

Explain this code line by line. For each non-trivial decision, justify why this approach was chosen versus alternatives. Highlight any assumptions about the input data or environment.

This catches mistakes because:

  1. The AI may notice issues when forced to explain
  2. You'll catch mistakes when you read the explanation
  3. The explanation surfaces assumptions you may have missed

04 · The seven specific code failure modes (revisited and expanded)

Lesson 01 introduced these briefly. Here's the deeper treatment with defenses.

Failure 1 — Silent type coercion

What it looks like: Code that processes a column expecting it to be numeric, but the data has it as string. Pandas converts silently in places; in others, it doesn't, and you get string concatenation where you wanted addition.

Defense: After loading any data, log column dtypes:

print(df.dtypes)

For critical columns, assert types:

assert pd.api.types.is_numeric_dtype(df['expression']), f"Expected numeric, got {df['expression'].dtype}"

Failure 2 — Wrong statistical test

What it looks like: AI suggests t-test where Wilcoxon would be more appropriate. Or two-way ANOVA where mixed-effects modeling would be correct.

Defense: Ask AI to justify the test choice. Cross-check with a statistician or methods paper. For important analyses, the test choice should be a deliberate decision documented in the analysis, not a default.

Failure 3 — Group misalignment

What it looks like: df.groupby('sample_id') when you meant to group by treatment. Subtle in the code; catastrophic in the results.

Defense: Print the groups before aggregating:

print(df.groupby(group_col).size())

Verify the group counts and labels match your experimental design.

Failure 4 — Missing data handling

What it looks like: Implicit row drops (df.dropna()), mean imputation, zero imputation. Sample sizes change between steps; the analysis is no longer what you intended.

Defense: Be explicit about missing data:

n_before = len(df)
df = df.dropna(subset=['expression'])
n_after = len(df)
print(f"Dropped {n_before - n_after} rows with missing expression")

Failure 5 — Visualization mismatches

What it looks like: Plot looks beautiful. Axis is wrong. Error bars represent SEM when you needed CI. Y-axis is on different scale than you'd expect.

Defense: For every plot, before relying on it:

  • Check axis ranges
  • Check what error bars represent
  • Check that the plot tells the same story as a manual spot-check of the underlying data
  • Check aspect ratios — distorted aspect ratios mislead

Failure 6 — Hardcoded values that should be variables

What it looks like: AI used your example data's values in the code. When you run on different data, things break or produce wrong results.

Defense: Read for hardcoded paths, numbers, column names. Refactor anything specific to your test case into parameters.

Failure 7 — Multiple comparisons mishandling

What it looks like: AI runs 1,000 tests without correction. Or applies correction at the wrong level. Or applies Bonferroni where FDR would be more appropriate.

Defense: Be explicit about multiple comparisons in the spec. Verify the correction is applied correctly (and at the right scope) before relying on results.


05 · Pipeline construction

Beyond individual scripts, computational biologists build pipelines that process many samples through many steps reproducibly. Common tools:

  • Snakemake (Python-based, widely used in academic and biotech bioinformatics)
  • Nextflow (Groovy-based, popular in pharma)
  • Common Workflow Language (CWL) (vendor-neutral, less common in practice)
  • Custom shell scripts (less reproducible, common in less mature environments)

AI is genuinely useful for pipeline construction, with caveats.

What AI does well

  • Standard pipeline patterns (read alignment, quantification, QC steps)
  • Snakemake/Nextflow boilerplate (rule definitions, input/output specifications)
  • Containerization (Dockerfile generation, Singularity recipes)
  • Configuration files (YAML/JSON for pipeline parameters)
  • Documentation of pipeline steps

What AI struggles with

  • Performance tuning specific to your cluster (it doesn't know your hardware)
  • Edge cases in real data (it pattern-matches happy-path scenarios)
  • Integration with proprietary data systems (it doesn't know your S3 layout, your LIMS, your internal tools)
  • Pipeline versioning and migration strategies

A practical pattern for AI-assisted pipeline construction

Step 1 — Architecture conversation:

Senior bioinformatician designing a bulk RNA-seq pipeline. Walk me through the recommended pipeline architecture for [setup: sample size, sequencing depth, deliverables, compute environment]. Address:
- Tool choices for each step (with justification)
- Pipeline orchestration approach
- Containerization strategy
- Reproducibility requirements
- Failure handling

Use this for thinking. Verify recommendations against current best practices.

Step 2 — Module-by-module implementation:

Rather than asking AI to generate an entire pipeline, ask for one module at a time. Each module gets verified before moving on.

Step 3 — Integration:

Wire the modules together. AI helps with the wiring (Snakemake rules, Nextflow channels) but the integration logic is yours.

Step 4 — Testing:

Run on a small test dataset before committing to a full production run. Verify every step's outputs.

Step 5 — Documentation:

Have AI help document what each step does, what its inputs/outputs are, and what assumptions it makes. Good pipeline docs are essential and AI accelerates writing them.

The pipeline reproducibility checklist

Any pipeline you build should have:

  • All software versions pinned (Conda env file, Docker image with versions, Renv lock)
  • All parameters in configuration files, not hardcoded in scripts
  • Random seeds set everywhere randomness is involved
  • Input data manifest with checksums
  • Logging at each step
  • Documentation of expected outputs at each step
  • Test data and expected results for regression testing
  • README explaining how to run the pipeline

This list takes time to satisfy. AI helps with each item — but you're responsible for ensuring they're all done.


06 · A worked debugging example

A realistic debug session, end-to-end.

Setting: You've written a script that processes a bulk RNA-seq dataset. The script runs without errors. The output looks plausible. But your colleague who runs a parallel analysis with a different tool gets noticeably different results for several key genes.

Initial diagnostic:

You suspect a bug somewhere. Where?

Step 1 — Reproduce both analyses on a small subset:

Take 10 genes where the results differ. Run your script on just those 10 genes. Run the alternative on just those 10 genes. Compare outputs in detail.

Step 2 — Bring it to AI:

Senior bioinformatics scientist. I'm seeing systematic differences between two analyses of the same bulk RNA-seq data. The two methods are [yours, e.g., DESeq2] and [colleague's, e.g., edgeR].

I'm attaching:
- Outputs from both methods for 10 genes where they disagree
- Both scripts
- The input count data for those 10 genes

Help me diagnose why the results differ. Consider:
- Normalization differences
- Filtering differences
- Statistical model differences
- Implementation choices
- Bugs in either script

Be specific.

AI's response (paraphrased):

"The differences appear to stem from three sources:

  1. Filtering — your script filters by mean count > 10 across all samples; the edgeR script filters by counts-per-million (CPM) > 1 in at least 3 samples. These are different filters. Genes filtered by one but not the other will have different downstream behavior.

  2. Normalization — DESeq2 uses median-of-ratios normalization; edgeR uses TMM. Both are valid but they produce different size factors, especially when expression is skewed.

  3. Dispersion estimation — DESeq2 fits a parametric mean-dispersion trend; edgeR fits via empirical Bayes. For low-count genes, this matters.

Likely diagnostic next step: check that the genes where you disagree are systematically in the low-count regime or have skewed expression. If so, the differences are method-driven, not bug-driven."

Step 3 — Verify the hypothesis:

You check. The disputed genes are indeed mostly low-count. The disagreement is method-driven, not a bug.

Step 4 — Document:

In your analysis notes:

"Note: differences between DESeq2 (our method) and edgeR (alternative) for low-count genes are expected due to differences in normalization, filtering, and dispersion estimation. Spot-check confirms our results are internally consistent."

Total debug time: ~45 minutes. Without AI: probably 2-3 hours of digging through method papers.

A note on what AI doesn't do here

AI's diagnosis was useful but not definitive. It said "likely method-driven, not bug-driven" — and you still had to verify. The AI didn't fix the bug; it gave you a structured hypothesis space.

That's the right use of AI in debugging. Hypothesis generation, not certainty.


07 · The "explain this code" workflow

A specific use case worth its own treatment: code archaeology.

You inherit a colleague's pipeline. Or you're reviewing a Bioconductor package's source. Or you're auditing your own code from two years ago. You need to understand what it does.

The prompt pattern

You are a senior bioinformatician helping me understand the following code. Provide:

1. A high-level summary of what this code does (3-5 sentences)
2. A step-by-step walkthrough (each major block explained)
3. Identification of any non-obvious choices and their likely rationale
4. Identification of potential issues (bugs, performance, edge cases)
5. Questions I should ask the original author

[Code attached]

This produces an annotated explanation that accelerates comprehension significantly. Caveat: cross-check critical interpretations by running the code on test cases.

When this is most valuable

  • Onboarding to a new codebase
  • Reviewing legacy pipelines
  • Auditing inherited analyses
  • Understanding a published method's reference implementation

When this is dangerous

  • For code you'll modify and trust without testing
  • For code that drives regulatory or clinical decisions
  • When AI's explanation contradicts your understanding (the AI might be wrong, or you might be wrong; either way, investigate)

08 · Cursor, Claude Code, Copilot — the pair programming tools

A note on the in-editor AI tools that have become standard for computational work.

The current landscape (early 2026):

  • Cursor — VS Code fork with deep AI integration; preferred by many data scientists
  • Claude Code — Anthropic's terminal-based agentic coding tool
  • GitHub Copilot — long-running pair-programming tool, increasingly capable
  • Windsurf — newer entrant with strong agentic capabilities

For computational biology work, these tools provide:

  • Autocomplete suggestions as you type
  • "Chat with codebase" features
  • Agentic features (multi-file refactors, automated debugging)
  • Integration with your terminal and version control

When these are clearly worth it

If you write code daily and your work is iterative, the productivity gains are real and material. Most computational biologists report 30-60% productivity gains within weeks of adoption.

When to be cautious

These tools encourage acceptance of AI suggestions in a low-friction way. The "read before running" discipline gets harder. The temptation to accept-and-iterate is strong.

The defenses:

  • Don't accept suggestions you don't understand
  • Periodically pause and review what's been accepted
  • Use formal code review (yours or a colleague's) for important work
  • Test more, not less

Data class considerations

Pair programming tools see your code as you write it. For confidential or restricted code:

  • Use enterprise versions of these tools where available
  • Verify data-handling terms
  • Consider on-premises options for the most sensitive code

The convenience of these tools shouldn't override the data classification discipline from Module 03.


09 · A short exercise

Take a coding task you'd do this week. Write the specification using the Section 02 template. Save the spec.

Run the spec through AI. Read the output carefully. Note:

  • What needed revision
  • What you would have spent more time on without AI
  • What AI got subtly wrong

Save the spec to your prompt library with notes on what worked. This is the seed of a personal computational biology AI playbook.


10 · Knowledge check

Three questions.


Q1. Which is the most important discipline that separates productive from dangerous AI coding?

a) Writing code in Python rather than R b) Reading AI-generated code carefully before running it — never paste-and-execute, especially for analysis code c) Using only enterprise AI tools d) Always asking for explanations


Q2. When two methods for the same analysis produce different results for some genes, what's the appropriate diagnostic approach with AI?

a) Trust the first method's results b) Average the two methods c) Use AI to generate structured hypotheses about why methods differ (normalization, filtering, statistical models), then verify which apply to your specific situation — AI does hypothesis generation, not certainty d) Switch to a third method


Q3. Why are pair-programming tools (Cursor, Claude Code, Copilot) potentially riskier than chat-based AI for computational work?

a) They're more expensive b) They produce worse code c) They encourage accept-and-iterate patterns that erode the "read before running" discipline — productivity gains are real but require active vigilance to maintain code quality d) They don't work on Mac


Answers: Q1: b · Q2: c · Q3: c


11 · What's next

Lesson 03 of Module 04B: Methodology and statistical reasoning — where the work demands the most domain judgment and AI's role is most circumscribed.


End of Lesson 02.

Knowledge check

3 questions · select an answer to see if you got it
1.Which is the most important discipline that separates productive from dangerous AI coding?
2.When two methods for the same analysis produce different results for some genes, what's the appropriate diagnostic approach with AI?
3.Why are pair-programming tools (Cursor, Claude Code, Copilot) potentially riskier than chat-based AI for computational work?