17 Prompt Engineering for Bioinformatics

A Practical Guide

Author

Affiliation

Dr Randy Johnson

Hood College

Published

April 2, 2026

Acknowledgements

Gemini was used to draft an outline for these notes and Gemini code assist was active while writing.

Review

There is a lot of overlap with today’s material and other discussions we have had in this class
At the risk of beating a dead horse, this is important to understand and provides context for today’s discussion

LLMs are not or magic or alive

Large Language Models (LLMs) are highly advanced prediction engines
They do not think or know anything
They calculate the statistical probability of the next word (token) based on the vast amount of text they were trained on

Context Windows

Context windows are the model’s short-term memory
Dictates how much text (papers, code, data schemas) you can feed it at once

Bioinformatics Shift

We are moving from syntax-heavy searching (e.g. stackoverflow) to natural language task delegation
We are no longer just coding; we are managing an AI assistant

The Stochastic Parrot

LLMs are prone to hallucinations (generating highly plausible but entirely false information)
This is especially dangerous with
- Highly specialized biological literature
- Obscure gene aliases
- Complex statistical assumptions

Verification is mandatory
- An LLM might invent a paper that doesn’t exist or confidently recommend the wrong normalization method for bulk RNA-seq data

Prompt Engineering Principles

Engineering
- Build something to solve a problem
- Test and evaluate the solution
- Learn and itterate

The Anatomy of a Prompt

A robust prompt provides boundaries. Use the RTCF framework:

Role: Who is the AI?

“Act as an expert biostatistician and Python developer…”
Task: What exactly needs to be done?

“…write a script to normalize this dataset…”

Context: What is the background?

“…the data is from a bulk RNA-seq experiment with heavy batch effects. I am using an Ubuntu environment…”
Format: How should the output look?

“…output only the code with inline comments, and provide a brief explanation of the statistical assumptions below.”

Clarity and Constraints

Positive vs. Negative Constraints
Tell the model what to do, but also explicitly what not to do.

“Do not use base R plotting; strictly use ggplot2.”

Advanced Prompting Techniques

Zero-shot prompting, few-shot prompting and chain of thought analysis can all be helpful for getting better results

Zero-Shot Prompting

Asking the model to perform a task without giving it prior examples
This is often where we start
Good for general coding or translation

Few-Shot Prompting

Providing the model with 1–3 examples of the desired input and output
This is especially helpful when formatting complex biological metadata or ensuring specific naming conventions

# Example
Format this raw clinical data.
Input: Patient 1, Male, 55, High BP
Output: `{"id": "P01", "sex": "M", "age": 55, "hypertension": true}`
Now do the same for the following data...

Chain-of-Thought (CoT)

Instruct the model to explain its reasoning step-by-step before outputting the final answer or code
Drastically reduces logic errors
This is especially useful for complex tasks

“I am planning an R script for a differential expression analysis. Outline the statistical steps required and provide justification for each step”

Bioinformatics-Specific Applications

Code Generation & Translation

LLMs excel at translating between languages
We can easily convert a data wrangling script from R (dplyr) to Python (pandas), or translate complex bioinformatics algorithms

Tip

When requesting code, specify the exact packages and versions you prefer to avoid deprecated functions

Reproducibility & Infrastructure

Use LLMs to generate robust Dockerfiles for containerizing your environments, saving hours of dependency troubleshooting

“Write a Dockerfile that pulls a lightweight Ubuntu base image, installs Python 3.10, R 4.3, and sets up the latest versions of Biopython and DESeq2.”

IDE and Terminal Integration

Bring the AI to where you work
Use GitHub Copilot or Gemini Code Assist within your IDE for inline code completion
Use the GitHub Copiloty or Gemini CLI directly in your terminal

Literature & Summarization

Use LLMs to find research papers
- This shouldn’t be your only way to search the literature, but it should be a part of your workflow
- Perplexity for search
- Connected Papers for finding papers related to a specific publication

Use LLMs to summarize dense genetics papers or extract specific methodologies
- NotebookLM:
  - Answer questions about the paper
  - Point you to specific sections of the paper
  - Generate an audio discussion of the paper between two people
  - Generate visual and video summaries

Ethics, Privacy, and Limitations

Data Security

Never paste PHI, PII, or other sensitive clinical data into public web interfaces
Assume anything pasted into a standard, non-enterprise LLM chat can be used as future training data
Use simulated or fully de-identified data for prompt testing

Accountability

The LLM writes the code; you own the science
- Always run the code
- Check edge cases
- Validate scientific claims against primary literature

Acknowledgements

Review

LLMs are not or magic or alive

Context Windows

Bioinformatics Shift

The Stochastic Parrot

Prompt Engineering Principles

The Anatomy of a Prompt

Clarity and Constraints

Iterative Refinement

Advanced Prompting Techniques

Zero-Shot Prompting

Few-Shot Prompting

Chain-of-Thought (CoT)

Bioinformatics-Specific Applications

Code Generation & Translation

Reproducibility & Infrastructure

IDE and Terminal Integration

Literature & Summarization

Ethics, Privacy, and Limitations

Data Security

Accountability