17 Prompt Engineering for Bioinformatics
A Practical Guide
Acknowledgements
Gemini was used to draft an outline for these notes and Gemini code assist was active while writing.
Review
- There is a lot of overlap with today’s material and other discussions we have had in this class
- At the risk of beating a dead horse, this is important to understand and provides context for today’s discussion
LLMs are not or magic or alive
- Large Language Models (LLMs) are highly advanced prediction engines
- They do not think or know anything
- They calculate the statistical probability of the next word (token) based on the vast amount of text they were trained on
Context Windows
- Context windows are the model’s short-term memory
- Dictates how much text (papers, code, data schemas) you can feed it at once
Bioinformatics Shift
We are moving from syntax-heavy searching (e.g. stackoverflow) to natural language task delegation
We are no longer just coding; we are managing an AI assistant
The Stochastic Parrot
LLMs are prone to hallucinations (generating highly plausible but entirely false information)
This is especially dangerous with
- Highly specialized biological literature
- Obscure gene aliases
- Complex statistical assumptions
- Verification is mandatory
- An LLM might invent a paper that doesn’t exist or confidently recommend the wrong normalization method for bulk RNA-seq data
Prompt Engineering Principles
- Engineering
- Build something to solve a problem
- Test and evaluate the solution
- Learn and itterate
The Anatomy of a Prompt
A robust prompt provides boundaries. Use the RTCF framework:
Role: Who is the AI?
“Act as an expert biostatistician and Python developer…”
Task: What exactly needs to be done?
“…write a script to normalize this dataset…”
Context: What is the background?
“…the data is from a bulk RNA-seq experiment with heavy batch effects. I am using an Ubuntu environment…”
Format: How should the output look?
“…output only the code with inline comments, and provide a brief explanation of the statistical assumptions below.”
Clarity and Constraints
Positive vs. Negative Constraints
Tell the model what to do, but also explicitly what not to do.
“Do not use base R plotting; strictly use
ggplot2.”
Iterative Refinement
- Treat prompting as a conversation
- If the first output is close but flawed you have two choices
- Reply with the error message or specific adjustments
- Fine tune and fix the output yourself
- (you almost never want to start over)
Advanced Prompting Techniques
Zero-shot prompting, few-shot prompting and chain of thought analysis can all be helpful for getting better results
Zero-Shot Prompting
- Asking the model to perform a task without giving it prior examples
- This is often where we start
- Good for general coding or translation
Few-Shot Prompting
Providing the model with 1–3 examples of the desired input and output
This is especially helpful when formatting complex biological metadata or ensuring specific naming conventions
# Example
Format this raw clinical data.
Input: Patient 1, Male, 55, High BP
Output: `{"id": "P01", "sex": "M", "age": 55, "hypertension": true}`
Now do the same for the following data...
Chain-of-Thought (CoT)
Instruct the model to explain its reasoning step-by-step before outputting the final answer or code
Drastically reduces logic errors
This is especially useful for complex tasks
“I am planning an R script for a differential expression analysis. Outline the statistical steps required and provide justification for each step”
Bioinformatics-Specific Applications
Code Generation & Translation
- LLMs excel at translating between languages
- We can easily convert a data wrangling script from R (
dplyr) to Python (pandas), or translate complex bioinformatics algorithms
When requesting code, specify the exact packages and versions you prefer to avoid deprecated functions
Reproducibility & Infrastructure
Use LLMs to generate robust
Dockerfilesfor containerizing your environments, saving hours of dependency troubleshooting“Write a Dockerfile that pulls a lightweight Ubuntu base image, installs Python 3.10, R 4.3, and sets up the latest versions of Biopython and DESeq2.”
IDE and Terminal Integration
- Bring the AI to where you work
- Use GitHub Copilot or Gemini Code Assist within your IDE for inline code completion
- Use the GitHub Copiloty or Gemini CLI directly in your terminal
Literature & Summarization
- Use LLMs to find research papers
- This shouldn’t be your only way to search the literature, but it should be a part of your workflow
- Perplexity for search
- Connected Papers for finding papers related to a specific publication
- Use LLMs to summarize dense genetics papers or extract specific methodologies
- NotebookLM:
- Answer questions about the paper
- Point you to specific sections of the paper
- Generate an audio discussion of the paper between two people
- Generate visual and video summaries
- NotebookLM:
Ethics, Privacy, and Limitations
Data Security
- Never paste PHI, PII, or other sensitive clinical data into public web interfaces
- Assume anything pasted into a standard, non-enterprise LLM chat can be used as future training data
- Use simulated or fully de-identified data for prompt testing
Accountability
- The LLM writes the code; you own the science
- Always run the code
- Check edge cases
- Validate scientific claims against primary literature