17  Prompt Engineering for Bioinformatics

A Practical Guide

Author
Affiliation

Dr Randy Johnson

Hood College

Published

April 2, 2026

Acknowledgements

Gemini was used to draft an outline for these notes and Gemini code assist was active while writing.

Review

  • There is a lot of overlap with today’s material and other discussions we have had in this class
  • At the risk of beating a dead horse, this is important to understand and provides context for today’s discussion

LLMs are not or magic or alive

  • Large Language Models (LLMs) are highly advanced prediction engines
  • They do not think or know anything
  • They calculate the statistical probability of the next word (token) based on the vast amount of text they were trained on

Context Windows

  • Context windows are the model’s short-term memory
  • Dictates how much text (papers, code, data schemas) you can feed it at once

Bioinformatics Shift

  • We are moving from syntax-heavy searching (e.g. stackoverflow) to natural language task delegation

  • We are no longer just coding; we are managing an AI assistant

The Stochastic Parrot

  • LLMs are prone to hallucinations (generating highly plausible but entirely false information)

  • This is especially dangerous with

    • Highly specialized biological literature
    • Obscure gene aliases
    • Complex statistical assumptions
  • Verification is mandatory
    • An LLM might invent a paper that doesn’t exist or confidently recommend the wrong normalization method for bulk RNA-seq data

Prompt Engineering Principles

  • Engineering
    • Build something to solve a problem
    • Test and evaluate the solution
    • Learn and itterate

The Anatomy of a Prompt

A robust prompt provides boundaries. Use the RTCF framework:

  • Role: Who is the AI?

    “Act as an expert biostatistician and Python developer…”

  • Task: What exactly needs to be done?

    “…write a script to normalize this dataset…”

  • Context: What is the background?

    “…the data is from a bulk RNA-seq experiment with heavy batch effects. I am using an Ubuntu environment…”

  • Format: How should the output look?

    “…output only the code with inline comments, and provide a brief explanation of the statistical assumptions below.”

Clarity and Constraints

  • Positive vs. Negative Constraints

  • Tell the model what to do, but also explicitly what not to do.

    “Do not use base R plotting; strictly use ggplot2.”

Iterative Refinement

  • Treat prompting as a conversation
  • If the first output is close but flawed you have two choices
    • Reply with the error message or specific adjustments
    • Fine tune and fix the output yourself
    • (you almost never want to start over)

Advanced Prompting Techniques

Zero-shot prompting, few-shot prompting and chain of thought analysis can all be helpful for getting better results

Zero-Shot Prompting

  • Asking the model to perform a task without giving it prior examples
  • This is often where we start
  • Good for general coding or translation

Few-Shot Prompting

  • Providing the model with 1–3 examples of the desired input and output

  • This is especially helpful when formatting complex biological metadata or ensuring specific naming conventions

# Example
Format this raw clinical data.
Input: Patient 1, Male, 55, High BP
Output: `{"id": "P01", "sex": "M", "age": 55, "hypertension": true}`
Now do the same for the following data...

Chain-of-Thought (CoT)

  • Instruct the model to explain its reasoning step-by-step before outputting the final answer or code

  • Drastically reduces logic errors

  • This is especially useful for complex tasks

    “I am planning an R script for a differential expression analysis. Outline the statistical steps required and provide justification for each step”

Bioinformatics-Specific Applications

Code Generation & Translation

  • LLMs excel at translating between languages
  • We can easily convert a data wrangling script from R (dplyr) to Python (pandas), or translate complex bioinformatics algorithms
Tip

When requesting code, specify the exact packages and versions you prefer to avoid deprecated functions

Reproducibility & Infrastructure

  • Use LLMs to generate robust Dockerfiles for containerizing your environments, saving hours of dependency troubleshooting

    “Write a Dockerfile that pulls a lightweight Ubuntu base image, installs Python 3.10, R 4.3, and sets up the latest versions of Biopython and DESeq2.”

IDE and Terminal Integration

  • Bring the AI to where you work
  • Use GitHub Copilot or Gemini Code Assist within your IDE for inline code completion
  • Use the GitHub Copiloty or Gemini CLI directly in your terminal

Literature & Summarization

  • Use LLMs to find research papers
    • This shouldn’t be your only way to search the literature, but it should be a part of your workflow
    • Perplexity for search
    • Connected Papers for finding papers related to a specific publication
  • Use LLMs to summarize dense genetics papers or extract specific methodologies
    • NotebookLM:
      • Answer questions about the paper
      • Point you to specific sections of the paper
      • Generate an audio discussion of the paper between two people
      • Generate visual and video summaries

Ethics, Privacy, and Limitations

Data Security

  • Never paste PHI, PII, or other sensitive clinical data into public web interfaces
  • Assume anything pasted into a standard, non-enterprise LLM chat can be used as future training data
  • Use simulated or fully de-identified data for prompt testing

Accountability

  • The LLM writes the code; you own the science
    • Always run the code
    • Check edge cases
    • Validate scientific claims against primary literature