I am not proficient at interacting with APIs from python, so Gemini was used heavily in preparing these notes and examples.
Introduction to Web APIs
Application Programming Interfaces (APIs) allow two software programs to talk to each other
RESTful web APIs allow our local Python scripts to ask remote servers (like NCBI) for data
The Anatomy of a Request
Base URL: The root address of the API (e.g. https://eutils.ncbi.nlm.nih.gov/entrez/eutils)
Endpoint: The specific tool or database we are accessing (e.g. /esearch.fcgi).
Parameters: The specific questions we are asking, passed in the URL after a ? and separated by & (e.g. ?db=nucleotide&term=BRCA1).
Return value: APIs typically return data in JSON (structured like Python dictionaries, very common modern standard) or XML (older, tag-based format, heavily used by NCBI).
Python Tools for APIs
The requests library is the standard for making HTTP calls in Python.
Code
import requests# A simple, non-bioinformatics example to show the mechanics (GitHub API)# This returns information about a public repository in clean JSONurl ="https://api.github.com/repos/Bioconductor/biocViews"response = requests.get(url)# Best Practice: Fail loud and early if the server returns an error (e.g., 404, 403)# This will stop the script immediately instead of trying to parse empty dataresponse.raise_for_status()# If the script makes it to this line, the request was successfuldata = response.json()print(f"Repository Name: {data['name']}")print(f"Stars: {data['stargazers_count']}")
Repository Name: biocViews
Stars: 4
Try it for yourself
NCBI e-utils
NCBI doesn’t have a single API; it has e-utilities, a suite of server-side programs
ESearch: Finds the unique IDs (UIDs) for your query.
EFetch: Takes those UIDs and downloads the actual data (FASTA, GenBank, etc.).
Step 0: Load relevant libaries
Step 1: eSearch - Find the ID for human BRCA1 in the nucleotide database
Step 2: eFetch - Retrieve the FASTA sequence for that ID
Expanding the Toolkit & Best Practices
NCBI relies heavily on XML and E-utilities
Other databases have highly structured JSON REST APIs
The RCSB PDB API is excellent for programmatically querying 3D macromolecular structural data
This allows us to fetch metadata about binding sites or resolution without downloading the whole .pdb file
Workflow integration
We don’t typically use APIs for one-off scripts
API pull-scripts can be modularized and integrated into larger, reproducible computational pipelines (e.g. Snakemake)
The API script serves as the first rule to gather the raw data
Version controlling scripts via Git ensures the data-gathering step is perfectly reproducible
API Keys & Rate Limits
NCBI allows 3 requests per second without an API key, and 10 with one
Hardcoding time.sleep(0.35) is a safe baseline for unauthenticated requests
Most APIs require an API key
Review Questions
Why is it important to check the HTTP status code (for example, by using response.raise_for_status()) before attempting to parse the data returned by an API?
What could happen if you skip this step?
If your goal is to download the FASTA sequence for the human BRCA1 gene, why can’t you simply send the term “BRCA1” directly to the eFetch utility?
Why is eSearch a necessary first step?
What is rate limiting? Explain why omitting a command like time.sleep() in a loop of API calls is considered bad practice and how it might impact your script or the server.
Construct a request to the NCBI ESearch endpoint. Pass parameters to search the nucleotide database for the term “TP53[Gene] AND mouse[Organism]”.