15 Database Design

Author

Affiliation

Dr Randy Johnson

Hood College

Published

March 19, 2026

Acknowledgements

We’ll be using the gapminder data set, curated by Jennifer Bryan in this module. Some of you may be familiar with the gapminder package in R - this is the same data set.

I am not proficient at interacting with SQL databases from python, and Gemini was used heavily in preparing these notes and examples.

Introduction to Databases

What are some different ways we can store data?
Flat files are great for small, static datasets but they have issues when
- Data get massive
- Multiple users need to edit data simultaneously
- We need strict access rules (data integrity)

Relational databases

Data isn’t dumped into one massive spreadsheet
Neatly organized into specialized tables (like tabs in Excel)
Tables are related to one another

Example relational databases

SQLite (we’ll use this today): serverless, runs locally in a file
PostgreSQL/MySQL: runs on a server, handles heavy web traffic

What is SQL?

Structured Query Language

The universal language we use to “talk” to relational databases
Reads almost like plain English

Basics of Database Design

Table: A collection of related data (e.g. countries).
- Row (Record): One single entry (e.g. “Canada”).
- Column (Field): A specific attribute (e.g. population).

Data Types

Databases are strict (if a column is an Integer, you cannot put the word “Five” in it)

INTEGER
TEXT
REAL (float/double)
BOOLEAN (logical)

Keys and Relationships

Relationships between tables are defined by key fields

Primary Keys

A unique ID for every row (e.g. student ID number)
Guarantees we can find a specific record even if two people have the exact same name

Foreign Keys

A column in one table that points to a primary key in another table
This is how relationships are formed

Normalization

Avoid data duplication and inconsistencies
Rule of thumb: If you find yourself typing the exact same text over and over in a column (like “Asia” or “Europe”), it probably deserves its own table

Exercise: Normalizing Gapminder

We will be working with the Gapminder dataset today. It has columns: country, continent, year, lifeExp, pop, gdpPercap.

Duplicated data

We have 12 observations spread over 60 years for Afghanistan
We have to type (and store) ‘Afghanistan’ 12 times
There are and 33 countries in Asia, meaning we need to repeat ‘Asia’ 396 times

Normalization

How can we break this single flat file into multiple relational tables?

Table 1: continents
- id (Primary Key)
- name (e.g. Asia, Europe)
Table 2: countries
- id (Primary Key)
- name (e.g. Afghanistan)
- continent_id (Foreign Key pointing to continents.id)
Table 3: observations
- id (Primary Key)
- country_id (Foreign Key pointing to countries.id)
- year
- life_exp
- population
- gdp_per_cap

gapminder.db

Today we’ll be working with a normalized database containing Gapminder data (DBeaver is a nice GUI for working with SQL databases)

Basic SQL Queries (CRUD)

You’ll often see “CRUD” operations mentioned in the context of databases

Create
Read
Update
Delete

Creating & Adding Data

Add coalitions

Scenario: we want to include groups of nations in our database (e.g. European Union, NAFTA, NATO)

We will need a new table for coalitions
Since it is possible for a single country to be part of more than one coalition, we also need a new table to relate countries to coallitions
This is called a many-to-many relationship

CREATE TABLE coalitions (id INTEGER PRIMARY KEY, name TEXT);
CREATE TABLE coalitions_contries (id INTEGER PRIMARY KEY,
                                  FOREIGN KEY (coalition_id) REFERENCES coalitions (id),
                                  FOREIGN KEY (country_id) REFERENCES countries (id));

Add a country

Scenario: a brand new country has formed, and we want to add it to our database

INSERT INTO countries (name, continent_id)
  VALUES ('Atlantis', 2); -- 2 is Europe's ID

Retrieving Data (Read)

This is the most common use of SQL for data science.

Read an entire table

Scenario: read the entire continents table from the database

Read a slice of a table

Scenario: read year and life_exp for Canada in 2007

Sorting and Limiting

Scenario: Display the top 5 highest GDPs per capita in the database

Fetching data from multiple tables

Scenario: We know country_id = 72 had a high GDP per capita, but who is country 72?

Practical Exercises

Find all observations where life expectancy is greater than 80

Find the population of country ID 24 in the year 1992

Updating and Deleting Data (Update & Delete)

We typically won’t want to update source data in the database, but if we do…

Fix typo in database

Scenario: we used the wrong name when we created our new country

Deleting Rows

Scenario: After giving it some thought, let’s remove New Atlantis from the database

Always use WHERE when updating and deleting

If you don’t use a WHERE clause when updating or deleting data, you could end up changing more than you want. For example, this code will change the name of every country to “New Atlantis”:

UPDATE countries
  SET name = "New Atlantis"

Connecting to SQL from Python

We won’t typically work with databases interactively like this
SQL queries will typically be scripted / automated
We can use the built-in sqlite3 library to execute our queries directly from python

Live Coding Exercise

Open up a new pytyon script in Positron and create a script to list the 3 countries with the highest life expectancy in 2007 using the starter code provided.

SQL injection

What happens when we hook up our database to a website and allow users to display information by year and someone enters this for the year?

2007; DROP TABLE countries;

Always use the ? syntax modeled in the starte code to make your code more secure.

# set up environment and connect to the database
import urllib.request
import sqlite3

# download the gapminder database
url = "https://github.com/BIFX547-26/gapminder/raw/refs/heads/main/inst/extdata/gapminder.db"
urllib.request.urlretrieve(url, "gapminder.db")

# Execute a query

# --- Best Practice: Parameterized Queries ---
# We NEVER use f-strings for SQL
# We use the '?' placeholder.
user_input = 72
cursor.execute("""
  SELECT countries.name, observations.year, observations.gdp_per_cap
    FROM observations
    INNER JOIN countries ON observations.country_id = countries.id
    WHERE countries.id = ?
    ORDER BY observations.gdp_per_cap DESC
    LIMIT 5;""",
  (user_input,))
# Notice the tuple here!

print("\nTop 3 Life Expectancies in 2007:")
for row in cursor.fetchall():
    print(f"{row[0]}: {row[1]:.1f} years")

# 5. Close connection when done
conn.close()

Assignment

Complete the code above and submit your script on Blackboard.