32  Working with local LLMs

Author
Affiliation

Dr Randy Johnson

Hood College

Published

April 23, 2026

This exercise will help you practice working with a local LLM.

System requirements

We’ll be comparing what we can do on the following systems:

  • The Hood cluster
  • Google Cloud Shell
  • TACC

Command line arguments

The following Linux command line arguments are useful for checking and monitoring your computer’s resources:

  • free -h will summarize installed memory and how much memory is currently being used
  • top will give you a live summary of processes running on your computer and how many resources each is using
  • lscpu will give you a summary of installed processor(s)
  • lspci | grep -i vga will identify the GPU(s) you have installed

Try running these commands on the Hood cluster and the Google Cloud Shell to compare what we have running on each of these services. See the Hood cluster instructions if you need a refresher on how to log in.

# Hood cluster memory
free -h

Indicates that we have 256Gb of memory on cn003 and 128Gb of memory on cn004.

# Hood cluster processors
lscpu
Architecture:                x86_64
  CPU op-mode(s):            32-bit, 64-bit
  Address sizes:             40 bits physical, 48 bits virtual
  Byte Order:                Little Endian
CPU(s):                      16
  On-line CPU(s) list:       0-15
Vendor ID:                   GenuineIntel
  Model name:                Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz
    CPU family:              6
    Model:                   44
    Thread(s) per core:      2
    Core(s) per socket:      4
    Socket(s):               2
.
.
.

If you look up the model information, you’ll see that these chips are quite old. We have:

  • 2 Sockets
  • 4 Cores per socket
  • 2 Threads per core

giving us a total of 16 CPUs

# Hood cluster gpu
lspci | grep -i vga
06:03.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 (rev 0a)

This indicates we have only enough graphics processing power to connect a monitor to the server. We won’t be doing any GPU computations on this machine, but we have sufficient compute power to run some decent models on the CPUs.

Ollama

Ollama is the runner that we’ll be using for our models. You can install it at /usr/local/bin/ollama on Linux or MacOS with the following command:

# do not run this on the server, as it is already installed
curl -fsSL https://ollama.com/install.sh | sh

We’ll try starting up llama3.2 with the following command:

# again, don't run this on the Hood cluster
# if we all try this at once, it'll cause problems
ollama run llama3.2:3b

When we are done, we’ll exit the interactive session with the /bye command.

Model breakdown

There are a bunch of models available on Ollama.com. The model we are going to try first is llama3.2:3b.

  • llama is the name of a model developed by Meta
  • 3.2 is the model version, which is the latest and greatest small model (as of Apr ’26)
  • 3b specifies that we want to use the model with 3 billion parameters. Models compressed to use fewer parameters require fewer system resources to run (see the list of models linked above for options).

The 3 billion parameter model works fine on the Hood cluster, but it is a little on the slow side. Next, we’ll try running a smaller model with

ollama run llma3.2:1b

It still isn’t lighning fast, but it is much faster.

Running an API service

Our server can only support one running instance of these large models at a time. What if we all want to access the model concurrently?

Using a system service will allow us all to use it together.

# start the service
sudo systemctl start ollama

# stop the service
sudo systemctl stop ollama

Once this has been started up, try using the python API to connect to it (we will be using the cn003 node):

import ollama

response = ollama.chat(model='llama3.2:1b', messages=[{'role': 'user', 'content': 'Hi'}])
print(response.message.content)

You can build your own user interface for others to use or even create a custom AI agent that will use this model to make decisions and do work for you.

Other models

Take some time to browse through the available models on Ollama.com. Some that I’m familiar with include:

  • llama3.1, is a new state-of-the-art model from meta available in various parameter sizes
  • gemma3n, by Google is designed for efficient execution on everyday devices such as laptops, tablets or phones
  • deepseq-r1 is a family of open reasoning models with performance approaching that of leading models, such as O3 and Gemini 2.5 Pro

Comparison with TACC

The Texas Advanced Computing Center (TACC) has much more up-to-date compute nodes available, and we can schedule time on those compute nodes as needed. You can learn a little more about TACC here.

Setup

I’ll be using the PVC queue on TACC, which has Intel Max 1550 GPUs. Ollama only works with Nvidia GPUs out of the box, but Intel Analytics has a nice Docker container that bridges this gap and makes it fairly easy to utilize their GPUs.

To start with, we’ll log into something other than the head node and download / process the Docker container. Docker is not supported on most HPC systems, but Apptainer is and will import Docker images.

# start up an interactive job on the pvc partition for an hour
idev -p pvc -t 01:00:00

# the $WORK directory has much higher storage quota, so we'll store our big files there
cd $WORK
module load tacc-apptainer
apptainer pull docker://intelanalytics/ipex-llm-inference-cpp-xpu:latest # update to latest

Ollama really tries hard to store model information in our home directory. Since this can be quite large, we’ll excede our quota quickly if we don’t tell it to store them somewhere else. This is only necessary on TACC - you shouldn’t need to do this on other systems.

# make sure it knows where to put/look for models
export OLLAMA_MODELS=$WORK/ollama_models
mkdir -p $OLLAMA_MODELS # only need to do this once
export HOME=$WORK # force it to use $WORK even if it tries to put it in ~

Now we are ready to download and process any models we want to use.

# We will run Ollama in the container we just set up
apptainer shell $WORK/ipex-llm-inference-cpp-xpu_latest.sif

### inside apptainer ###
# Initialize the Intel GPU environment
source ipex-llm-init --gpu --device Max # you may need to pick a different device on other systems

# Start Ollama (Intel provides the specialized start script)
bash /llm/scripts/start-ollama.sh

# hit return - it should be running in the background

# download and start up the model!! (running in $WORK)
./ollama run llama3.1:70b
TipVerify Ollama is running

If you want to be sure Ollama is running, you can check with

ps -ef | grep ollama

Also, if you want to clean things up before logging out, you can kill the Ollama process with

pkill ollama # use this to kill when done - not usually necessary, though

Next time

Next time you log in and want to pick up where you left off, we can skip many of the setup steps.

# start up an interactive job on the pvc partition for an hour
idev -p pvc -t 01:00:00

# the $WORK directory has much higher storage quota, so we'll store our big files there
cd $WORK
module load tacc-apptainer

# make sure it knows where to put/look for models
export OLLAMA_MODELS=$WORK/ollama_models
export HOME=$WORK # force it to use $WORK even if it tries to put it in ~

# We will run Ollama in the container we previously set up
apptainer shell $WORK/ipex-llm-inference-cpp-xpu_latest.sif

### inside apptainer ###
# Initialize the Intel GPU environment
source ipex-llm-init --gpu --device Max # you may need to pick a different device on other systems

# Start Ollama (Intel provides the specialized start script)
bash /llm/scripts/start-ollama.sh

# hit return - it should be running in the background

# Start up the model!! (running in $WORK)
./ollama run llama3.1:70b

Assignment

Take a screenshot of your python code running on the Hood cluster with some output from Ollama and submit it on Blackboard.