32 Working with local LLMs
This exercise will help you practice working with a local LLM.
System requirements
We’ll be comparing what we can do on the following systems:
- The Hood cluster
- Google Cloud Shell
- TACC
Command line arguments
The following Linux command line arguments are useful for checking and monitoring your computer’s resources:
free -hwill summarize installed memory and how much memory is currently being usedtopwill give you a live summary of processes running on your computer and how many resources each is usinglscpuwill give you a summary of installed processor(s)lspci | grep -i vgawill identify the GPU(s) you have installed
Try running these commands on the Hood cluster and the Google Cloud Shell to compare what we have running on each of these services. See the Hood cluster instructions if you need a refresher on how to log in.
# Hood cluster memory
free -hIndicates that we have 256Gb of memory on cn003 and 128Gb of memory on cn004.
# Hood cluster processors
lscpuArchitecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 40 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
CPU family: 6
Model: 44
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 2
.
.
.
If you look up the model information, you’ll see that these chips are quite old. We have:
- 2 Sockets
- 4 Cores per socket
- 2 Threads per core
giving us a total of 16 CPUs
# Hood cluster gpu
lspci | grep -i vga06:03.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 (rev 0a)
This indicates we have only enough graphics processing power to connect a monitor to the server. We won’t be doing any GPU computations on this machine, but we have sufficient compute power to run some decent models on the CPUs.
Ollama
Ollama is the runner that we’ll be using for our models. You can install it at /usr/local/bin/ollama on Linux or MacOS with the following command:
# do not run this on the server, as it is already installed
curl -fsSL https://ollama.com/install.sh | shWe’ll try starting up llama3.2 with the following command:
# again, don't run this on the Hood cluster
# if we all try this at once, it'll cause problems
ollama run llama3.2:3bWhen we are done, we’ll exit the interactive session with the /bye command.
Model breakdown
There are a bunch of models available on Ollama.com. The model we are going to try first is llama3.2:3b.
llamais the name of a model developed by Meta3.2is the model version, which is the latest and greatest small model (as of Apr ’26)3bspecifies that we want to use the model with 3 billion parameters. Models compressed to use fewer parameters require fewer system resources to run (see the list of models linked above for options).
The 3 billion parameter model works fine on the Hood cluster, but it is a little on the slow side. Next, we’ll try running a smaller model with
ollama run llma3.2:1bIt still isn’t lighning fast, but it is much faster.
Running an API service
Our server can only support one running instance of these large models at a time. What if we all want to access the model concurrently?
Using a system service will allow us all to use it together.
# start the service
sudo systemctl start ollama
# stop the service
sudo systemctl stop ollamaOnce this has been started up, try using the python API to connect to it (we will be using the cn003 node):
import ollama
response = ollama.chat(model='llama3.2:1b', messages=[{'role': 'user', 'content': 'Hi'}])
print(response.message.content)You can build your own user interface for others to use or even create a custom AI agent that will use this model to make decisions and do work for you.
Other models
Take some time to browse through the available models on Ollama.com. Some that I’m familiar with include:
- llama3.1, is a new state-of-the-art model from meta available in various parameter sizes
- gemma3n, by Google is designed for efficient execution on everyday devices such as laptops, tablets or phones
- deepseq-r1 is a family of open reasoning models with performance approaching that of leading models, such as O3 and Gemini 2.5 Pro
Comparison with TACC
The Texas Advanced Computing Center (TACC) has much more up-to-date compute nodes available, and we can schedule time on those compute nodes as needed. You can learn a little more about TACC here.
Setup
I’ll be using the PVC queue on TACC, which has Intel Max 1550 GPUs. Ollama only works with Nvidia GPUs out of the box, but Intel Analytics has a nice Docker container that bridges this gap and makes it fairly easy to utilize their GPUs.
To start with, we’ll log into something other than the head node and download / process the Docker container. Docker is not supported on most HPC systems, but Apptainer is and will import Docker images.
# start up an interactive job on the pvc partition for an hour
idev -p pvc -t 01:00:00
# the $WORK directory has much higher storage quota, so we'll store our big files there
cd $WORK
module load tacc-apptainer
apptainer pull docker://intelanalytics/ipex-llm-inference-cpp-xpu:latest # update to latestOllama really tries hard to store model information in our home directory. Since this can be quite large, we’ll excede our quota quickly if we don’t tell it to store them somewhere else. This is only necessary on TACC - you shouldn’t need to do this on other systems.
# make sure it knows where to put/look for models
export OLLAMA_MODELS=$WORK/ollama_models
mkdir -p $OLLAMA_MODELS # only need to do this once
export HOME=$WORK # force it to use $WORK even if it tries to put it in ~Now we are ready to download and process any models we want to use.
# We will run Ollama in the container we just set up
apptainer shell $WORK/ipex-llm-inference-cpp-xpu_latest.sif
### inside apptainer ###
# Initialize the Intel GPU environment
source ipex-llm-init --gpu --device Max # you may need to pick a different device on other systems
# Start Ollama (Intel provides the specialized start script)
bash /llm/scripts/start-ollama.sh
# hit return - it should be running in the background
# download and start up the model!! (running in $WORK)
./ollama run llama3.1:70bIf you want to be sure Ollama is running, you can check with
ps -ef | grep ollamaAlso, if you want to clean things up before logging out, you can kill the Ollama process with
pkill ollama # use this to kill when done - not usually necessary, thoughNext time
Next time you log in and want to pick up where you left off, we can skip many of the setup steps.
# start up an interactive job on the pvc partition for an hour
idev -p pvc -t 01:00:00
# the $WORK directory has much higher storage quota, so we'll store our big files there
cd $WORK
module load tacc-apptainer
# make sure it knows where to put/look for models
export OLLAMA_MODELS=$WORK/ollama_models
export HOME=$WORK # force it to use $WORK even if it tries to put it in ~
# We will run Ollama in the container we previously set up
apptainer shell $WORK/ipex-llm-inference-cpp-xpu_latest.sif
### inside apptainer ###
# Initialize the Intel GPU environment
source ipex-llm-init --gpu --device Max # you may need to pick a different device on other systems
# Start Ollama (Intel provides the specialized start script)
bash /llm/scripts/start-ollama.sh
# hit return - it should be running in the background
# Start up the model!! (running in $WORK)
./ollama run llama3.1:70bAssignment
Take a screenshot of your python code running on the Hood cluster with some output from Ollama and submit it on Blackboard.