Sunday, December 28, 2025

Testing the AI workstation I built at running Phi4 using both CPU and GPU accelerated


Note: this article was going to be posted in March but because of issues with the setup and other things going on at this time the article was never posted.

The system I built and posted about in an earlier post I've finally purchased a pair of graphics cards so that I can run the AI models I want to try using GPU accelerations.  I purchased a pair of AMD Radeon Pro V340L cards for this purpose as they are supported by ROCM and ollama and are relatively cheap (picked two up for $50 each) as they are intended for use in virtualization providing accelerated 3d graphics in a VDI environment they are not probably the best cards for numerical acceleration, but they were the most cost effective ones possibly.

If these don't work out as performing well, I will likely move on to a pair of AMD Instinct MI50 cards that are a about double the cost per card, though I really would like something with 32gb or more RAM per card and run with 2 or 3 of them if possible. I'm really itching to try out some larger but not impossible to run models like LLama3.3 70b but you start talking a minimum of $500 for a MI60 that has 32gb per card ... not in the current budget unless I find something truly useful to do with it that possibly can earn a little cash on the side with.

I'm going to post a few benchmarks using Microsoft's Phi 4 model that is a 14b parameter model intended to be able to be run on consumer hardware.  This I get not horrible performance using the CPU only for processing the model.  With the prompt "write a short story about a girl, her dog and a journey"to make sure it generates a reasonable amount output for testing I get the following.

total duration:       2m51.219695186s
load duration:        33.998953ms
prompt eval count:    23 token(s)
prompt eval duration: 1.735s
prompt eval rate:     13.26 tokens/s
eval count:           735 token(s)
eval duration:        2m49.449s
eval rate:            4.34 tokens/s

So 4.3 tokens/s not really a speed demon, it generates output at a rate that is well under my reading rate, but depending on the application might be usable especially with a smaller model.

total duration:       35.587415648s
load duration:        57.451041ms
prompt eval count:    38 token(s)
prompt eval duration: 679ms
prompt eval rate:     55.96 tokens/s
eval count:           658 token(s)
eval duration:        34.848s
eval rate:            18.88 tokens/s

So 18.9 tokens/s which is a usable rate and if the task you are performing can be fulfilled by such a small model you could use your CPU to run the model.

Installing AMD ROCm drivers to allow for GPU accelerated AI under Linux.

AMD provides the following page with instructions on installing ROCm drivers for linux, you can use the Ubuntu 24.04 instructions for Linux Mint 22 as that's what it's based pm/

Install Radeon software for Linux with ROCm — Use ROCm on Radeon GPUs

I've installed the two Radeon ProV340 cards I purchased each features two AMD Vega 56 GPUs with 8GB of HBM each.  Each is on par with a GTX 1050 GPU so they are not terribly powerful individually, but are an inexpensive starting point as both cards were just $100 in total and are a good point to get an idea how scaling works across multiple GPUs.

So going back to our benchmark using Microsoft's Phi 4 model that is a 14b parameter model intended to be able to be run on consumer hardware. With the prompt "write a short story about a girl, her dog and a journey"to make sure it generates a reasonable amount output for testing I get the following.

 

total duration:       43.088122883s
load duration:        39.01255ms
prompt eval count:    23 token(s)
prompt eval duration: 2.589s
prompt eval rate:     8.88 tokens/s
eval count:           656 token(s)
eval duration:        40.457s
eval rate:            16.21 tokens/s

So the cards are only about 4x faster and it looks like the fact that the 14B parameter Phi-4 model is spread across all four GPUs there is some loss of efficiency as I'm only seeing 50% or less GPU utilization across all GPUs most likely because of communications overhead.

Running a model like llama 3.2 that is a 3B parameter model that fits easily on one GPU should offer better performance. Besides taking less calculations with fewer parameters it's all on one GPU so the communications overhead should be minimal.


So we do get 100% GPU usage for a short period of time while it processes the request and the throughput is higher than the CPU only version, but not nearly as high as expected.

total duration:       12.223336862s
load duration:        58.862989ms
prompt eval count:    38 token(s)
prompt eval duration: 66ms
prompt eval rate:     575.76 tokens/s
eval count:           586 token(s)
eval duration:        12.093s
eval rate:            48.46 tokens/s

This is only about 2.5x as fast as the CPU version at 18.9 tokens/s to evaulate not quite the speed-up I was expecting.

Trying the Gemma 2 27B parameter model with our test prompt we get the following results.

total duration:       1m2.508463899s
load duration:        103.767848ms
prompt eval count:    22 token(s)
prompt eval duration: 2.504s
prompt eval rate:     8.79 tokens/s
eval count:           621 token(s)
eval duration:        59.898s
eval rate:            10.37 tokens/s

Just for giggles I tried running the Llama 3.3 70B and the model surprisingly ran, but with 64% running on GPU and 34% running on the CPU.  The performance for my standard prompt was kind of glacial seeming that's it's a bigger model and running at least partially on CPU. 

total duration:       10m7.537928089s
load duration:        59.888654ms
prompt eval count:    23 token(s)
prompt eval duration: 5.895s
prompt eval rate:     3.90 tokens/s
eval count:           693 token(s)
eval duration:        10m1.581s
eval rate:            1.15 tokens/s

After much investigation, it seems like best option is to there are some issues with the motherboard design that will limit performance, besides the fact that any two-slot graphics card will always end up on different processors limiting bandwidth to transfer information between the cards and the fact that the slots are all PCI-E 3.0 16x the best solution seems to just limit to one gfx card and limit usage to 32b parameter models or there about.


 

Sunday, February 23, 2025

Installing the needed tools to run Large Language Models on my Workstation

This is going to be a quick guide as to all the things I did to get the tools installed on my workstation build so that I could use to to play with and test various AI models.

The system I'm using is the dual Xenon system that I built earlier for just this purpose.  It has plenty of available CPU cores (32 using hyper-threading) and 256GB of ECC DDR4 RAM to run both virtual machines as well as docker containers as needed. As it's more of a workstation than a server I installed my favorite desktop Linux distribution Linux Mint 22.1.  It's based off of Ubuntu 24.04 LTS so it's stable and almost anything that you can do on Ubuntu you can do on it with little or no modifications of the process.  The advantage as far as I'm concerned is that it uses a more traditional Cinnamon desktop environment that I prefer over what Ubuntu uses.

For summary the following is what I'm working with:

The first thing needed is a way to run the models themselves and provide api access to other applications that will be making use of them.  For this I installed and setup ollama as it's an easy to use tool that will provide access to other tools by an Open-AI compatible API and can make use of both the systems CPU and one or more graphics cards for AI acceleration.

Installing ollama on the system

For installing ollama you can follow the manual method in the following document.
How to install Ollama on Linux (2 easy methods)

1. Update the system

Make sure your system is up to date as this will make sure things run smoother.

sudo apt update
sudo apt upgrade

 2. Install requited dependencies

The following will speed up the installation, the install script should install them if they aren't present, but it's best to make sure they are installed.

sudo apt install python3 python3-pip git

  3. Download the ollama installation package

The following command will download a script from the ollama website and install ollama on your system taking care of most the details.

curl -fsSL https://ollama.com/install.sh | sh

verify that the install was successful

ollama --version

 4. Run and configure Ollama

You should be able to launch the ollama as a server with the following command.

ollama serve

You can test that it's running by the following command, this will download, install and run the llama 3.2 model and give you verbose statistics about the command prompt.

ollama run --verbose llama3.2:3b

If you run the above command and let it finish downloading the model and then give it the prompt "Write a one-sentence summary of the plot of Cinderella." you should see something like the following.

You can exit out of the ollama shell by typing /bye and hitting enter.

5. Setup to start automatically

Now that we've verified it appears to be working, we should enable the service to start when the system starts up automatically.

sudo systemctl daemon-reload sudo systemctl enable ollama

That should complete the installation of ollama server on the system and you should be ready to progress to the next step.

Installing Open WebUI

The first thing you need to do is install Docker on Linux Mint, I following the following instructions, I won't copy How to Install Docker on Linux Mint 22 the instructions as it's pretty long.

Create a Docker volume to persist Open WebUI data

docker volume create open-webui

Pull the docker container from github.

docker pull ghcr.io/open-webui/open-webui:main

Execute the docker run command to start the Open WebUI container

docker run -d --network=host -v open-webui:/app/backend/data -e OLLAMA_BASE_URL=http://127.0.0.1:11434 --name open-webui --restart always ghcr.io/open-webui/open-webui:main

Explanation of the command:

  • docker run -d: Runs the container in detached mode (in the background).
  • --network=host: Tells it to use the host network interface (allows external connections) 
  • -v open-webui:/app/backend/data: Mounts the open-webui volume to the /app/backend/data directory in the container.
  • -e OLLAMA_BASE_URL=http://127.0.0.1:11434: sets the ollama base url.
  • --name open-webui: Assigns the name "open-webui" to the container.
  • --restart always: Always restart container when docker starts
  • ghcr.io/open-webui/open-webui:main: Specifies the Open WebUI Docker image to use

 Assuming the command executed without error you should be able to open open up a browser to http://localhost:8080/ or http://<host ip>:8080/ from another system and setup the first user which will be the administrator.

Updating Open WebUI

You will need to update the Open WebUI app from time to time as there are feature updates and fixes from time to time. The easiest way I found to update Open WebUI is just to use watchtower to update it with the following command.

docker run --rm --volume /var/run/docker.sock:/var/run/docker.sock containrrr/watchtower --run-once open-webui

Using Open WebUI

I'm not going to write instructions on using Open WebUI, the interface is similar to Open AI/Gemini/Copilot interface in usage, there are advanced features that can be enabled such as search integration and RAG (Recovery Augmented Generation for integrating your own data into the knowledge) that are better covered in other places.

The main point of this post is to document the configuration for myself so if I need to do it again I know how I did it without having to go thought searching the documentation again.