The system I built and posted about in an earlier post I've finally purchased a pair of graphics cards so that I can run the AI models I want to try using GPU accelerations. I purchased a pair of AMD Radeon Pro V340L cards for this purpose as they are supported by ROCM and ollama and are relatively cheap (picked two up for $50 each) as they are intended for use in virtualization providing accelerated 3d graphics in a VDI environment they are not probably the best cards for numerical acceleration, but they were the most cost effective ones possibly.
If these don't work out as performing well, I will likely move on to a pair of AMD Instinct MI50 cards that are a about double the cost per card, though I really would like something with 32gb or more RAM per card and run with 2 or 3 of them if possible. I'm really itching to try out some larger but not impossible to run models like LLama3.3 70b but you start talking a minimum of $500 for a MI60 that has 32gb per card ... not in the current budget unless I find something truly useful to do with it that possibly can earn a little cash on the side with.
I'm going to post a few benchmarks using Microsoft's Phi 4 model that is a 14b parameter model intended to be able to be run on consumer hardware. This I get not horrible performance using the CPU only for processing the model. With the prompt "write a short story about a girl, her dog and a journey"to make sure it generates a reasonable amount output for testing I get the following.
total duration: 2m51.219695186s
load duration: 33.998953ms
prompt eval count: 23 token(s)
prompt eval duration: 1.735s
prompt eval rate: 13.26 tokens/s
eval count: 735 token(s)
eval duration: 2m49.449s
eval rate: 4.34 tokens/s
So 4.3 tokens/s not really a speed demon, it generates output at a rate that is well under my reading rate, but depending on the application might be usable especially with a smaller model.
total duration: 35.587415648s
load duration: 57.451041ms
prompt eval count: 38 token(s)
prompt eval duration: 679ms
prompt eval rate: 55.96 tokens/s
eval count: 658 token(s)
eval duration: 34.848s
eval rate: 18.88 tokens/s
So 18.9 tokens/s which is a usable rate and if the task you are performing can be fulfilled by such a small model you could use your CPU to run the model.
Installing AMD ROCm drivers to allow for GPU accelerated AI under Linux.
AMD provides the following page with instructions on installing ROCm drivers for linux, you can use the Ubuntu 24.04 instructions for Linux Mint 22 as that's what it's based pm/
Install Radeon software for Linux with ROCm — Use ROCm on Radeon GPUs
I've installed the two Radeon ProV340 cards I purchased each features two AMD Vega 56 GPUs with 8GB of HBM each. Each is on par with a GTX 1050 GPU so they are not terribly powerful individually, but are an inexpensive starting point as both cards were just $100 in total and are a good point to get an idea how scaling works across multiple GPUs.
So going back to our benchmark using Microsoft's Phi 4 model that is a 14b parameter model intended to be able to be run on consumer hardware. With the prompt "write a short story about a girl, her dog and a journey"to make sure it generates a reasonable amount output for testing I get the following.
total duration: 43.088122883s
load duration: 39.01255ms
prompt eval count: 23 token(s)
prompt eval duration: 2.589s
prompt eval rate: 8.88 tokens/s
eval count: 656 token(s)
eval duration: 40.457s
eval rate: 16.21 tokens/s
So the cards are only about 4x faster and it looks like the fact that the 14B parameter Phi-4 model is spread across all four GPUs there is some loss of efficiency as I'm only seeing 50% or less GPU utilization across all GPUs most likely because of communications overhead.
Running a model like llama 3.2 that is a 3B parameter model that fits easily on one GPU should offer better performance. Besides taking less calculations with fewer parameters it's all on one GPU so the communications overhead should be minimal.
So we do get 100% GPU usage for a short period of time while it processes the request and the throughput is higher than the CPU only version, but not nearly as high as expected.
total duration: 12.223336862s
load duration: 58.862989ms
prompt eval count: 38 token(s)
prompt eval duration: 66ms
prompt eval rate: 575.76 tokens/s
eval count: 586 token(s)
eval duration: 12.093s
eval rate: 48.46 tokens/s
This is only about 2.5x as fast as the CPU version at 18.9 tokens/s to evaulate not quite the speed-up I was expecting.
Trying the Gemma 2 27B parameter model with our test prompt we get the following results.
total duration: 1m2.508463899s
load duration: 103.767848ms
prompt eval count: 22 token(s)
prompt eval duration: 2.504s
prompt eval rate: 8.79 tokens/s
eval count: 621 token(s)
eval duration: 59.898s
eval rate: 10.37 tokens/s
Just for giggles I tried running the Llama 3.3 70B and the model surprisingly ran, but with 64% running on GPU and 34% running on the CPU. The performance for my standard prompt was kind of glacial seeming that's it's a bigger model and running at least partially on CPU.
total duration: 10m7.537928089s
load duration: 59.888654ms
prompt eval count: 23 token(s)
prompt eval duration: 5.895s
prompt eval rate: 3.90 tokens/s
eval count: 693 token(s)
eval duration: 10m1.581s
eval rate: 1.15 tokens/s
After much investigation, it seems like best option is to there are some issues with the motherboard design that will limit performance, besides the fact that any two-slot graphics card will always end up on different processors limiting bandwidth to transfer information between the cards and the fact that the slots are all PCI-E 3.0 16x the best solution seems to just limit to one gfx card and limit usage to 32b parameter models or there about.