Hardware recommendations for AI box?

zenlord · 2026-03-13 11:10:32

Hi,

Long-time (arch)linux user, but never ventured into gaming rigs and stayed clear from dedicated graphics chips the past 15 years.

I would like to start testing AI models in the realm of RAG and translation and was directed towards librechat. I was able to find installation guides and am now getting acquainted with the required software stack. I will probably spin up a VM first to test the software, but I probably should not have too high hopes without any discrete GPU and with rather limited RAM in that VM...

Online sources show that I should budget 3-4k EUR for boxes like NVidia's DGX Spark or AMD's Halo Strix (the former being more performant but more expensive). I have three problems:
1. 3-4k EUR is a pretty hefty amount of money to play around with something new, even though I am rather confident that the end result will help me, even professionally.
2. I am not familiar with evaluating the performance of these systems, so I don't know if they will bring speed increases of 5x or 100x versus my (future) VM setup.
3. Recent news seems to confirm that the NVidia platform is supported best in linux environments, but with "lemonade-server", now also AMD XDNA2 seems to be supported and even Intel with the NPU driver is getting better (even though the list of supported applications seems shorter)

Anyone has any meaningful experience or links to online resources to help me see the forest through the trees?

Thx!

gromit · 2026-03-13 11:28:03

What models would you like to use? If you wanna load big models like GPT-OSS you'll most likely need something with bigger VRAM or unified memory

zenlord · 2026-03-13 12:32:11

Thank you @gromit for your reply.

I have no idea which model size I need for it to become useful - that is what I will have to play around with. Looking at what the general interwebs are considering 'good hardware for AI tasks', it looks like a discrete GPU with 24GB of VRAM is almost a requirement. Maybe I have been reading up on the wrong fora and Reddit threads, though (which is why I'm hoping to find advice on what is the minimum required hardware). I did also notice that for image generation, more CPU power is required, and that is entirely out of scope for my purposes: I'm focusing on text-related tasks.

Luciddream · 2026-03-13 19:53:32

I started mine with a 16GB AMD 9070XT that I got for about 550 euro. For image generation and image editing it's actually fine, it's a fast GPU. For text generation, you can fit a 4-bit GGUF like gpt-oss-20b in it and have something to experiment with. For a bit larger models, you will need minimum 24GB of VRAM (I haven't tried it but I guess a 4-bit GGUF of the dense Qwen 3.5 27B would be possible on the GPU)

For even larger models, which will produce better results faster (for example Qwen 3.5 122B or MiniMax-M2.5), you will need something with 64-128GB unified memory. I got a Strix Halo 128GB now and I'm now using it almost exclusively (with lemonade-server). I've been running Qwen 3.5 122B since it came out and while it is slow on the Strix Halo, it produces very good results most of the time. You can also use this machine to fine tune llms with your own datasets, but I haven't done it myself yet. You can also run models on the CPU, on the GPU (better performance), and on the NPU (better energy consumption).

There are many more things to consider, like having a dense model or a moe model, but I'm not really an expert on what is better and when you should use it. I keep experimenting with models and settings when I have time available.

ewaller · 2026-03-13 21:20:31

https://www.homedepot.com/p/Square-D-Ho … /204836379

loqs · 2026-03-13 23:49:14

Luciddream wrote:

(for example Qwen 3.5 122B or MiniMax-M2.5), you will need something with 64-128GB unified memory.

How do you fit MiniMax-M2.5's 229 billion parameters into 128GB and have room to fit KV cache, context, LLM engine and OS? REAP of around 50% plus 4 Bit quantization or 2Bit quantization?

cryptearth · 2026-03-14 00:12:29

i can run 14b models in quantized 6bit on my rx7700 12gb (@lucid: *jealous face - paid about 500 for mine back when i bought it - lucky you) - but that's the upper limit and depending on the run model i have only a small context of 4k and only about 15 tokens/s with a bit of spill onto system ram and cpu
only tried text generation via llama.cpp - not dove into what's required for other tasks or media, yet
here's my current list:

DeepSeek-Coder-V2-Lite-Instruct-Q5_K_M.gguf
deepseek-llm-7b-chat_Q5_k_m.gguf
fusechat-llama-3.1-8b-instruct-q6_k.gguf
gemma-2-9b-instruct-Q4_K_M.gguf
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
Ministral-3-14B-Instruct-2512-Q4_K_M-bpw.gguf
Mistral-7B-Instruct-v0.3-Q4_K_M.gguf
Nous-Hermes-2-Mistral-7B-DPO.Q4_K_M.gguf
openchat-3.6-8b-20240522-Q4_K_M.gguf
Phi-3-medium-128k-instruct-Q4_K_M.gguf
phi-4-Q4_K.gguf
qwen2.5-14b-instruct-q6_k.gguf
qwen2.5-coder-14b-instruct-q6_k.gguf
Qwen3-14B-Q6_K.gguf

some work better then others - some I get quite a big context of 20k - one even up to 45k - with high speeds (40+ tokens/sec) and good quality - but ask them too much they quickly break down
one of the qwen models once started to talk chinese all of the sudden during an active reply (ok, it was a stress benchmark - but it shows where're the limits of both my hardware and the model)
fun fact: it acutally does matter who trained the model: ask chinese models about stuff like taiwan or the tiananmen square - they nope out while western models by meta, M$, google or others happily talk about it
I once tried to frame qwen into a shady communist persona for a round of the party game mafia - the results were ... interesting
but aside from such obvious political constraints they're acutally quite good and fast when it comes to more general topics
as for coding - well, even the "code" models have thier issues - it's best to have at least one other model check the output of one of its brothers - together they are actually quite good (in difference to chatgpt which is quite bad as even when you point it to man pages or other docs it still gets simple commands very wrong)

Last edited by cryptearth (2026-03-14 00:19:39)

Luciddream · 2026-03-14 07:12:09

loqs wrote:

Luciddream wrote:
(for example Qwen 3.5 122B or MiniMax-M2.5), you will need something with 64-128GB unified memory.
How do you fit MiniMax-M2.5's 229 billion parameters into 128GB and have room to fit KV cache, context, LLM engine and OS? REAP of around 50% plus 4 Bit quantization or 2Bit quantization?

I've used the MiniMax-M2.5-GGUF:UD-Q3_K_XL - it fits in 128GB. To be honest I've only used this for a couple of days, I haven't used it since Qwen 3.5 came out, I think Qwen 3.5 122B works better for me.

Luciddream · 2026-03-14 07:19:16

cryptearth wrote:

i can run 14b models in quantized 6bit on my rx7700 12gb (@lucid: *jealous face - paid about 500 for mine back when i bought it - lucky you) - but that's the upper limit and depending on the run model i have only a small context of 4k and only about 15 tokens/s with a bit of spill onto system ram and cpu only tried text generation via llama.cpp

If you are in the US, check out the AMD Developer challenge. They are giving 128GB laptops if you contribute something to the community. That's how I got mine!

Arch Linux

#1 2026-03-13 11:10:32

Hardware recommendations for AI box?

#2 2026-03-13 11:28:03

Re: Hardware recommendations for AI box?

#3 2026-03-13 12:32:11

Re: Hardware recommendations for AI box?

#4 2026-03-13 19:53:32

Re: Hardware recommendations for AI box?

#5 2026-03-13 21:20:31

Re: Hardware recommendations for AI box?

#6 2026-03-13 23:49:14

Re: Hardware recommendations for AI box?

#7 2026-03-14 00:12:29

Re: Hardware recommendations for AI box?

#8 2026-03-14 07:12:09

Re: Hardware recommendations for AI box?

#9 2026-03-14 07:19:16

Re: Hardware recommendations for AI box?

Board footer