Phase 1: Just a Bunch of Compute
Plan the hardware for your SWE experimentation laboratory
Time to get down to business. Phase 1 of the plan is the rudimentary layer to building your own home-brew “mad scientist” lab for software engineering experimentation. JBoC — Just a Bunch of Compute — is about deciding the physical portion of the platform.
Prev: Building a SWE Experimentation Platform | Next: Phase 1: Just a Bunch of Storage
I’ll be describing this in terms of what’s most typical for backend and data engineers, but don’t view that as limiting your options. If you’re interested more in the front-end, you might make slightly different choices. If you’ve been dreaming of building a thousand-node Raspberry PI supercluster, have at it! Honestly if some board manufacturer would underwrite the work, I’d do a supercluster in a heartbeat.
The Outline
Money only goes so far. There is benefit to planning purchases much the same way a cash-strapped company might. The heuristic I’m using is:
75% of purchases will go to lower-cost hardware.
20% will be the mid-tier cost range.
5% go to the very infrequent budget-busters you have decided you must say yes to.
The percentages are ballparks in number of units purchased, not in dollars. That 5% top-end units may very well end up 75% of your spend if you’ve decided you absolutely must have that Nvidia H100 card, for example. More commonly it’ll be the monster laptop rig that you talk yourself into once every 5 years.
The goal in experimentation with distributed computing is to have hardware to distribute work over. The per-node computational muscle may not be that important. It’s the distribution of function that you are exercising, not whether each and every node can supply a water-cooled 100 FPS while playing Cyberpunk 2077.
Lower-Tier Nodes
Used, open-box, and unused-but-discontinued hardware is a great way to build out a cluster. Product lines in the range of 3-6 years old can have provide a lot of bang for the buck. We pay an insane premium in order to get the newest thing with a 50% performance bump when operated flat out. Most of the time we aren’t using server compute anywhere near full load, particularly in a home lab.
Think of your lab like you’re setting up a business. Fund the work you are going to do, not the work that somebody, somewhere, in a big wide world might conceive of doing. Someday.
Having headroom for maximum performance is great… if you’re using it. Save that for the high-end 5% of purchases. If it turns out to have been a vanity purchase, at least you’ll understand where the motivation came from.
Try to identify a single hardware model that you will stick to. There are two reasons for this:
When you have many different hardware variations, it becomes more challenging to reason about behavior like performance and fault rates. The differences become primary variables instead of your chosen computational activity.
Maintenance is harder because you’ll have different issues arise with every kind of different hardware. Firmware patch issues will differ, BIOS configuration will differ, supported memory chips or SSDs can differ. By sticking to a single model you gain leverage on any learning curve: solve a problem once, then copy-paste to all the other nodes of the same model.
Here are the characteristics I look for in lower-cost nodes.
The BIOS and motherboard features must support whatever operating system and O/S features I intend to use, and supply the connectivity I’ve decided I need for any hardware that will be mounted directly on or cabled to the motherboard. More than anything else, this draws a line for “too old to use”.
The model must be easily available. There’s no point in picking something you rarely see available on the major reseller web sites.
There should be a decent amount of internet history on experiences with hardware quality for that model. You want hardware that has been reliable, both in initial assembly and in use over time. The model might not house the fastest and hottest-running components as of when it was made, because heat means thermal stress. Find a reliable workhorse, not an older Ferrari.
The physical size of the model must suit the location where it is going to live. If you have an unused (and dry!) basement with ample space for racks and 1U “pizza boxes” then the width and depth of those nodes might be fine. If you have a half dozen square feet on a couple of home-office shelves, then small form-factor PCs may be what you’re looking for: I use SFFs myself for exactly that reason.
You need to be able to support the power requirements of that model. You don’t want the circuit breaker tripping all the time because you tried to run too many nodes with 1200w power supplies.
The models must run cool enough for the environment where they are going to live. Obviously you may be able to improve on the situation with various cooling solutions, but the end result establishes your thermal constraints.
CPUs won’t be the best possible but should be acceptable in terms of their performance, their heat generation, and any history related to reliability. As an example, you may decide to take a pass on Intel Raptor Lake 13900K and 14900K CPUs because of their voltage-related stability problems, but on the other hand I wouldn’t rush to anything in the i3 product line either because it’s just too low-powered.
RAM capacity matters for almost everything. Unless you’re doing the Raspberry PI supercluster, I wouldn’t bother with anything that doesn’t have at least 32gb of RAM per node, and preferably at least 64gb. Note that sometimes the manufacturer information will suggest 32gb when 64gb was possible if you had the right RAM sticks, so do your homework online by tracking down posts from modders and happy owners of a particular node model. Be aware that some motherboards with 4 sockets for RAM may not operate with maximum memory performance in that configuration; if speed is your goal you may be limited to a single pair of DIMMS.
I personally prefer to buy bare-bone boxes that have any CPU, GPU, Wi-Fi, Bluetooth, and wired network support components mounted… but not the RAM or SSD. That lets me pick a consistent model for RAM and SSD hardware without feeling like I wasted any money. If you aren’t super-fussy about performance testing, then I wouldn’t worry as much about that. Keeping the Wi-Fi / Bluetooth / Network cards consistent though is likely to save you a lot of headaches. There’s more than enough room for configuration-related ick around those as it is.
Middle-Tier Nodes
The process here is much the same as that for the low-cost tier. The difference is you may know you want to perform work that requires physical capabilities your lower-end nodes will lack.
An obvious example would be if the lower-tier nodes don’t have any Nvidia GPU cards in them, and you need something that would let you work with CUDA. The node models will be a bit newer and components slightly higher-end, but you don’t have to break the bank. What you likely care about are which versions of CUDA are supported by a particular GPU card model. The Wikipedia page on Nvidia card features can be a handy resource for looking at API support.
Avoiding the RTX cards that end in “80” or “90” can provide you with some affordable options. Obviously lower-end GPUs wouldn’t cut it for somebody training a 100B-parameter base LLM, but we’re not trying to spec out hardware for an activity that runs with budgets in the 10’s of millions USD. I’m not going to get into AI details further now, they’ll be “Cheapskate AI” material in future articles.
Another kind of node warrants consideration as a possible mid-tier expense, and that is a good-quality NAS (Network Attached Storage). That’ll be discussed more in the next article on JBoS (Just a Bunch of Storage).
Higher-Tier Nodes
Not much to explicitly plan here except to note that this can be a disproportionate hunk of your budget. Avoid it if you can, but make the money count if you do it.
My personal inclination is towards these being the periodic upgrade to a really good laptop, since even when you have a compute cluster you’ll behave like any developer and work on it remotely.
If you have big plans in the ML and GenAI space, you might instead opt for a desktop box with whatever you can afford in the way of GPU cards and the most VRAM you decide you can fund.
Power Planning
This is a big area and I’m not the person to do full justice to it, but as you have more money in hardware you’ll want to learn about:
Surge protectors
Surge arrestors/limiters
Power/line conditioners
Uninterruptible Power Supplies (UPS) and their common architectural variants: standby, line-interactive, and online
I have seen a rack taken out by a lightning strike, so learning about the relevant power-handling hardware and how to set it up properly is something to have on your radar for some stage of the evolution of your lab.
Until then, at least consider surge protection. To be extra safe, power down and unplug everything whenever you hear about an incoming electrical storm, or when notified that the utility company is going to be doing work in your area.
One point of this lab-building exercise is to treat it as though you are running a business, not playing computer games, so safeguard that investment.
Other Parts
Buying compute nodes will not be the only purchase. You’ll need to allow for:
RAM sticks of the appropriate size and part numbers.
Drives (whether HDD, SSD, or hybrid).
Network cables with shielding appropriate to the speed and length until you can connect to a router or switch.
A network switch.
Some USB sticks for doing initial O/S installations or for firmware updates.
Anything you need for interacting with nodes, such as a monitor and keyboard.
If the motherboard or node case have vendor-specific parts, like fan housings or heat shield armor, consider picking up a few spares while you can get them.
Add any cooling supplies or spare fan parts you want quickly available.
Add any tools you want for assembling hardware like Torx screwdrivers, anti-static mats and wrist straps, magnifier and lighting for viewing small parts in tight quarters, etc.
If you went the route of rack-mount hardware, then depending on your plan either full-sized or table-top sized racks and all their supporting hardware pieces; alternatively if you need any office furniture like shelving then allow for that.
Storage bins for bits and pieces, because you’ll definitely collect bits and pieces.
Add or remove as appropriate. It’s your lab for your experiments.
The Process
I would suggest some habits around how you acquire and integrate hardware.
Learn which vendor sources you find reliable, and mostly stick to them. You’ll have enough going wrong as it is. Sometimes your compatibility homework on a RAM part number won’t turn out as you had hoped. Sometimes the wrong part will get shipped. The universe introduces enough chaos so knowing which vendors are more reliable can save you a lot of grief.
Don’t rush to buy multiple units of a compute node or expensive part until you’ve bought and tried the very first one. It’s not hard to trip over something that will be show-stopping for your plans, and you have to send that item back and start again on your homework for the alternative. Buy one, put it to some use, and then if your plan was to buy more you can proceed.
Don’t buy until you have time to install and test, particularly when it isn’t a brand-new shrink-wrapped product. Schedule the time for the work. Not only will that help you catch bad parts fast enough to return them, it will also help you spot when somebody sends you a part that is not what was advertised. Even the shrink-wrapped products are better to be able to test quickly, but anything used or open-box you want to make sure you didn’t waste your money or were scammed.
Make notes and take pictures if those might be helpful later. Setting up a compute cluster has a lot of repeat work, and you will find yourself routinely wishing you remembered exactly what steps you took on a previous node. Capture the information while it is fresh. Even a day later you’ll forget something.
Budget for a cost 10% above what you planned. You’ll discover you need something that you didn’t realize would be important until opening up a case, or after realizing half your network cables aren’t being recognized by the switch.
Expect to do operating system installations frequently at first. Not only are there many flavors to choose from — even just in the Unix/Linux world — but you’ll find yourself grappling with questions on specific O/S version numbers, desktop vs server variants, disk layout for the installation, and sometimes BIOS configuration. You will change your mind, and discover why keeping notes comes in handy. Don’t be emotionally wedded to any initial set-up. Learn to feel comfortable tearing down anything that isn’t exactly as you want it to be.
A Living Example
Here are the low-tier nodes for my SFF cluster.
Intel NUCs, all but one are the same model; the exception is my Kubernetes control plane. It has a small keyboard and monitor if for some reason I can’t get to the cluster remotely, e.g. due to a flubbed network configuration experiment.
Each node has 64gb of RAM, an HDD, and an SSD.
Cooling isn’t too much of a concern as Intel made these from CPUs intended for the mobile market, but I keep a fan at one end to circulate the air just to be safe.
The nodes connect to a network switch positioned on the shelf above.
All of these were bought after Intel had moved on to the next couple of series in the NUC product line. Fortunately I avoided the Raptor Lake debacle. At this point I think about 3/4ths of the Intel NUC lines are aging out as an option for those starting in 2025 on their own cluster. There are many other SFF vendors that have taken up the baton, so I expect SFFs to remain a viable approach for those not wanting to go in the direction of rack-mount hardware.
The Experimentalist : Phase 1: Just a Bunch of Compute © 2025 by Reid M. Pinchback is licensed under CC BY-SA 4.0