When Chad and I started building Pepperdata’s product, we knew we needed a Hadoop cluster to test the software on. But we realized that since the whole point of the product is to provide more clarity, control, and capacity for Hadoop operators in the face of limited hardware resources, we didn’t need to buy expensive machines – in fact, what we really needed was a cluster we could max out easily. So rather than coughing up $7K-$14K per node, I called up Unix Surplus and found 20 used machines with a reasonable hardware configuration for $185 each.

When I drove over to pick up the machines, I got a closer look. They looked awfully familiar, so I asked the guy there, “By any chance, did you pull these out of a Yahoo data center?” Indeed, he had. They looked just like the machines I had ordered when running the web search group at Yahoo. I looked more closely and could see a little gummy residue in the corner where Rackable had put the sticker labeling each machine with the machine-type name I had specified when I ordered them – so I told the Unix Surplus guy, “This is the second time I’ve bought these machines!”

Sure enough, when we booted up the servers, their Roamer card displays said “Index Server” – they were the servers we used to serve web search results at Yahoo around 2006! We took the bootstrapped-startup approach to hosting the machines, too – I built a plywood cage under my porch with fans on each end.

For 18 months we did much of our development and testing on those ~8-year-old, $185 air-cooled servers in a wooden cage. The only real downside of this approach was that every month or two I had to go to the “data center” and debug the servers – by which I meant removal of the moths that made their way through the air filter. (Of course, for testing at large scale, we ran many hundreds of VMs in EC2, but that’s a topic for a different blog post…)

Take a free 15-day trial to see what Big Data success looks like

Pepperdata products provide complete visibility and automation for your big data environment. Get the observability, automated tuning, recommendations, and alerting you need to efficiently and autonomously optimize big data environments at scale.