I just finished reading a fascinating Wired article on the proliferation of hardware bugs. Here’s the gist:
When computers crash, buggy software usually gets the blame. But over the past few years, computer scientists have started taking a hard look at hardware failures, and they’re learning that another type of problem pops up more often than many people realize. That’s right: hardware bugs.
Chipmakers work hard to make sure their products are tested and working properly before they ship, but they don’t like to talk about the fact that it can be a struggle to keep the chips working accurately over time. Since the late 1970s, the industry has known that obscure hardware problems could cause bits to flip inside microprocessor transistors. As transistors have shrunk in size, it’s become even easier for stray particles to bash into them and flip their state. Industry insiders call this the “soft error” problem, and it’s something that’s going to become more pronounced as we move to smaller and smaller transistors where even a single particle can do much more damage.
But these “soft errors” are only part of the problem. Over the past five years, a handful of researchers have taken a long hard look at some very large computing systems, and they’ve realized that in many cases, the computer hardware we use is just plain broken. Heat or manufacturing defects can cause components to wear out over time, leaving electrons leaking from one transistor to another, or channels on the chip that are designed to transmit current simply break down. These are the “hard errors.”
This growing problem highlights the need to have hardware tested alongside software, which is of course the basis for in-the-wild testing.