reproducible builds vs silent data corrution

Bernhard M. Wiedemann bernhardout at lsmod.de
Sun Sep 29 06:43:22 UTC 2024


One more reason for reproducible builds (and deterministic computation 
in general):
to catch silent data corruption.

https://x.com/petereliaskraft/status/1840011158347972765?t=8lNqKAsaFS1-GMqnKxVmPg&s=19

> What happens if your CPU gets something wrong? If it wakes up one day and decides 2+2=5?
> 
> Well, most of us will never have to worry about that. But if you work at a company the size of Google, you do, which is why this paper on "mercurial cores" is so fascinating.
> 
> What the authors report--and supposedly this is common knowledge at the hyperscalers--is that a couple cores per several thousand machines are "mercurial." Due to subtle manufacturing defects or old age, they give wrong answers for certain instructions. These can cause all sorts of impossible-to-diagnose issues. Some rare problems at Google that were traced back to bad CPUs include:
> 
> - Mutexes not working, causing application crashes
> - Silent data corruption
> - Garbage collectors targeting live memory, causing application crashes
> - Kernel state corruption causing kernel panics
> 
> What makes CPUs go bad? It's very hard to tell. The authors posit that issues are becoming more frequent as CPUs get more complex, but there aren't solid numbers behind that. There are certainly strong relationships between frequency, temperature, voltage, and bad CPU behavior--most mercurial CPUs only cause problems under very specific conditions, but those conditions vary from CPU to CPU. Age is another source of problems, as older CPUs are more likely to exhibit problems.
> 
> Bad CPUs are an especially serious problem because they're very hard to detect. If cosmic rays flip bits in storage or on the network, that can be detected through error coding. But there's no analogy for a CPU that allows cheap online verification of its correctness. Instead, the best detection techniques involve monitoring for symptoms. If a core exhibits exceptionally high rates of process crashes or kernel panics relative to its fellows, that's a strong indication something is wrong with it. For the most critical applications, the authors propose triple modular redundancy--redoing each of its computations on three cores and majority-voting a reliable result.
> 
> More than anything, this paper is a call to action--letting everyone know that CPUs can fail. So now, if you ever find a bug you can't diagnose, you can blame the CPU!


https://dl.acm.org/doi/abs/10.1145/3458336.3465297

-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature.asc
Type: application/pgp-signature
Size: 236 bytes
Desc: OpenPGP digital signature
URL: <http://lists.reproducible-builds.org/pipermail/rb-general/attachments/20240929/862c2080/attachment.sig>


More information about the rb-general mailing list