Making reproducible builds & GitBOM work together in spite of low-level component variation

Wed Jun 22 16:19:31 UTC 2022

All:

I would like for reproducible builds & GitBOMs to be able to work simultaneously.

Unfortunately,  I strongly suspect there's a subtle incompatibility when you try
to combine them to due to low-level component variation.
The good news is that I think there are ways to address this.

This message summarizes what I suspect is going to be a problem, along with a
few ways to address it. If someone sees a better way, PLEASE post.
It's also possible that this isn't a real problem - if that's so, please explain!
I'm trying to foresee & prevent problems before they become problems.
I'm posting to reproducible-builds, but also plan to post to this for GitBOM, so
that both communities can look at the issue.

Details below.

--- David A. Wheeler

==========================

BACKGROUND

GitBOM is explained at <https://gitbom.dev/>. As they explain it, its purpose is to:
	• Build a compact Artifact Dependency Graph (ADG), tracking every source code file incorporated into each built artifact.
	• Embed a unique, content-addressable reference for that Artifact Dependency Graph (ADG), the GitBOM identifier, into the artifact at build time.
For example, if you invoked a compiler generating ELF, it would store in an ELF section
a sorted list of cryptographic hashes of the file inputs (each terminated with a newline).
Such lists are defined to be *sorted* as a requirement, so it is deterministic given the same inputs.
Note that this recording is transitive.
This mechanism would typically be built into compilers & linkers (though there are alternatives).
The idea is to record a complete record of what contents you used so later users can find out
which file contents were used to create it. It's not exactly a
software bill of materials (SBOM), e.g., given an empty file you won't
know what software package it came from... but it's clearly a related idea.

Reproducible builds are explained at <https://reproducible-builds.org/>.
"Reproducible builds are a set of software development practices that create an
independently-verifiable path from source to binary code."; more specifically,
"A build is reproducible if given the same source code, build environment and build instructions,
any party can recreate bit-by-bit identical copies of all specified artifacts."
This is the best known countermeasure against subverted builds (as happened with SolarWinds' Orion).
The other main countermeasure against subverted builds is protecting the build environment,
but that assumes you can always protect an environment without failure EVER;
we need a better alternative.

THE POTENTIAL PROBLEM

Reproducible builds have made remarkable progress. However, reproducible builds are extremely
sensitive to changes, and my concern is that GitBOMs may introduce an additional sensitivity that
could be hard to address.

The challenge is that I believe that there will be subtle variations in inputs caused by
very low-level components, particularly kernels & but also potentially also low-level
runtimes like the C runtime. This could result it irreproducibility of anything with GitBOMs
if the whole process is applied without some corrective factor.

I'm going to use the Linux kernel as an example here. That said,
I suspect the issue is broader (it would at least apply to any kernel).

Programs running on a Linux kernel eventually must call the kernel.
To support this, the Linux kernel provides a mechanism to export its API. See
"exporting kernel headers" here:
https://docs.kernel.org/kbuild/headers_install.html#:~:text=The%20linux%20kernel's%20exported%20header,used%20with%20these%20system%20calls.

These header files are either used directly by programs to call to the kernel,
or are processed & converted into other files that end up getting embedded in
intermediate runtimes (typically the C runtime).

But here's the thing: kernel header files change on basically every release,
e.g., to add new system calls or new flags. In practically all cases these changes
don't change the result of executing a build, and thus don't currently interfere
with reproducible builds. If GitBOM data is added, however, this variance will
cause different hashes to be included, causing all build results (transitively) to be
different when you use an even *slightly* different kernel version.

I've raised this issue with Aeva Black & Ed Warnicke, who work on GitBOM, and
discussed this at length in the hopes of finding solutions.

POTENTIAL SOLUTIONS

Here are some potential solutions I can see:

1. For reproducible builds, rebuild on *EXACTLY* the same kernel version, C library, etc.
  This means that you can't just use containers to control rebuilds, since typically containers are
  designed to be able to run on arbitrary kernels & people normally upgrade containers.
  You'll need to build on whole new VMs with specifically-configured kernels,
  *NOT* just embed this in containers. You also need to record exactly which
  kernel was used to compiler it.

2. For reproducible builds, redirect header file content requests so they use the same
   header files, etc., as the original build. GitBOM doesn't care what the underlying kernel
   version is really, it's just recording the inputs *used*. This means containers can
   once again be used, even when the kernel changes, but it does complicate
   performing reproducible builds.

3. Have compiler flags/configurations to *omit* certain files from the GitBOM results.
  After all, you're not actually *including* the kernel in the generated results, so it makes
  sense to omit those files from the point of view of "what is being included in this application"?
  Ed Warnicke hates this idea, because it creates a "blind spot" in GitBOM.

4. Tweak the definition of reproducible builds so that it's a bit-by-bit identical
copy of a specified artifact, but the artifact can be *part* of a file.
Basically, the checked artifacts are the files NOT including GitBOM.
Since the *executed* parts would be identical, just not certain metadata,
the risk of subverted code seems small. Sure, someone might slip secrets
into the unchecked parts, but that's not really why most people are interested
in reproducible builds.

5. Rearchitect low-level components so that header files are only read where
necessary, and then make the header files unlikely to change. That seems hard
and unlikely to be successful.

6. Include in the build a "preprocessor" that extracts just the "interesting parts" of
headers, and then only use the extracted versions (under the assumption that these
parts won't change). Again, seems like a lot of work.

Options #1 and #4 are probably the easiest to implement.
Option #1 is probably the "purest" but it does impose a higher cost on
performing reproducible builds.

I'd love to hear other ideas.