Making reproducible builds & GitBOM work together in spite of low-level component variation

Wed Jun 22 18:43:49 UTC 2022

On 2022-06-22, David A. Wheeler wrote:
> The challenge is that I believe that there will be subtle variations in inputs caused by
> very low-level components, particularly kernels & but also potentially also low-level
> runtimes like the C runtime. This could result it irreproducibility of anything with GitBOMs
> if the whole process is applied without some corrective factor.
>
> I'm going to use the Linux kernel as an example here. That said,
> I suspect the issue is broader (it would at least apply to any kernel).
>
> Programs running on a Linux kernel eventually must call the kernel.
> To support this, the Linux kernel provides a mechanism to export its API. See
> "exporting kernel headers" here:
> https://docs.kernel.org/kbuild/headers_install.html#:~:text=The%20linux%20kernel's%20exported%20header,used%20with%20these%20system%20calls.
>
> These header files are either used directly by programs to call to the kernel,
> or are processed & converted into other files that end up getting embedded in
> intermediate runtimes (typically the C runtime).
>
> But here's the thing: kernel header files change on basically every release,
> e.g., to add new system calls or new flags. In practically all cases these changes
> don't change the result of executing a build, and thus don't currently interfere
> with reproducible builds. If GitBOM data is added, however, this variance will
> cause different hashes to be included, causing all build results (transitively) to be
> different when you use an even *slightly* different kernel version.

> POTENTIAL SOLUTIONS
>
> Here are some potential solutions I can see:
>
> 1. For reproducible builds, rebuild on *EXACTLY* the same kernel version, C library, etc.
>   This means that you can't just use containers to control rebuilds, since typically containers are
>   designed to be able to run on arbitrary kernels & people normally upgrade containers.
>   You'll need to build on whole new VMs with specifically-configured kernels,
>   *NOT* just embed this in containers. You also need to record exactly which
>   kernel was used to compiler it.

This seems more relevent for the way GitBOM records provenance
information than it does for achieving a reproducible build.

Kernel version differences are tested on Debian's 31k+ packages:

  https://tests.reproducible-builds.org/debian/index_variations.html

Most of the reproducibility issues I've encountered seem to be embedding
the kernel version, not header data. 

I don't recall off the top of my head how many packages have been
manually fixed, but the remaining packages in debian that are affected
by kernel version differences amount to about 30 packages out of 31k+
total:

  https://tests.reproducible-builds.org/debian/issues/bookworm/captures_kernel_version_issue.html
  https://tests.reproducible-builds.org/debian/issues/bookworm/captures_kernel_version_via_CMAKE_SYSTEM_issue.html

So from a Reproducible Builds pespective, the running kernel should not
really matter... and that is a good thing!

> 2. For reproducible builds, redirect header file content requests so they use the same
>    header files, etc., as the original build. GitBOM doesn't care what the underlying kernel
>    version is really, it's just recording the inputs *used*. This means containers can
>    once again be used, even when the kernel changes, but it does complicate
>    performing reproducible builds.

Feels a bit unclean... either you record what you care about honestly,
or you decide you don't care about it and don't record it. I think the
key is the transparency about the process.

> 3. Have compiler flags/configurations to *omit* certain files from the GitBOM results.
>   After all, you're not actually *including* the kernel in the generated results, so it makes
>   sense to omit those files from the point of view of "what is being included in this application"?
>   Ed Warnicke hates this idea, because it creates a "blind spot" in GitBOM.

Yeah, I can see why someone would not like this approach.

> 4. Tweak the definition of reproducible builds so that it's a bit-by-bit identical
> copy of a specified artifact, but the artifact can be *part* of a file.

In practice, there are a few cases where this is done, e.g. .apk and
.rpm files embed signatures which need to be stripped out for
reproducibility comparison.

Excluding some bits and verifying the rest adds complication to the
verification process, and thus opportunities for errors, and I believe
at least once resulted in incorrect results due to bugs in the
verification process...

> Basically, the checked artifacts are the files NOT including GitBOM.
> Since the *executed* parts would be identical, just not certain metadata,
> the risk of subverted code seems small. Sure, someone might slip secrets
> into the unchecked parts, but that's not really why most people are interested
> in reproducible builds.

Presuming I am understanding GitBOM correctly, I would consider the
GitBOM metadata about the build, and not the build artifact itself.

This is the same for .buildinfo files used in Debian; we don't expect
the .buildinfo files to be identical, just the checksums of the
artifacts that people actually use (e.g. the .deb files).

> 5. Rearchitect low-level components so that header files are only read where
> necessary, and then make the header files unlikely to change. That seems hard
> and unlikely to be successful.

Very easy to get it wrong sometimes, too.

> 6. Include in the build a "preprocessor" that extracts just the "interesting parts" of
> headers, and then only use the extracted versions (under the assumption that these
> parts won't change). Again, seems like a lot of work.

Indeed a lot of work, and perhaps more importantly, error-prone.

> Options #1 and #4 are probably the easiest to implement.
> Option #1 is probably the "purest" but it does impose a higher cost on
> performing reproducible builds.

If you treat GitBOM as metadata about a build, then I don't see any
reason why you need to do anything, at least from a Reproducible Builds
perspective.

Though, in order to use GitBOM data to verify the reproducibility of a
given build, well ... then you need to decide what parts of the build
environment are actually relevent, and provide a way to recreate a
sufficiently similar build enviroment from the GitBOM. The Reproducible
Builds documentation has a "What's in a build environment?" page:

  https://reproducible-builds.org/docs/perimeter/

That may mean using GitBOM to verify reproducible builds for a given
software project requires the exact same kernel version, or implementing
one of the other strategies you mentioned above, despite their
downsides. That may be for GitBOM to decide, or maybe even for
individual projects using GitBOM.

If GitBOM errs on the side of including checksums of everything in the
build environment, then to expect to verify a reproducible build when
handed a particular GitBOM, I would expect you would need to reproduce
the build environment exactly.

Reproducible Builds promises that given a sufficiently similar build
environment (not necessarily an identical one), you *should* be able to
build bit-for-bit identical artifacts from the same source code.

You are of course very familiar that you can in some cases produce
bit-for-bit identical build artifacts with intentionally different build
environments, such as the work done Diverse-Double Compiling GNU Mes and
producing bit-for-bit identical artifacts on Guix, Debian and NixOS:

  https://reproducible-builds.org/news/2019/12/21/reproducible-bootstrap-of-mes-c-compiler/

But this typically needs a lot of carefully crafted code intentionally
designed to pull this off, and is not a general expectation...

I suspect the same would hold true for GitBOM; you *might* get a
bit-for-bit identical build artifact with a build environment produced
from a slightly different GitBOM, but it is unreasonable to expect it in
most cases.

At the end of the day, we can only strive for perfection, A stray
particle here or there can trigger a surprise bit-flip; you have to
figure out your goals, and an appropriate level of accuracy that
sufficiently meets those goals.

Thanks for reading so far!

live well,
  vagrant
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 227 bytes
Desc: not available
URL: <http://lists.reproducible-builds.org/pipermail/rb-general/attachments/20220622/99edaa72/attachment.sig>