Introducing: Semantically reproducible builds

Fri May 26 20:06:44 UTC 2023

Reproducible builds are great for showing that a package really was built
from some given source, but sometimes they're hard to do.

If your primary goal is to determine where the major risks are from subverted builds,
I think a useful backoff is something called a "semantically reproducible build".
(This term was decided on in a discussion with some other people & now I can't remember
who came up with the term.)

Below is a definition of the term & some rationale for it.
Note that this is expressly *not* the same as a fully reproducible build, though
any reproducible build is *also* a semantically reproducible build.

My hope is that if someone wants a reproducible build, they'll use that term.
However, If they want to talk about this backoff approach, they'll have a clearly
*different* but *related* term they can use, eliminating a source of confusion.

---- David A. Wheeler

==== Details ======

As explained in the documentation for the oss-reproducible tool
<https://github.com/microsoft/OSSGadget/tree/main/src/oss-reproducible/README.md>,
which is part of OSSGadget <https://github.com/microsoft/OSSGadget/>:

"A project build is *semantically reproducible*
if its build results can be either recreated exactly (a bit for bit reproducible build
<https://en.wikipedia.org/wiki/Reproducible_builds>,
or if the differences between the release package and a rebuilt package are not expected
to produce functional differences in normal cases.
For example, the rebuilt package might have different date/time stamps,
or one might include files like .gitignore that are not in the other and would not change
the execution of a program under normal circumstances."

A semantically reproducible build has very low risk of being a subverted build
as long as it's *verified* to be semantically reproducible.
Put another way, verifying that a package has a semantically reproducible build
counters the risk where the putative source code isn't malicious, but
where someone has tampered with the build or distribution process,
resulting in a built package that *is* malicious.
It's quite common for builds to produce different date/time stamps, or
to add or remove "extra" files that would have no impact if the original
source code was not malicious.

It's much easier (and lower cost) for software
developers to create a semantically reproducible build instead of always
creating a fully reproducible build.
Fully reproducible builds are still a gold standard for verifying
that a build has not been tampered with.
However, creating fully reproducible builds often require that package
creators change their build process, sometimes in substantive ways.
In many cases a semantically reproducible build requires no changes,
and even if changes are required, there are typically fewer changes required.

OSSGadget <https://github.com/microsoft/OSSGadget/">
includes a tool that can determine if a given package is
semantically reproducible.
It's still helpful to work to make a package a fully reproducible build.
A fully reproducible build is a somewhat stronger claim, and
you don't need a complex tool to determine if the package is fully
reproducible.
Even given that, it's easier to first create a package that's
semantically reproducible, and then work on the issues remaining
to make it a fully reproducible build.

In short, making packages at least semantically reproducible
(and verifying this) is a great countermeasure against subverted builds.

I had earlier talked with some people about this idea, but they noted that
there would be a lot of problems if the term "reproducible build" was changed to
be something like this. After consideration, I've decided they're right.
Bit-for-bit equality is a *powerful* countermeasure against even very clever attacks.
However, there's value in taking steps to get closer to bit-for-bit equality,
and if your goal is to measure risk, it's useful to have a term for this intermediate stage.
So, let's create a new (but obviously similar) term for it.