Introducing: Semantically reproducible builds

Sat May 27 08:03:37 UTC 2023

I could see myself supporting this.

It seems appropriate for the weaker term to require more words (thereby 
teeing up the opportunity to point out the distinction, which will 
remain important to do as part of urging further progress).  And this 
proposal does fit that criteria!

Cheers!

On 26.05.2023 22:06, David A. Wheeler wrote:
> Reproducible builds are great for showing that a package really was built
> from some given source, but sometimes they're hard to do.
>
> If your primary goal is to determine where the major risks are from subverted builds,
> I think a useful backoff is something called a "semantically reproducible build".
> (This term was decided on in a discussion with some other people & now I can't remember
> who came up with the term.)
>
> Below is a definition of the term & some rationale for it.
> Note that this is expressly *not* the same as a fully reproducible build, though
> any reproducible build is *also* a semantically reproducible build.
>
> My hope is that if someone wants a reproducible build, they'll use that term.
> However, If they want to talk about this backoff approach, they'll have a clearly
> *different* but *related* term they can use, eliminating a source of confusion.
>
> ---- David A. Wheeler
>
> ==== Details ======
>
> As explained in the documentation for the oss-reproducible tool
> <https://github.com/microsoft/OSSGadget/tree/main/src/oss-reproducible/README.md>,
> which is part of OSSGadget <https://github.com/microsoft/OSSGadget/>:
>
> "A project build is *semantically reproducible*
> if its build results can be either recreated exactly (a bit for bit reproducible build
> <https://en.wikipedia.org/wiki/Reproducible_builds>,
> or if the differences between the release package and a rebuilt package are not expected
> to produce functional differences in normal cases.
> For example, the rebuilt package might have different date/time stamps,
> or one might include files like .gitignore that are not in the other and would not change
> the execution of a program under normal circumstances."
>
> A semantically reproducible build has very low risk of being a subverted build
> as long as it's *verified* to be semantically reproducible.
> Put another way, verifying that a package has a semantically reproducible build
> counters the risk where the putative source code isn't malicious, but
> where someone has tampered with the build or distribution process,
> resulting in a built package that *is* malicious.
> It's quite common for builds to produce different date/time stamps, or
> to add or remove "extra" files that would have no impact if the original
> source code was not malicious.
>
> It's much easier (and lower cost) for software
> developers to create a semantically reproducible build instead of always
> creating a fully reproducible build.
> Fully reproducible builds are still a gold standard for verifying
> that a build has not been tampered with.
> However, creating fully reproducible builds often require that package
> creators change their build process, sometimes in substantive ways.
> In many cases a semantically reproducible build requires no changes,
> and even if changes are required, there are typically fewer changes required.
>
> OSSGadget <https://github.com/microsoft/OSSGadget/">
> includes a tool that can determine if a given package is
> semantically reproducible.
> It's still helpful to work to make a package a fully reproducible build.
> A fully reproducible build is a somewhat stronger claim, and
> you don't need a complex tool to determine if the package is fully
> reproducible.
> Even given that, it's easier to first create a package that's
> semantically reproducible, and then work on the issues remaining
> to make it a fully reproducible build.
>
> In short, making packages at least semantically reproducible
> (and verifying this) is a great countermeasure against subverted builds.
>
> I had earlier talked with some people about this idea, but they noted that
> there would be a lot of problems if the term "reproducible build" was changed to
> be something like this. After consideration, I've decided they're right.
> Bit-for-bit equality is a *powerful* countermeasure against even very clever attacks.
> However, there's value in taking steps to get closer to bit-for-bit equality,
> and if your goal is to measure risk, it's useful to have a term for this intermediate stage.
> So, let's create a new (but obviously similar) term for it.