Introducing: Semantically reproducible builds

Sat May 27 13:24:25 UTC 2023

 > It's much easier (and lower cost) for software
 > developers to create a semantically reproducible build instead of always
 > creating a fully reproducible build.
 > Fully reproducible builds are still a gold standard for verifying
 > that a build has not been tampered with.
 > However, creating fully reproducible builds often require that package
 > creators change their build process, sometimes in substantive ways.
 > In many cases a semantically reproducible build requires no changes,
 > and even if changes are required, there are typically fewer changes 
required.

I think semantically reproducible builds is going to be more expensive 
in the long run.

diffoscope is only reliable if it reports both files as bit-for-bit 
identical (exit code 0). If there are _any_ differences (exit code 1) it 
generates a semantic diff to help debug the root cause, but it does not 
guarantee a complete diff of every byte (and sometimes there are quite 
many bytes missing).

I found that adding "benign" differences can sometimes help to prevent 
diffoscope from revealing my malicous changes, because if the semantic 
diff is identical it falls back to a binary diff (that would reveal my 
backdoor). If I intentionally introduce some benign difference in the 
semantic diff it's picking that up as the reason for a mismatch and 
moves on (leaving my non-benign changes unreported).

https://twitter.com/kpcyrd/status/1575080558572449792

On top of development cost of a *reliable* semantic diff program you 
would also still continously depend on humans for their opinions about 
each diff.

 > OSSGadget <https://github.com/microsoft/OSSGadget/">
 > includes a tool that can determine if a given package is
 > semantically reproducible.
 > It's still helpful to work to make a package a fully reproducible build.
 > A fully reproducible build is a somewhat stronger claim, and
 > you don't need a complex tool to determine if the package is fully
 > reproducible.
 > Even given that, it's easier to first create a package that's
 > semantically reproducible, and then work on the issues remaining
 > to make it a fully reproducible build.

oss-reproducible only seems to repack source code into different 
container formats like zip/tar but doesn't deal with any compilation steps.

As soon you're in a position to manage the compiler infrastrucutre too 
(so your binaries are even remotely close to each other) you're usually 
in a good enough position to just go for fully reproducible builds.

This is why mostly Linux Distributions are in the reproducible builds space.

---

I think a better investment would be tooling to mimic the environment of 
a given github actions worker run, the SBOMs github currently generates 
are based on *the source code*, but not *the CI run* that generated my 
binaries.

For example, this github actions run generated a binary:

https://github.com/spytrap-org/spytrap-adb/actions/runs/5043916251

Github tells me it was built from this commit:

https://github.com/spytrap-org/spytrap-adb/commit/b8f667bf54f47a8c358f01aad6d027a70a6fb61b

But there is no tooling (I'm aware of) that I can use to setup a build 
environment on my own computer that matches the github actions worker of 
the specific job that generated this binary.

It would also need to know what version these commands resolved to:

- sudo apt-get install musl-tools
- rustup target add x86_64-unknown-linux-musl

cheers,
kpcyrd