Reproducibility terminology/definitions

Thu Nov 9 21:12:50 UTC 2023

On 11/9/23 11:13, Pol Dellaiera wrote:
> The document includes a dedicated section that attempts to formalize the 
> concepts of 'computation' and 'reproducibility'. I've taken the liberty 
> of synthesizing our formalism discussions into an online document, which 
> is now available at: https://typst.app/project/rhUl4XwrXToXvxjoaWB6DI
> 
> Several members have already provided good feedback, which I am 
> currently reviewing to refine the document further. I would be grateful 
> for your insights as well. Any feedback you might offer will be highly 
> appreciated and will undoubtedly contribute to the document's evolution.

The official slogan of reproducible-builds.org is "create an 
independently-verifiable path from source to binary code", I think 
that's already fairly clear (in the context of reproducible builds) and 
should always be the explicit primary focus.

The list of projects even attempting claims like this is fairly short, 
and the list of projects actively and publicly being "put to the test" 
is even shorter. Unless involved in at least one of them it's going to 
be difficult to tell issues and non-issues apart.

 > - $t$, an impure function returning a date and time representation

https://reproducible-builds.org/docs/ has a list for "Achieve 
deterministic builds" that is much longer than "avoid looking at the 
current time". Most notably there's also "functions with unstable output 
order", yet this doesn't matter much if there's no impact on build 
outputs (e.g. because they are sorted at some point, or used in a way 
order doesn't matter, like a "sum" operation).

Formalizing it to "impure function returning a date and time" would also 
imply "freezing time" to a hardcoded value solves this problem, which 
has already been tried but introduces bugs and failing builds in too 
many cases.

 > - $H$, the set of all possible hardware environment

While rebuilders are naturally going to be a diverse set of hardware, 
there is little to no value in "all possible hardware environments". If 
99 people (that I trust) have confirmed they could reproduce the given 
binary from source code, and 1 person discovers a silly hardware 
configuration that produces a different binary (or fails to build at 
all, for example because their computer only has 128 MB ram), there's 
very little value in this finding and the software would likely still be 
considered reproducible in praxis.

 > - $E$, the set of all possible software environment

I'm not aware of any project having this in scope. It's crucial for 
projects to document their build environment (see buildinfo files) and 
matching it when reproducing the build. If you use a different compiler 
version, or a different linker version/configuration, you're almost 
guaranteed to get mismatching binaries. Knowing two different compiler 
versions produce the same binary is also not inherently useful.

https://reproducible-builds.org/docs/recording/
https://reproducible-builds.org/docs/perimeter/

 > - $D$, the set of all possible input data

I'm not good at reading these, but "change the input data and expect the 
same output data" does not sound right. reproducible builds assumes the 
source code is canonically known and identified. There is no value in "I 
changed the source code and got a different binary".

---

The draft is also missing problems that are well-known by practitioners, 
like "we know the exact compiler used, but can't acquire a copy anymore" 
(this is a common issue with snapshot.debian.org). The remaining 
uncertainty in this space are things like "do we expect old releases to 
continue to be reproducible, and if so, for how long". This is a 
controversial topic because it would require a public archive of all old 
build dependencies (that not every project is willing/able to commit to).

I hope somebody considers this email useful.

cheers,
kpcyrd