Reproducibility terminology/definitions
kpcyrd
kpcyrd at archlinux.org
Thu Nov 9 21:12:50 UTC 2023
On 11/9/23 11:13, Pol Dellaiera wrote:
> The document includes a dedicated section that attempts to formalize the
> concepts of 'computation' and 'reproducibility'. I've taken the liberty
> of synthesizing our formalism discussions into an online document, which
> is now available at: https://typst.app/project/rhUl4XwrXToXvxjoaWB6DI
>
> Several members have already provided good feedback, which I am
> currently reviewing to refine the document further. I would be grateful
> for your insights as well. Any feedback you might offer will be highly
> appreciated and will undoubtedly contribute to the document's evolution.
The official slogan of reproducible-builds.org is "create an
independently-verifiable path from source to binary code", I think
that's already fairly clear (in the context of reproducible builds) and
should always be the explicit primary focus.
The list of projects even attempting claims like this is fairly short,
and the list of projects actively and publicly being "put to the test"
is even shorter. Unless involved in at least one of them it's going to
be difficult to tell issues and non-issues apart.
> - $t$, an impure function returning a date and time representation
https://reproducible-builds.org/docs/ has a list for "Achieve
deterministic builds" that is much longer than "avoid looking at the
current time". Most notably there's also "functions with unstable output
order", yet this doesn't matter much if there's no impact on build
outputs (e.g. because they are sorted at some point, or used in a way
order doesn't matter, like a "sum" operation).
Formalizing it to "impure function returning a date and time" would also
imply "freezing time" to a hardcoded value solves this problem, which
has already been tried but introduces bugs and failing builds in too
many cases.
> - $H$, the set of all possible hardware environment
While rebuilders are naturally going to be a diverse set of hardware,
there is little to no value in "all possible hardware environments". If
99 people (that I trust) have confirmed they could reproduce the given
binary from source code, and 1 person discovers a silly hardware
configuration that produces a different binary (or fails to build at
all, for example because their computer only has 128 MB ram), there's
very little value in this finding and the software would likely still be
considered reproducible in praxis.
> - $E$, the set of all possible software environment
I'm not aware of any project having this in scope. It's crucial for
projects to document their build environment (see buildinfo files) and
matching it when reproducing the build. If you use a different compiler
version, or a different linker version/configuration, you're almost
guaranteed to get mismatching binaries. Knowing two different compiler
versions produce the same binary is also not inherently useful.
https://reproducible-builds.org/docs/recording/
https://reproducible-builds.org/docs/perimeter/
> - $D$, the set of all possible input data
I'm not good at reading these, but "change the input data and expect the
same output data" does not sound right. reproducible builds assumes the
source code is canonically known and identified. There is no value in "I
changed the source code and got a different binary".
---
The draft is also missing problems that are well-known by practitioners,
like "we know the exact compiler used, but can't acquire a copy anymore"
(this is a common issue with snapshot.debian.org). The remaining
uncertainty in this space are things like "do we expect old releases to
continue to be reproducible, and if so, for how long". This is a
controversial topic because it would require a public archive of all old
build dependencies (that not every project is willing/able to commit to).
I hope somebody considers this email useful.
cheers,
kpcyrd
More information about the rb-general
mailing list