Fw: Build Reproducibility in Debian - Opinion Needed

Wed Aug 24 17:37:37 UTC 2022

Hi Muhammad,

Thanks for re-sending this to our public list. More than happy to get
the conversation started for you, and hopefully others can chip in and
expand my answers, if not outright correct them. :)

> Do you feel that Build Unreproducibility issues can be captured
> using Just-In-Time Defect Prediction metrics (such as the amount of
> code change introduced or the dispersion of changes across different
> modules)? Basically, do you find that certain types of commits (such as
> widespread ones) are more problematic for build reproducibility?

I am not overly familiar with Just-In-Time Defect Prediction, but my
instinct is that the amount of code altered in a changeset may be
loosely related to the introduction of reproducibility issues. The
correlation is unlikely to be very strong, however, as reproducibility
problems often seem to appear or reappear in quite an arbitrary
fashion. This is purely an intuition of mine informed in part by
submitting (or, alas, resubmitting) reproducibility fixes in Debian
packages.

Just to give a quick example, large-scale rewrites of build systems
(eg. replacing GNU Autotools with CMake, a common trend in the past
five years or so) will naturally be a source of reproducibility
issues, as well as exhibiting a large amount of code changed dispersed
across a significant number of modules. But, on the other hand, simply
the introduction of a worked example in the documentation can render
the package unreproducible too.

> Do you feel there is potential for detecting build unreproducibility
> statically (without executing adversarial rebuilds)?

Yes, very much so. And in fact, we've been doing this for a little
while: if you look at the issues.yml file within our notes.git
repository, you will find some examples that use
codesearch.debian.net to statically locate potential reproducibility
issues. The links there are not intended to be remotely exhaustive
of the potential for static analysis though; I'm sure we've been
doing other stuff, and there is scope for significantly more.

Separate to that, whilst it does require a single build (and not an
adversarial rebuild), Lintian has some support for statically finding
reproducibility issues in Debian packages. For example, checking for
timestamped .gzip files was added all the way back in 2014 (!):

  https://salsa.debian.org/lintian/lintian/commit/5ff108539deb1596f37a3f8d853e7e716d623c1e

... but it also supports checking for error tracebacks in manual pages
and a bunch of other low-hanging fruit. Searching the Lintian Git
history for "reproduc" should find everything.

Other avenues requiring a single build would include all the instrumention
approach (eg. strace/systemtap, etc.) taken by a few projects. I think
Bernhard might be able to speak better on this, and there are some
academic projects in this area as well.

> What advice/training are developers given to ensure that they don't 
> induce a build unreproducibility issue?

Speaking only of Debian, there is no structured training given to
developers to ensure they do not introduce reproducibility issues.
This could potentially be added to the New Maintainer process,
though. In lieu of that, however, we endeavour to provide
documentation, tooling, automated testing and support where necessary.

Again, hope this gets the ball rolling.

Best wishes,

-- 
      o
    ⬋   ⬊      Chris Lamb
   o     o     reproducible-builds.org 💠
    ⬊   ⬋
      o