Research on Reproducible Builds

Omar Navarro Leija omarsa at seas.upenn.edu
Mon Feb 17 16:19:40 UTC 2020


> Guix also controls environment variables (which I don't think you talk
about in your paper) to ensure the same initial state.

Yeah, we didn't really come up with a principled approach to handling env
vars. Inside the container, we start with the default container-set env
vars, but this itself may leak more information about the system than we
would like. On the other hand, unsettting too many env vars that programs
expect to be set often results in weird, hard to track down errors.

> We avoid a lot of the non-determinism issues other package managers are
struggling with, but we still have some and we try to fix them one by one
when we encounter them

I see, does Guix maintain a list somewhere of all sources of nondeterminism
that the project has come across? DRB has something like this via
https://tests.reproducible-builds.org/debian/reproducible.html

> so we don't prevent reading a file timestamp, current time or issues due
to filesystem ordering. You seem to have a solution for these things that
are probably our greatest cause of non determinism now, so I'm really
looking forward to seeing your implementation and try and port it to guix
if it is practical
Yeah, the main approach for this would be to intercept some subset of
filesystem system calls and ensure results returned for various values
(e.g. timestamps) are deterministic. This itself requires some finesse, as
clever programs like autotools `./configure` create an empty file, and
compare the mtime or ctime of the file, versus the current time of the
system clock to check for clock skews (and crashes if it doesn't like what
it sees). Filesystem file ordering is handled by sorting the results of the
readdir system call before hand (this means reading the entire dir ahead of
time). A FUSE layer can also handle file ordering, but IMO not worth the
slow down of all IO operations to the filesystem.

> I know, I'm also in academia :) (I'm a post-doc at yale now, I should
update my webpage). I tend to prefer perfect and complete solutions, but if
we can already improve things, it's great!

That's super cool!

On Wed, Feb 12, 2020 at 1:27 PM Julien Lepiller <julien at lepiller.eu> wrote:

> Le 12 février 2020 11:50:11 GMT-05:00, Omar Navarro Leija <
> omarsa at seas.upenn.edu> a écrit :
> >Hello Julien,
> >
> >I'm glad you enjoyed the work!
> >
> >Your understanding of DetTrace is correct. Our container abstraction is
> >very lightweight in the sense that we just piggyback off Linux
> >namespaces +
> >chroot to provide isolation. Currently, to provide reproducibility,
> >someone
> >using DetTrace should download a chroot image (e.g. via debootstrap)
> >and
> >use this as the canonical filesystem image to use for the build. This
> >is a
> >bit clunky, and I believe it is not a 100% satisfactory solution. I
> >don't
> >know of any other ways to "normalize" the filesystem environment
> >though.
> >
> >This may seem a little heavy handed, so I'm curious how Guix handles a
> >build process that tries to read arbitrary filessystem data? I'm
> >reading
> >more about Guix now, so I'll have smarter things to say about it later
> >(hopefully).
>
> Guix builds packages in an isolated environment, with user namespaces: the
> build always happens in $TMPDIR/guix-build-package-n (normalized to
> /tmp/guix-build-package-0 in the environment), with access to declared
> inputs in the store (built with this process or downloaded after checking
> the hash of their content). The user is normalized to guix-build and uid
> and gid are set to 0 iirc. Guix also controls environment variables (which
> I don't think you talk about in your paper) to ensure the same initial
> state.
>
> The initial filesystem is therefore composed of the inputs, sources and
> build script, in /gnu/store that were built reproducibly (hopefully) and an
> empty directory in /tmp. Any machine can reproduce this environment
> bit-to-bit if: they use the same architecture, they use the same guix git
> commit, they build the same package, they could download or build every
> input (there's guix time-machine to re-create something from a specific
> commit). The store is more or less an append-only structure where you can't
> do any overriding, which ensures that same inputs=same outputs.
>
> I'm not sure we actually isolate anything in the store, but store items
> have a hash that's computed from inputs, sources and other stuff. It's not
> possible to guess that unless you have a direct reference (although an
> adversarial process could do a nasty ls I suppose).
>
> However, there is no other mechanism, so we don't prevent reading a file
> timestamp, current time or issues due to filesystem ordering. You seem to
> have a solution for these things that are probably our greatest cause of
> non determinism now, so I'm really looking forward to seeing your
> implementation and try and port it to guix if it is practical. We avoid a
> lot of the non-determinism issues other package managers are struggling
> with, but we still have some and we try to fix them one by one when we
> encounter them.
>
> >For Dettrace we set out to see if it was feasible to create a 100%
> >(foolproof) dynamic determinism enforcement system. I believe we
> >succeeded
> >at this goal (modulo some CPU instructions). However, I don't believe
> >the
> >full-proof solution is necessary or practical (tangent: our solution
> >attempts to be foolproof for mostly academic reasons, not practical
> >concerns about solving real problems, this is part of the fun of being
> >in
> >academia).
>
> I know, I'm also in academia :) (I'm a post-doc at yale now, I should
> update my webpage). I tend to prefer perfect and complete solutions, but if
> we can already improve things, it's great!
>
> >
> >The point being: we attempt to sequentialize execution of threads, this
> >is
> >extremely difficult, and we can't do it properly. The biggest sources
> >of
> >unsupported packages (more details in the paper!) are Java, sockets,
> >and
> >intra-process signals. Java always ends up deadlocking due to our
> >attempts
> >to sequentialize thread execution in the JVM. With our current methods
> >I
> >don't think this can ever work properly.
> >
> >Not all is lost though: I don't expect package builds to be
> >nondeterministic from thread scheduling, sockets, or signals. So the
> >simple
> >solution is just to allow these things to happen in DetTrace and call
> >it
> >good enough. We still get all the other benefits of DetTrace but relax
> >the
> >paranoia and thus allow a wider set of packages to build. DetTrace
> >could
> >certainly be modified to support this.
> >
> >I don't have any immediate plans to improve this, but would certainly
> >not
> >be against it either. I like to think the biggest contribution of
> >DetTrace
> >toward the reproducible builds effort is the ideas and methods, rather
> >than
> >the implementation.
> >
> >I'll definitely let you know when it is available, the implementation
> >is
> >not as robust as it could be. So I want to set expectations
> >accordingly!
>
> Thank you!
>
> >
> >Omar
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.reproducible-builds.org/pipermail/rb-general/attachments/20200217/371b5e00/attachment.htm>


More information about the rb-general mailing list