Research on Reproducible Builds
Julien Lepiller
julien at lepiller.eu
Wed Feb 12 18:27:06 UTC 2020
Le 12 février 2020 11:50:11 GMT-05:00, Omar Navarro Leija <omarsa at seas.upenn.edu> a écrit :
>Hello Julien,
>
>I'm glad you enjoyed the work!
>
>Your understanding of DetTrace is correct. Our container abstraction is
>very lightweight in the sense that we just piggyback off Linux
>namespaces +
>chroot to provide isolation. Currently, to provide reproducibility,
>someone
>using DetTrace should download a chroot image (e.g. via debootstrap)
>and
>use this as the canonical filesystem image to use for the build. This
>is a
>bit clunky, and I believe it is not a 100% satisfactory solution. I
>don't
>know of any other ways to "normalize" the filesystem environment
>though.
>
>This may seem a little heavy handed, so I'm curious how Guix handles a
>build process that tries to read arbitrary filessystem data? I'm
>reading
>more about Guix now, so I'll have smarter things to say about it later
>(hopefully).
Guix builds packages in an isolated environment, with user namespaces: the build always happens in $TMPDIR/guix-build-package-n (normalized to /tmp/guix-build-package-0 in the environment), with access to declared inputs in the store (built with this process or downloaded after checking the hash of their content). The user is normalized to guix-build and uid and gid are set to 0 iirc. Guix also controls environment variables (which I don't think you talk about in your paper) to ensure the same initial state.
The initial filesystem is therefore composed of the inputs, sources and build script, in /gnu/store that were built reproducibly (hopefully) and an empty directory in /tmp. Any machine can reproduce this environment bit-to-bit if: they use the same architecture, they use the same guix git commit, they build the same package, they could download or build every input (there's guix time-machine to re-create something from a specific commit). The store is more or less an append-only structure where you can't do any overriding, which ensures that same inputs=same outputs.
I'm not sure we actually isolate anything in the store, but store items have a hash that's computed from inputs, sources and other stuff. It's not possible to guess that unless you have a direct reference (although an adversarial process could do a nasty ls I suppose).
However, there is no other mechanism, so we don't prevent reading a file timestamp, current time or issues due to filesystem ordering. You seem to have a solution for these things that are probably our greatest cause of non determinism now, so I'm really looking forward to seeing your implementation and try and port it to guix if it is practical. We avoid a lot of the non-determinism issues other package managers are struggling with, but we still have some and we try to fix them one by one when we encounter them.
>For Dettrace we set out to see if it was feasible to create a 100%
>(foolproof) dynamic determinism enforcement system. I believe we
>succeeded
>at this goal (modulo some CPU instructions). However, I don't believe
>the
>full-proof solution is necessary or practical (tangent: our solution
>attempts to be foolproof for mostly academic reasons, not practical
>concerns about solving real problems, this is part of the fun of being
>in
>academia).
I know, I'm also in academia :) (I'm a post-doc at yale now, I should update my webpage). I tend to prefer perfect and complete solutions, but if we can already improve things, it's great!
>
>The point being: we attempt to sequentialize execution of threads, this
>is
>extremely difficult, and we can't do it properly. The biggest sources
>of
>unsupported packages (more details in the paper!) are Java, sockets,
>and
>intra-process signals. Java always ends up deadlocking due to our
>attempts
>to sequentialize thread execution in the JVM. With our current methods
>I
>don't think this can ever work properly.
>
>Not all is lost though: I don't expect package builds to be
>nondeterministic from thread scheduling, sockets, or signals. So the
>simple
>solution is just to allow these things to happen in DetTrace and call
>it
>good enough. We still get all the other benefits of DetTrace but relax
>the
>paranoia and thus allow a wider set of packages to build. DetTrace
>could
>certainly be modified to support this.
>
>I don't have any immediate plans to improve this, but would certainly
>not
>be against it either. I like to think the biggest contribution of
>DetTrace
>toward the reproducible builds effort is the ideas and methods, rather
>than
>the implementation.
>
>I'll definitely let you know when it is available, the implementation
>is
>not as robust as it could be. So I want to set expectations
>accordingly!
Thank you!
>
>Omar
>
More information about the rb-general
mailing list