Arch Linux minimal container userland 100% reproducible - now what?

Vagrant Cascadian vagrant at reproducible-builds.org
Fri Mar 29 21:40:45 UTC 2024


On 2024-03-29, John Gilmore wrote:
> kpcyrd <kpcyrd at archlinux.org> wrote:
>> 1) There's currently no way to tell if a package can be built offline 
>> (without trying yourself).
>
> Packages that can't be built offline are not reproducible, by
> definition.  They depend on outside events and circumstances
> in order for a third party to reproduce them successfully.
>
> So, fixing that in each package would be a prerequisite to making a
> reproducible Arch distro (in my opinion).
>
> I don't understand why a "source tree" would store a checksum of a
> source tarball or source file, rather than storing the actual source
> tarball or source file.  You can't compile a checksum.

There are design tradeoffs, obviously.

Coming primarily from a Debian background, which does keep source
tarballs on it's own infrastructure, it makes sense. It sure makes it
clear for things like GPL compliance.

I have also worked on Guix in more recent years, which has a model more
similar to what I understand Arch Linux is doing; having checksums of
source code in package definitions.

This has the obvious drawback of added complexity to actually fetch the
source code. But also makes it easier to look up where to get the code;
at least in guix, a single git repository has all the source code
references.

I know with Guix, if the source is not already present in the local
cache, there is integration to fall back to other archival sources, such
as https://softwareheritage.org and https://disarchive.guix.gnu.org/ as
well as their own build farms, which cache the sources (which are really
just an object for the most part like any other) for a "reasonable"
amount of time.

The actual build environment for Guix is without network access; the
build tooling downloads (if not already present) the source and sticks
it into the isolated build environment before the build is performed.

That said, the entire history of guix and all the packages it has ever
supported can be represented in a single git repository only some
hundreds of megabytes. Mirroring that repository has modest
requirements.

Mirroring the entire history of Debian, by contrast, is many terrabytes
of binary data, with all the challenges of reliably storing large
amounts of data, and there happen to be some gaps in that history due to
those challenges, unfortunately. And that does not even begin to include
the VCS history of each individual Debian package...

It is a tradeoff of shunting around the complexity from one step in the
build process to another.

Even most builds in Debian are typically performed by downloading source
code an build dependencies from ... a network... Debian just hosts more
of the infrastructure itself.

I'll admit, too much dependence on external networks does make me
nervous from a Reproducible Builds perspective.  Weather a project hosts
complete copies of source code or not is a design decision, like any
other, with various challenges, advantages and disadvantages.


> kpcyrd <kpcyrd at archlinux.org> wrote:
>> Specifically Gentoo and OpenBSD Ports have solutions for this that I 
>> really like, they store a generated list of URLs along with a 
>> cryptographic checksum in a separate file, which includes crates 
>> referenced in e.g. a project's Cargo.lock.
>
> I don't know what a crate or a Cargo.lock is,

It is all part of that newfandangled Rust, which has it's own package
manager, Cargo, which for the most part, assumes to download stuff off
the internet. The Cargo.lock files, as I understand it, allow specifying
the exact version of some dependency to use, maybe even with a
cryptographic hash of some kind.

Debian maintainers of Rust packages tend to download all the relevent
bits somehow and upload the source code to Debian's archive. Guix is
interestingly in the middle, where you have to define the source you're
downloading and relevent checksums, and Guix tooling downloads and
verifies the source, and then fires off a build in an isolated
environment with the dependencies and sources available.


> but rather than fix the problem at its source (include the source
> files), you propose to add
> another complex circumvention alongside the existing package building
> infrastructure?  What is the advantage of that over merely doing the
> "cargo fetch" early rather than late and putting all the resulting
> source files into the Arch source package?

It seems like you are effectively asking "Why don't you merely change
the entire way in which Arch Linux does packaging?" rather than
extending the existing model to handle another case...

A quick search turned up this:

  https://wiki.archlinux.org/title/PKGBUILD

Which basically includes obvious things like package name, version,
dependencies and package relationships, etc ... and links to where to
get the source and checksums for verifying those sources. Nowhere
obvious to me where to embed the sources.


>> 3) All of this doesn't take BUILDINFO files into account
>
> The BUILDINFO files are part of the source distribution needed
> to reproduce the binary distribution.  So they would go on the
> source ISO image.

In Arch Linux, I believe the .buildinfo files are actually inside of
each binary package. This has proven to be a really excellent design
choice; there is no need for a separate distribution mechanism or having
to find the right metadata to correlate the correct .buildinfo file.


>>                           Using plenty of different gcc versions looks 
>> annoying, but is only an issue for bootstrapping, not for reproducible 
>> builds (as long as everything is fully documented).
>
> I agree that it's annoying.  It compounds the complexity of reproducing
> the build.  Does Arch get some benefit from doing so?
>
> Ideally, a binary release ISO would be built with a single set of
> compiler tools.  Why is Arch using a dozen compiler versions? Just to
> avoid rebuilding binary packages once the binary release's engineers
> decide what compiler is going to be this release's gold-standard
> compiler?

Arch Linux is also famously a rolling release, so there is no "release",
per se...

I don't know of any binary distribution that uses exactly a single
compliler to compile all of it's binaries, at the very least, there are
numerous interdependency bootstrapping loops, and several iterations of
the compiler may be needed, minor version bumps, etc.

Part of that is just the mess of the current state of things, where to
build a C compiler you need a C++ compiler and python, and to build a
C++ compiler you need C and python, and to build python... ugh.

bootstrappable.org is trying to provide at least a reasonably
documentable path to get to our current mess from a reasonable clean
slate.

Most distros I am aware of since the mid-90s mostly just use whatever
version happens to be there today to build today's packages, heavily
leveraging/trusting/hoping for ABI compatibility...


> (E.g. The one that gets installed when the user runs pacman
> to install gcc.)  Or do the release-engineers never actually standardize
> on a compiler -- perhaps new ones get thrown onto some server whenever
> someone likes, and suddenly all the users who install a compiler just
> start using that one?

Sounds like nearly every binary distribution I know about, more-or-less.

There may be *some* coordination, e.g. compiler version X.Y is the
default C compiler, but in order to get $compiler version X.Y you needed
compiler X.Y-N, and then a minor version update comes out, and you get
compiler version X.Y.Z. Or maybe a configure argument change for
X.Y.Z+1, So that's typically at least three compilers, and depending on
the length of the development cycle, typically many more incremental
iterations.

FWIW, there have been seven uploads of gcc-13 to Debian just this month
alone:

  https://tracker.debian.org/pkg/gcc-13

Some distros, such as Nix and Guix do rebuild anything that has had a
change somewhre in it's dependency chain, but very few distros do this,
instead relying on ABI compatibility.


live well,
  vagrant
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 227 bytes
Desc: not available
URL: <http://lists.reproducible-builds.org/pipermail/rb-general/attachments/20240329/4d2359a5/attachment.sig>


More information about the rb-general mailing list