[rb-general] Comparison of the Debian and Arch .buildinfo approaches (was: Re: buildinfo filename convention)

Fri Aug 10 15:30:37 CEST 2018

Hi all,

First of all: I can't pretend your points are fundamentally false or
such and you were asked for your personal opinion anyway. I just don't
really see it as dramatically as you so I would like to share my
thoughts as well.
Thanks for your valuable response and input!

On 8/9/18 8:32 PM, Arnout Engelen wrote:
> 
> The main difference is Arch includes the .BUILDINFO inside the package
> and signs the whole package, where in Debian the .buildinfo is outside
> of the package (but contains its hash) and is signed separately.
> 
> This means when a rebuilder successfully rebuilds in the same
> environment, there is no big difference: on Arch he can share a second
> signature of the same package, and on Debian he can share a second
> signed .buildinfo containing the same hash.
> 
> When a rebuilder successfully rebuilds in a slightly different
> environment, however, things are a little more tricky on Arch: since
> the .BUILDINFO is different and contained in the package, the package
> is different. This means to share a successful rebuild it is not
> enough to publish a second signature, he must also share the package
> he built. To check the reproduction, a checker would have to fetch
> both packages and signatures, check the signatures, and check that the
> packages are identical except for the .BUILDINFO file.
>
> On Debian, it is sufficient for the rebuilder to share his signed
> .buildinfo file. The checker only needs to fetch package and both
> .buildinfo files, and check the signatures and hashes.

I believe "check everything except the .BUILDINFO file" should not be
done under any circumstances and doing so is not really required in this
case.

The rebuilder needs some reproduction logic anyway, if you are
rebuilding a published package (so not just continuously build something
twice with the same repo snapshot) you have to reinstall identical
dependencies as the first "original" package (some more explanation
follows).

While in theory its obviously an advantage to potentially be able to
reproducible with some minor dependency bumps that don't affect the
produced artifact, I doubt that in reality it works good enough to be
practical for a user to check reproducibility themselves. Best and most
obvious examples is a different gcc version (with will ultimately
produce different binary code) or any kind of soname bumps.
It's more practical for anyone who wants to prove reproducibility of a
published package to restore the versions used in the first build. If
the rebuilder tool does that anyway you just restore some basic things
like the packager string as well and simply get an identical artifact
(including the .BUILDINFO).

Just to be sure: I totally see your points and I'm fully aware of the
advantages when the .BUILDINFO file is detatched, just want to share my
view on this and why I'm not convinced that it gives enough advantages
or any meaningful security improvements to live with the potential noise
and false negatives such approach creates.

Technically you don't even need the binary package with the Arch
approach, the only additional data that is really necessary is the
.BUILDINFO file itself and a trust-able hash to compare against. For
latter you can simply take the package signature itself, it is basically
an authenticated hash that can be used to verify if the reproduced
artifact is reproducible or not. Of cause when the .BUILDINFO file is
retrieved without the package, it would not be signed/authenticated, but
if the goal is to have a platform/storage that holds just the .BUILDINFO
file, that could be signed individually as well. The "second" .BUILDINFO
file would not even need another signature of the checker as in this
approach the .BUILDINFO file is identical and "reproducible" itself.

Whenever something is _not_ reproducible and you want to investigate
what the cause is (f.e. via diffoscope) you will need the original and
the reproduced artifacts/package anyway -- so there is no different
between Arch and Debian but I agree it could be handy to have lots of
non-required environment data to do so.

> 
> While the Arch approach has the advantage that the .BUILDINFO is more
> 'tied' to the package, I think I like the Debian approach more for 2
> (closely related) reasons:
> 
> 1) as demonstrated above, sharing and checking that the package was
> successfully reproduced across 'slightly different' environments is
> much easier with the Debian approach.

Yes, this is true. I still don't believe it will be very practical for
varying package versions as explained.
If someone wants to do reproducibility tests f.e. to track down problems
or such on slightly different environments the tool of choice could be
'reprotest'.
Also the concept of sharing such reports is more complex then just
providing a platform to upload varying .BUILDINFO files, you need to
have something in place (trust points, web of trust or whatever, report
abuse) that eliminates "malicious wrong results" and/or trolling but
that's quite off-topic here.

> 
> 2) with the Arch approach, it is relatively 'expensive' to add new
> fields to the .BUILDINFO, as also 'irrelevant' differences in the
> .BUILDINFO lead to different packages. There is no such cost in the
> Debian approach: as long as package hash in the the signed .buildinfo
> is OK, all is OK. Adding (possibly-irrelevant) fields to .buildinfo
> can be useful for tracking down sources of accidental
> non-reproducibility, so it is nice if this is cheap.

I fully agree here! However, as pointed out above, its strictly tied to
a specific makepkg version anyway so whenever the rebuilder reconstructs
the package versions from a repository snapshot the version of the
.BUILDINFO file will be identical and newly added files will be ignored
(sure the newly added fields don't provide any value then).

This is truly a disadvantage that it can't produce/distribute new
versions of the .BUILDINFO spec with additional data and its also true
that there is no additional varying data about the env that could be
shared by a miss-matching checker to help track down the source... but
you would want to have the non-reproducible artifact for investigating
anyway and hopefully it mostly gives a potentially good understanding
where the problem could be when looked at via diffoscope. On the Debian
approach you will have more and standardized info about the environment
that produced the non-reproducible artifact, that's out of question and
definitively an advantage of that approach, but even there you would
want to have a copy of the non-reproducible artifact as well as the
original artifact.

> 
> I think there is a lot of value in checking reproducibility across
> 'slightly different' environments, as one of the reasons for doing
> this in the first place is to find malicious sources of
> unreproducibility. It would be a shame if we missed a backdoor because
> we so carefully made sure all rebuilders used the same environment
> (containing the same trojan).
> 

I partially agree on this as well, but I'm totally not convinced how
practical that is to really find a trojan in a way that it really
improves/tightens anything and can't be circumvented easily. Not like
Arch needs a crazy amount of insane stuff to be the same, basically we
need: dependency versions, builddate, packager string and the builddir
(latter nowadays has gcc support and stuff).

Even if i see some advantages in finding varying dependency versions
that still produce the same final artifact, especially from a
developer/maintainer point of view for testing and fiddling. However, I
don't think we really loose much when just requiring the very same in
terms of security. For now i think its just "fun" to vary it but in
reality all kind of stuff influences it (gcc, glibc and tons of other
things like sonames/ABI) to a degree that it will create lots of noise
through false negatives.

In my opinion the practical scope of reproducible builds is primarily to
find a trojan/backdoor between a (binary) upload and the corresponding
published "source" aka build-blueprint, let me explain:

- If upstream release is already backdoored you won't be able to detect
much. An inserted backdoor will then mostly just deterministically
backdoor it and/or depend on runtime input rather then build time input
(or just introduce a micro bug that can be used to exploit the target).

- If the build-blueprint is backdoored, pretty much all above applies as
well. The scope of protection through reproducible builds won't
solve/remove the need to still review/audit the build-blueprints.
Malicious stuff can still easily be done in a way that reproducibly
backoors it in 'slightly different' environments. From an attacker point
of view it even makes sense to backdoor it in a smart way while being
fully reproducible (to circumvent any kind of automated detection) so we
would not really remove any crazy attack surface.

- If a malicious artifact plus a non-malicious "corresponding"
build-blueprint is uploaded all the varying approaches here will detect
it. No matter if done by blackmailing or deeply backdooring the build
system.

I'm convinced we still need all the following to provide a meaningful
amount of security:
- authenticated upstream sources via signatures
- reproducible builds to detect malicious uploads
- review/audit build-blueprints
- and ultimately: review/audit the upstream sources as well

sincerely,
Levente