[rb-general] Codehash.db

Ximin Luo infinity0 at debian.org
Tue Mar 14 21:00:00 CET 2017

> [..] During the last Reproducible Builds summit in Berlin, such a database
> was also discussed. (I'm not aware of any public design doc or
> implementation though.)
> I think it would be a good idea to actually coordinate this effort
> together with the Reproducible Builds folks (who I am Cc:ing), in order
> not to duplicate the work, as well as to create and/or use open formats
> - usable by all kind of software projects.
> Furthermore I'd like to suggest that a design document for such a
> database could be created (collectively). [..]

Hi Ulrike and everyone else,

I think first we should be clear on exactly what problems we're solving. There are a few problems:

1. Assuring that ${source code} == ${binary code}  (in practise we'd use the hashes not the actual content)

This is buildinfo files. The Debian ones don't actually do source code hashes yet, but this is a bug that I've been meaning to file and propose fixes for. (They won't be very trivial, unfortunately.)

2. Assuring that ${software name} == ${source code}.
3. Assuring that ${software name} == ${binary code}.

This is what Joanna's github repo [1] seems to be trying to solve, and also what the binary transparency project (only half-alive I think) is trying to solve. 

This is a harder problem because names are subjective, you'd need to figure out some acceptable way of binding software names, to the actual pieces of software that they represent. As you noted, "Who can upload hashes there and how can we build trust?" This is also why I want source hashes for (1), because atm the meaning implicitly requires you to trust the Debian FTP archive to give everyone the same and correct code for a given package name.

I haven't thought about this problem in the context of reproducible builds at all, because IMO it's outside of the scope of reproducible builds. But I'd be interested in hearing what thoughts other people have on solving this problem. I'm not sure the reproducible builds project *should* spend effort on it, but could change my mind based on other suggestions that are made.

Also, I think that (3) is really a combination of (2) and (1). These could be solved separately, then the solutions can be combined by a higher-layer to give a more user-friendly notice that they have the property of (3).

So, in a way we are already helping with part of the problem. I'd be happy to talk more about how what we're doing, can be integrated with something-that-does-(1), to form a something-that-does-(3). But I don't have any good suggestions for something-that-does-(1) right now.


[1] https://github.com/rootkovska/codehash.db
For now we can ignore the fact that git's use of SHA-1 makes this exercise pointless, since GPG-signed commits/tags only sign a SHA-1 reference to the tree objects. In a real system, to work around this people must release GPG-signed tarballs of the repo, or else wait for git to be fixed.

GPG: ed25519/56034877E1F87C35
GPG: rsa4096/1318EFAC5FBBDBCE

More information about the rb-general mailing list