Reproducible Builds Verification Format

Mon May 18 03:31:01 UTC 2020

Hi Eric,

On Tue, 2020-05-12 at 23:44 +0200, Eric Myhre wrote:
> Some of these dreams and the outlines of these concepts have been around 
> quite a bit longer than this year, even.  I think some differential 
> diagnosis about what makes this draft different, and why it makes the 
> choices it does, would be useful.
> 
> Some things I'd like to see identified and explicitly discussed more 
> frequently in this concept space:
> 
> - What's the "primary key"?  In other words, how can I meaningfully 
> expect to identify this one attestation record, or this one build 
> instruction document?

If I understand the "primary key" correct we are looking at something like: 

	origin-suite-component-target-name-version

this would look something like:

	debian-testing-packags-amd64-tmux-1.0-1

There is one file provided by Debian that fits that key.
> 
> - What are the "secondary keys" I could plausibly expect to select on if 
> I have a zillion of these, and want to find those that should or should 
> not align in results?

I'm unsure what you mean with the "secondary keys", I'm guessing these would be
the results of multiple rebuilders. These rebuilder each offer a *status* for a
"primary key", which is currently something like "reproducible",
"unreproducible" and a few more[1].

[1]: 
https://github.com/aparcar/reproducible-builds-verification-format#status-enum

> - What parts of this info do we expect to be useful, and why?  (What 
> user story caused a certain piece of info to seem relevant and 
> actionable enough to include?)

Ideally the format is kept simple and allows package-managers, automated result
collectors and developers to make sense out of it. Say information like "status"
is valuable for everyone, diffoscope outputs (or the urls to such outputs) are
valuable for developers.
> 
> - What things we *could* imagine someone proposing putting in this info 
> which we might reject because we don't believe it would be useful, and why?

Ideally the format is not bloated in a way that it becomes unfeasible for
package-mangers to update it in a frequent manner (e.g. on every run of apt
update). Therefore links to certain developer specific information are good,
build logs not so much. Overall there should be a field allowing to store build
system specific information. The concrete values are however very different,
being Archlinux, Debian, OpenWrt or a Java build environment.
> 
> The motivations of "a generic way to compare results" are good.  But 
> good intentions can only carry us so far.  These four things are some of 
> the first considerations I have when looking at a format proposal.  
> Without some thought about the "keys", I don't know how it will deliver 
> on "comparability" at scale.  Without some meta-documentation of not 
> just the data that goes _in_, but also the kind of data that _doesn't_, 
> I worry that the spec will become a kitchen sink, sopping up more data 
> with time regardless of its relevance, and correspondingly becoming less 
> and less useful over time.

True, the content of a once agreed on format shouldn't be easily extended.

> 
> I don't know if these are the only four questions to ask, nor will I 
> claim they are perfect, but they're some of the first things that come 
> to my mind as heuristics, and I share them in the hope that they can be 
> a useful whetstone for someone else's thoughts.
> 
> 
> 
> As an incidental aside, I think what's currently listed in that github 
> link as "origin_uri" may be mistaken in its conception of "URI".  The 
> examples are such things as "http://ftp.us.debian.org/" and 
> "https://download.docker.com/", and I'm sure these are _locations_, not 
> _identifiers_ -- URLs, not URIs.

Correct this is a mistake. Meant is a public available source of binaries.
> 
> And I would question (begging forgiveness from anyone who knows my 
> refrain already) if "locations" as any sort of primary key are a sturdy 
> idea to try to build upon.  They're terribly centralized. And provide 
> very little insurance against mutability events which can make all other 
> documents that refer to them become instantly useless.  
> Content-addressing may have some potential to address this, git (at 
> least in concept) has shown us the way...

-- 
Paul Spooren <mail at aparcar.org>