Introducing: Semantically reproducible builds

Mon May 29 18:22:31 UTC 2023

> On May 29, 2023, at 12:41 PM, kpcyrd <kpcyrd at archlinux.org> wrote:
> 
> I think the pypi example and missing .gitignore file is more about "git and pypi are both a VCS, did the author commit the same source code". It's about "what's the canonical source code release" instead of a real build.

Huh?  PyPI is not a VCS, it's a package repository. PyPI stores bits sent to it for later retrieval, that's mostly what it does. Those bits are claimed to have been generated by a build from source, for a given package name and version#, but PyPI has no way to verify this. It has almost none of the capabilities of a VCS (like git, mercurial, subversion, CVS, rcs, and sccs). Same for the other repos.

Saying "instead of a real build" is missing the point. There was a time long ago where a built package on npm, PyPI, Rubygems, etc., was just an archived copy of the source code. In many cases that's no longer true. Today the built result may be the result of a complex build process involving multiple tiers of compilers, tree-shaking, minification, and so on. The builds are typically created by the individual project (the repository simply *stores* whatever it's sent), so there's no central organization (like a Linux distro) who can enforce any build rules at all. As a result, there are no rules. I'm not delighted with this state of affairs, I'm just trying to *deal* with reality as it currently exists.

> I don't think it's a worthwhile activity to try to build security controls on top of it, it sounds more like a code-review problem. Source code inputs are commonly pinned by their sha256sum, so it's very clear what should be reviewed, with no ambiguity of some .gitignore being present or absent.

We seem to be talking past each other, as the threat model I'm considering is not the same.
The threat in view here isn't that the source is malicious, so source code review is irrelevant.
Indeed, many packages get *lots* of source review.

The threat is that someone has posted a built package on a repository (typically an unmanaged one),
and the user is trying to determine the likelihood that the given package is malicious where (1) the
source code is not malicious *and* (2) there's no will to create the changes necessary to make it reproducible OR
the builders/developers won't make those changes.

For example, when the login credentials for the repository have been stolen, the attacker
can easily upload a "new" package to the repository that adds malicious code.
The developer(s) might not notice this for a long time (if ever, since they may have died).
Saying "the builders should create reproducible builds" is true but irrelevant, because many
builds aren't reproducible, and I cannot change what the builders choose to do.

Please don't view the text above as opposing reproducible builds.
I think reproducible builds are the gold standard for countering subverted builds, and I will continue to encourage them.
But when you can't get them (e.g., because you don't have time to patch every program
in the universe or the builders won't make changes to their build process),
it's useful to look for some *workable* backoff alternatives. The backoffs may not give
you all you wanted, but they can at least help users focus on their biggest risks first.

In any case, I thought it'd be important for this group to learn about this approach,
in cases where it might be useful to you.

--- David A. Wheeler