Reproducible tarballs on Github?

Martin Monperrus martin.monperrus at gnieh.org
Tue Oct 26 06:58:05 UTC 2021


Dear all,

Thanks for your inputs. Many thanks Eli for your great substantiated explanation.

Since none of this is publicly documented at Github (AFAIK), the idea is to use Eli's email 
<https://lists.reproducible-builds.org/pipermail/rb-general/2021-October/002422.html> as a reference 
URL about the topic.

Best regards,

--Martin


On 10/24/21 03:42, Eli Schwartz wrote:
> On 10/23/21 5:51 AM, Martin Monperrus wrote:
>> Dear all,
>>
>> FYI, Github's autogenerated release tarballs are not deterministic (see
>> discussion on keybase<https://github.com/keybase/client/issues/10800>,
>> and Bitcoin-core release warning
>> <https://github.com/bitcoin/bitcoin/releases/tag/v22.0>).
>>
>> Does anybody have good connections at Github to get this fixed?
>>
>> Best regards,
>
> I see a bunch of assertions that were counter-asserted. As several
> people in this thread have stated, the output of the `git archive`
> command is specifically designed to be reproducible these days.
>
> Github specifically uses `git archive`, and that has meaningful
> ramifications people depend on, such as respecting .gitattributes (e.g.
> export-ignore), and means that
>
> gzip -dc <archive-file>.tar.gz | git get-tar-commit-id
>
> will print the tar commit id, per the git-archive documentation.
>
> A long time ago, Github's archives failed to be deterministic because
> git-archive was not deterministic. Since then, people periodically panic
> that it still isn't, but no one has provided credible proof that I can
> recall.
>
> I do not count cases that were successfully proven to have been upstream
> developers force pushing and overwriting a public tag because they
> wanted to revise history.
>
> ...
>
> Note also that GNU gzip's output algorithm is stable and deterministic.
> Other compression algorithms are not necessarily so... zstd for example
> documents that zstd compression is always deterministic *iff* you use
> the same version of zstd, which does no good for code hosting sites
> relying on git-archive... but then again, zstd is also a format under
> active development.
>
> Note that it does depend on which *implementation* of gzip you use. For
> example, most people use GNU gzip (I believe github always has), but
> busybox gzip ***used to*** produce different output.
>
> This was actually fixed by Daniel Edgecumbe (after I encouraged him to
> do so) and mentioned in the rb-general status update at:
>
> https://lists.reproducible-builds.org/pipermail/rb-general/2019-September/001647.html
>
> The patches were applied in
> https://git.busybox.net/busybox/commit/?id=c660cc1b7714fffbac95c9378ff4b73de650a6de
> and busybox v1.32.0 produces the same output as GNU gzip (and is
> expected to remain stable). This is relevant to anyone who actually runs
> a software forge using busybox and/or Alpine Linux, which I suspect lies
> drastically out of the Github interest zone.
>
>
> It specifically was relevant to sourcehut (https://git.sr.ht) and was in
> fact a cause of non-reproducibility on that software forge... which is
> now fixed.
>
>
> And, again, this literally just means you're dependent on the stability
> of the third-party compressor as your single point of failure (other
> than possibly using absolutely, anciently, decrepit versions of git from
> before git-archive was specifically modeled to be reproducible).
>
>
> If anyone can actually point to a real life case in the last 4 years
> where github autogenerated tarballs have actually changed, without
> changing the contents (that is to say, the union of unpacked files,
> leading directory prefix, and git-get-tar-commit-id header) I would
> actually love to hear about that. I've asked people before, and I
> believe I've always gotten one of 4 responses:
>
> - "I heard that it can happen"
> - "it happened before famous-date which is specifically acknowledged as
>     the cutoff point for git-archive itself"
> - "it happened right here" followed by verification that that specific
>     case was a force pushed tag change
> - "it happened right here" followed by verification that that specific
>     case was a repository rename, which resulted in the extracted
>     directory itself changing (so tar -xf ... && cd foo-1.0 became
>     instead tar -xf ... && cd bar-1.0)
>
>
> Given the relative popularity of people *saying* it can happen, surely
> there must be some evidence... somewhere... that this is still a problem?
>
> Meanwhile, it has remained true for years, that I can reproduce exact
> checksums of a github release using the following command line:
>
> $ git config --get alias.github-release
> !f() { local repo=$(basename "$(pwd)") tag=$1; git archive
> --prefix=${repo}-${tag#v}/ -o ${repo}-${tag#v}.tar.gz ${tag}; }; f
>
>
> $ git github-release <tag>
> $ git github-release <sha1>
>
>
> Maybe someone can point at git-archive itself modifying the format of
> its output stream? Since that really is what it all boils down to, I
> would imagine.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.reproducible-builds.org/pipermail/rb-general/attachments/20211026/987aedaf/attachment.htm>


More information about the rb-general mailing list