git 2.38.0: Change in `git archive` output

brian m. carlson sandals at crustytoothpaste.net
Mon Oct 17 00:51:25 UTC 2022


On 2022-10-17 at 00:02:19, Jeff King wrote:
> Interesting. For a small input, they seem to produce the same file for
> me:
> 
>   git init repo
>   cd repo
>   seq 1000 >file
>   git add file
>   git commit -m foo
> 
>   git -c tar.tar.gz.command='git archive gzip' \
>     archive --format=tar.gz HEAD >internal.tar.gz
>   git -c tar.tar.gz.command='gzip -cn' \
>     archive --format=tar.gz HEAD >external.tar.gz
>   cmp internal.tar.gz external.tar.gz && echo ok
> 
> but if I instead do "seq 10000", then the files differ. I didn't dig
> into the actual binary to see the source of the change. It might be
> something we can tweak (e.g., if it's how a header is represented, or if
> we can change the zlib parameters to find the same compressions).

I will say that trying to make two compression implementations produce
identical output is likely futile because it's almost always the case
that there are multiple identical ways to encode the same data.  Most
implementations are going to prefer improving size over consistency, so
there's little incentive to copy the same algorithm across
implementations. I believe even GNU gzip has changed its output in the
past as better optimizations were implemented.

I mean, don't let me stop you from trying to tweak things to see if you
can make it work, but in general I think it's likely that some
divergence is going to occur between implementations no matter what.

> I don't think we make promises about stable output from "git archive".
> We've fixed bugs in the tar-generating side before that lead to changes.
> But if we can easily make them the same, that might be worth doing.

Since this is on the reproducible builds list, I would be interested in
working with tar implementations to specify a profile of the pax format
that _is_ standardized, stable, and consistent and that Git and other
implementations could use to produce bit-for-bit identical tar archives
across versions, since this is a thing lots of people seem to want.  (If
that's of interest, please contact me off list.)

However, I don't think that trying to do that with compression formats
is likely to lead to a productive work product, so users who cared about
reproducibility would need to compare the uncompressed output.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 263 bytes
Desc: not available
URL: <http://lists.reproducible-builds.org/pipermail/rb-general/attachments/20221017/835a4d9f/attachment.sig>


More information about the rb-general mailing list