git quirk: core.autocrlf
James Addison
jay at jp-hosting.net
Mon Apr 22 00:00:35 UTC 2024
Hi folks,
This message isn't _directly_ related to reproducible builds, but it
does relate to unexpected differences in text (including, potentially,
source code) checked out from git repositories, and I think that that
could be relevant to the audience here.
Some of the code within the Sphinx documentation generator removes
carriage-return ('\r' in Python string literal escape code notation)
characters from input documents before checksumming them, and that
part of the code puzzled me - generally any kind of content
modification before checksumming seems like a code smell to me.
The relevant code removes those carriage-returns so that the checksums
produced are in a sense cross-platform compatible; that is, the 'same
content' produces the same checksum whether the platform uses CRLF or
LF line-endings.
Now, Python itself does include some functionality[1] to handle what
it refers to as 'universal newlines'; newlines in strings are
generally represented using a single '\n' character, that is
serialized and deserialized to CRLF or LF as platform-appropriate.
This is stable, mature and well-established behaviour at this point.
That universal newline handling may cause problems in some cases if
not handled carefully, but surprisingly -- at least to me -- 'git'
itself also automatically converts the line-endings of files to the
local platform's standard.
I suppose this makes sense so that developer tooling designed for each
platform works as-expected with text stored in git repositories
(which, internally, store the newlines using LF).
However it does mean that the checksums of files checked out from the
same origin git repository can differ on different OS platforms.
Overriding this behaviour on a per-file basis is possible using
.gitattributes config[2] file(s) within the repository, or
alternatively a git client system system can use the 'core.autocrlf'
configuration setting[3] to specify the desired line-ending-conversion
method.
Again: this is probably slightly off-topic and perhaps not of direct
relevance to anyone on the list today. However, it seems like the
kind of issue that is useful to be aware of if-and-when puzzling over
unexpected git content / checksum issues (situations that I _do_
expect people on this list encounter from time-to-time).
Regards,
James
[1] - https://docs.python.org/3.12/glossary.html#term-universal-newlines
[2] - https://git-scm.com/docs/gitattributes
[3] - https://git-scm.com/docs/git-config#Documentation/git-config.txt-coresafecrlf
PS: For anyone concerned that this might inadvertently expose some
kind of checksumming vulnerability; I briefly worried about that after
determining the line-ending behaviour to be the cause. Padding of
source files with carriage-returns could be a way for bad actors to
attempt to find checksum collisions, yes; but equally, newlines -- or
spaces -- are available to achieve the same. Are there any languages
that attempt to prevent arbitrary source code padding so that
checksum-space-exploration from a known code plaintext is constrained?
Golang and other languages that require or support autoformatting may
be the safest bets.
More information about the rb-general
mailing list