[rb-general] SOURCE_PREFIX_MAP and Occam's Razor

John Gilmore gnu at toad.com
Sun Jan 22 04:31:47 CET 2017


> AIUI, the prefix map is for the case where you need to debug (or do
> something else to) two files with the same basename but in different
> directories, for example "utils.c". If you just strip out everything
> in the path, the debugger or other tool that needs this information
> can get confused.

I agree that that situation can be confusing to a debugger.

However, I suspect that there aren't two source files with the same
basename in the vast majority of the few thousand packages that are
not reproducible because they include the build-path by default.
(I saw that statistic somewhere but now I can't re-find it.)

If the build-path was only put into the binaries UPON REQUEST, then
the bulk of those packages would become reproducible.

The request for a build-path could be made by a command-line option in
CFLAGS, only in the small number of programs that actually need it.
Oh, you could use an environment variable for that, set it in a
makefile and read it in a compiler, like you're proposing.  But
passing such a thing "by the back door" will make it hard for people 5
or 10 years from now to figure out what's making the tools behave that
way, what this odd thing is in the Makefile, or why gcc has that
bizarre dependency on an obscurely named environment variable.

It would be easy to do a static pass through the packages and
determine how many and which packages DO have duplicated file names.
=> Would you like me to do this analysis, or are you-all firmly set on
going your own way no matter what the input from the rest of us? <=  If
we found that fewer than 5% of packages have duplicate source file
names, would you be willing to agree to remove build-paths in gcc by
default?

> I also don't fancy trying to convince all build tools in existence
> to adopt the information-stripping approach.

First, isn't the SOURCE_PREFIX_MAP an information-stripping approach
itself?  It strips out PART of the path in the object file.  If you
can't convince tool maintainers to strip information, you'll have a
hard time getting uptake on SOURCE_PREFIX_MAP.  Whereas if they are
amenable to stripping SOME information, I am merely suggesting a
simpler and more straightforward change: stripping out the ENTIRE path
(by default), and using a standard command line option when that path
is actually needed.

Second, aren't the vast bulk of packages built by the GNU compiler
tools?  A small patch to gcc that would remove the build-path by
default would reduce the "build-path" issue to a much smaller, more
tractable number of packages.

I was initially thinking that the compiler command-line option would
have no argument, and if present would lead to the current behavior
(object files contining the build path from the root).  This would
avoid any issue about command line options being stored in object
files.  But upon reflection, suppose it took an argument that was a
relative path to the top of the source tree?  E.g.

  gcc --record-build-path-from=../../

This option would be set to the same value in every build on every
host (so would have no dependency on where the builds happen) and
would tell the compiler to insert into the object file only the build
path between ../../ and the current file.

It could also be passed as a number of directories, e.g.:

  gcc --record-path-components=2

but makefiles and build tools are much better equipped to deal with
filepaths rather than counts of path components.  And when building in
subdirs, etc, it would require a clear spec about what the number
means.  If in directory /tmp/mypackage/blah/grumble you ran:

  gcc -o buildsubdir/foo.o --record-build-path-from=../../ -c ../lib/foo.c

then you know the recorded path will be: "blah/lib/foo.c".  If you
ran:

  gcc -o buildsubdir/foo.o --record-path-components=2      -c ../lib/foo.c

then would ".." or "buildsubdir" count as a path component?  It's
unclear.

This approach would work less well with an environment variable, since
it would be silently passed into subdir builds without modification,
probably silently producing incorrect results that would only be
noticed when some user tried debugging a released binary.  A
command line option that includes a relative path OBVIOUSLY needs to
be adjusted when descending into a subdir, and also appears in the
build log output where the programmer can notice it's wrong.

> Once this
> SOURCE_PREFIX_MAP thing is done, it's done and everyone is happy.

Well, that's true for any solution.  Once it's solved, it's solved.
But that begs the question of HOW to solve it.

There doesn't seem to be a consensus on how to get it done.  Some
people want one path, some want many paths, some want a version number
so that an even more complex proposal can be implemented later, some
want colons, some want equals, everybody has a favorite character that
they think won't occur in a pathname except that they can all occur in
pathnames, some want quoting, some think it's way too complicated
already.  It's a mess, and it's a mess that can be resolved by the use
of Occam's Razor.  Do the simple thing first, resolve 90+% of the
problem, then try to handle the remaining cases by more complex and
more ad-hoc mechanisms -- like the stuff you're trying to get
consensus on, or perhaps by post-compile tools that strip the
offending info out of the object files directly, or during linking.

> It was a similar situation with SOURCE_DATE_EPOCH, the "hardcode"
> "keep-it-simple" people didn't see the point but now it's all fine
> and people with opinions across the whole range of the spectrum can
> mostly agree to it.

The "hardcode" people didn't need to even look at SOURCE_DATE_EPOCH,
since their tools already didn't encode a timestamp in the object
files.

At Cygnus in the 1990s we made the GNU tools fully reproducible, even
for cross-compilation.  This required dealing with many byte-order
issues and floating-point representations and such -- a set of issues
that you don't have.  We built the byte-order-independent BFD tools,
and the new GNU linker, pretty much from scratch, for just that purpose.

If over the years since then, people who weren't testing for
reproducibility have put in changes that mess up reproducibility,
usually the simplest fix is to revert those changes, rather than to
try to patch them up by adding ad-hoc environmental dependencies.

(If you do start cross-compiling reproducible packages, you will
probably find some dependencies on the host system, which don't solely
relate to pathnames.  I hope you will try to convince the tool makers
to actually fix those dependencies on byte orders or whatever, by
actually fixing the bugs, instead of trying to patch them up with even
more environment variables.  Not every problem is a nail, once you
have invented a hammer.)

	John


More information about the rb-general mailing list