[rb-general] Source code timestamps
Eric Myhre
hash at exultant.us
Tue Dec 6 00:11:11 CET 2016
Is it possible there's two concepts here? What if we started factoring
things apart and naming our goals (plural!) distinctly? ISTM that "bit
for bit reproducible" and "easily reproducible on a wide range of
environments" are both useful, but very different.
> A build process is basically a pure function f from Set of Things
> to Bytes.
> Reproducible builds are the art of minimizing the amount of factors
> that influence the output of build functions.
To me, "reproducible builds" is the art of getting that pure function in
the first place (and running it repeated to verify that it is, in fact,
pure). Halting the definition there makes it simple, actionable, and
leaves little room for errors in interpretation.
Using the formal concept of a pure function to describe builds resonates
strongly with me, and I like that description: Builds *should be* pure
functions.
On the other hand, I'm less convinced we can take that pure function as
a given. Many builds are not pure functions, even with total input
capture of the complete environment. We have unstable sorts, random map
iterations, etc in many compilers. With the tools we currently have at
hand, it's very important to talk about this pure function as a goal,
not a given. And thus, we need a name for this goal!
Reducing the number of factors that cause the function to generate a
wider range of results is also good, but perhaps we could call it
"robust builds" or "portable reproducibility", or some other such name.
Chasing this robustness, this reduction of variables, is productive.
But it's not a *requirement* for bit-for-bit reproducibility. It's just
making the reproduction process easier to do with a wider range of
resources.
> This can be achieved by changing the function to not use inputs. That
> is a lot of work, since there are a lot of functions (packages).
> A more efficient way is to reduce the amount of items in the Set of
> Things or to reduce the amount of information these items carry.
> It’s a question of interface design, really. Should the interface
> of builds contain the time stamps of files?
My 2c: yes. Flatten it all you want (and perhaps our tools should do
that by default!). But write it down.
This reasoning is why I would like to propose the separation of concerns
between "reproducible" and "robust": the pursuit of robustness makes a
siren song that tempts us into doing premature information hiding.
Failing to correctly whitelist all inputs, or discarding some
information because we hope it shouldn't matter, may lead us to a world
of "reproducible on my machine" problems. It's terribly hard to debug
why something isn't reproducible if we've insisted half the variables
are unimportant, only to discover we've made a mistake, and simply not
tested it because we no longer have the descriptive framework to do so!
More information about the rb-general
mailing list