[rb-general] Source code timestamps

Eric Myhre hash at exultant.us
Tue Dec 6 00:11:11 CET 2016


Is it possible there's two concepts here?  What if we started factoring 
things apart and naming our goals (plural!) distinctly? ISTM that "bit 
for bit reproducible" and "easily reproducible on a wide range of 
environments" are both useful, but very different.


> A build process is basically a pure function f from Set of Things
> to Bytes.
> Reproducible builds are the art of minimizing the amount of factors
> that influence the output of build functions.

To me, "reproducible builds" is the art of getting that pure function in 
the first place (and running it repeated to verify that it is, in fact, 
pure).  Halting the definition there makes it simple, actionable, and 
leaves little room for errors in interpretation.

Using the formal concept of a pure function to describe builds resonates 
strongly with me, and I like that description: Builds *should be* pure 
functions.

On the other hand, I'm less convinced we can take that pure function as 
a given.  Many builds are not pure functions, even with total input 
capture of the complete environment.  We have unstable sorts, random map 
iterations, etc in many compilers.  With the tools we currently have at 
hand, it's very important to talk about this pure function as a goal, 
not a given.  And thus, we need a name for this goal!

Reducing the number of factors that cause the function to generate a 
wider range of results is also good, but perhaps we could call it 
"robust builds" or "portable reproducibility", or some other such name.

Chasing this robustness, this reduction of variables, is productive.  
But it's not a *requirement* for bit-for-bit reproducibility.  It's just 
making the reproduction process easier to do with a wider range of 
resources.


> This can be achieved by changing the function to not use inputs. That
> is a lot of work, since there are a lot of functions (packages).
> A more efficient way is to reduce the amount of items in the Set of
> Things or to reduce the amount of information these items carry.
> It’s a question of interface design, really. Should the interface
> of builds contain the time stamps of files?
My 2c: yes.  Flatten it all you want (and perhaps our tools should do 
that by default!).  But write it down.

This reasoning is why I would like to propose the separation of concerns 
between "reproducible" and "robust": the pursuit of robustness makes a 
siren song that tempts us into doing premature information hiding.

Failing to correctly whitelist all inputs, or discarding some 
information because we hope it shouldn't matter, may lead us to a world 
of "reproducible on my machine" problems.  It's terribly hard to debug 
why something isn't reproducible if we've insisted half the variables 
are unimportant, only to discover we've made a mistake, and simply not 
tested it because we no longer have the descriptive framework to do so!



More information about the rb-general mailing list