Sphinx: copyright substitution and reproducibility

James Addison jay at jp-hosting.net
Tue Jul 9 12:14:07 UTC 2024


Hi folks,

This thread is an attempt to gather feedback for an issue I've reported in the
Sphinx[1] documentation generator; the requirements and implementation details
for that span both the Sphinx and Reproducible Builds projects.

The very-abbreviated problem statement is that when SOURCE_DATE_EPOCH is
configured, Sphinx projects that have configured their copyright notices using
dynamic elements can produce nonsensical output under some circumstances.

Although I think there are reasons to prefer and encourage static (constant)
declaration of copyright notices, many projects today do choose to use
dynamically-evaluated values.

I doubt there's a way to solve every edge case, but I think that it should be
possible to vastly reduce the number of clearly-incorrect copyright notice
outputs that could occur, while retaining both backwards compatibility and also
build reproducibility.


Context
-------

Sphinx is written in Python, and documentation projects that use Sphinx provide
their site-specific configuration (theming, language, code formatting, ...) in
a file named 'conf.py' that is dynamically-evaluated as Python code each time
the documentation is rebuilt.

One of the frequently-used configuration settings is a 'copyright' setting,
Sphinx output formats (HTML, LaTeX, ...) typically embed this in the footer of
each document.

Many projects configure their copyright notice(s) as a static string. However,
it's also a fairly established practice for projects to lean on the dynamic
evaluation of the 'conf.py' file to insert the current year into the copyright
notice at build-time.

So far, that's all fine - except that timestamps are a well-known source of
build non-reproducibility.  Sphinx is a relatively popular documentation
generator used by software projects, and the ability to accurately rebuild not
only the software but also its documentation in a reproducible way is
important.

This copyright build reproducibility problem was identified as early as Y2016,
and Sphinx was patched[2] to substitute the standardised SOURCE_DATE_EPOCH[3]
year into copyright notices that match certain patterns, achieving much greater
build reproducibility.


Perceived limitations
---------------------

The SOURCE_DATE_EPOCH-substitution patch does significantly improve the
reproducibility of copyright notices, and therefore of builds of Sphinx-based
documentation.

However, it has also caused some confusion[4][5], particularly for NixOS
packages, where Sphinx documentation is often built using a SOURCE_DATE_EPOCH
value of one (1), or, since more recently[6], 315532800.

There is also at least one scenario[7] in which a project has intentionally
adjusted the formatting of their copyright notice so that the substitution
logic does not apply to it.

The behaviour that raised my concern and that I've reported[8] relates to
multiline copyright notices.  I don't believe that any incorrect output
produced affects the actual effective rights of copyright holders -- but even
so, I think that it may be beneficial to reduce the possibility for confusion
by adding further safeguards (preconditions) to the substitution code.


Proposed improvements
---------------------

Because the 'conf.py' file is evaluated in a Turing-complete programming
language, it could produce nearly anything, and isn't guaranteed to evaluate at
all.

I won't claim that it'll be possible to catch every edge-case, but I think that
it's possible to offer some improvements that maintain the following properties
under the vast majority of circumstances:

  * No change to potentially-affected documentation source projects or their
    configuration files should be required.

  * The output of the copyright notice when SOURCE_DATE_EPOCH is configured
    should match the value that would have been emitted by a build of that same
    documentation source project at the corresponding point-in-time.

  * The output of copyright notices when SOURCE_DATE_EPOCH is configured should
    be deterministic.


And the adjustments that I'd suggest, with pseudocode examples, to reduce
nonsensical output while achieving the above properties are:

  * When possible, detect statically-declared/constant copyright notice
    configuration and do not perform substitution on it.

       - conf: '2000-{system.date.year}, author'
      => search-and-replace: '{system.date.year}' by '{SOURCE_DATE_EPOCH}.year'

       - conf: '2000-2020, author'
      => search-and-replace: <none>

  * When detecting year values to replace, only substitute years in the
    evaluated copyright notice that match the current system-clock year.  The
    rationale for this is that dynamic copyright configurations almost always
    insert the current year -- so by including that in the find-and-replace
    pattern matching, we should be substituting only the dynamically-evaluated
    part of the string.

       - conf: '2000-{system.date.year} author, 2020-2021 contrib'
      => search-and-replace: '{system.date.year}' by '{SOURCE_DATE_EPOCH}.year'

  * Only allow substitution of the current year for a strictly earlier year
    from SOURCE_DATE_EPOCH; or to rephrase that: cap/upper-bound the
    replacement year to the current system clock year.  This prevents output
    that would otherwise appear to present a publication date, derived from
    SOURCE_DATE_EPOCH, that could be more recent than the build-time.

       -  env: SOURCE_DATE_EPOCH = <two-years-in-the-future>
       - conf: '2000-{system.date.year}'
      => search-and-replace: <none>

These are up for debate and discussion, and I've opened a pair of pull
requests in Sphinx that work towards these.

Thanks for reading and for any feedback and suggestions.  I'd particularly
welcome any additional example cases, possible limitations and edge cases.

Regards,
James


[1] - https://www.sphinx-doc.org

[2] - https://reproducible-builds.org/docs/source-date-epoch/

[3] - https://github.com/sphinx-doc/sphinx/pull/2503

[4] - https://github.com/sphinx-doc/sphinx/issues/3451
[5] - https://github.com/coq/coq/issues/7378

[6] - https://github.com/NixOS/nixpkgs/pull/89794

[7] - https://github.com/matplotlib/matplotlib/issues/28418#issuecomment-2181728365

[8] - https://github.com/sphinx-doc/sphinx/issues/12451


More information about the rb-general mailing list