Sphinx: localisation changes / reproducibility

James Addison jay at jp-hosting.net
Wed Apr 26 19:05:26 UTC 2023

On Wed, 26 Apr 2023 at 18:48, Vagrant Cascadian
<vagrant at reproducible-builds.org> wrote:
> On 2023-04-26, James Addison wrote:
> > On Tue, 18 Apr 2023 at 18:51, Vagrant Cascadian
> > <vagrant at reproducible-builds.org> wrote:
> >> > James Addison <jay at jp-hosting.net> wrote:
> >> This is why in the reproducible builds documentation on timestamps,
> >> there is a paragraph "Timestamps are best avoided":
> >>
> >>   https://reproducible-builds.org/docs/timestamps/
> >>
> >> Or as I like to say "There are no timestamps quite like NO timestamps!"
> >
> > I see a parallel between the use of timestamps as a key for
> > data-lookup (as in Holger's developers-reference package), and the use
> > of locale as a similar data-lookup key (as in the case of localised
> > documentation builds).
> > I'm not sure what the equivalent approach is for localisation, though.
> > Command-line software, for example, requires at least one written
> > natural-language to be usable, and as a second use case, providing
> > natural-language documentation with software is highly recommended (is
> > it part of the software?  maybe not.  but a sufficiently-confusing
> > poorly-translated error message could be as serious as a code-related
> > bug, I think?).
> >
> > Linking back to my recent experience with Sphinx, and from the
> > perspective of allowing-users-to-verify-their-software, I'd tend to
> > think that an ideally-produced, reproducible, localised software would
> > include _all_ available translations in the build artifact.  Some of
> > that could be retrieved at runtime (gettext, for example), and some
> > could be static (file-backed HTML documentation, where runtime lookups
> > might not be so straightforward).
> I struggle to see the parallel. A timestamp is an arbitrary value based
> on when you built it, whereas the locale-rendered document should be
> reproducibly translated based on the translations you have available at
> the time you run whatever process generates the translated version of
> the document/binary, and regardless of the locale of the build
> environment.

Ok, I think I understand.  Please check my understanding, though: I
interpret your perspective as matching the ideal-world scenario that
John outlined, where the SOURCE_DATE_EPOCH value has no effect at all
on the output of the build

Until then, I see both the build-time (SOURCE_DATE_EPOCH) and
build-locale as inputs that do affect the output of software build
systems, and believe that relevant guidance could help projects
migrate towards reproducibility.

> With runtime translation, you would be desiring translation from the
> source language to the operating locale of the environment you've called
> it in... but that should still be systematic, no?

Runtime translation should be systematic, yes.  So recommending that
projects use runtime translation (instead of compiling-in separate
source files for each language) is good advice.

> While there almost certainly might be more than one legitimate
> translation for a given work, your process for rendering it should
> really only have one particular output given a particular input
> (e.g. the source language input and the descriptions of how to translate
> it to the desired language)... barring, of course, bugs in the system
> ... or am i missing something entirely?

No, I don't think you missed anything, and I think we have the same
understanding of the components.  We're likely arriving from different
perspectives on the problem space.

My question is approximately this: for some source software developed
in a natural language that I don't read or understand, and that
includes statically-built documentation (say, HTML files for example),
could I determine that the distributed software (an installer file
downloaded from the web, for example) recommended to me because it
includes support for a natural language that I _do_ understand is
identical to the one in the developers' own natural language?

(and I think that yes, it's possible: build the source to include the
content from all available languages, and distribute that single copy;
the translations may be better or worse in some areas, but we can all
agree that it is not only the same source, but the same build of that

> Unless, I guess, you're using some Machine Learning model to produce
> your translations?

... well, in honesty I think that Machine Learning could -- and in
many cases, perhaps should -- be encouraged towards
deterministic/repeatable behaviour.  But that's probably a
conversation for another thread.

More information about the rb-general mailing list