[rb-general] BUILD_PATH_PREFIX_MAP format spec, draft #1

Daniel Shahaf danielsh at apache.org
Sat Jan 21 20:42:15 CET 2017


Ximin Luo wrote on Sat, Jan 21, 2017 at 13:50:00 +0000:
> Name change explanation
> =======================
> 
> This was brought up at RWS2 as well, then recently:
> 
> <infinity0> this is another slight difference with source-date-epoch as well,
>  we expect everyone's values for this to be different for source_prefix_map but
>  the same for source_date_epoch
> <infinity0> i sort of wonder if we should rename this build_prefix_map
> <infinity0> because it is actually a property of a single build, not of the whole source
> <h01ger> just from the last 2 lines i think this renaming would make sense
> 
> And then I chose BUILD_PATH_PREFIX_MAP because it seems some people are using
> "buildPrefixMap" in source code to mean a "build a new trie" function.
> 
> Proposal draft
> ==============
> 
> TL;DR: similar to url-encoding except we only hexencode {'%', '=', ':'}.
> 

Looks good to me.

> Implementation notes
> ====================
> 
> This encoding is an encoding between T-sequences and T-sequences, where T is
> the type of both environment variable values and filesystem paths. These types
> are dependent on the platform; we do not support platforms where the two types
> are different and from here on we'll talk about a single type T (per platform).
> 
> Transmitting these values
> =========================
> 
> This encoding explicitly does *not* hide non-printable characters or other
> binary data. If they appear in the input, they will appear in the output.
> 
> Therefore, if you expect that your paths may contain such characters, you
> SHOULD *additionally* postprocess the value of BUILD_PATH_PREFIX_MAP (and any
> other envvars) using some other generic encoding scheme such as base64 that is
> designed to map arbitrary binary data to text, before transmitting it.
> Recipients must then reverse this process to restore the original value, before
> they apply the decode() process described above.

"Postprocess the value" sounds like you mean:

    BUILD_PATH_PREFIX_MAP=$(printf %s "$BUILD_PATH_PREFIX_MAP" | base64 -e)

That'd be guaranteed to cause insanity down the road, since the envvar's
value will be differently typed, not only by platform but also by
producer.  It would be better for data that is not encodeable in the
envvar's value is to be transmitted out-of-band and the envvar
reconstructed to a conforming value by the recipient.  (That might be
what you meant, but the phrasing was ambiguous.)

Alternatively, permit arbitrary characters, not just %=:, to be encoded
would solve this problem without changes to the consumer/decoder side.

The invariant to keep here is "if the envvar is set, then its value
can be decoded in the standard manner".

> You SHOULD NOT hexencode any additional characters in the _enquote() step, or
> anything equivalent to this. Although decoders can process this correctly, this
> is only meant to simplify implementations of the decode algorithm and not as a
> general data-encoding mechanism. In particular, this only works if T is "bytes"
> on the transmitter side. For other T types, you would need some extra encoding
> step similar to the previous paragraph *anyways* and augmenting _enquote would
> split your logic across two places - not clean.

I'm just lost here.  It sounds like it'd be a lot easier to specify that
the platform must specify an encoding of T to bytes — Windows, for
example, could specify UTF-16BE (since big endian is the network byte
order) — and then the decoding process is two-stepped: first decode
%-encoding to get a sequence of bytes, then decode that sequence into
a sequence of T.  (The "outer" encoding/decoding would be a no-op when
T == bytes.)  This would allow any platform to consume envvar values
generated for use on any other platform, and would remove the needs for
implementation-specific base64 hacks.

If we do this, the spec will need to spell out whether ll out that
leading zeroes may be omitted.  (I.e., whether %0 and %00 are both
valid.)

> On the other hand, if you expect that your paths do *not* contain such
> characters, e.g. if they only contain printable ASCII characters, then you
> could transmit the value of BUILD_PATH_PREFIX_MAP as-is.
> 
> Rejected options
> ================
> 
> - Any variant of backslash-escape, because it is annoying to implement in
>   higher-level languages. Backslash-escape is an encoding that is optimised for
>   being typed manually by humans, but I don't expect that will be a major
>   use-case for this encoding.

Both of these are your subjective opinions, not objective properties of
backslash escaping.

Regarding interactive use, I'd say that backslash encoding is superior 
URL-encoding since it doesn't involve looking up byte values, but that
neither of them is the holy grail of UX.

Regarding ease of implementation...

    def decode(s):
        "Decode a $PATH-with-backslash-escaping-encoded value into a list"
        return \
            "".join(
                '\0' if x == ':' else x[-1]
                for x in re.compile(r'[\\]?.').findall(s)
            ).split('\0')

> C version of _dequote
> =====================
> 
> /* optimised for (hopefully) clarity rather than efficiency */
> 
> int
> _dequote (char *src)
> {
>   char *dest = src;
>   char x[] = {0, 0, 0};
>   char c;
>   while (*src)
>     {
>       switch (*src)
>         {
>         case ':':
>         case '=':
>           return 0; // invalid, should have been escaped
>         case '%':
>           if (!(x[0] = *++src) || !(x[1] = *++src))
>             return 0; // invalid, past end of string
>           sscanf(x, "%2hhx", &c); // could be more efficient but my C-foo is low

Example:

https://github.com/apache/subversion/blob/e55689361cfc67f38442283128b4a09e5fd7b63b/subversion/libsvn_subr/checksum.c#L427-L436

>           if (errno != 0)
>             return 0; // invalid, not valid hex
>           *dest = c;
>           break;
>         default:
>           *dest = *src;
>         }
>       ++dest, ++src;
>     }
>   *dest = '\0';
>   return 1;
> }


More information about the rb-general mailing list