[rb-general] BUILD_PATH_PREFIX_MAP format spec, draft #1

Sat Jan 21 21:49:00 CET 2017

Daniel Shahaf:
> [..]
> 
> "Postprocess the value" sounds like you mean:
> 
>     BUILD_PATH_PREFIX_MAP=$(printf %s "$BUILD_PATH_PREFIX_MAP" | base64 -e)
> 
> [..]

Envvars never get transmitted directly between two systems like this. What I meant was that they should invent their own way of communicating arbitrary envvars, that can handle arbitrary data inside the names or values.

> [..] It would be better for data that is not encodeable in the
> envvar's value is to be transmitted out-of-band and the envvar
> reconstructed to a conforming value by the recipient.

I don't understand what you mean by "data that is not encodeable in the envvar's value" nor what you mean by "transmitted out-of-band". Environment variable values don't get "transmitted" anywhere, you have to expressly read the value and turn it into a string. In which case you are not transmitting an envvar but a string. Then you should "postprocess" this string (to reuse my earlier terminology) if you expect it contains characters that aren't suited to your transmission protocol or your recipient.

>> You SHOULD NOT hexencode any additional characters in the _enquote() step, or
>> anything equivalent to this. Although decoders can process this correctly, this
>> is only meant to simplify implementations of the decode algorithm and not as a
>> general data-encoding mechanism. In particular, this only works if T is "bytes"
>> on the transmitter side. For other T types, you would need some extra encoding
>> step similar to the previous paragraph *anyways* and augmenting _enquote would
>> split your logic across two places - not clean.
> 
> I'm just lost here.  It sounds like it'd be a lot easier to specify that
> the platform must specify an encoding of T to bytes — Windows, for
> example, could specify UTF-16BE (since big endian is the network byte
> order) — and then the decoding process is two-stepped: first decode
> %-encoding to get a sequence of bytes, then decode that sequence into
> a sequence of T.  (The "outer" encoding/decoding would be a no-op when
> T == bytes.) [..]

This is basically what I said in the paragraph above ("postprocess" and "reverse") except that I'm leaving the exact method open for the future because it really is a separate concern and we don't need to finalise that at the moment.

> 
>> On the other hand, if you expect that your paths do *not* contain such
>> characters, e.g. if they only contain printable ASCII characters, then you
>> could transmit the value of BUILD_PATH_PREFIX_MAP as-is.
>>
>> Rejected options
>> ================
>>
>> - Any variant of backslash-escape, because it is annoying to implement in
>>   higher-level languages. Backslash-escape is an encoding that is optimised for
>>   being typed manually by humans, but I don't expect that will be a major
>>   use-case for this encoding.
> 
> Both of these are your subjective opinions, not objective properties of
> backslash escaping.
> 
> Regarding interactive use, I'd say that backslash encoding is superior 
> URL-encoding since it doesn't involve looking up byte values, but that
> neither of them is the holy grail of UX.
> 
> Regarding ease of implementation...
> 
>     def decode(s):
>         "Decode a $PATH-with-backslash-escaping-encoded value into a list"
>         return \
>             "".join(
>                 '\0' if x == ':' else x[-1]
>                 for x in re.compile(r'[\\]?.').findall(s)
>             ).split('\0')
> 

Sure, but I didn't want to make this dependent on regex either, since every language does those very slightly differently. It makes it more time-consuming to verify that these are exactly following a spec including handling all the error conditions, and that it's behaving the same way as another implementation in a different language.

split(":") without having to worry about a backslash before it, is an operation that is really in every language and implemented obviously exactly the same. It takes probably <100 the amount of time to glance over and understand, than a regexp or a for loop.

All of these factors are objective differences between urlencode and backslash-escape. I shortened them all to "annoying" because I'd really like to close this topic ASAP and bury the bikeshed.

X

-- 
GPG: ed25519/56034877E1F87C35
GPG: rsa4096/1318EFAC5FBBDBCE
https://github.com/infinity0/pubkeys.git