[rb-general] BUILD_PATH_PREFIX_MAP format spec, draft #1

Sat Jan 21 14:50:00 CET 2017

Name change explanation
=======================

This was brought up at RWS2 as well, then recently:

<infinity0> this is another slight difference with source-date-epoch as well,
 we expect everyone's values for this to be different for source_prefix_map but
 the same for source_date_epoch
<infinity0> i sort of wonder if we should rename this build_prefix_map
<infinity0> because it is actually a property of a single build, not of the whole source
<h01ger> just from the last 2 lines i think this renaming would make sense

And then I chose BUILD_PATH_PREFIX_MAP because it seems some people are using
"buildPrefixMap" in source code to mean a "build a new trie" function.

Proposal draft
==============

TL;DR: similar to url-encoding except we only hexencode {'%', '=', ':'}.

# Internally, prefix-maps must be represented as an ordered list of 2-tuples
# (pairs) or equivalent, and *not* an unordered dictionary.

def _enquote(part):
    return part.replace("%", "%2a").replace("=", "%3d").replace(":", "%3a")

def encode(prefix_map):
    return ":".join("=".join(map(_enquote, pair)) for pair in prefix_map)

def _dequote(part):
    subs = part.split("%")
    # Will raise if there are <2 chars after % or if these aren't valid hex
    return subs[0] + "".join(chr(int(sub[0:2], 16)) + sub[2:] for sub in subs[1:])
    # In a lower-level language one would manually iterate through the string,
    # see bottom of this email for an example.

def decode(prefix_str):
    tuples = (part.split("=") for part in prefix_str.split(":"))
    # Will raise if any tuple can't be destructured into a pair
    return [(_dequote(src), _dequote(dst)) for src, dst in tuples]

def map_prefix(path, prefix_map):
    for src, dst in reversed(prefix_map):
        if path.startswith(src): ## optionally could be restricted to match full path components instead
            return dst + path[len(src):]
    return path

(Yes, I will translate this into natural language in the final spec.)

Rationale
=========

The operations are simple, involving only string split, join and replace, which
is available in all languages. Furthermore, the separator characters do not
appear in the enquoted forms of the paths, which makes the split operation much
simpler. (Backslash-escaping code is pretty annoying to write in higher-level
languages because one does not normally iterate over characters of a string;
even if "it's possible" it disrupts the style of the rest of the code.)

Implementation notes
====================

This encoding is an encoding between T-sequences and T-sequences, where T is
the type of both environment variable values and filesystem paths. These types
are dependent on the platform; we do not support platforms where the two types
are different and from here on we'll talk about a single type T (per platform).

On POSIX systems T is an octet or byte, and on Windows systems T is a UTF-16
16-bit wide word. Due to UTF-16's use of surrogate pairs to encode characters
not in the BMP, the "naive" way of implementing our algorithm (by looking at
each 16-bit word as a distinct unit) will fortunately work correctly.

Therefore, it should in theory be possible to write code that employs the same
algorithm on a sequence of any type T that works cross-platform, as long as the
character codes are compatible with ASCII as both examples above are. However,
different languages have different ways of presenting filepaths and environment
variable values, so one should be careful and actually test this.

For example, on Python 3.3+ on all platforms, paths and environment variables
are presented as (unencoded) Unicode strings. On POSIX where the underlying OS
values are bytes, values which cannot be UTF-8 decoded to valid Unicode are
instead decoded (by default) into a lone "low surrogate" character (Python
calls this the "surrogateescope" encoding) which is not present in "normal"
Unicode. The resulting string, when UTF-8 encoded back into bytes, preserves
the original byte value - which is invalid UTF-8 but that doesn't matter to a
POSIX OS. Therefore, it is still correct to use our "naive" algorithm on Python
unicode strings (i.e. T = unencoded Unicode character) even when the OS type is
bytes, and the benefit is that the same code will also work on Windows.

This type of "accidentally correct" situation may not be true for all languages
however, so you should understand these issues carefully and check it.

TODO: Yes I will provide some test vectors. For example, 'ñ LATIN SMALL LETTER
N WITH TILDE' is 0xF1 in Latin-1 but 0xF1 is not valid UTF-8 (ñ is 0xC3 0xB1 in
UTF-8); you can create a file called this and use it in your own tests.

$ touch $'\xf1'
$ LOL=$'\xf1=a' python3 -i defs.py
>>> import os
>>> os.listdir(), os.listdir(b'.')
(['defs.py', '\udcf1'], [b'defs.py', b'\xf1'])
>>> os.path.exists('\udcf1') # not the actual filename, don't transmit this outside of Python
True
>>> os.getenv("LOL"), os.getenvb(b"LOL")
('\udcf1=a', b'\xf1=a')
>>> pm = decode(os.getenv("LOL"))
>>> pm
[('\udcf1', 'a')]
>>> map_prefix(os.listdir()[1], pm)
'a'

Transmitting these values
=========================

This encoding explicitly does *not* hide non-printable characters or other
binary data. If they appear in the input, they will appear in the output.

Therefore, if you expect that your paths may contain such characters, you
SHOULD *additionally* postprocess the value of BUILD_PATH_PREFIX_MAP (and any
other envvars) using some other generic encoding scheme such as base64 that is
designed to map arbitrary binary data to text, before transmitting it.
Recipients must then reverse this process to restore the original value, before
they apply the decode() process described above.

You SHOULD NOT hexencode any additional characters in the _enquote() step, or
anything equivalent to this. Although decoders can process this correctly, this
is only meant to simplify implementations of the decode algorithm and not as a
general data-encoding mechanism. In particular, this only works if T is "bytes"
on the transmitter side. For other T types, you would need some extra encoding
step similar to the previous paragraph *anyways* and augmenting _enquote would
split your logic across two places - not clean.

On the other hand, if you expect that your paths do *not* contain such
characters, e.g. if they only contain printable ASCII characters, then you
could transmit the value of BUILD_PATH_PREFIX_MAP as-is.

Rejected options
================

- Simple-split using semi-common characters like ':' which loses the ability to
  map paths containing those characters.

- Simple-split using never-in-path characters like '\t' or '0x1E RECORD
  SEPARATOR' because they make the output unconditionally non-printable.

- Any variant of backslash-escape, because it is annoying to implement in
  higher-level languages. Backslash-escape is an encoding that is optimised for
  being typed manually by humans, but I don't expect that will be a major
  use-case for this encoding.

- Any variant of url-encoding, because the original perceived gain (being able
  to use existing decoders) did not work out in the end:

  - Encoders are supposed to encode anything (roughly) non-alphanumeric, but
    for simplicity I thought we could only encode '=&%'. This turned out to be
    false, we must at a minimum also encode:

    - '+' to '%2b' to make sure it is decoded correctly, because a raw '+' is
      decoded as ' ' in all url-encoding variants

    - ';' to '%3b' to make sure it is decoded correctly, because most decoders
      treat ';' in the same way as '&', i.e. separates pairs

  - Decoders in most programming languages decode to a { key -> value list }
    rather than a list of pairs as we require, and there is no simple way to
    massage this into a list of pairs that still preserves ordering.

    Suitable:
    - Python: urllib.parse.parse_qsl
    - Perl: URL::Encode::url_params_flat
    - Rust: url::Url::query_pairs

    Not suitable:
    - PHP: parse_str
    - Go: net/url::ParseQuery
    - Ruby: CGI::parse

    So we have to manually split it anways, in which case the current proposal
    is simpler (involves less characters) and less likely to be misinterpreted.

- Using another character instead of '%' to further distinguish it from
  url-encoding. But I eventually decided that being "similar" is actually good
  - someone with no idea of this spec could script up a decoder if they needed
  to recover data or had some other single-use need, and the ':' separators
  should already be a hint it's not *exactly* a URI query-string. Of course any
  tool that is used repeatedly, should follow the final spec.

  - '*' was my next-favoured candidate because it looks like 'x', but gave it
    up because it could be mistaken for a shell character and also it's used
    nowhere else for this purpose.

  - '#' was another candidate, as in hex color codes, but is also used in
    '&#nn;' decimal SGML character entities.

C version of _dequote
=====================

/* optimised for (hopefully) clarity rather than efficiency */

int
_dequote (char *src)
{
  char *dest = src;
  char x[] = {0, 0, 0};
  char c;
  while (*src)
    {
      switch (*src)
        {
        case ':':
        case '=':
          return 0; // invalid, should have been escaped
        case '%':
          if (!(x[0] = *++src) || !(x[1] = *++src))
            return 0; // invalid, past end of string
          sscanf(x, "%2hhx", &c); // could be more efficient but my C-foo is low
          if (errno != 0)
            return 0; // invalid, not valid hex
          *dest = c;
          break;
        default:
          *dest = *src;
        }
      ++dest, ++src;
    }
  *dest = '\0';
  return 1;
}

-- 
GPG: ed25519/56034877E1F87C35
GPG: rsa4096/1318EFAC5FBBDBCE
https://github.com/infinity0/pubkeys.git