[rb-general] SOURCE_PREFIX_MAP format specification proposals

Thu Jan 12 06:36:02 CET 2017

My opinion:

a) Avoid the characters 0x00 through 0x1F inclusive (note: this includes tab)

b) Avoid the character 0x7F (DEL)

This is to make things as portable as possible.  (Even through
screenshots, printouts, contexts that don't permit control characters
(like email subject lines), etc)

Some concrete suggestions:

1a) "key1:value1:key2:value2"

    def unparse(mapping):
        return ":".join(key + ":" + value
                        for key, value in mapping.items())

Easy to parse, using split().

Doesn't allow colons in directory names.

On windows this could use ";" instead of ":", as $PATH does.

1b) "key1:value1::key2:value2::key3:value3"

Basically the same as (1a).

2) Like (1a) but escape any backslash and colon in the keys and values
with a backslash; that is:

    def unparse(mapping):
        escape = lambda string: "".join(('\\'+char if char in '\\:' else char)
                                        for char in string)
        ret = "".join(escape(k) + ':' + escape(v)
                      for k,v in mapping.items())
        return ret

To parse: split on unescaped colons, then remove every escaping backslash:

    /* Forward declaration. */
    char *remove_backslashes(char *s);

    /* Parse the envvar into some kind of array of interleaved keys and values. */
    array_t parse_the_env_var() {
        array_t ret;

        /* Init. */
        char *const s = getenv("…");
        if (!s || !*s) return NULL;
        char *const END = &s[strlen(s)];
        assert(END > s);
        assert(*END == '\0');

        /* Convert syntactical colons to NULs. */
        for (char *p = s; p <= END; ) {
            switch (*p) {
                case '\\': p += 2; continue;
                case ':': *p = '\0'; p++; continue;
            }
        }

        /* Backslash-escape the segments in-place. */
        char *p = s;
        do {
            ret.append(remove_backslashes(p));
            while (*p) p++;
            /* On the last iteration, the following line will cause p to be
             * a one-past-the-end pointer — which is well-defined */
            ++p;
        } while (p <= END);

        return ret;
    }

    /* Like strcpy() but undo backslash-escaping as you go. */
    char *remove_backslashes(char *s) {
        char *dest = s, *src = s;
        while (*dest = *src) {
            if (*src == '\\') {
                *dest = *++src;
                if (!*dest) {
                    /* 
                     * If we get here, the next evaluation of the loop
                     * condition will dereference 'src' when it is
                     * one-past-the-end, which is undefined behaviour.
                     */
                    abort();
                }
            }
            dest++, src++;
        }
        return s;
    }

This format is fully general, printable non-whitespace ASCII only,
round-trips through everything including dead trees and screenshot, and
so on.  Yes, parsing this requires more code than just split(), which
means the consumers' code is a little longer; but in return the
producers' code is dead simple.  I think that'd be a good trade-off to
make.

(Not to mention that producers who don't have colons in their pathnames
don't need to worry about escaping them.)

Feel free to use some other character instead of colon, e.g., with
little change this would allow a list of "foo=bar;baz=qux;" with the
three characters «=» «;» «\» being backslash-escaped.

3) Counted-length strings

    def unparse(mapping):
        # "4:key1 6:value1 4:key2 6:value2"
        #
        # Assume strlen() is a function that returns the length *in
        # bytes* (as opposed to characters)
        escape = lambda string: (strlen(string)) + ":" + string)
        return " ".join(escape(key) + " " + escape(value)
                        for key,value in mapping.items())

Ximin Luo wrote on Wed, Jan 11, 2017 at 17:42:00 +0000:
> Warning: extreme bikeshedding ahead

In the interest of avoiding bikeshedding, I won't post to this thread
again unless I have something to say which I haven't already.

Apologies for the somewhat dense style of the code samples; they are
just examples in an ephemeral email so I wrote them more succinct than
usual.

Cheers,

Daniel

> Not considered
> ==============
> 
> because not easy to parse (will hinder adoption):
> - custom escaping with \ and \\
> - urlencode
> - json
> 
> because appears in file paths:
> - PATH-separator (: on POSIX, ; on windows)
>   because we'd like to work with as many user paths as possible; by contrast PATH can afford to be more restrictive to "system-like" paths