[rb-general] SOURCE_PREFIX_MAP format specification proposals

Wed Jan 11 18:42:00 CET 2017

Warning: extreme bikeshedding ahead

For background, see https://gcc.gnu.org/ml/gcc-patches/2016-11/msg00182.html

In that thread, the possibility of supporting multiple mappings was brought up. Matthias Klose the Debian GCC maintainer has also told me that they'd prefer us to have a draft spec for this variable, before they accept the patch - it doesn't have to be the final spec, but at least a document that describes the exact format of it. And at RWS2 we discussed multiple mappings, and decided that this is probably a good thing to support.

Rust also seem keen on supporting multiple mappings, similar to GCC:
- https://github.com/rust-lang/rust/issues/38322
- https://github.com/rust-lang/rust/pull/38348

Also, this variable is a little different from SOURCE_DATE_EPOCH in that we'd encourage upstream buildsystems to set this themselves. (That is optional and not explicitly encouraged for SOURCE_DATE_EPOCH). Plus, we explicitly allow appending extra mappings onto existing maps, see below.

Aims
====

Setting the variable would be done by buildsystems and higher-level tools such as distribution-buildsystems, virtual-environment builders, CI systems, etc. Therefore, it should be easy to do this, in the languages that these tools are likely to be written in. For example, Makefile, POSIX shell, Perl.

Parsing and applying the variable would be done by lower-level tools such as compilers and documentation generators. Also, applying the variable is a separate concern from the *format* of how it's represented, so we don't need to think about it here. Therefore, it should be easy to parse the format, in the languages that these tools are likely to be written in. For example, C, python.

Currently-favoured proposals
============================

Parsing the variable: splitn+rsplit
-----------------------------------

1. split SOURCE_PREFIX_MAP into many mappings, on each "record-separator" single-byte character
2. then, right-split each mapping into a key-value pair, on a "pair-separator" single-byte character

This is easy to implement in all languages, both to set and to parse the variable. Note that the map is *ordered*, so e.g. in Python this must not be stored in a plain `dict` but perhaps a `collections.OrderedDict` or even list-of-pairs.

I have in mind the following variants:

a. Emphasis on displayability / typeability
Record-separator = 0x08 HORIZONTAL TAB
Pair-separator   = 0x3D EQUALS SIGN

b. Emphasis on representability in some markup languages
Record-separator = 0x0C FORM FEED
Pair-separator   = 0x3D EQUALS SIGN

c. Emphasis on supporting arbitrary file paths and fitting the intended purpose of the chosen characters
Record-separator = 0x1E RECORD SEPARATOR
Pair-separator   = 0x1F UNIT SEPARATOR

I'd like help with choosing one of these. For more details see the "Details on characters" section below. My personal preference is (c), and my reasoning is as follows:

In terms of writing these characters in a programming language (either to set the variable or to parse it), this should not be a problem. You can either use the characters directly, or use whatever escape code mechanism your language provides if you want to stick to "printable characters".

Generally, plain text editors seem to display the raw characters either using replacement codes (e.g. ^^ in emacs, vim, less) or use a stand-in font glyph that contains the character code (0x1E) or short name (RS) so they can be recognised visually.

The main downside is that these will not be displayed properly on a web page, though they can still be copy+pasted into another program if they are selected (tested in Chrome and Firefox.) Also, most of them are not valid in a markup document. But I think the other concerns outweigh this; web pages are generally not expected to preserve arbitrary data, and even if this is needed then it could offer the data in a raw form outside of a markup document.

Setting the variable
--------------------

Subject to "parsing" being finalised, setting the variable would be done by appending new mappings to any existing map, which must not be overwritten.

Applying the variable
---------------------

As mentioned, this is independent of "parsing".

Mappings must be applied in reverse order that they were originally set. Each mapping is applied by checking if the mapping source is a substring-prefix of the subject path, and if so replacing it with the mapping target and returning immediately without applying any earlier mappings.

As a corollary to the above, implementations may optionally preprocess paths *before* performing the above application, e.g. to deal with situations like [2]. It is up to the implementation to ensure that preprocessing preserves correctness, e.g. if two fields are semantically related (such as `DW_AT_name` and `DW_AT_comp_dir`), then this relationship is preserved in a satisfactory way after preprocessing and mapping.

[2] https://github.com/rust-lang/rust/pull/38348#issuecomment-267394032

Not considered
==============

because not easy to parse (will hinder adoption):
- custom escaping with \ and \\
- urlencode
- json

because appears in file paths:
- PATH-separator (: on POSIX, ; on windows)
  because we'd like to work with as many user paths as possible; by contrast PATH can afford to be more restrictive to "system-like" paths

Details on characters
=====================

See https://gist.github.com/infinity0/1b9acca742aa09e032841fa2a9ef9fa8
in particular test-all.sh

EV: easy to insert into (set) variables, in various languages
PT: not used in file paths anywhere
1B: single-byte <127, so encoding-independent
4P: fits the "intended purpose" of the character
DA: displayable
TA: typeable
MA: representable in markup (HTML, XML, etc)

[+] yes  [~] sort-of  [-] no  [*] see details below the table

chr | EV | PT | 1B | 4P*| DA | TA | MA |
----+----+----+----+----+----+----+----+
LF  | -* | +  | +  | ~  | ~* | +  | +  |
HT  | +  | ~  | +  | -  | +  | +  | +  |
RS  | +  | +  | +  | +  | ~* | -  | -  |
FF  | +  | +  | +  | -  | ~* | ~* | ~* |
VT  | +  | +  | +  | ~  | ~* | ~* | -  |
NEL | +  | +  | -  | ~  | ~* | -  | ~* |

4P:
 - This column is based on my subjective judgement, but my reasoning is roughly
   that the "record separator" concept intuitively is close to a "small
   vertical separator" concept.

LF/EV:
 - not easy in Makefile, see newline.mk

LF/DA:
 - many interfaces like to split envvars with this character, so having it
   inside a value might make things very confusing

RS/DA:
 - in emacs: ^^
 - other editors / terminals: a stand-in font glyph that says "001E" or "RS"
 - invisible in chrome and firefox but can be copy+pasted

FF/DA:
 - similar to RS/DA
 - but in some editors, shows up as a page break

FF/MA:
 - invalid in: HTML 4, XHTML 1.0, XML 1.0
 - valid in: HTML5, XML 1.1

FF/TA:
 - \f in many escape-code schemes

VT/DA:
 - similar to RS/DA

VT/DA:
 - \v in many escape-code schemes

NEL/DA:
 - similar to RS/DA
 - but in emacs: \205 (instead of ^[etc])

NEL/MA:
 - invalid in: HTML 4, XHTML 1.0, HTML5
 - valid in: XML 1.0, XML 1.1

-- 
GPG: ed25519/56034877E1F87C35
GPG: rsa4096/1318EFAC5FBBDBCE
https://github.com/infinity0/pubkeys.git