[rb-general] BUILD_PATH_PREFIX_MAP code examples and test cases

Ian Jackson ijackson at chiark.greenend.org.uk
Thu Feb 9 14:48:17 CET 2017


Ximin Luo writes ("Re: [rb-general] BUILD_PATH_PREFIX_MAP code examples and test cases"):
> Ian Jackson:
> > Ximin Luo writes ("[rb-general] BUILD_PATH_PREFIX_MAP code examples and test cases"):
> >> % -> %p
> >> = -> %e
> >> : -> %c
> > 
> > Please don't use letters (or underscore, or ideally, hyphens) for
> > this.  There are all sorts of informal and semi-formal string handling
> > algorithms that depend on finding the boundaries of `words' consisting
> > of alphanumerics.  They will not work right, generating an endless
> > stream of low-level lossage for human expert users writing ad-hoc
> > regexps, doing cut and paste, and so on.
> > 
> > These approaches can also mislead wetware and sometimes cause false
> > matches in software.  (These problem are evident in QP- and
> > URL-encodings.)
> 
> Do you have a specific proposal to make along these lines?

Sorry about the delay replying.

I think the original proposal's %+ %; %% were OK.  I think the
problems with %% are overblown.  But to satisfy those who think %% is
a problem, and avoid encodings which expose Unix shell metacharacters
(and avoid adding /s), I suggest
   =  =>  %+     (mnemonic: same key on many keyboards)
   :  =>  %.     (mnemonic: visually similar)
   %  =>  %#     (weak mnemonic: both are quite full character cells)
(The other characters which meet all the nice-to-haves are ^ @ , ~
and are, I think, less memorable.  @ is a metacharacter in Perl
""-strings, too, and "," seems a poor choice.)

These have the following good properties:

 * If filenames do not contain = : % then no encoding is needed.
 * If a filename can be written unquoted in Unix shell, so can its
   encoding.
 * Decoding does not involve resolving the semi-ambiguity of `%%'.
 * Word-breaking algorithms based on [A-Za-z0-9]+ [-A-Za-z0-9]+
   [_A-Za-z0-9]+ will treat encoding of punctuation as punctuation.

> This isn't meant to be a generic communications encoding, I don't know why anyone would do what you're suggesting, and the % character (or any other reasonable character we could use) would already mess with "word boundary" algorithms.

I think you have misunderstood my objection to %p %e %c.

Suppose I have a PREFIX_MAP value mentions a directory
"blork=wombat.d", which is encoded as "blork%ewombat.d".  Such
filenames are not likely to occur other than as formulaic compositions
by a build system, or similar, so it is likely that "wombat" is an
interesting token.

The encoging "blork%ewombat.d" is suboptimal because it "looks" like
it was made out of "ewombat", rather than "wombat".  Examples where
this might be annoying:

 $ less +/'\bwombat\b' build.log       # misses the mention in PREFIX_MAP
 $ printenv | grep '\bwombat\b'        # misses the mention in PREFIX_MAP
 double-click on wombat in an xterm    selects "ewombat", not "wombat"

Of course more formal setups would probably not make the assumption
which "ewombat" violates.  But I think we would prefer to avoid
misleading users who type informal and ad-hoc shell runes, and to
avoid breaking their finger macros, etc.

Encoding punctuation characters as somewhat different punctuation
characters avoids this problem.  In my suggestion you end up with
"blork%+wombat.d", which matches \bwombat\b

Of course it doesn't match \bblork=wombat\b but someone who is typing
that is will hopefully suspect that the = will cause trouble and
search for \bblork.*wombat\b or something - perhaps even \bwombat\b

Sorry for perpetuating this bike shed conversation.

Ian.


More information about the rb-general mailing list