[rb-general] BUILD_PATH_PREFIX_MAP code examples and test cases
Ian Jackson
ijackson at chiark.greenend.org.uk
Thu Feb 9 14:48:17 CET 2017
Ximin Luo writes ("Re: [rb-general] BUILD_PATH_PREFIX_MAP code examples and test cases"):
> Ian Jackson:
> > Ximin Luo writes ("[rb-general] BUILD_PATH_PREFIX_MAP code examples and test cases"):
> >> % -> %p
> >> = -> %e
> >> : -> %c
> >
> > Please don't use letters (or underscore, or ideally, hyphens) for
> > this. There are all sorts of informal and semi-formal string handling
> > algorithms that depend on finding the boundaries of `words' consisting
> > of alphanumerics. They will not work right, generating an endless
> > stream of low-level lossage for human expert users writing ad-hoc
> > regexps, doing cut and paste, and so on.
> >
> > These approaches can also mislead wetware and sometimes cause false
> > matches in software. (These problem are evident in QP- and
> > URL-encodings.)
>
> Do you have a specific proposal to make along these lines?
Sorry about the delay replying.
I think the original proposal's %+ %; %% were OK. I think the
problems with %% are overblown. But to satisfy those who think %% is
a problem, and avoid encodings which expose Unix shell metacharacters
(and avoid adding /s), I suggest
= => %+ (mnemonic: same key on many keyboards)
: => %. (mnemonic: visually similar)
% => %# (weak mnemonic: both are quite full character cells)
(The other characters which meet all the nice-to-haves are ^ @ , ~
and are, I think, less memorable. @ is a metacharacter in Perl
""-strings, too, and "," seems a poor choice.)
These have the following good properties:
* If filenames do not contain = : % then no encoding is needed.
* If a filename can be written unquoted in Unix shell, so can its
encoding.
* Decoding does not involve resolving the semi-ambiguity of `%%'.
* Word-breaking algorithms based on [A-Za-z0-9]+ [-A-Za-z0-9]+
[_A-Za-z0-9]+ will treat encoding of punctuation as punctuation.
> This isn't meant to be a generic communications encoding, I don't know why anyone would do what you're suggesting, and the % character (or any other reasonable character we could use) would already mess with "word boundary" algorithms.
I think you have misunderstood my objection to %p %e %c.
Suppose I have a PREFIX_MAP value mentions a directory
"blork=wombat.d", which is encoded as "blork%ewombat.d". Such
filenames are not likely to occur other than as formulaic compositions
by a build system, or similar, so it is likely that "wombat" is an
interesting token.
The encoging "blork%ewombat.d" is suboptimal because it "looks" like
it was made out of "ewombat", rather than "wombat". Examples where
this might be annoying:
$ less +/'\bwombat\b' build.log # misses the mention in PREFIX_MAP
$ printenv | grep '\bwombat\b' # misses the mention in PREFIX_MAP
double-click on wombat in an xterm selects "ewombat", not "wombat"
Of course more formal setups would probably not make the assumption
which "ewombat" violates. But I think we would prefer to avoid
misleading users who type informal and ad-hoc shell runes, and to
avoid breaking their finger macros, etc.
Encoding punctuation characters as somewhat different punctuation
characters avoids this problem. In my suggestion you end up with
"blork%+wombat.d", which matches \bwombat\b
Of course it doesn't match \bblork=wombat\b but someone who is typing
that is will hopefully suspect that the = will cause trouble and
search for \bblork.*wombat\b or something - perhaps even \bwombat\b
Sorry for perpetuating this bike shed conversation.
Ian.
More information about the rb-general
mailing list