[rb-general] BUILD_PATH_PREFIX_MAP code examples and test cases

Mon Feb 13 18:39:00 CET 2017

Ian Jackson:
> [..]
> 
> Sorry about the delay replying.
> 
> I think the original proposal's %+ %; %% were OK.  I think the
> problems with %% are overblown.  But to satisfy those who think %% is
> a problem, and avoid encodings which expose Unix shell metacharacters
> (and avoid adding /s), I suggest
>    =  =>  %+     (mnemonic: same key on many keyboards)
>    :  =>  %.     (mnemonic: visually similar)
>    %  =>  %#     (weak mnemonic: both are quite full character cells)
> (The other characters which meet all the nice-to-haves are ^ @ , ~
> and are, I think, less memorable.  @ is a metacharacter in Perl
> ""-strings, too, and "," seems a poor choice.)
> 
> These have the following good properties:
> 
>  * If filenames do not contain = : % then no encoding is needed.
>  * If a filename can be written unquoted in Unix shell, so can its
>    encoding.
>  * Decoding does not involve resolving the semi-ambiguity of `%%'.
>  * Word-breaking algorithms based on [A-Za-z0-9]+ [-A-Za-z0-9]+
>    [_A-Za-z0-9]+ will treat encoding of punctuation as punctuation.
> 
>> This isn't meant to be a generic communications encoding, I don't know why anyone would do what you're suggesting, and the % character (or any other reasonable character we could use) would already mess with "word boundary" algorithms.
> 
> I think you have misunderstood my objection to %p %e %c.
> 
> Suppose I have a PREFIX_MAP value mentions a directory
> "blork=wombat.d", which is encoded as "blork%ewombat.d".  Such
> filenames are not likely to occur other than as formulaic compositions
> by a build system, or similar, so it is likely that "wombat" is an
> interesting token.
> 
> The encoging "blork%ewombat.d" is suboptimal because it "looks" like
> it was made out of "ewombat", rather than "wombat".  Examples where
> this might be annoying:
> 
>  $ less +/'\bwombat\b' build.log       # misses the mention in PREFIX_MAP
>  $ printenv | grep '\bwombat\b'        # misses the mention in PREFIX_MAP
>  double-click on wombat in an xterm    selects "ewombat", not "wombat"
> 
> Of course more formal setups would probably not make the assumption
> which "ewombat" violates.  But I think we would prefer to avoid
> misleading users who type informal and ad-hoc shell runes, and to
> avoid breaking their finger macros, etc.
> 
> Encoding punctuation characters as somewhat different punctuation
> characters avoids this problem.  In my suggestion you end up with
> "blork%+wombat.d", which matches \bwombat\b
> 
> Of course it doesn't match \bblork=wombat\b but someone who is typing
> that is will hopefully suspect that the = will cause trouble and
> search for \bblork.*wombat\b or something - perhaps even \bwombat\b
> 
> Sorry for perpetuating this bike shed conversation.
> 
> [..]

Thanks for the very detailed reply and explanations! I actually realised my "already mess with" comment was wrong right after I sent that email, but didn't have anything else to say at the time so I just left it.

I think I generally agree with this, and I wasn't *too* pleased with the %pec stuff myself, but I thought picking another symbol would be "too random". I think % -> %# is fine though. I also had another idea:

% -> %@
= -> %-
: -> %.

The % sign could be thought of as "doubling" the next character, with @ used instead of 0/o to avoid "word" characters. Anyway, this is very easy to change in the code so I'll wait for a few more days in case anyone else wants to comment, after which I'll pick one of these schemes at random.

X

-- 
GPG: ed25519/56034877E1F87C35
GPG: rsa4096/1318EFAC5FBBDBCE
https://github.com/infinity0/pubkeys.git