Unicode identifiers

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Unicode identifiers

Graydon Hoare
Hi,

I came across some 3rd party discussion of my choice of ASCII-range
identifiers (and limitation of non-ASCII-range unicode to strings, chars
and comments) that cited this as a major problem in the language. This
prompted a little more research and reading on my part, and talking with
people who had differing experiences with non-English identifier use in
programming languages. I now believe that my earlier impression of
"almost universal" adoption of ASCII-range identifiers in non-English
programming shops was mistaken, an that there is actually substantial
value to such programmers in having non-ASCII range available.

Moreover, looking at the approach taken by PEP 3131 (delegating to the
NFKC-normalization-closed sets defined in UAX 31,
XID_Start/XID_Continue), I see the "proper solution" has a
better-established consensus than I had previously understood to exist.
So I've updated the Rust manual to delegate to these specifications as
well, and filed a bug (issue 242, if anyone wants to jump on it) to get
the lexer patched up to handle this change.

Practical implications of this change are few for people (a) already
comfortable with ASCII-range identifiers or (b) working outside the
lexer. Hopefully it'll make things more welcome for people who don't fit
in to case (a) though.

Apologies for the trashing about on this issue, I misunderstood the
current state of play (possibly due to a little too much time spent in
despair while trying to upgrade ECMAScript 4 to "any Unicode spec after
1995", but that's a whole other story...)

-Graydon

Reply | Threaded
Open this post in threaded view
|

Unicode identifiers

Igor Bukanov
On 25 February 2011 20:38, Graydon Hoare <graydon at mozilla.com> wrote:
> So I've updated the Rust manual to delegate
> to these specifications as well

With most fonts it is not possible to see that the following ES
fragment should alert 1, not 2. I guess such problems are not
considered high on the list of language designs.

javascript:var a = 1; ? = 2; alert(a);

Reply | Threaded
Open this post in threaded view
|

Unicode identifiers

Graydon Hoare
On 11-02-25 02:35 PM, Igor Bukanov wrote:

> With most fonts it is not possible to see that the following ES
> fragment should alert 1, not 2. I guess such problems are not
> considered high on the list of language designs.
>
> javascript:var a = 1; ? = 2; alert(a);

It's certainly relevant! The NFKC pass is supposed to defend against ..
*some* of this sort of ambiguity; but there is a limit, and as you say,
it's font-dependent. FWIW, the emacs buffer I pasted it in to inspect
showed the difference plenty well (0x0061 LATIN SMALL LETTER A vs.
0x0430 CYRILLIC SMALL LETTER A). But I think that's because it had to do
a multi-font patchwork job. A well done full-range font would probably
have collided, visually.

I pointed out the risk of homographic "attacks" like this to the fellow
who raised the complaint initially, and .. I still agree it's a risk.
It's even a risk in IDNA, where they don't (I believe) squeeze this
ambiguity out even in their nameprep profile.

It's just one that I've changed my feeling on the risk/reward ratio of,
I guess. It's substantially *less* of a risk here than in URLs, too; it
seems really unlikely that anyone's going to use systems-language source
code for phishing attacks (goodness, I hope not).

In any case, if it worries you in production code I suspect it'll be a
small matter of adding a pragma (once the compiler plug-in interface is
sufficiently non-vapourware) to clamp your project's source to a
particular unicode range.

-Graydon

Reply | Threaded
Open this post in threaded view
|

Unicode identifiers

David Herman
> In any case, if it worries you in production code I suspect it'll be a small matter of adding a pragma (once the compiler plug-in interface is sufficiently non-vapourware) to clamp your project's source to a particular unicode range.

+1

This sounds like the right approach to me: this kind of restriction is very easy to write in a lint tool, and our planned compiler plug-in approach is a perfect way to encourage people to write these plug-ins. But I'd rather be inclusive and give people the option to exclude, especially if it's not hard to express the exclusions.

I also agree that the risks are different for an AOT-compiled systems language, where the code lives in a single repository and is statically inspectable, than they are for web standards.

Dave


Reply | Threaded
Open this post in threaded view
|

Unicode identifiers

Igor Bukanov
In reply to this post by Graydon Hoare
On 25 February 2011 23:57, Graydon Hoare <graydon at mozilla.com> wrote:
> it seems
> really unlikely that anyone's going to use systems-language source code for
> phishing attacks (goodness, I hope not).

It is not about phishing attacks, it is about deliberate subverting
software to embed, for example, a back dor.

> In any case, if it worries you in production code I suspect it'll be a small
> matter of adding a pragma (once the compiler plug-in interface is
> sufficiently non-vapourware) to clamp your project's source to a particular
> unicode range.

That is a good idea. This can be even implemented as a syntax extension.

Reply | Threaded
Open this post in threaded view
|

Unicode identifiers

Graydon Hoare
In reply to this post by Igor Bukanov
On 11-02-25 02:35 PM, Igor Bukanov wrote:

> With most fonts it is not possible to see that the following ES
> fragment should alert 1, not 2. I guess such problems are not
> considered high on the list of language designs.
>
> javascript:var a = 1; ? = 2; alert(a);

Also note that UAX 36 and 39 cover these issues in more detail,
including defining various additional restriction subsets. Some of those
subsets, with or without combination of locale information, could
provide some defense here. Particularly, as I say, as a pragma.

-Graydon

Reply | Threaded
Open this post in threaded view
|

Unicode identifiers

Florian Weimer
In reply to this post by Igor Bukanov
* Igor Bukanov:

> With most fonts it is not possible to see that the following ES
> fragment should alert 1, not 2. I guess such problems are not
> considered high on the list of language designs.
>
> javascript:var a = 1; ? = 2; alert(a);

Same problem with ASCII and some fonts:

  javascript:var l = 1; I = 2; alert(l);

(Gill Sans comes to my mind, but probably no one uses that for
programming.)

AI05-0227-1 is relevant in this context because it shows the
difficulties that come with non-ASCII identifiers in some contexts:

<http://www.ada-auth.org/cgi-bin/cvsweb.cgi/ai05s/ai05-0227-1.txt?rev=HEAD>

To my knowledge, no Ada implementation makes a decent attempt at
getting this right.  There does not seem to be much commercial demand,
so compiler vendors have different priorities.