Would lua support varaible name with non-ascii characters?

Discussion:

Would lua support varaible name with non-ascii characters?

奥斯陆君王

2018-10-15 15:55:52 UTC

Luajit does support this feature.
and for utf-8,all we need is just changing the follow in llex.c(line 558-571
)

if (lislalpha(ls->current)) { /* identifier or reserved word? */
TString *ts;
do {
save_and_next(ls);
} while (lislalnum(ls->current));
ts = luaX_newstring(ls, luaZ_buffer(ls->buff),
luaZ_bufflen(ls->buff));
seminfo->ts = ts;
if (isreserved(ts)) /* reserved word? */
return ts->extra - 1 + FIRST_RESERVED;
else {
return TK_NAME;
}
}

to

if (lislalpha(ls->current)|| ls->current &0x80) { /* identifier or
reserved word? */
TString *ts;
do {
save_and_next(ls);
} while (lislalnum(ls->current)|| ls->current & 0x80);
ts = luaX_newstring(ls, luaZ_buffer(ls->buff),
luaZ_bufflen(ls->buff));
seminfo->ts = ts;
if (isreserved(ts)) /* reserved word? */
return ts->extra - 1 + FIRST_RESERVED;
else {
return TK_NAME;
}
}

It's very easy.Will lua 5.4 support it?

Luiz Henrique de Figueiredo

2018-10-15 16:29:58 UTC

and for utf-8,all we need is just changing the follow in llex.c(line 558-571)

Or change lctype.c. See http://lua-users.org/lists/lua-l/2009-10/msg00104.html

It's very easy.Will lua 5.4 support it?

Probably not. See the thread above.

Dirk Laurie

2018-10-17 06:19:23 UTC

Op Ma., 15 Okt. 2018 om 18:30 het Luiz Henrique de Figueiredo

Post by Luiz Henrique de Figueiredo

and for utf-8,all we need is just changing the follow in llex.c(line 558-571)

Or change lctype.c. See http://lua-users.org/lists/lua-l/2009-10/msg00104.html

It's very easy.Will lua 5.4 support it?

Probably not. See the thread above.

In a later post,
http://lua-users.org/lists/lua-l/2011-05/msg00543.html, Luiz spelt it

Post by Luiz Henrique de Figueiredo
Note that you can also provide your own lctype.c without patching
the one in the Lua core. The linker will use yours instead.

We also had some fun last year in this thread:

http://lua-users.org/lists/lua-l/2017-04/msg00395.html

Lorenzo Donati

2018-10-16 12:57:50 UTC

On 15/10/2018 17:55, 奥斯陆君王 wrote:

[...]

Post by å¥¥æ¯éåç
It's very easy.Will lua 5.4 support it?

I hope it never will!

Sorry, it is not about any cultural prejudice (I know many people,
especially Asian people, could feel discriminated by such a stance, but
it is not my intention).

It is just a matter of convenience and "safety". It is not worth opening
such a big can of worms, IMO.

I started programming learning by trial and error what it means using
"0" and "O" and "o" as characters in identifiers carelessly. The same
goes for "l" and "1".

That is, any subset of characters that have likely similar glyphs in
some font are going to cause grief in some cases without proper
programming practices.

Allow the whole UNICODE mess into identifiers and the chances for
mistaking a symbol for another skyrockets exponentially! I'm not an
UNICODE guru but I bet my bottom dollar that there are more than a dozen
symbols that, in some font, look like an uppercase latin "O" (that is a
symbol looking like more or less like a circle). The same goes for other
simple-looking symbols like an uppercase "I" (a vertical "stick" of some
sort).

Now, imagine an identifier like B10010100, where each individual
"character" is in fact a different "version" of a "0" or a "1". Nightmare!

These problems are somewhat small annoyances to cope with when you are
dealing with ASCII, where the "problematic" chars are well known,
because every programmer more or less knows what's in *the whole ASCII
set*.

But what the frigging heck is in UNICODE?!? There are gazillions of code
points! There are even not-yet-defined code points!!! WHO knows UNICODE
in its entirety?

How can I be sure that whoever must use my code where I inserted a
"unicodishy" identifier is able to understand uniquely what kind of
"characters" make up the identifier?

Is this worth all the hassle? What advantages would this bring to the
programming effort? How much will it cost to track down bugs generated
by the possible mistake?

I doubt there are tangible *net* advantages in *standardizing* UNICODE,
even in its remarkable UTF-8 encoding, as an alphabet for programming.

UNICODE was meant for linguistics and typesetting, not for programming.

And anyway, as LHF pointed out, you can change lctype.c if you have
special needs (which I definitely won't argue against, that's for sure).

BTW, since this is not the first time this "I'd like unicode in my
names" thing comes up, I'd like to see some of the UNICODE gurus on this
list entering a contest of creating the most bedazzling set of
seemingly-identical identifiers using theirs "utf-8 powers". :-D

I think this will have great educational value for those thinking that
having *generally standardized* UNICODE identifiers is a good idea. ;-)

Cheers!

--Lorenzo

Viacheslav Usov

2018-10-16 13:18:35 UTC

Post by Lorenzo Donati
Now, imagine an identifier like B10010100, where each individual

"character" is in fact a different "version" of a "0" or a "1". Nightmare!

Especially in Lua, which happily treats any unknown identifier as a valid
global variable.

oÐŸÎ¿ïœð

Cheers,
V.

Lorenzo Donati

2018-10-16 17:11:37 UTC

Post by Lorenzo Donati

Post by Lorenzo Donati
Now, imagine an identifier like B10010100, where each individual

"character" is in fact a different "version" of a "0" or a "1". Nightmare!
Especially in Lua, which happily treats any unknown identifier as a valid
global variable.
oоοｏ𝖔
Cheers,
V.

Yepp!!

And since there are few fonts (AFAIK) that cover the entire UNICODE set,
we will need editors capable of automatically rendering source code by
mixing and matching different glyphs from different fonts.

Word processors for source code, anyone? (*Ouch!*)

Imagine: use a ~1GB application to write a ~100 lines script (~1kB
source code) of a language whose implementation is ~1MB. That's
minimalism! :-)

Moreover, the same editors will have to be also hex-editors (*Urgh!*),
because we will need the ability to look at the actual encoding of
glyphs to discriminate those visually-ambiguous identifiers (something
that is so easy in ASCII, e.g. by switching to a monospaced font).

Then the same editor would need the ability to map the encoding to its
standard UNICODE representation, just because otherwise we would also
need to remember all the possible UTF-8 sequences and their meanings.
(*Arghh!!*)

Then....

The more I think about it, the more it seems a possible representation
of Hell (in the biblical sense) for a programmer. Spend the eternity
learning every UTF-8 sequence, its mapping to code-points and their
possible visual representation with glyphs in an infinite number of
fonts which never can represent the whole UNICODE plane-set.

In comparison solving the halting problem is just purgatory! (*<grin>*)

There are up sides, though. Imagine how many nice and mind-boggling
pranks you could do to your colleague programmers! :-]]

Cheers! :-D

-- Lorenzo

Sean Conner

2018-10-16 17:47:12 UTC

Post by Lorenzo Donati
Imagine: use a ~1GB application to write a ~100 lines script (~1kB
source code) of a language whose implementation is ~1MB. That's
minimalism! :-)

There are people who already do that (I'm looking at you, Electron! [1])

Post by Lorenzo Donati
Moreover, the same editors will have to be also hex-editors (*Urgh!*),
because we will need the ability to look at the actual encoding of
glyphs to discriminate those visually-ambiguous identifiers (something
that is so easy in ASCII, e.g. by switching to a monospaced font).

I think it would be easy enough to have an editor that just highlights
Unicode characters outside the range of 0x20 - 0x7E.

-spc (Hmmm ... I may have to look into modifying my editor to do just that
... )

[1] https://electronjs.org/

Albert Chan

2018-10-16 18:08:44 UTC

There are up sides, though. Imagine how many nice and mind-boggling pranks you could do to your colleague programmers! :-]]
Cheers! :-D
-- Lorenzo

Since you might read your own code, the prank might backfire :-D

I think coding is hard enough without worrying about non-ASCII issue ...

Andrew Starks

2018-10-16 20:03:01 UTC

Post by Lorenzo Donati

Post by Lorenzo Donati
There are up sides, though. Imagine how many nice and mind-boggling

pranks you could do to your colleague programmers! :-]]

Post by Lorenzo Donati
Cheers! :-D
-- Lorenzo

Since you might read your own code, the prank might backfire :-D
I think coding is hard enough without worrying about non-ASCII issue ...

Iâd rather have LPEG matching for variable assignments.

lpeg.râahâ = 42

assert(b == g and g == â42â )

Tim Hill

2018-10-16 22:51:31 UTC

Post by Lorenzo Donati

Post by Lorenzo Donati

Post by Lorenzo Donati
Now, imagine an identifier like B10010100, where each individual

"character" is in fact a different "version" of a "0" or a "1". Nightmare!
Especially in Lua, which happily treats any unknown identifier as a valid
global variable.
oоοｏ𝖔
Cheers,
V.

Yepp!!
And since there are few fonts (AFAIK) that cover the entire UNICODE set, we will need editors capable of automatically rendering source code by mixing and matching different glyphs from different fonts.
Word processors for source code, anyone? (*Ouch!*)
Imagine: use a ~1GB application to write a ~100 lines script (~1kB source code) of a language whose implementation is ~1MB. That's minimalism! :-)
Moreover, the same editors will have to be also hex-editors (*Urgh!*), because we will need the ability to look at the actual encoding of glyphs to discriminate those visually-ambiguous identifiers (something that is so easy in ASCII, e.g. by switching to a monospaced font).
Then the same editor would need the ability to map the encoding to its standard UNICODE representation, just because otherwise we would also need to remember all the possible UTF-8 sequences and their meanings. (*Arghh!!*)
Then....
The more I think about it, the more it seems a possible representation of Hell (in the biblical sense) for a programmer. Spend the eternity learning every UTF-8 sequence, its mapping to code-points and their possible visual representation with glyphs in an infinite number of fonts which never can represent the whole UNICODE plane-set.
In comparison solving the halting problem is just purgatory! (*<grin>*)
There are up sides, though. Imagine how many nice and mind-boggling pranks you could do to your colleague programmers! :-]]
Cheers! :-D
-- Lorenzo

Unicode (noun): A character encoding system designed to make code pages look sensible.

—Tim

Sergey Zakharchenko

2018-10-17 05:31:09 UTC

Hello,

I'll just throw in my take on how Lua may look 'different' in a text
editor with Unicode font support (what you type and what's stored in
the file is still the same plain ASCII, just the presentation is
different). Not that I'm using it though, I just created it for fun
once.

Loading Image...

https://gist.github.com/szakharchenko/a479fb90af72ef0243710278f1a7eac2

You could create some sort of convention, like '__Uxxx_ in a symbol
name should be displayed as the corresponding Unicode char', and have
the editor do the transformation for you. The elisp to do that is left
as an exercise.

Best regards,

--
DoubleF

Dirk Laurie

2018-10-17 06:08:07 UTC

Op Wo., 17 Okt. 2018 om 07:31 het Sergey Zakharchenko

Post by Sergey Zakharchenko
Hello,
I'll just throw in my take on how Lua may look 'different' in a text
editor with Unicode font support (what you type and what's stored in
the file is still the same plain ASCII, just the presentation is
different). Not that I'm using it though, I just created it for fun
once.
https://pasteboard.co/HIP1375.png

We've been here before.

http://lua-users.org/lists/lua-l/2016-11/threads.html#00213

Tim Hill

2018-10-22 17:30:48 UTC

Now, imagine an identifier like B10010100, where each individual "character" is in fact a different "version" of a "0" or a "1". Nightmare!

Especially in Lua, which happily treats any unknown identifier as a valid global variable.
oÐŸÎ¿ïœð
Cheers,
V.

+1 .. I realize Iâm a native English speaker and so biased, but it seems to me the benefits of Unicode identifiers are far outweighed by the problems created.

âTim

Thomas Jericke

2018-10-23 07:10:43 UTC

Post by Tim Hill

On Tue, Oct 16, 2018 at 2:58 PM Lorenzo Donati

Post by Lorenzo Donati
Now, imagine an identifier like B10010100, where each individual

"character" is in fact a different "version" of a "0" or a "1". Nightmare!
Especially in Lua, which happily treats any unknown identifier as a
valid global variable.
oÐŸÎ¿ïœð
Cheers,
V.

+1 .. I realize Iâm a native English speaker and so biased, but it
seems to me the benefits of Unicode identifiers are far outweighed by
the problems created.
âTim

What problem can be so bad outweighting the benefit of using ð€®
(U+1F92E) as an identifier?

--

Thomas

Enrico Colombini

2018-10-23 07:17:47 UTC

What problem can be so bad outweighting the benefit of using 🤮
(U+1F92E) as an identifier?

Would it represent a function returning all arguments?

--
Enrico

Ivan Krylov

2018-10-23 09:23:12 UTC

On Tue, 23 Oct 2018 09:10:43 +0200

What problem can be so bad outweighting the benefit of using 🤮
(U+1F92E) as an identifier?

Ah, the famous -vomit-frame-pointer optimization?

--
Best regards,
Ivan

Lorenzo Donati

2018-10-25 06:05:40 UTC

On Tue, Oct 16, 2018 at 2:58 PM Lorenzo Donati

Post by Lorenzo Donati
Now, imagine an identifier like B10010100, where each individual

"character" is in fact a different "version" of a "0" or a "1". Nightmare!
Especially in Lua, which happily treats any unknown identifier as a
valid global variable.
oоοｏ𝖔
Cheers,
V.

+1 .. I realize I’m a native English speaker and so biased, but it
seems to me the benefits of Unicode identifiers are far outweighed by
the problems created.
—Tim

What problem can be so bad outweighting the benefit of using 🤮
(U+1F92E) as an identifier?

It could be useful in Lua error messages, though. ;-)

That would clearly be a nice substitute for the "assertion error"
message. :-D

--
Thomas

16 Replies
2 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

奥斯陆君王 2018-10-15 15:55:52 UTC

Luiz Henrique de Figueiredo 2018-10-15 16:29:58 UTC

Dirk Laurie 2018-10-17 06:19:23 UTC

Lorenzo Donati 2018-10-16 12:57:50 UTC

Viacheslav Usov 2018-10-16 13:18:35 UTC

Lorenzo Donati 2018-10-16 17:11:37 UTC

Sean Conner 2018-10-16 17:47:12 UTC

Albert Chan 2018-10-16 18:08:44 UTC

Andrew Starks 2018-10-16 20:03:01 UTC

Tim Hill 2018-10-16 22:51:31 UTC

Sergey Zakharchenko 2018-10-17 05:31:09 UTC

Dirk Laurie 2018-10-17 06:08:07 UTC

Tim Hill 2018-10-22 17:30:48 UTC

Thomas Jericke 2018-10-23 07:10:43 UTC

Enrico Colombini 2018-10-23 07:17:47 UTC

Ivan Krylov 2018-10-23 09:23:12 UTC

Lorenzo Donati 2018-10-25 06:05:40 UTC

about - legalese

Loading...