Discussion:
Would lua support varaible name with non-ascii characters?
奥斯陆君王
2018-10-15 15:55:52 UTC
Permalink
Luajit does support this feature.
and for utf-8,all we need is just changing the follow in llex.c(line 558-571
)

if (lislalpha(ls->current)) { /* identifier or reserved word? */
TString *ts;
do {
save_and_next(ls);
} while (lislalnum(ls->current));
ts = luaX_newstring(ls, luaZ_buffer(ls->buff),
luaZ_bufflen(ls->buff));
seminfo->ts = ts;
if (isreserved(ts)) /* reserved word? */
return ts->extra - 1 + FIRST_RESERVED;
else {
return TK_NAME;
}
}

to

if (lislalpha(ls->current)|| ls->current &0x80) { /* identifier or
reserved word? */
TString *ts;
do {
save_and_next(ls);
} while (lislalnum(ls->current)|| ls->current & 0x80);
ts = luaX_newstring(ls, luaZ_buffer(ls->buff),
luaZ_bufflen(ls->buff));
seminfo->ts = ts;
if (isreserved(ts)) /* reserved word? */
return ts->extra - 1 + FIRST_RESERVED;
else {
return TK_NAME;
}
}

It's very easy.Will lua 5.4 support it?
Luiz Henrique de Figueiredo
2018-10-15 16:29:58 UTC
Permalink
and for utf-8,all we need is just changing the follow in llex.c(line 558-571)
Or change lctype.c. See http://lua-users.org/lists/lua-l/2009-10/msg00104.html
It's very easy.Will lua 5.4 support it?
Probably not. See the thread above.
Dirk Laurie
2018-10-17 06:19:23 UTC
Permalink
Op Ma., 15 Okt. 2018 om 18:30 het Luiz Henrique de Figueiredo
Post by Luiz Henrique de Figueiredo
and for utf-8,all we need is just changing the follow in llex.c(line 558-571)
Or change lctype.c. See http://lua-users.org/lists/lua-l/2009-10/msg00104.html
It's very easy.Will lua 5.4 support it?
Probably not. See the thread above.
In a later post,
http://lua-users.org/lists/lua-l/2011-05/msg00543.html, Luiz spelt it
Post by Luiz Henrique de Figueiredo
Note that you can also provide your own lctype.c without patching
the one in the Lua core. The linker will use yours instead.
We also had some fun last year in this thread:

http://lua-users.org/lists/lua-l/2017-04/msg00395.html
Lorenzo Donati
2018-10-16 12:57:50 UTC
Permalink
On 15/10/2018 17:55, 奥斯陆君王 wrote:

[...]
Post by 奥斯陆君王
It's very easy.Will lua 5.4 support it?
I hope it never will!

Sorry, it is not about any cultural prejudice (I know many people,
especially Asian people, could feel discriminated by such a stance, but
it is not my intention).

It is just a matter of convenience and "safety". It is not worth opening
such a big can of worms, IMO.

I started programming learning by trial and error what it means using
"0" and "O" and "o" as characters in identifiers carelessly. The same
goes for "l" and "1".

That is, any subset of characters that have likely similar glyphs in
some font are going to cause grief in some cases without proper
programming practices.

Allow the whole UNICODE mess into identifiers and the chances for
mistaking a symbol for another skyrockets exponentially! I'm not an
UNICODE guru but I bet my bottom dollar that there are more than a dozen
symbols that, in some font, look like an uppercase latin "O" (that is a
symbol looking like more or less like a circle). The same goes for other
simple-looking symbols like an uppercase "I" (a vertical "stick" of some
sort).

Now, imagine an identifier like B10010100, where each individual
"character" is in fact a different "version" of a "0" or a "1". Nightmare!

These problems are somewhat small annoyances to cope with when you are
dealing with ASCII, where the "problematic" chars are well known,
because every programmer more or less knows what's in *the whole ASCII
set*.

But what the frigging heck is in UNICODE?!? There are gazillions of code
points! There are even not-yet-defined code points!!! WHO knows UNICODE
in its entirety?

How can I be sure that whoever must use my code where I inserted a
"unicodishy" identifier is able to understand uniquely what kind of
"characters" make up the identifier?

Is this worth all the hassle? What advantages would this bring to the
programming effort? How much will it cost to track down bugs generated
by the possible mistake?

I doubt there are tangible *net* advantages in *standardizing* UNICODE,
even in its remarkable UTF-8 encoding, as an alphabet for programming.

UNICODE was meant for linguistics and typesetting, not for programming.

And anyway, as LHF pointed out, you can change lctype.c if you have
special needs (which I definitely won't argue against, that's for sure).


BTW, since this is not the first time this "I'd like unicode in my
names" thing comes up, I'd like to see some of the UNICODE gurus on this
list entering a contest of creating the most bedazzling set of
seemingly-identical identifiers using theirs "utf-8 powers". :-D

I think this will have great educational value for those thinking that
having *generally standardized* UNICODE identifiers is a good idea. ;-)


Cheers!

--Lorenzo
Viacheslav Usov
2018-10-16 13:18:35 UTC
Permalink
Post by Lorenzo Donati
Now, imagine an identifier like B10010100, where each individual
"character" is in fact a different "version" of a "0" or a "1". Nightmare!

Especially in Lua, which happily treats any unknown identifier as a valid
global variable.

oПο𝖔

Cheers,
V.
Lorenzo Donati
2018-10-16 17:11:37 UTC
Permalink
Post by Lorenzo Donati
Post by Lorenzo Donati
Now, imagine an identifier like B10010100, where each individual
"character" is in fact a different "version" of a "0" or a "1". Nightmare!
Especially in Lua, which happily treats any unknown identifier as a valid
global variable.
oоοo𝖔
Cheers,
V.
Yepp!!

And since there are few fonts (AFAIK) that cover the entire UNICODE set,
we will need editors capable of automatically rendering source code by
mixing and matching different glyphs from different fonts.

Word processors for source code, anyone? (*Ouch!*)

Imagine: use a ~1GB application to write a ~100 lines script (~1kB
source code) of a language whose implementation is ~1MB. That's
minimalism! :-)

Moreover, the same editors will have to be also hex-editors (*Urgh!*),
because we will need the ability to look at the actual encoding of
glyphs to discriminate those visually-ambiguous identifiers (something
that is so easy in ASCII, e.g. by switching to a monospaced font).

Then the same editor would need the ability to map the encoding to its
standard UNICODE representation, just because otherwise we would also
need to remember all the possible UTF-8 sequences and their meanings.
(*Arghh!!*)

Then....

The more I think about it, the more it seems a possible representation
of Hell (in the biblical sense) for a programmer. Spend the eternity
learning every UTF-8 sequence, its mapping to code-points and their
possible visual representation with glyphs in an infinite number of
fonts which never can represent the whole UNICODE plane-set.

In comparison solving the halting problem is just purgatory! (*<grin>*)

There are up sides, though. Imagine how many nice and mind-boggling
pranks you could do to your colleague programmers! :-]]

Cheers! :-D

-- Lorenzo
Sean Conner
2018-10-16 17:47:12 UTC
Permalink
Post by Lorenzo Donati
Imagine: use a ~1GB application to write a ~100 lines script (~1kB
source code) of a language whose implementation is ~1MB. That's
minimalism! :-)
There are people who already do that (I'm looking at you, Electron! [1])
Post by Lorenzo Donati
Moreover, the same editors will have to be also hex-editors (*Urgh!*),
because we will need the ability to look at the actual encoding of
glyphs to discriminate those visually-ambiguous identifiers (something
that is so easy in ASCII, e.g. by switching to a monospaced font).
I think it would be easy enough to have an editor that just highlights
Unicode characters outside the range of 0x20 - 0x7E.

-spc (Hmmm ... I may have to look into modifying my editor to do just that
... )

[1] https://electronjs.org/
Albert Chan
2018-10-16 18:08:44 UTC
Permalink
There are up sides, though. Imagine how many nice and mind-boggling pranks you could do to your colleague programmers! :-]]
Cheers! :-D
-- Lorenzo
Since you might read your own code, the prank might backfire :-D

I think coding is hard enough without worrying about non-ASCII issue ...
Andrew Starks
2018-10-16 20:03:01 UTC
Permalink
Post by Lorenzo Donati
Post by Lorenzo Donati
There are up sides, though. Imagine how many nice and mind-boggling
pranks you could do to your colleague programmers! :-]]
Post by Lorenzo Donati
Cheers! :-D
-- Lorenzo
Since you might read your own code, the prank might backfire :-D
I think coding is hard enough without worrying about non-ASCII issue ...
I’d rather have LPEG matching for variable assignments.

lpeg.r”ah” = 42

assert(b == g and g == “42” )
Tim Hill
2018-10-16 22:51:31 UTC
Permalink
Post by Lorenzo Donati
Post by Lorenzo Donati
Post by Lorenzo Donati
Now, imagine an identifier like B10010100, where each individual
"character" is in fact a different "version" of a "0" or a "1". Nightmare!
Especially in Lua, which happily treats any unknown identifier as a valid
global variable.
oоοo𝖔
Cheers,
V.
Yepp!!
And since there are few fonts (AFAIK) that cover the entire UNICODE set, we will need editors capable of automatically rendering source code by mixing and matching different glyphs from different fonts.
Word processors for source code, anyone? (*Ouch!*)
Imagine: use a ~1GB application to write a ~100 lines script (~1kB source code) of a language whose implementation is ~1MB. That's minimalism! :-)
Moreover, the same editors will have to be also hex-editors (*Urgh!*), because we will need the ability to look at the actual encoding of glyphs to discriminate those visually-ambiguous identifiers (something that is so easy in ASCII, e.g. by switching to a monospaced font).
Then the same editor would need the ability to map the encoding to its standard UNICODE representation, just because otherwise we would also need to remember all the possible UTF-8 sequences and their meanings. (*Arghh!!*)
Then....
The more I think about it, the more it seems a possible representation of Hell (in the biblical sense) for a programmer. Spend the eternity learning every UTF-8 sequence, its mapping to code-points and their possible visual representation with glyphs in an infinite number of fonts which never can represent the whole UNICODE plane-set.
In comparison solving the halting problem is just purgatory! (*<grin>*)
There are up sides, though. Imagine how many nice and mind-boggling pranks you could do to your colleague programmers! :-]]
Cheers! :-D
-- Lorenzo
Unicode (noun): A character encoding system designed to make code pages look sensible.

—Tim
Sergey Zakharchenko
2018-10-17 05:31:09 UTC
Permalink
Hello,

I'll just throw in my take on how Lua may look 'different' in a text
editor with Unicode font support (what you type and what's stored in
the file is still the same plain ASCII, just the presentation is
different). Not that I'm using it though, I just created it for fun
once.

Loading Image...

https://gist.github.com/szakharchenko/a479fb90af72ef0243710278f1a7eac2

You could create some sort of convention, like '__Uxxx_ in a symbol
name should be displayed as the corresponding Unicode char', and have
the editor do the transformation for you. The elisp to do that is left
as an exercise.

Best regards,
--
DoubleF
Dirk Laurie
2018-10-17 06:08:07 UTC
Permalink
Op Wo., 17 Okt. 2018 om 07:31 het Sergey Zakharchenko
Post by Sergey Zakharchenko
Hello,
I'll just throw in my take on how Lua may look 'different' in a text
editor with Unicode font support (what you type and what's stored in
the file is still the same plain ASCII, just the presentation is
different). Not that I'm using it though, I just created it for fun
once.
https://pasteboard.co/HIP1375.png
We've been here before.

http://lua-users.org/lists/lua-l/2016-11/threads.html#00213
Tim Hill
2018-10-22 17:30:48 UTC
Permalink
Now, imagine an identifier like B10010100, where each individual "character" is in fact a different "version" of a "0" or a "1". Nightmare!
Especially in Lua, which happily treats any unknown identifier as a valid global variable.
oПο𝖔
Cheers,
V.
+1 .. I realize I’m a native English speaker and so biased, but it seems to me the benefits of Unicode identifiers are far outweighed by the problems created.

—Tim
Thomas Jericke
2018-10-23 07:10:43 UTC
Permalink
Post by Tim Hill
On Tue, Oct 16, 2018 at 2:58 PM Lorenzo Donati
Post by Lorenzo Donati
Now, imagine an identifier like B10010100, where each individual
"character" is in fact a different "version" of a "0" or a "1". Nightmare!
Especially in Lua, which happily treats any unknown identifier as a
valid global variable.
oПο𝖔
Cheers,
V.
+1 .. I realize I’m a native English speaker and so biased, but it
seems to me the benefits of Unicode identifiers are far outweighed by
the problems created.
—Tim
What problem can be so bad outweighting the benefit of using 🀮
(U+1F92E) as an identifier?

--

Thomas
Enrico Colombini
2018-10-23 07:17:47 UTC
Permalink
What problem can be so bad outweighting the benefit of using 🤮
(U+1F92E) as an identifier?
Would it represent a function returning all arguments?
--
Enrico
Ivan Krylov
2018-10-23 09:23:12 UTC
Permalink
On Tue, 23 Oct 2018 09:10:43 +0200
What problem can be so bad outweighting the benefit of using 🤮
(U+1F92E) as an identifier?
Ah, the famous -vomit-frame-pointer optimization?
--
Best regards,
Ivan
Lorenzo Donati
2018-10-25 06:05:40 UTC
Permalink
On Tue, Oct 16, 2018 at 2:58 PM Lorenzo Donati
Post by Lorenzo Donati
Now, imagine an identifier like B10010100, where each individual
"character" is in fact a different "version" of a "0" or a "1". Nightmare!
Especially in Lua, which happily treats any unknown identifier as a
valid global variable.
oоοo𝖔
Cheers,
V.
+1 .. I realize I’m a native English speaker and so biased, but it
seems to me the benefits of Unicode identifiers are far outweighed by
the problems created.
—Tim
What problem can be so bad outweighting the benefit of using 🤮
(U+1F92E) as an identifier?
It could be useful in Lua error messages, though. ;-)

That would clearly be a nice substitute for the "assertion error"
message. :-D
--
Thomas
Loading...