Lua 5.1.4 with Traditional Chinese

Discussion:

tjhack

2015-01-12 03:50:10 UTC

Hi all,

I convert some lua script(it contains chinese character) from simplified chinese to traditional chinese, and now the chinese character is encoding with cp950.

Now I switch my win7 machine locale to zh_TW, and restart. Everything seems okay, the script with traditional chinese character is correct displayed.

But when I complied these script.It is error.Invalid escape string.

for example:
msg="å€åç³»æ®æ»æ»æ" print(msg)
the result is:

å€å·štæ®æ»æ»æ

Look at the hex of the string, it is
\xa5~\xa5\\\xa8t\xb4\xb6\xa7\xf0\xa7\xf0\xc0\xbb
so it is the lua not escape the string.

Now the problem is, can I solve it? How can I let the script compile success? My source script can not encoding into utf-8, if can, it is esay.â

Thanks.
â

Lorenzo Donati

2015-01-12 08:16:57 UTC

Permalink

Post by tjhack
Hi all,
I convert some lua script(it contains chinese character) from simplified chinese to traditional chinese, and now the chinese character is encoding with cp950.
Now I switch my win7 machine locale to zh_TW, and restart. Everything seems okay, the script with traditional chinese character is correct displayed.
But when I complied these script.It is error.Invalid escape string.
msg="外功系普攻攻擊" print(msg)
外巨t普攻攻擊
Look at the hex of the string, it is
\xa5~\xa5\\\xa8t\xb4\xb6\xa7\xf0\xa7\xf0\xc0\xbb
so it is the lua not escape the string.
Now the problem is, can I solve it? How can I let the script compile success? My source script can not encoding into utf-8, if can, it is esay.‍

Probably you should encode your UTF-8 characters with escape sequences
inside strings, because Lua opens files for parsing in text mode, and
thus relies on the underlying C text mode file handling routines. UTF-8
files may not reach the parser unaltered, depending on the
transformations done by the C routines (which are NOT UTF-8 aware).

Lua 5.3 may help you a bit, since it introduces new UTF-8 escape
sequences: in Lua 5.3 you may enter the encoding of a Unicode code-point
directly as \u{NNN} instead of typing the corresponding encoded sequence
as possibly several \xNN escapes.

Post by tjhack
Thanks.
‍

-- Lorenzo

Dirk Laurie

2015-01-12 08:52:01 UTC

Permalink

Post by tjhack
I convert some lua script(it contains chinese character) from simplified
chinese to traditional chinese, and now the chinese character is encoding
with cp950.
Now I switch my win7 machine locale to zh_TW, and restart. Everything seems
okay, the script with traditional chinese character is correct displayed.
But when I complied these script.It is error.Invalid escape string.
msg="外功系普攻攻擊"
print(msg)
外巨t普攻攻擊

If I run your code by the interactive interpreter under Ubuntu,
I get this:

$ LANG="zh_TW" lua5.1
Lua 5.1.5 Copyright (C) 1994-2012 Lua.org, PUC-Rio

Post by tjhack
g3;ff.f;;f

stdin:1: '=' expected near ';'

I.e. the input is already mangled by the reader.

However, if I save the code to a file, and then run it, it is fine:

$ LANG="zh_TW" lua5.1 < /tmp/chinese.lua
外功系普攻攻擊

Interestingly, for LuaJIT the interactive interpreter also works.
Can the culprit be the readline library (not a default option for
LuaJIT)? Let's try. Remove the line
"#define LUA_USE_READLINE"
from luaconf.h and recompile.

lua-5.1-noreadline$ LANG="zh_TW" src/lua
Lua 5.1.5 Copyright (C) 1994-2012 Lua.org, PUC-Rio

Post by tjhack
msg="外功系普攻攻擊"
print(msg)

外功系普攻攻擊

Yes!

But under Windows LUA_USE_READLINE is anyway not defined.
We must look for something else.

What happens under Windows if you run the program
as a script file instead of interactively?

Johnson Lin

2015-01-12 09:30:20 UTC

Permalink

Post by Dirk Laurie

seems

Post by tjhack
okay, the script with traditional chinese character is correct displayed.
But when I complied these script.It is error.Invalid escape string.
msg="å€åç³»æ®æ»æ»æ"
print(msg)
å€å·štæ®æ»æ»æ

If I run your code by the interactive interpreter under Ubuntu,
$ LANG="zh_TW" lua5.1
Lua 5.1.5 Copyright (C) 1994-2012 Lua.org, PUC-Rio

Post by tjhack
g3;ff.f;;f

stdin:1: '=' expected near ';'
I.e. the input is already mangled by the reader.
$ LANG="zh_TW" lua5.1 < /tmp/chinese.lua
å€åç³»æ®æ»æ»æ
Interestingly, for LuaJIT the interactive interpreter also works.
Can the culprit be the readline library (not a default option for
LuaJIT)? Let's try. Remove the line
"#define LUA_USE_READLINE"
from luaconf.h and recompile.
lua-5.1-noreadline$ LANG="zh_TW" src/lua
Lua 5.1.5 Copyright (C) 1994-2012 Lua.org, PUC-Rio

Post by tjhack
msg="å€åç³»æ®æ»æ»æ"
print(msg)

å€åç³»æ®æ»æ»æ
Yes!
But under Windows LUA_USE_READLINE is anyway not defined.
We must look for something else.
What happens under Windows if you run the program
as a script file instead of interactively?

It is cool to meet another Traditional Chinese Lua user here!

That's a very very famous cp950 (Big5) encoding problem called "èš±åè" issue
(please kindly refer to the wikipedia!
http://zh.wikipedia.org/wiki/%E5%A4%A7%E4%BA%94%E7%A2%BC)

I would say UTF-8 is definitely the way to go.

For our game, we store our Trad. Chinese and Japanese Lua scripts in
UTF-8-without-BOM format (notepad++ is a very good helper to do that, even
if you are not sure what encoding the current file is in), and usually if
you want to feed the strings to the underlying C/C++ libraries, for
instance FreeType for font rendering, that's not a problem at all if you
make sure you pass the original unmangled byte stream into it.

If you happen to need to manipulate it at Lua level, utf8_simple is good
enough to handle most cases:
https://github.com/Pogs/lua-utf8-simple

Hope it helps!

best,

tjhack

2015-01-12 09:29:02 UTC

Permalink

Post by Dirk Laurie
But under Windows LUA_USE_READLINE is anyway not defined.
We must look for something else.
What happens under Windows if you run the program
as a script file instead of interactively?

Notice that my script is encoding into cp950,zh_TW.BIG5. I tried this script on my windows with locale set to zh_TW and encoding into cp950:
msg="å€åç³»æ®æ»æ»æ"
print(msg)
msg="å®¶æéç©"
print(msg)â

The result is:
C:\Users\tjhack\Desktop>lua test.lua
lua: test.lua:3: unfinished string near '"å®¶æé?'â

Now I set my centos server locale to LANG="zh_TW.BIG5" and try this:â
[***@plsyserver ~]$ lua test.lua
lua: test.lua:3: unfinished string near '"å®¶æé'â

The result is same, with â
msg="å€åç³»æ®æ»æ»æ"
print(msg)

the result is the same is:
å€å·štæ®æ»æ»æâ

--tjhack

Johnson Lin

2015-01-12 09:44:01 UTC

Permalink

Post by tjhack

Notice that my script is encoding into cp950,zh_TW.BIG5. I tried this
msg="å€åç³»æ®æ»æ»æ"
print(msg)
msg="å®¶æéç©"
print(msg)â
C:\Users\tjhack\Desktop>lua test.lua
lua: test.lua:3: unfinished string near '"å®¶æé?'â
Now I set my centos server locale to LANG="zh_TW.BIG5" and try this:â
lua: test.lua:3: unfinished string near '"å®¶æé'â
The result is same, with â
msg="å€åç³»æ®æ»æ»æ"
print(msg)
å€å·štæ®æ»æ»æâ
--tjhack

Hello tjhack,

Did you read my message first though? Do you absolutely need to keep the
cp950 encoding in your script files? For what reason?

best,

tjhack

2015-01-12 09:53:15 UTC

Permalink

Post by Johnson Lin
Hello tjhack,
Did you read my message first though? Do you absolutely need to keep the cp950 encoding in your script files? For >what reason?
best,

Thanks, I have read it but our game's source script is encoding in cp936. And it is a large work to trans to utf8 and i can not decide it.

best wishes

Johnson Lin

2015-01-12 10:21:18 UTC

Permalink

Post by Johnson Lin

Post by Johnson Lin
Hello tjhack,
Did you read my message first though? Do you absolutely need to keep the

cp950 encoding in your script files? For >what reason?

Post by Johnson Lin
best,

Thanks, I have read it but our game's source script is encoding in cp936.
And it is a large work to trans to utf8 and i can not decide it.
best wishes

Hey there,

Even if that's the case, I would still urge you to investigate a way to
automate this encoding conversion and made your scripts into utf8.

GB and Big5 are both obsolete. And all workarounds have their limits. For
Big5's èš±åè issue, usually the lowest cost way of fixing it is somehow add
in an additional escape ('\') where the mangled character resides. But then
you wouldn't be able to store the string in a human readable form; or,
encapsulate all related string usages with a function call that parses the
raw byte data, and then seek to where the additional escape are needed and
put it in.

But for some other usages, that additional escape may surprisingly (or
rather not surprisingly) cause other problems.

best,

Tom N Harris

2015-01-12 11:19:20 UTC

Permalink

Post by Johnson Lin
GB and Big5 are both obsolete. And all workarounds have their limits. For
Big5's 許功蓋 issue, usually the lowest cost way of fixing it is somehow add
in an additional escape ('\') where the mangled character resides. But then
you wouldn't be able to store the string in a human readable form; or,
encapsulate all related string usages with a function call that parses the
raw byte data, and then seek to where the additional escape are needed and
put it in.

My suggestion would be to move the Big5 strings out of the source code and
into a file that is read when the program starts. A gettext library will do
this though it may be overkill. A simple list of keys and strings that is
parsed into a table will do.

for line in msgfile:lines() do
local key, str = line:match "(%w[%w%d_-]*):(.*)"
MESSAGES[key] = str
end

--
tom <***@whoopdedo.org>

Paul Merrell

2015-01-12 09:17:45 UTC

Permalink

Post by tjhack
Now the problem is, can I solve it? How can I let the script compile
success? My source script can not encoding into utf-8, if can, it is esay.‍

If working with utf-8 would cure your problem, check
<https://github.com/starwing/luautf8>.

We've been working with that library embedded in NoteCase Pro along
with Lua. 5.2.x for many months now. I haven't hit any problems so far
other than utf8.title returning all capital characters, the same
output as utf8.upper.

The library has been tested with Lua 5.2.3 and LuaJIT.

On the other hand, I haven't done any testing using CJK characters.
We have some CJK users, but I don't know if any of them have written
scripts.

Best regards,

Paul

--
[Notice not included in the above original message: The U.S. National
Security Agency neither confirms nor denies that it intercepted this
message.]

Paul Merrell

2015-01-12 09:28:31 UTC

Permalink

Post by Paul Merrell
We've been working with that library embedded in NoteCase Pro along
with Lua. 5.2.x for many months now. I haven't hit any problems so far
other than utf8.title returning all capital characters, the same
output as utf8.upper.

I should have added that NoteCase Pro runs on Windows (all variants
since Windows 9x), OS X (starting from version 10.4), Linux (many
distributions, also for mobile devices), Solaris, and Free BSD. No
utf8 issues reported by any users with any of the cross platform
scripts we ship that use the lua-utf8 library (about 30 of them).

Best regards,

Paul

--
[Notice not included in the above original message: The U.S. National
Security Agency neither confirms nor denies that it intercepted this
message.]

Hao Wu

2015-01-12 16:54:40 UTC

Permalink

Post by tjhack
Hi all,
I convert some lua script(it contains chinese character) from simplified
chinese to traditional chinese, and now the chinese character is encoding
with cp950.
Now I switch my win7 machine locale to zh_TW, and restart. Everything
seems okay, the script with traditional chinese character is correct
displayed.
But when I complied these script.It is error.Invalid escape string.
msg="å€åç³»æ®æ»æ»æ"
print(msg)
å€å·štæ®æ»æ»æ
Look at the hex of the string, it is
\xa5~\xa5\\\xa8t\xb4\xb6\xa7\xf0\xa7\xf0\xc0\xbb
so it is the lua not escape the string.

Baking the literal string into the source file is not a good idea anyways,
and apparently, Lua parser is not compatible with encoding that's not ASCII
compatible.

I would suggest put the string into a separate lookup file which can be
encoded and preprocessed so Lua can read them.

Post by tjhack
Now the problem is, can I solve it? How can I let the script compile
success? My source script can not encoding into utf-8, if can, it is esay.
â
Thanks.
â