Discussion:
is string.gmatch(), string.upper() 7-bit ascii only?
Marc Balmer
2016-04-07 13:00:32 UTC
Permalink
I am trying to manipulate text with umlauts. string.upper() does not produce upper case version of umlauts like ä,ö,ü etc.

Also the %g pattern, when used in string.gmatch() does not match these umlauts.

Is there anything that can be done about it? Or, am I making a stupid mistake?
k***@cioccolatai.it
2016-04-07 13:19:29 UTC
Permalink
Post by Marc Balmer
I am trying to manipulate text with umlauts. string.upper() does not produce upper case version of umlauts like ä,ö,ü etc.
Also the %g pattern, when used in string.gmatch() does not match these umlauts.
Is there anything that can be done about it? Or, am I making a stupid mistake?
***@katom:~$ lua
Lua 5.2.3 Copyright (C) 1994-2013 Lua.org, PUC-Rio
Post by Marc Balmer
=os.setlocale()
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=C;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C;LC_IDENTIFICATION=C
Post by Marc Balmer
s="abòè"
=s:match("%g+")
ab
Post by Marc Balmer
=s:upper()
ABòè
Post by Marc Balmer
=os.setlocale("it_IT")
it_IT
Post by Marc Balmer
return s:upper()
ABòè
Post by Marc Balmer
return s:match("%g+")
abòè

ciao,
I.
Roberto Ierusalimschy
2016-04-07 13:27:57 UTC
Permalink
Post by Marc Balmer
I am trying to manipulate text with umlauts. string.upper() does not produce upper case version of umlauts like ä,ö,ü etc.
Also the %g pattern, when used in string.gmatch() does not match these umlauts.
Is there anything that can be done about it? Or, am I making a stupid mistake?
http://www.lua.org/manual/5.3/manual.html#6.4
[...]
The string library assumes one-byte character encodings.

If you are using an 8-bit encoding (e.g., LATIN 1), then these should
work, given a proper locale. Otherwise (e.g., UTF-8), you will need an
external library.

-- Roberto
Marc Balmer
2016-04-07 14:40:08 UTC
Permalink
Post by Roberto Ierusalimschy
Post by Marc Balmer
I am trying to manipulate text with umlauts. string.upper() does not produce upper case version of umlauts like ä,ö,ü etc.
Also the %g pattern, when used in string.gmatch() does not match these umlauts.
Is there anything that can be done about it? Or, am I making a stupid mistake?
http://www.lua.org/manual/5.3/manual.html#6.4
[...]
The string library assumes one-byte character encodings.
If you are using an 8-bit encoding (e.g., LATIN 1), then these should
work, given a proper locale. Otherwise (e.g., UTF-8), you will need an
external library.
Well, at least on an Ubuntu 14.04 system it does not work. But I don't blame Lua if the underlying OS supplied toupper() C function doesn't do the job, of course:

$ sudo locale-gen de_CH
Generating locales...
de_CH.ISO-8859-1... done
Generation complete.
$ lua
Lua 5.3.2 Copyright (C) 1994-2015 Lua.org, PUC-Rio
Post by Roberto Ierusalimschy
= os.setlocale('de_CH.ISO-8859-1')
de_CH.ISO-8859-1
Post by Roberto Ierusalimschy
= string.upper('äöü')
äöü
(Expected is 'ÄÖÜ')

- MARC
Roberto Ierusalimschy
2016-04-07 15:20:43 UTC
Permalink
Post by Marc Balmer
$ sudo locale-gen de_CH
Generating locales...
de_CH.ISO-8859-1... done
Generation complete.
$ lua
Lua 5.3.2 Copyright (C) 1994-2015 Lua.org, PUC-Rio
Post by Roberto Ierusalimschy
= os.setlocale('de_CH.ISO-8859-1')
de_CH.ISO-8859-1
Post by Roberto Ierusalimschy
= string.upper('äöü')
äöü
(Expected is 'ÄÖÜ')
As far as I know, Ubuntu terminals (and pretty much everything else)
work with UTF-8. Try this:

$ lua
Post by Marc Balmer
#'äöü'
-- Roberto
Michal Kottman
2016-04-08 10:54:17 UTC
Permalink
(Expected is 'ÄÖÜ')
You need an external library which understands all of Unicode. Not
advocating anything in particular, just picking a first 'Lua utf8' library
that a simple web search returned:

$ sudo luarocks install luautf8
$ lua
Lua 5.1.5 Copyright (C) 1994-2012 Lua.org, PUC-Rio
utf8 = require 'lua-utf8'
= utf8.upper('ÀöÌ')
ÄÖÜ
for m in utf8.gmatch('ÀöÌ', '%g+') do print(m) end
ÀöÌ
Marc Balmer
2016-04-08 12:07:12 UTC
Permalink
(Expected is 'ÄÖÜ')
$ sudo luarocks install luautf8
$ lua
Lua 5.1.5 Copyright (C) 1994-2012 Lua.org, PUC-Rio
utf8 = require 'lua-utf8'
= utf8.upper('ÀöÌ')
ÄÖÜ
for m in utf8.gmatch('ÀöÌ', '%g+') do print(m) end
ÀöÌ
Note that I used ISO-8859-1 in my example, not UTF-8.
Javier Guerra Giraldez
2016-04-08 13:06:34 UTC
Permalink
Post by Marc Balmer
Note that I used ISO-8859-1 in my example, not UTF-8.
But Roberto's point is that since it was typed in your terminal, and
actually readable, it probably was UTF-8. Unless you changed the
terminal's setting too.
--
Javier
Marc Balmer
2016-04-08 23:02:29 UTC
Permalink
Post by Javier Guerra Giraldez
Post by Marc Balmer
Note that I used ISO-8859-1 in my example, not UTF-8.
But Roberto's point is that since it was typed in your terminal, and
actually readable, it probably was UTF-8. Unless you changed the
terminal's setting too.
dang… That’s a good pointm indeed…
Coda Highland
2016-04-08 20:44:50 UTC
Permalink
Post by Michal Kottman
Post by Marc Balmer
(Expected is 'ÄÖÜ')
You need an external library which understands all of Unicode. Not
advocating anything in particular, just picking a first 'Lua utf8' library
$ sudo luarocks install luautf8
$ lua
Lua 5.1.5 Copyright (C) 1994-2012 Lua.org, PUC-Rio
Post by Marc Balmer
utf8 = require 'lua-utf8'
= utf8.upper('äöü')
ÄÖÜ
Post by Marc Balmer
for m in utf8.gmatch('äöü', '%g+') do print(m) end
äöü
It should be noted that "uppercase" isn't always a trivial matter anyway.

In Turkish and some other languages borrowing from its script, the
uppercase form of i is İ, and the lowercase form of I is ı. These are
distinct vowels, but Unicode only encodes the glyphs in this case, not
the semantic distinction between the two. (Unicode is an ugly mess in
that way -- sometimes it encodes identical but semantically-distinct
characters separately; sometimes it combines them.) I and i are still
U+0049 and U+0069, respectively, but İ is U+0130 and ı is U+0131.

/s/ Adam



/s/ Adam
sur-behoffski
2016-04-09 22:52:52 UTC
Permalink
The wider topic of the C locale, ASCII and 7 bits is currently being
POSIX ... says that LC_ALL=C is _required_ to treat all 256 byte values as
valid characters
Although that was the intent of POSIX, it's not what the current standard
says, and it's not what many popular platforms do. Problematic platforms
include Fedora 23, where mbrtowc reports an encoding error in the C locale
when given a byte outside the range 0-127. This affects many programs other
than 'grep'.

This bug in the standard is intended to be fixed in a future version of
POSIX (see <http://austingroupbugs.net/view.php?id=663#c2738>). I suppose
glibc and eventually Fedora will be fixed to conform to the new standard in
due course.

Perhaps grep should work around this problem on systems like Fedora 23 where
the underlying C library does not conform to the next version of POSIX. It
sounds like a new gnulib module or two might do the trick. This should fix
the problems that Björn mentions.

In the meantime grep -a is the way to go. Yes, it's not portable to non-GNU
grep, but there is no portable solution given the abovementioned POSIX
problems, so a GNU-grep-only workaround is all one can reasonably ask for.

Also, there's a number of hairy cases in GNU grep regarding Unicode and
upper/lower case handling, as Grep tries to provide case-insensitive matching:

/* The set of wchar_t values C such that there's a useful locale
somewhere where C != towupper (C) && C != towlower (towupper (C)).

For example, 0x00B5 (U+00B5 MICRO SIGN) is in this table, because:

towupper (0x00B5) == 0x039C (U+039C GREEK CAPITAL LETTER MU), and
towlower (0x039C) == 0x03BC (U+03BC GREEK SMALL LETTER MU).
*/

Grep's definition of case-insensitive matching is effectively (pseudocode):

#define CI_MATCHES(a, b) (towupper(a) == towupper(b))

Even within the Basic Multilingual Plane, holes have deliberately been left
by various encoding sets at various points: E.g., from the Wikipedia article
for IEC_10646:

The system deliberately leaves many code points not assigned to characters,
even in the BMP. It does this to allow for future expansion or to minimize
conflicts with other encoding forms.

---------

So, bringing the focus back to Lua, which uses the system's libc in order to
be portable, the ISO-8859-1 locale, rather than the C or POSIX locale, might
be a useful locale for simple cases. However, my experience in using anything
other than C or English locales is very limited (both at the terminal emulation
level, and the within-Lua string handling level), so I'll stop here, and let
more experienced people speak.

cheers,

sur-behoffski (Brenton Hoff)
Programmer, Grouse Software.

Loading...