Discussion:
New to LUA, trying to read from a file
Brian Sanders
2008-10-08 02:26:34 UTC
Permalink
Hello, I am new to LUA and was trying to do some basic scripts to get a
feel for it. I have tried to make a basic script to parse through a log
file and display simple output. I have the following code which I thought I
understood just fine.

print ("opening file for reading")
logfile = io.open("system.log","r")
logstring = logfile:read("*all")
print(logstring)

What is confusing me is, I can create a new text document, and just type a
few words in it. Then I can use this script (unless in my many variations I
posted the wrong one) and get the text back out. If I could get past this,
I would then do some small matching off the log file. Unfortunately it only
works when I make my text document. When I grab the actual log file I am
trying to parse, it only returns a space, a square, and then the first
character in the file. When opening this log file in word pad it does look
normal with line returns after each log entry. In notepad however the line
returns are shown as squares. I am therefore led to believe this must have
something to do with formatting of this file, but I really don't know. Can
anyone point me in the right direction here? I just don't see how these
line returns could be the problem when it does not even parse that far, it
just gets to the first character.

Thanks for helping a new guy out,
Brian
--
"Faithless is he, who says 'farewell', when the path darkens
"you just keep on trying till you run out of cake"
Mike Crowe
2008-10-08 02:45:22 UTC
Permalink
Hi Brian,

Couple of things:

1) How big is your log file? Are you trying to read too much? You may
want to read line-by-line with "*line" instead of "*all", though it is more
inefficient.

2) Here's a recent snippet I used:

local file = io.open(feed,"r")
while true do

local line = file:read("*l")

if line == nil then break end

text = line .. "\n" .. text

end
Post by Brian Sanders
Hello, I am new to LUA and was trying to do some basic scripts to get a
feel for it. I have tried to make a basic script to parse through a log
file and display simple output. I have the following code which I thought I
understood just fine.
print ("opening file for reading")
logfile = io.open("system.log","r")
logstring = logfile:read("*all")
print(logstring)
What is confusing me is, I can create a new text document, and just type a
few words in it. Then I can use this script (unless in my many variations I
posted the wrong one) and get the text back out. If I could get past this,
I would then do some small matching off the log file. Unfortunately it only
works when I make my text document. When I grab the actual log file I am
trying to parse, it only returns a space, a square, and then the first
character in the file. When opening this log file in word pad it does look
normal with line returns after each log entry. In notepad however the line
returns are shown as squares. I am therefore led to believe this must have
something to do with formatting of this file, but I really don't know. Can
anyone point me in the right direction here? I just don't see how these
line returns could be the problem when it does not even parse that far, it
just gets to the first character.
Thanks for helping a new guy out,
Brian
--
"Faithless is he, who says 'farewell', when the path darkens
"you just keep on trying till you run out of cake"
James Dennett
2008-10-08 02:46:33 UTC
Permalink
Post by Brian Sanders
Hello, I am new to LUA and was trying to do some basic scripts to get a
feel for it. I have tried to make a basic script to parse through a log
file and display simple output. I have the following code which I thought I
understood just fine.
print ("opening file for reading")
logfile = io.open("system.log","r")
logstring = logfile:read("*all")
print(logstring)
What is confusing me is, I can create a new text document, and just type a
few words in it. Then I can use this script (unless in my many variations I
posted the wrong one) and get the text back out. If I could get past this,
I would then do some small matching off the log file. Unfortunately it only
works when I make my text document. When I grab the actual log file I am
trying to parse, it only returns a space, a square, and then the first
character in the file. When opening this log file in word pad it does look
normal with line returns after each log entry. In notepad however the line
returns are shown as squares. I am therefore led to believe this must have
something to do with formatting of this file, but I really don't know. Can
anyone point me in the right direction here? I just don't see how these
line returns could be the problem when it does not even parse that far, it
just gets to the first character.
Thanks for helping a new guy out,
Brian
What's the format of the log file, exactly? In particular, what charset is
it using (range and encoding), and what newline format? A hex dump of (the
start of) the file can be very helpful in guessing that, if it's not
documented somewhere. You mention "notepad", so I might guess that you're
on Windows, which increases the probability that the file is some variant of
UTF16, possibly UTF-16LE with a BOM (byte order mark). But that's a wild
guess, and not likely to be right.

Unfortunately the simple term "text file" covers a whole family of formats.

-- James
Brian Sanders
2008-10-08 10:24:49 UTC
Permalink
wow that was fast... let me see if I got all this correct.

First, as far as the format of the log file, I may have to look at it in a
hex editor. I don't know the format they used while writing the file, I
just know it is written as .log and it seems generally expected that word
pad can view it just fine. So perhaps more on that to come, sounds like a
place to start.

shouldn't it be "*a"? Check
http://www.lua.org/manual/5.1/manual.html#pdf-file:read ...
Also, you can try to print the length of the string with
print(#logstring).

Using *a produced the same results, I remember trying multiple ways due to
my google searching. If one is correct and the other is not, that is one
thing I was hoping to learn from this little experiment :)

I put the print statement in to see the length of the string. It returns
2345630, even though it only prints those first few characters. I found
that interesting. Am i getting to large a size for a single string? I
could process each line individually...

Couple of things:

1) How big is your log file? Are you trying to read too much? You may
want to read line-by-line with "*line" instead of "*all", though it is more
inefficient.

2) Here's a recent snippet I used:

local file = io.open(feed,"r")
while true do

local line = file:read("*l")

if line == nil then break end

text = line .. "\n" .. text

end


Well the log file is 2,291KB but I tried a very short one, of only about 4KB
with the exact same results. I tried implementing this line by line code to
see how it would turn out. If I print text outside the end statement, I get
the exact same output as when I read everything at once. I then print the
length of the string as suggested earlier and I end up with 4. I even tried
adding a print of line inside the while loop just to see if I could simply
print each line as it reads it. It prints the exact same text one time. It
appears that reading a line at a time is stopping after the first attempt,
which still has output I don't understand in it.

So I am probably back to, how is this file formatted exactly. I believe the
logs were written with the idea of opening them in word pad for reading, but
I will see about getting a hex editor and comparing this to a standard text
file, which I have seen work.
Post by James Dennett
Post by Brian Sanders
Hello, I am new to LUA and was trying to do some basic scripts to get a
feel for it. I have tried to make a basic script to parse through a log
file and display simple output. I have the following code which I thought I
understood just fine.
print ("opening file for reading")
logfile = io.open("system.log","r")
logstring = logfile:read("*all")
print(logstring)
What is confusing me is, I can create a new text document, and just type a
few words in it. Then I can use this script (unless in my many variations I
posted the wrong one) and get the text back out. If I could get past this,
I would then do some small matching off the log file. Unfortunately it only
works when I make my text document. When I grab the actual log file I am
trying to parse, it only returns a space, a square, and then the first
character in the file. When opening this log file in word pad it does look
normal with line returns after each log entry. In notepad however the line
returns are shown as squares. I am therefore led to believe this must have
something to do with formatting of this file, but I really don't know. Can
anyone point me in the right direction here? I just don't see how these
line returns could be the problem when it does not even parse that far, it
just gets to the first character.
Thanks for helping a new guy out,
Brian
What's the format of the log file, exactly? In particular, what charset is
it using (range and encoding), and what newline format? A hex dump of (the
start of) the file can be very helpful in guessing that, if it's not
documented somewhere. You mention "notepad", so I might guess that you're
on Windows, which increases the probability that the file is some variant of
UTF16, possibly UTF-16LE with a BOM (byte order mark). But that's a wild
guess, and not likely to be right.
Unfortunately the simple term "text file" covers a whole family of formats.
-- James
--
"Faithless is he, who says 'farewell', when the path darkens
"you just keep on trying till you run out of cake"
Tim Channon
2008-10-08 10:29:23 UTC
Permalink
Post by Brian Sanders
logfile = io.open("system.log","r")
Maybe "rb" would help
Brian Sanders
2008-10-08 10:37:24 UTC
Permalink
Yeah, I considered the binary as well... but it didn't make a difference.
Got the exact same output.

Looking at this fine in a hex editor, I see some interesting stuff. The
file starts with FF EE, then begins with the standard characters. Between
every character is 00, which in the other window translates to just a
square. It appears that these 00's are every other character in the file.
I can't control how these files are written, but does knowing this tell
someone what might be going on?

for example

FF EE 58 00 30 00
?? ?? [ ?? 0 ??

So every other character from then on is expected and is what I see in the
file. I just don't know the starting, or the 00 every other character.
Post by Tim Channon
Post by Brian Sanders
logfile = io.open("system.log","r")
Maybe "rb" would help
--
"Faithless is he, who says 'farewell', when the path darkens
"you just keep on trying till you run out of cake"
Javier Guerra Giraldez
2008-10-08 10:46:57 UTC
Permalink
Post by Brian Sanders
Looking at this fine in a hex editor, I see some interesting stuff. The
file starts with FF EE, then begins with the standard characters. Between
every character is 00, which in the other window translates to just a
square. It appears that these 00's are every other character in the file.
I can't control how these files are written, but does knowing this tell
someone what might be going on?
yep, that's UCS-16

I wouldn't consider that a 'text file'
--
Javier
Klaus Ripke
2008-10-08 10:53:54 UTC
Permalink
Post by Brian Sanders
Looking at this fine in a hex editor, I see some interesting stuff. The
file starts with FF EE, then begins with the standard characters. Between
Hmm, this should be FF EE, making it UTF-16 or UCS-2 with BOM.
http://en.wikipedia.org/wiki/UTF-16/UCS-2

FF FE is the byte order mark, telling you it's little endian
(lower byte first, the Unicode value of the BOM is U+FEFF),
which comes as no surprise on a Wintel box..
Post by Brian Sanders
every character is 00, which in the other window translates to just a
square. It appears that these 00's are every other character in the file.
FF EE 58 00 30 00
?? ?? [ ?? 0 ??
So every other character from then on is expected and is what I see in the
file. I just don't know the starting, or the 00 every other character.
As long as the actual character values are in Latin 1,
the high byte (every other) is always 0 and you can simply ignore it,
and discard the two bytes BOM (sure you have FF EE?).


HTH
Klaus
Klaus Ripke
2008-10-08 10:59:23 UTC
Permalink
Post by Klaus Ripke
Post by Brian Sanders
file starts with FF EE, then begins with the standard characters. Between
Hmm, this should be FF EE, making it UTF-16 or UCS-2 with BOM.
... should be FF FE, making it UTF-16 or UCS-2 with BOM. sry
Brian Sanders
2008-10-08 11:30:24 UTC
Permalink
Hmm... so if it is in either UTF-16 or UCS-2 with BOM... is there any way
for me to use these log files with a LUA script? I guess it is good to know
that I did understand the LUA tutorials, it was my input file I was not
looking closely enough at.
Post by Brian Sanders
Post by Klaus Ripke
Post by Brian Sanders
file starts with FF EE, then begins with the standard characters.
Between
Post by Klaus Ripke
Hmm, this should be FF EE, making it UTF-16 or UCS-2 with BOM.
... should be FF FE, making it UTF-16 or UCS-2 with BOM. sry
--
"Faithless is he, who says 'farewell', when the path darkens
"you just keep on trying till you run out of cake"
Matthew Wild
2008-10-08 11:37:39 UTC
Permalink
Post by Brian Sanders
Hmm... so if it is in either UTF-16 or UCS-2 with BOM... is there any way
for me to use these log files with a LUA script? I guess it is good to know
that I did understand the LUA tutorials, it was my input file I was not
looking closely enough at.
It's a hack, and there is probably a nice(r) way of doing it, but try:

logstring = logstring:sub(3):gsub("%z", "")

It will at least remove the zeros that stop it from printing, but if
you have non-latin characters then they might get messed up.

Matthew.
李辉
2008-10-08 12:07:20 UTC
Permalink
in liolib.c line 291 (version 2.1.3)

read_line function

it use "fgets(p, LUAL_BUFFERSIZE, f)" to read the file (line 297)

but use "l = strlen(p);" to get the length (line 301)

so,for u file,although u read lots of bytes from file(eg n),the strlen will
return 3 for your data "FF EE 58 00 30 00...."

and the length of this read will be set 3,so u will get 3 in lua script

but for ther file,the pos goes n bytes,so n-3 bytes lost,and at last u will
get only few characters.

maybe it`s a bug of lua?
Post by Matthew Wild
Post by Brian Sanders
Hmm... so if it is in either UTF-16 or UCS-2 with BOM... is there any way
for me to use these log files with a LUA script? I guess it is good to
know
Post by Brian Sanders
that I did understand the LUA tutorials, it was my input file I was not
looking closely enough at.
logstring = logstring:sub(3):gsub("%z", "")
It will at least remove the zeros that stop it from printing, but if
you have non-latin characters then they might get messed up.
Matthew.
--
ͬÖÞ Àî»Ô
Tel£º0755-26990000-7741
ÊÖ»ú£º13631656753
Luiz Henrique de Figueiredo
2008-10-08 12:11:35 UTC
Permalink
Post by 李辉
maybe it`s a bug of lua?
No. The notion of lines only make sense for text files. If you read
binary files (defined as those that contain unprintable bytes -- not
chars), you'll probably get weird results.
Brian Sanders
2008-10-08 12:15:06 UTC
Permalink
Thanks, I see what you did and I will give that a try! I may not get back
to this little project till tomorrow but I will try and give you a quick up
date to how it goes.

Thanks again!
Post by Matthew Wild
Post by Brian Sanders
Hmm... so if it is in either UTF-16 or UCS-2 with BOM... is there any way
for me to use these log files with a LUA script? I guess it is good to
know
Post by Brian Sanders
that I did understand the LUA tutorials, it was my input file I was not
looking closely enough at.
logstring = logstring:sub(3):gsub("%z", "")
It will at least remove the zeros that stop it from printing, but if
you have non-latin characters then they might get messed up.
Matthew.
--
"Faithless is he, who says 'farewell', when the path darkens
"you just keep on trying till you run out of cake"
Klaus Ripke
2008-10-08 12:55:21 UTC
Permalink
Post by Matthew Wild
logstring = logstring:sub(3):gsub("%z", "")
It will at least remove the zeros that stop it from printing, but if
you have non-latin characters then they might get messed up.
a bit cleaner (and more expensive) would be something like

s = s:gsub('(.)(.)', function (lo,hi) return 0==hi and lo or '?' end)

this transforms all characters in the Latin-1 subset to their
Latin-1 code and all others, including the nasty BOM, into a '?'


conversion of UCS-2 to UTF-8 can also easily be done in Lua
(although using iconv is probably considerably faster, if you have it):

local format, mod, floor = string.format, math.mod, math.floor
function utf8 (i) -- BMP only
if i<128 then return format("%c", i) end
if i<2048 then return format("%c%c", 192+i/64, 128+mod(i,64)) end
local j=floor(i/4096)
i = i-j*4096
return format("%c%c%c", 224+j, 128+i/64, 128+mod(i,64))
end
s = s:gsub('(.)(.)', function (lo,hi) return utf8(hi*256+lo) end)


make sure to read your file in binary chunks of even size.


cheers
Klaus

David Given
2008-10-08 11:58:06 UTC
Permalink
Post by Brian Sanders
Hmm... so if it is in either UTF-16 or UCS-2 with BOM... is there any
way for me to use these log files with a LUA script? I guess it is good
to know that I did understand the LUA tutorials, it was my input file I
was not looking closely enough at.
You need to convert it somehow into UTF-8. There are a number of Lua
addons for doing this sort of transcoding, but if you've got Cygwin,
it's probably easier just to use the following command line:

iconv -f utf-16 -t utf-8 fnord.log > fnord.txt

...then open 'fnord.txt'.

You may even be able to do this:

local fp = io.popen("iconv -f utf-16 -t utf-8 fnord.log")
local text = fp:read("*all")

...but I forget whether that works on Windows.

(This whole problem is due to Windows having a rather different idea of
what 'plain text' means to the rest of the world; see
http://en.wikipedia.org/wiki/Bush_hid_the_facts for a rather amusing
consequence...)
--
David Given
***@cowlark.com
Loading...