Discussion:
Matching sequences of identical characters with lpeg.
Magicks M
2018-11-20 22:16:58 UTC
Permalink
Hi, I'm having some trouble with this.
I want to capture seqences on identical characters in a string: "aaabbbcccd"
-> "aaa" "bbb" "ccc" "d"
I attempted to use the regular pattern "((.)%1*)" and found that
quantifiers didn't work. I then switched to lpeg and attempted to use:
Cg(P(1), "char") * Cb'char'^0
Which errors, and thus I cannot think of a good way to match this.
Andrew Gierth
2018-11-20 22:51:41 UTC
Permalink
Magicks> Hi, I'm having some trouble with this.
Magicks> I want to capture seqences on identical characters in a
Magicks> string: "aaabbbcccd" -> "aaa" "bbb" "ccc" "d"

Magicks> I attempted to use the regular pattern "((.)%1*)" and found
Magicks> that quantifiers didn't work. I then switched to lpeg and
Magicks> attempted to use: Cg(P(1), "char") * Cb'char'^0 Which errors,
Magicks> and thus I cannot think of a good way to match this.

Doing this in lpeg will, I believe, either require Cmt or pre-generating
a pattern for every possible character. (Your attempt above seems to
misunderstand what Cb does - it just fetches and returns the value from
Cg, it does not attempt to match it against the subject string;
backreference matches (as with =foo in the lpeg "re" module, or the "Lua
long strings" example in the lpeg docs) require Cmt.)

It's not immediately clear that doing it with Cmt would be in any way
better than just open-coding the search in lua, since you'd be calling
the capture function at every character position. With pre-generated
patterns it might look like this:

local lpeg = require "lpeg"
local P, C = lpeg.P, lpeg.C
local subpat = P(false)
for i = 0,255 do
subpat = subpat + P(string.char(i))^2 -- 2 or more occurrences
end
local pat = (C(subpat) + P(1))^0

print(pat:match("abcaabbccaaabbbcccd"))
-- output: aa bb cc aaa bbb ccc
--
Andrew.
Gabriel Bertilson
2018-11-20 23:00:24 UTC
Permalink
Here's a solution using Cmt. As Andrew pointed out, Cb just inserts a
capture into the list of captures returned by the current pattern, it
doesn't match anything.

Cg(1, "char") * Cmt(C(1) * Cb'char', function (_, _, char1, char2)
return char1 == char2 end)^0

It matches one character, labels it as "char", then matches further
characters if they are equal to "char". To use it on "aaabbbcccd" in
patt = Cg(1, "char") * Cmt(C(1) * Cb'char', function (_, _, cur, prev) return cur == prev end)^0
(C(patt)^1):match "aaabbbcccd"
aa bbb ccc d

— Gabriel
— Gabriel
Hi, I'm having some trouble with this.
I want to capture seqences on identical characters in a string: "aaabbbcccd" -> "aaa" "bbb" "ccc" "d"
Cg(P(1), "char") * Cb'char'^0
Which errors, and thus I cannot think of a good way to match this.
Sean Conner
2018-11-21 01:04:15 UTC
Permalink
Post by Gabriel Bertilson
Here's a solution using Cmt. As Andrew pointed out, Cb just inserts a
capture into the list of captures returned by the current pattern, it
doesn't match anything.
Cg(1, "char") * Cmt(C(1) * Cb'char', function (_, _, char1, char2)
return char1 == char2 end)^0
It matches one character, labels it as "char", then matches further
characters if they are equal to "char". To use it on "aaabbbcccd" in
patt = Cg(1, "char") * Cmt(C(1) * Cb'char', function (_, _, cur, prev) return cur == prev end)^0
(C(patt)^1):match "aaabbbcccd"
aa bbb ccc d
And it can be further extended to UTF-8:

local char = R"\1\127"
+ R"\194\244" * R"\128\191"^1
local seq = Cg(char,'char')
* Cmt(
C(char) * Cb'char',
function(_,_,cur,prev)
return cur == prev
end
)^0
local patt = C(seq)^1

print(patt:match "aaabbbcccd")
print(patt:match "###aaaa©©©©####bbbbb####")

-spc

Loading...