Writing a topic up for oneself is one of the best ways of learning it.
Once you have learnt it, you can never recapture the difficulty you had in understanding the original.
I promise not to rework these notes to elegance when I know the material well. I will add stuff and correct errors, though. Thanks to members of the Lua list for helping me here, particularly PierreYves Gérardy (alias Pygy).
They are mainly for my own use, especially when I need the information after not having worked with it for a long time. If anyone else finds them helpful, it is a bonus.
They are not intended to be comprehensive, only to flatten the learning curve until the reader knows enough not to be daunted by the official documentation.
LPEG is a library for operating on patterns. It is not an alternative to the Lua string library, but it can be used to implement libraries rather like the string library instead of doing it directly in C.
Some words mean very specific things to LPEG.
Not the same as string.match
. In fact, it may be a good idea to forget everything you know about the Lua string library while coming to grips with LPEG.
lpeg.match(p,"abcdef",2)
applies the pattern p
at position 2 of "abcdef".The central concept of LPEG.
Not just a portion of a match.
The most spectacular feature of LPEG is the way complicated patterns are built up from simpler ones: Patterns can be used instead of numbers as the values that the variables in an arithmetic expression may take. For example, x^2+3*x13
is a perfectly valid LPEG expression when x
is a pattern.
Roberto Ierusalimschy put a lot of thought into allocating useful meanings to the arithmetic operations and constants of Lua. In particular, the priority of operations is such that one often needs no parentheses. Some words of warning, though:
To make a pattern actually do anything, it must be passed to the function match
.
Constants like 3
or true
in a pattern expression are special shorthand notations. They are recognized as patterns only when a pattern is expected, e.g. as an argument to match
or another LPEG function, or when combined with existing patterns. To combine two constants, first convert at least one of them to a pattern by lpeg.P
(the lpeg.
is used only here, it will be just P
later). I.e. if Lua will think 3
is just a number, say P(3)
instead of just 3
.
The expressions look familiar but that must not mislead you into thinking that the usual commutative, associative and distributive laws of algebra hold. They don't.
true
stands for a pattern that always succeeds, false
for a pattern that always fails.
0
stands for a pattern that matches the empty string, positive n
for a pattern that matches exactly n
bytes, n
for the negation of n
(see next item).
p
succeeds when p
fails. It consumes no input. n
succeeds when there are fewer than n
bytes of input left.
fct
stands for a pattern that fails when fct
returns a Lua false value (i.e. false
, nil
or none) and succeeds when it returns true
or a new valid position. See Cmt
under Captures for a specification of fct
.
#p
matches what p
matches, but consumes no input.
p
and q
respectively match a
and b
, then p*q
matches a..b
. Note that multiplication is not commutative.
or
.p+q
matches what p
matches, except when p
fails; then it matches what q
matches. Note that p+q
succeeds if and only if q+p
succeeds, but if both p
and q
would succeed, the match is that of the first pattern. So addition is not quite commutative.
and not
.pq
fails if q
succeeds, otherwise matches what p
matches. Note that 0p
does the same as p
, but pp
does not do the same as 0
, it does the same as false
.
p/s
matches what p
does, and defines a capture for the whole expression by processing the captures of p
as specified by s
. It can also turn a noncapture pattern into a capture pattern. Here are the simplest cases, assuming that p
is a noncapture pattern that has matched the nonnil value named value
.
p/1
captures value
.p/tbl
captures tbl[value]
if it is not nil.p/str
captures str:gsub("%0",value)
.p/fct
captures fct(value)
, even if the return value is nil. Returning nothing means no capture, which is not the same thing.For the fallbacks when value
or tbl[value]
is nil or fct(value)
returns nothing, and for what happens when p
is already a capture pattern, see Captures.
Not quite exponentiation in the usual sense: p*p
means exactly two repetitions of p
, which is not the same as p^2
.
p^n
matches n
or more repetitions of p
.p^0
matches any number of repetitions of p
.p^n
matches not more than n
repetitions of p
. If there are more, p^n
still succeeds; the other repetitions are still there for followup patterns to examine.p^0
and p^n
match the empty string, but p
itself is restricted to patterns that do not match the empty string.
For example, suppose x=P"abc"
. Then x^2+3*x13
means "two or more copies of abc
, or any three bytes followed by abc
, but shorter than 13 bytes".
The introductory lpeg.
has been omitted here. I have set up my system so that typing lpeg
in a command shell does
lua l lpeg e "for k,v in pairs(lpeg) do _ENV[k]=v end" i
Then I treat these names as reserved:
B C Carg Cb Cc Cf Cg Cmt Cp Cs Ct P R S V
More items than those get copied, but I don't bother about the others. Actually I just remember "avoid oneletter uppercase and short names starting with C
".
P
(Pattern)P(v)
converts to a pattern a value v
of any Lua type except thread or nonpattern userdata, as described above. Existing patterns are returned unchanged. It is not advisable for v
to have a metatable.
P
is explicitly needed at least once when forming a new pattern from scratch, but seldom after that; the LPEG functions invoke P
automatically when expecting a pattern.
R
(Range)R(r)
, where r
is a twobyte string, matches any byte whose internal numerical code is in the range r:byte(1,2)
. You could use characters in r
, e.g. R"az"
on most systems matches the range of lowercase letters (but see locale
for a more portable alternative).
S
(Set)S(s)
, where s
is a string, matches any of the bytes in s
.
locale
locale()
returns a table of patterns that match character classes. Recommended method is to examine the keys of the returned table, e.g. locale().lower
matches all lowercase letters.
When called with p
as first argument, these functions will replace p
by P(p)
before proceeding. When p
is already a pattern, they can also be called in an objectoriented way, as occasionally shown below.
match(p,subject[,init]), p:match(subject,init, ...)
Tries to match p
to subject:sub(init)
. init
defaults to 1. Returns the captures of the match, or if none specified, the next position, i.e. the index of the first byte in the subject after the match. Returns nil
only when the capture fails.
As indicated, you may not omit the init
argument to match
when there are extra arguments. Those optional arguments can be accessed as described under Captures.
B(p), p:B()
(Backspace)p
does, but matches just before the next position instead of at it and consumes no input. p
is restricted to patterns of fixed length that make no captures.
C(p), p:C()
(Capture)p
does, and returns a capture of the match, plus all captures that were made.
The LPEG distribution contains re.lua
, an application demonstrating the feasibility of writing a regular expression handler using LPEG. The present author is not yet qualified to write about that. He is probably not qualified to write about the topics below either, but he pretends to be.
Captures made by a subexpression automatically become captures of the whole expression, and are eventually returned by match
, unless modified by the division operator and the capture methods.
The patterns made by P
acting on a nonpattern do not capture anything. They can be made to do so with the four options provided by the division operator. In principle one can do almost anything that way, since one of the options is to process the match by a function of one's own choosing.
In practice some tasks are so common that it is useful to have additional capture functions predefined. We are reaching the limits of this primer here, and there is really no substitute for reading the official LPEG documentation, or at least the tutorial on the LuaWiki, but at the risk of saying too much but not enough, here are some of them.
The pattern p1=P'a'/"A"+P'b'/"B"+P'c'/"C"
, which matches lowercase a
, b
or c
and captures the uppercase equivalent, is used in some of the examples.
All the capture constructors are pattern modifiers: they affect what happens after a given pattern succeeds. For all except Cmt
, success or failure is determined by the pattern, and the capture constructor merely determines which captures will be produced. As in table constructors, return lists etc, nil captures are significant as placeholders.
Carg(n)
n
th extra argument to match
. (p1*Carg(1)*p1):match("ab",1,2)
returns 'A',2,'B'
(the 1
is init
, which is nonoptional if you want to use extra arguments).
Cp()
(Position)n+1
when the input up to position n
has been consumed. (p1*p1*Cp()):match("ab")
returns 'A','B',3
.
Cc(...)
(Constant)(p1+Cc(math.pi)):match"xyz"
returns 3.1415926535898
.
Ct(p), p:Ct()
(Table)p
. Ct(p1*Carg(1)*p1):match("ab",1,2)
is {A,2,B}
.
Cs(p), p:Cs()
(Substitute in String)Cs(p1*Carg(1)*p1):match("ab",1,2)
returns A2B
. The captures must be strings. Note that the substitution applies to the match, not to the full subject: Cs(p1*Carg(1)*p1):match("abc",1,2)
also returns A2B
.
Cf(p,fct), p:Cf(fct)
(Fold by Function)fct
must be a function of two variables. Think of it as a leftassociative binary operator applied between the captures of p
. The result is the capture of Cf(p,fct)
. (C(1)^3):match"361"
returns '3','6','1'
but Cf(C(1)^3,math.max):match"361"
returns 6
.
Cmt(p,fct)
(MatchTime iMmediaTe)Makes a new match immediately after p
reports success. This does more than define captures for p
, it overrules everything that p
did: whether the match fails despite the success of p
, and in the case of success, what the new position after the match is, and what captures the modified pattern returns.
The call to fct
supplies arguments subj,pos,...
, where the pos
is the next position after the match of p
and the vararg list contains all the captures that p
made. The first return value must be one of:
false
, nil
or none if the match fails;true
, if the match succeeds but consumes no input;pos
to #subj+1
.In the case of success, further return values of fct
become captures of Cmt(p,fct)
.
This specification of function fct
is also applicable when P(fct)
is used to create a userdefined pattern.
Finally we are in a position to use LPEG to test for the property mentioned when we defined "pattern": "seven letters, starts with 'p', has a double letter in it somewhere".
dbl = function(s,p)
while p<#s do
if s:sub(p,p)==s:sub(p+1,p+1) then return p+1 end
p=p+1
end
end
pat = #P(dbl)*#P"p"*P(7)
These patterns are useless by themselves. They are always used as part of a bigger pattern.
The first version of this document said: "The remaining capture functions: Cg
and Cb
, are definitely too advanced for this primer." That is still basically true, but one application is within reach: the GroupBack combination allows values to be stored and retrieved.
Cg(p,name), p:Cg(name)
(Group)p
made into a single entity, and gives that entity a name for future reference. Does not actually add any captured values to the big pattern.
Cb(name)
(Back)Here is how it works.
name = C(R"AZ"*R"az"^1)  Capitalized word
entry = Cg(name,'surname')*','*P" "^0*name*Cb'surname'
 reverses the items in a twoitem commaseparated list like `data`
data = "Ierusalimschy, Roberto"
result = {entry:match(data)}
print (table.concat(result,' ')) > Roberto Ierusalimschy
There's also an anonymous version of Cg
, with the name omitted. For that, and much more, take a look at Pygy's recent addition to http://luausers.org/wiki/LpegTutorial.
When p
succeeds without producing a capture, p:match
returns the next position (see Cp()
, above). The division operator cannot negate success: if it discards all captures, the match still succeeds and p/whatever
still returns the next position.
If p
is a capture pattern but has not actually produced any captures, it acts like a noncapture pattern, as discussed when division was first decribed.
Assume that there is at least one capture, and call the captures c_1
, c_2
etc.
p/n
captures c_n
, p/0
means there is no capture.p/tbl
captures tbl[c_1]
except when this is nil, in which case there is no capture.p/str
means the capture is the result of replacing in str
all instances of "%n"
by c_n
, where n
runs from 0 to 9. c_0
means the whole match. This feature can only handle nine captures.p/fct
means the capture (or perhaps several captures) is fct(c_1,c_2,...)
. Any number of return values are allowed, any or all of which may be nil.A grammar is a pattern constructed out of a table tbl
of named patterns, called "rules", that may refer to each other recursively. One of the patterns, known as the "initial rule", becomes the pattern created by P(tbl)
; it must either be tbl[1]
or its key must be stored as tbl[1]
.
The rules look very much like BNF notation. For example, a simple arithmetic expression might be defined as:
expr ::= term  expr+term  exprterm
term ::= factor  term*factor  term/factor
One would like to translate this to LPEG as:
expr = term + expr * '+' * term + expr * '' * term
term = factor + term * '*' * factor + term * '/' * factor
but that would not work, because when the righthand sides are evaluated, the required values are not known yet. Instead, we put everything inside a table, and use the pattern constructor V
(for Variable) to refer to rules.
tbl = { "expr",
expr = V"term" + V"term" * "+" * V"expr" + V"term" * "" * V"expr";
term = factor + factor*"*"*V"term" + factor*"/"*V"term"
}
What then happens is quite similar to the distinction between code inside a Lua function definition and outside it. In the following Lua code, the expression x+a
is not equivalent to x+1
; instead, it generates code to be executed later for adding the values that x
and a
have when the function is executed.
local a = 1
function f(x)
return x+a
end
Similarly, when V(key)
is invoked, the current value of tbl[key]
is not relevant: code is generated that will use the future value of tbl[key]
at the stage when the table is turned into a pattern. This happens once only: the actual userdata in tbl[key]
mutates, after which it is fixed.
For recursive grammars, it is very important that at least one byte of input is consumed before the same rule is applied again, otherwise the pattern goes into an infinite loop. Actually, the first time round, the above definition of tbl
did not work because I put the operations in the wrong order: V"expr" * "+" * V"term"
instead of V"term" * "+" * V"expr"
, etc. (The BNF itself is already wrong, of course.) P(tbl)
will in most cases detect such loops and refuse to construct the grammar.
Certain subexpressions occur so often in pattern expressions that one gets to recognize them at a glance.
This...  matches...  

1 
the end of the subject.  
#p*q 
what q matches, provided that p would succeed. 

(1p)^0 
everything up to where p would succeed. 
(I can't tell whether they will be frequently asked.)
How do you match anywhere, as the Lua string library does?
The easiest way is to use an idiom to skip over the nonmatching part: (1p)^0*Cp()*p*Cp()
is a pattern that returns the first position where p
succeeds and the position after the match.
Can one do a sed
style gsub
?
The point of the question is that string.gsub
, if given a greedy pattern that can match the empty string, produces an empty match after every nonempty match except right at the end, and in some situations this is not what one wants.
For example, (string.gsub(";a;","a*","ITEM"))
returns "ITEM;ITEMITEM;ITEM"
whereas "ITEM;ITEM;ITEM"
as returned by the sed
command s/a*/ITEM/
is more convenient in applications like string splitting.
p^0
(where p=P"a"
) acts like the Lua pattern "a*"
, but one cannot use it by itself in a loop because it matches the empty string. The trick here is to do the possibly empty first match outside the loop, and let the loop include the nonmatch in between, which must be nonempty. Let q=p^0/"ITEM"
, then matching by Cs(q*((1p)*q)^0)
does the job: the outer Cs
substitutes the captures instead of returning them.
In general, suppose that the straightforward q=s/r
, where s
is a noncapture pattern and r
the replacement, is not good enough. We need a p
so that the pattern Cs(q*((1p)*q)^0)
works instead. The critical property that p
must have is to succeed only when the given s
produces a nonempty match. You can construct it this way:
nonempty=function(str,pos,cap)
if cap and #cap>0 then return true end
end
p=Cmt(s,nonempty)
It is not possible to construct p
given only q
: the information whether the match was empty has been lost.
APL is a language in which expressions look something like this:
((⍴X)⍴⍋⍋X⍳X,Y)⍳(⍴Y)⍴⍋⍋X⍳(Y←1 2 3),X←2 3 4
APL has the following properties, simplified slightly for the present purpose.
It is a substantial but straightforward task to write a Lua library in which the functions do what the APL functions do. For example, ⍳6
in APL means "the integers 1 to 6", and iota(6)
could do that; 2 3⍴A
means "form a 2×3 matrix out of the contents of A", and rho(A,{2,3})
could do that. Thus 2 3⍴⍳6
which means "form a 2×3 matrix out of the integers 1 to 6" would translate to the Lua expression rho(iota(6),{2,3})
.
The APL expression with which we started, would become
iota(rho(gradeup(gradeup(iota(comma(assign({2,3,4},'X'),
assign({1,2,3},'Y')),_V.X))),rho(_V.Y)),rho(gradeup(gradeup(
iota(comma(_V.Y,_V.X),_V.X))),rho(_V.X)))
This is getting to be quite daunting. It is quite obvious that such a library will never catch on if Lua users are expected to compose expressions like that for themselves. With the aid of LPEG, a function apl2lua
could be written that generates the above Lua code from the original APL code. From there to functions loadapl
and doapl
is easy, so that the Lua user could write
doapl"(((⍴X)⍴⍋⍋X⍳X,Y)⍳(⍴Y)⍴⍋⍋X⍳(Y←1 2 3),X←2 3 4)"
I'm not saying that is perfectly unobfuscated either, but at least it is no more so than the original APL code is. It's a listed idiom from an APL website, so it's safe to assume that regular APL users would not be fazed by it.
APL syntax is pretty straightforward to express in BNF and that in turn can be expressed directly as a Lua grammar. Add the desired Lua translations via the division operator, and there's our APL compiler.
local apl_expr = P{ "expr";
expr =
V'func'*V'expr'/"%1(%2)" +
V'leftarg'*V'func'*V'expr'/"%2(%3,%1)" +
V"variable"*'←'*V"expr"/"assign(%2,'%1')" +
V"leftarg";
func =
funcname*operator*funcname/"%2(%1,%3)" +
funcname*operator/"%2(%1)" +
funcname;
leftarg = V"value" + '('*V"expr"*')'/1;
value = vector/numbers + V"variable"/"_V.%1";
variable = varname*'['*V"indices"*']'/"%1[%2]" + varname;
index = V"expr"+space^0/"nil";
indices = V"index"*';'*V"index"/"{%1;%2}" + V"expr";
}
Wow. Was that easy or was that easy? The only subtlety is that parenthesized expressions in APL need not be parenthesized in Lua, because these expressions will be arguments of a function call.
Actually, that was just the easy part; the rest is harder. We have terminals funcname
, operator
, vector
and varname
to supply, plus the function numbers
that converts a vector
to Lua syntax.
numbers
is a list containing decimal numbers separated by whitespace, formatted as in Lua except that minus signs are replaced by APL's high minus, and optional plus signs are not allowed. This can be coded in LPEG as follows, with a little fancy footwork to disallow a lone decimal point.
local space = S" \n\t"  one character of whitespace
local dec = R"09"^1  positive decimal integer
local sign = P"¯"^1  optional high minus
local fixed = dec*P"."*dec^1 + (dec^1*P".")^1*dec  %f number
local number = sign*fixed*(S"eE"*sign*dec)^1  %e number
local vector = space^0*number*(space^1*number)^0
The numbers
function has very little to do, since LPEG has already verified the syntactic correctness of the individual numbers by the time numbers
is invoked. We can do it with the Lua string library (no need to be pedantic and do everything in LPEG).
local numbers = function(str)
str=str:gsub("¯","")
local v,n = str:gsub("%s+",',')
if n==0 then return str else return '{'..v..'}' end
end
The remaining terminals are varname
, funcname
and operator
. The legal function and operator names are not known at the stage when the grammar is constructed; the user might add to them at any time. Yet the grammar depends on whether a name refers to a function, an operator or a variable.
Suppose we keep tables for the functions and operators: when the user constructs a new function, it is just an entry added to the table. The keyvalue pairs will be the APL and Lua representations respectively. A toy version of these tables might be:
local funcs = {
['+']='plus', ['×']='times', ['/']='slash', ['⍋']="gradeup",
['⍴']="rho", ['⍳']="iota", [',']="comma" }
local ops = { ['/']='slash', ['.']='dot' }
What we now need is a pattern that captures the value if its key is in a table, and fails when the key is absent. This cannot be done by the division operator, since the pattern has already succeeded by the time /tbl
is invoked; only the Cmt
constructor can overrule that.
local lookup = function(tbl)
return function(subj,pos,key)
local v = tbl[key]
if v then return pos,v end
end
end
local funcname = Cmt(name,lookup(funcs))
local operator = Cmt(name,lookup(ops))
We have not yet said what qualifies as name
. As far as the syntax is concerned, it could be anything that does not match numbers, whitespace or the special characters appearing in the grammar. For reasons of legibility and error detection, that would be too permissive.
Historically, APL allows the usual alphabeticalphanumeric names plus the names ⍺
and ⍵
, the latter being reserved for argument names in function definitions. In olden days, APL symbols occupied one byte, internally encoded the same way as other symbols like "é": one needed special screen fonts to display them. Nowadays, the APL symbols not in the ASCII character set have their own Unicode characters, usually represented in Lua strings by UTF8 codepoints, and any decent font can display them.
One must also not lose sight of the fact that APL symbols denote functions, and will be implemented as Lua functions. It is therefore convenient to define any single character that could be an APL function to be an APL name. The following cases cover that:
Let's code that in LPEG.
local first = R"az"+R"AZ"  alphabetic
local later = first+R"09"  alphanumeric
local utc = R"\x80\xBF"  UTF8 continuation byte
local utf2 = R"\xC0\xDF"*utc  2byte codepoint
local utf3 = R"\xE0\xEF"*utc*utc  3byte codepoint
local utf = utf2 + utf3  P"←"P"¯"
local neutral = R"\x21\x7E"laterS"()[;]"
local name = first*later^0 + utf + neutral
For variable names, we must be more restrictive. The assign operator does not care — assign({2,3,4},'X')
— but when variables are referred to, they need to work as unquoted keys — comma(_V.Y,_V.X)
. Anything not already defined to be a function or an operator is allowed.
local varname = C(first*later^0funcnameoperator)
Finally, the function that calls match
and returns the Lua code. This function would be very short if we took a highhanded attitude to APL syntax errors on the part of the user, but it is quite easy to be helpful.
apl2lua = function(apl)
local lua,pos = (apl_expr*Cp()):match(apl)
pos = pos or 1
if pos>#apl then return lua
else error("APL syntax error\n"..apl.."\n"..
string.rep(' ',utflen(apl:sub(1,pos))1)..'↑')
end
end
The Cp()
pattern marks how far the APL compiler managed to get. If that is not the end of the APL code, an error message is printed. utflen
is a little utility that counts the actual number of codepoints: pos1
would put in too many spaces since typically lots is twobyte and threebyte codepoints are used in APL code.
Let's test it with the example at the top.
> =apl2lua"((⍴X)⍴⍋⍋X⍳X,Y)⍳(⍴Y)⍴⍋⍋X⍳(Y←1 2 3),X←2 3 4"
iota(rho(gradeup(gradeup(iota(comma(assign({2,3,4},'X'),
assign({1,2,3},'Y')),_V.X))),rho(_V.Y)),rho(gradeup(gradeup(
iota(comma(_V.Y,_V.X),_V.X))),rho(_V.X)))
Exactly what we had before. (Actually, I cheated. My attempt to write this by hand contained several mistakes, so I cutandpasted this version to up there.)
And an example with a syntax error (here merely a genuine APL function not included in our toy tables):
> =apl2lua"((⍴X)⍴⍋⍋X⍳X,Y)⍳(⍴Y)⍴?⍋X⍳(Y←1 2 3),X←2 3 4"
lpegaplcompiler.lua:68: APL syntax error
((⍴X)⍴⍋⍋X⍳X,Y)⍳(⍴Y)⍴?⍋X⍳(Y←1 2 3),X←2 3 4
↑
The arrow is not quite at the right spot, because the string ending at that point is not legal APL. If the error is inside long parentheses, it is more annoying:
> =apl2lua"((⍴X)⍴?⍋X⍳X,Y)⍳(⍴Y)⍴⍋⍋X⍳(Y←1 2 3),X←2 3 4"
lpegaplcompiler.lua:68: APL syntax error
((⍴X)⍴?⍋X⍳X,Y)⍳(⍴Y)⍴⍋⍋X⍳(Y←1 2 3),X←2 3 4
↑
I can explain that, and could maybe make the error detection more intelligent, but is it worth it? Especially since there is a serious error still present:
> return apl2lua"((⍴X)⍴⍋⍋X⍳X,Y)⍳(⍴Y) ⍴⍋⍋ X ⍳ (Y←1 2 3),X←2 3 4"
lpegaplcompiler.lua:68: APL syntax error
((⍴X)⍴⍋⍋X⍳X,Y)⍳(⍴Y) ⍴⍋⍋ X ⍳ (Y←1 2 3),X←2 3 4
↑
This is the original, correct expression, with some whitespace added to make it more legible. Our grammar makes no provision for inert whitespace — back to the drawing board!
Postscript: all it took was this:
local funcname = space^0*Cmt(name,lookup(funcs))*space^0
local operator = space^0*Cmt(name,lookup(ops))*space^0
local varname = space^0*C(first*later^0+utffuncnameoperator)*space^0
<! About this document: Dirk Laurie wrote it in order to teach himself LPEG. All errors in it can be blamed on his inexperience. The original is lpegbrief.txt
, which is written in Pandoc Markdown. The other versions were made using Pandoc http://www.johnmacfarlane.com/pandoc. />