PeggerUser Guide
Overview
Parsing Expression Grammar (PEG) for when Regular Expressions just aren't enough!
Pegger is a Parsing Expression Grammar (PEG) implementation. It lets you create text parsers by building up a tree of simple matching rules.
Advanced parsing options let you look ahead with predicates, and throw errors to fail fast.
Pegger has been used (by Fantom Factory) to parse HTML, CSS, Markdown, and ANBF - to name but a few. The general strategy is usually:
- Create a structure of Fantom data classes
- Create a grammar to parse text documents into a node tree
- Trim the tree by removing labels and branches, so the tree resembles the structure of the Fantom classes
- Walk the tree, recursively creating corresponding data classes
Pegger was inspired by Mouse, Parboiled for Java, and pegs for NIM.
Quick Start
using afPegger::Peg class Example { Void main() { input := "<<<Hello Mum>>>" rule := "'<'+ name:[a-zA-Z ]+ '>'+" match := Peg(input, rule).match name := match["name"]// --> "Hello Mum"} }
Usage
The quick start example saw a lot of crazy symbols... so woah, what just happened?
Rules
Pegger attempts to match a rule
against a given string. In the example the string was <<<Hello Mum>>>
and the rule was that mixed bag of crazy characters. Rules can be written in PEG notation (see mixed bag of crazy characters) or they can be created programmatically via Fantom code using the Rules mixin:
// PEG Notation:// '<'+ ([a-zA-Z] / " ")+ '>'+// Fantom code:rule := sequence { oneOrMore(char('<')), oneOrMore(firstOf { alphaChar, spaceChar }), oneOrMore(char('>')), }
The Fantom code can be a lot simpler to read and understand, but is also a lot more verbose.
Once you run a match, the result is a tree. Use Match.dump()
to see it:
rule := sequence { oneOrMore(char('<')), oneOrMore(firstOf { alphaChar, spaceChar }), oneOrMore(char('>')), } input := "<<<Hello Mum>>>" match := Peg(input, rule).match match.dump// --> root : "<<<Hello Mum>>>"
Okay, so that single line starting with root
is not much of a tree. To create a tree, we need to give parts of our rule a label:
rule := sequence { oneOrMore(char('<')) .withLabel("start"), oneOrMore(firstOf { alphaChar, spaceChar }).withLabel("name"), oneOrMore(char('>')) .withLabel("end"), }
We can do the same in PEG notation by using a label:
prefix:
// PEG Notation: // start:'<'+ name:[a-zA-Z ]+ end:'>'+
match.dump()
now gives:
root ├─ start : "<<<" ├─ name : "Hello Mum" └─ end : ">>>"
Each part of the match may be retrieved using the Match.get()
operator:
match["start"].toStr// -> "<<<"match["name"].toStr// -> "Hello Mum"match["end"].toStr// -> ">>>"
Grammar
The same could also be written as PEG grammar. Grammar defines multiple PEG rules. Grammars may be coded programmatically but are usually created from a string. There should always be the one top-level or root rule which is responsible for matching everything:
root = start name end start = '<'+ name = [a-zA-Z ]+ end = '>'+
Definitions may span multiple lines but proceeding lines must contain leading whitespace to distinguish it from a new rule definition.
root = start name end start = '<'+ name = [a-zA-Z ]+ end = '>'+
When run, the same result is given:
grammarStr := "..." grammar := Peg.parseGrammar(grammarStr) input := "<<<Hello Mum>>>" match := grammar["root"].match(input).dump// root// ├─ start : "<<<"// ├─ name : "Hello Mum"// └─ end : ">>>"
Once a grammar (or rule) has been parsed and validated, it may be cached for future re-use.
Rules may be omitted from the result tree by prefixing the definitions with a -
:
root = start name end -start = '<'+ name = [a-zA-Z ]+ -end = '>'+// root// └─ name : "Hello Mum"
PEG Notation
Writing grammar files can be a lot easier to understand than the verbose programatic method. Here's your guide.
Pegger is primarily concerned with parsing displayable 7-bit ASCII characters, but where mentioned also provides support for 16-bit Unicode characters. Non-visible / non-printable characters are beyond the remit of Pegger; largely because Pegger uses strings as input! See Design notes for details.
Rule definitions
A rule is defined with a name followed by =
. The more formal definition of <-
may be used in place of =
.
ruleDef1 = rule ruleDef2 <- rule
Advanced Note: Prefixing a rule with -
will remove it from the result tree.
-ruleDef1 = rule -ruleDef2 <- rule
Advanced Note: Suffixing a rule with -
will remove it from debug output.
ruleDef1- = rule ruleDef2- <- rule
Rules may have both a -
prefix and suffix: -ruleDef- = rule
Sequence
Ordered sequences of rules are expressed by separating them with a space.
ruleDef = rule1 rule2 rule3
Choice / First of
When given a choices, Pegger will match the first rule that passes. Beware: order of choices can be important.
ruleDef = rule1 / rule2 / rule3
Grouping
Choice rules have a higher operator precedence than Sequence rules, but it is better to group rules with brackets to avoid ambiguity.
ruleDef = (rule1 rule2) / (rule3 rule4)
Repetition
Rules may be specified to be matched by different amounts.
rule1? // optional rule1+ // one or more rule1* // zero or more rule1{4} // exactly 4 rule1{2,6} // between 2 and 6 (inclusive) rule1{,5} // at most 5 - same as {0,5} rule1{3,} // at least 3
Literal
Literal strings may be matched with either single or double quotes.
ruleDef1 = "literal 1" ruleDef2 = 'literal 2'
Use backslash to escape the usual \b \f \n \r \t
characters and to escape quotes. (Note that Fantom normalises all new line characters to just \n
.)
Use an i
suffix to indicate the match should be case-insensitive.
ruleDef3 = "new\nline\n"i
Character class
Individual characters, and ranges thereof, may be matched with a regex-esque character class.
ruleDef1 = [a] // matches a ruleDef2 = [abc] // matches 'a', 'b', or 'c' ruleDef3 = [a-z] // matches any character in the range from 'a' to 'z' inclusive ruleDef4 = [a-z]i // as above, but case-insensitive ruleDef5 = [0-9A-F]i // matches any hex digit ruleDef6 = [\n\] \t] // backslash escapes
The hat ^
character will match any character BUT the chosen ones.
ruleDef6 = [^0-9] // match any char BUT not digits
Any character
Use .
to match any character.
ruleDef = . . . // matches any 3 characters
Predicates
Use the ampersand &
to look ahead for a match, but NOT consume any characters.
ruleDef = "something " &"good" .
Use exclamation !
to look ahead for a negative match, but again, NOT consume any characters.
ruleDef = "something " !"bad" .
As per the example above, predicates should always be used in conjunction with a character consuming rule.
Unicode
Use \uXXXX
(hexadecimal) notation to match a 16-bit unicode character. May be used in string literals and character classes.
crlf = [\u000D] "\u000A"
Unicode escapes MUST contain exactly 4 hex digits; 4 to ease parsing and to match Java string characters which are 16 bits.
Macros
Pegger introduces macros for useful extensions. These may be used individually as rules.
name | function |
---|---|
\sos | Matches Start-Of-Stream (or Start-Of-String!?) - does not consume chars |
\eos | Matches End-Of-Stream (or End-Of-String!?) - does not consume chars |
\eol | Matches Start-Of-Line (also matches \sos) - does not consume chars |
\sol | Matches End-Of-Line (also matches \eos) - does not consume chars |
\lower | Matches a lowercase character in the current locale |
\upper | Matches an uppercase character in the current locale |
\alpha | Matches a character in the current locale |
\pass | Always passes the Match |
\fail | Always fails the Match |
\err(xxx) | Throws a PegParseErr with msg |
Comments
Hash #
comments are allowed in grammar, as are double slash //
comments.
# Hash comments are allowed in grammar // as are double slash comments a = . // end-of-line comments also allowed b = [cd] // comments may also appear inbetween rules [de]
PEG Grammar
Interestingly, PEG grammar may itself be expressed as PEG grammar. And indeed, Pegger does use itself to parse your PEG definitions!
PEG grammar:
grammar <- (!\eos (commentLine / ruleDef / \err(FAIL)))+ ruleDef <- exclude:"-"? ruleName debugOff:"-"? cwsp* ("=" / "<-") cwsp* rule commentLine? ruleName <- [a-zA-Z] (("-" [a-zA-Z0-9_]) / [a-zA-Z0-9_])* rule <- firstOf / \err(FAIL) firstOf <- sequence (cwsp* "/" cwsp* sequence)* sequence <- expression (cwsp* expression)* expression <- predicate? (label ":")? type multiplicity? label <- [a-zA-Z] [a-zA-Z0-9_\-]* type <- group / ruleName / literal / chars / macro / dot -group <- "(" cwsp* rule cwsp* ")" predicate <- "!" / "&" multiplicity <- "*" / "+" / "?" / ("{" min:[0-9]* (com:"," max:[0-9]*)? "}") literal <- singleQuote / doubleQuote -singleQuote <- "'" (unicode / ("\\" .) / [^'])+ "'" "i"? -doubleQuote <- "\"" (unicode / ("\\" .) / [^"])+ "\"" "i"? chars <- "[" (unicode / ("\\" .) / [^\]])+ "]" "i"? macro <- "\\" [a-zA-Z]+ ("(" [^)\n]* ")")? unicode <- "\\" "u" [0-9A-F]i [0-9A-F]i [0-9A-F]i [0-9A-F]i dot <- "." -commentLine- <- sp* (nl / comment) -comment- <- ("#" / "//") (!\eos [^\n])* nl -cwsp- <- sp / (!\eos cnl (sp / &("#" / "//"))) -cnl- <- nl / comment -sp- <- [ \t] -nl- <- \eos / "\n"
Recursive / HTML Parsing
A well known limitation of regular expressions is that they can not match nested patterns, such as HTML. (See StackOverflow for explanation.) Pegger to the rescue!
Because PEGs contain rules that may reference themselves in a circular fashion, it is possible to create recursive parsers.
Below is an example that parses nested HTML tags. You can see the recursion from the element
definition which references itself:
grammar := Peg.parseGrammar("element = startTag (element / text)* endTag startTag = '<' name:[a-z]i+ '>' endTag = '</' name:[a-z]i+ '>' text = [^<]+") html := "<html><head><title>Pegger Example</title></head><body><p>Parsing is Easy!</p></body></html>" grammar["element"].match(html).dump Peg(html, element).match.dump
Which outputs the following result tree:
element ├─ startTag │ └─ name : "html" ├─ element │ ├─ startTag │ │ └─ name : "head" │ ├─ element │ │ ├─ startTag │ │ │ └─ name : "title" │ │ ├─ text : "Pegger Example" │ │ └─ endTag │ │ └─ name : "title" │ └─ endTag │ └─ name : "head" ├─ element │ ├─ startTag │ │ └─ name : "body" │ ├─ element │ │ ├─ startTag │ │ │ └─ name : "p" │ │ ├─ text : "Parsing is Easy!" │ │ └─ endTag │ │ └─ name : "p" │ └─ endTag │ └─ name : "body" └─ endTag └─ name : "html"
Parsing has never been easier!
Note that only Chuck Norris can parse HTML with regular expressions.
Converting the match results into XML is left as an exercise for the user, but there are a couple of options open to you:
1. Looping
This is usually the easiest way to convert your match results, but not always the cleanest.
It involves looping over the match results the same as you would with any other list, and recursively calling yourself to convert inner data.
2. Walking
Create a walk()
method that recursively calls a given function every time it steps into, or out of, a match:
Void walk(Match match, |Match, Str startOrEnd| fn) { fn?.call(match, "start") match.matches.each { walk(it, fn) } fn?.call(match, "end") }
Custom Rules
Custom rules may be created by subclassing the Rule class.
There (currently) is no support for introducing custom rules in PEG grammar, so use of custom rules limits their use to inclusion in programmatic Fantom based grammars.
Debugging
By enabling debug logging, Pegger
will spew out a lot of debug / trace information. (Possibly more than you can handle!) But note it will only emit debug information for rules with names or labels.
Enable debug logging with the line:
Peg.debugOn// orLog.get("afPegger").level = LogLevel.debug
Which, for the above html parsing example, will generate the following:
[afPegger] [ 1] --> element - Processing: startTag (element / text)* endTag with: <html><title>Pegger Ex... [afPegger] [ 2] --> startTag - Processing: "<" [a-zA-Z]+ ">" with: <html><title>Pegger Ex... [afPegger] [ 2] > startTag - Passed! [afPegger] [ 2] > startTag - Matched: "<html>" [afPegger] [ 4] --> element - Processing: startTag (element / text)* endTag with: <title>Pegger Example<... [afPegger] [ 5] --> startTag - Processing: "<" [a-zA-Z]+ ">" with: <title>Pegger Example<... [afPegger] [ 5] > startTag - Passed! [afPegger] [ 5] > startTag - Matched: "<title>" [afPegger] [ 7] --> element - Processing: startTag (element / text)* endTag with: Pegger Example</title>... [afPegger] [ 8] --> startTag - Processing: "<" [a-zA-Z]+ ">" with: Pegger Example</title>... [afPegger] [ 8] > startTag - Did not match "<". [afPegger] [ 8] > startTag - Failed. Rolling back. [afPegger] [ 7] > element - Did not match startTag. [afPegger] [ 7] > element - Failed. Rolling back. [afPegger] [ 7] --> text - Processing: (!"<" .)+ with: Pegger Example</title>... [afPegger] [ 7] > text - Rule was successfully processed 14 times [afPegger] [ 7] > text - Passed! [afPegger] [ 7] > text - Matched: "Pegger Example" ... ... ...
Design notes
Pegger was purposefully designed to match strings or more specifically, Unicode character data (of variable byte length dependant on encoding) - not binary data. Hence it lacks notation to match bytes and byte ranges as seen in some RFC documents.
If required, hexadecimal Unicode escape sequences ( \uXXXX
) may be used to represent 8-bit binary data.