Introduction to Pyparsing: An Object-oriented Easy-to-Use Toolkit for Building Recursive Descent Parsers

Author: Paul McGuire <ptmcg@austin.rr.com>
Version: 2.1

Agenda

What is Pyparsing for?

Extracting data from

The Zen of Pyparsing

Don't clutter up the parser grammar with whitespace, just handle it! (likewise for comments)

Grammars must tolerate change, as grammar evolves or input text becomes more challenging.

Grammars do not have to be exhaustive to be useful.

Simple grammars can return lists; complex grammars need named results.

Class names are easier to read and understand than specialized typography.

Parsers can (and sometimes should) do more than tokenize.

The Zen of Pyparsing Development

Grammars should be:

No separate code-generation step.

No imposed naming conventions.

Stay Pure Python.

Lightweight packaging (single .py source file)

Liberally licensed. (MIT License - free for commercial or non-commercial use)

Two Kinds of Parsing Applications

Design-driven:

language -> BNF -> parser impl --+->
 concept     ^                   |
             |   refine/extend   |
             +---  language    --+

Data-driven:

gather    determine
sample ->   text    -> BNF -> parser ---+->
inputs     patterns                     |
             ^                          |
             |         gather new       |
             +----- (non-conforming) ---+
                         inputs

Pyparsing "Hello World"

Compose grammar:

greeting = oneOf("Hello Ahoy Yo Hi") +
           Literal(",") +
           Word( string.uppercase, string.lowercase ) +
           Literal("!")

Call parseString():

print greeting.parseString("Hello, World!")
print greeting.parseString("Ahoy, Matey !")
print greeting.parseString("Yo,Adrian!")
['Hello', ',', 'World', '!']
['Ahoy', ',', 'Matey', '!']
['Yo', ',', 'Adrian', '!']

Basic Pyparsing

Words and Literals

Basic Pyparsing (2)

Combining

Basic Pyparsing (new)

Combining

Basic Pyparsing (3)

Repetition and Collection

Whitespace is not matched as part of the grammar:

Literal("name") + Literal("(") + Literal("x") + Literal(")")
    matches "name(x)", "name (x)", "name( x )", "name ( x )"

Helpful Pyparsing Built-ins

Helpful Pyparsing Built-ins (2)

Helpful Pyparsing Built-ins (new)

Helpful Pyparsing Built-ins (3)

Intermediate Pyparsing

SkipTo - "wildcard" advance to expression

Suppress - match, but suppress matching results; useful for punctuation, delimiters

NotAny - negative lookahead (can also use unary '~' operator)

FollowedBy - assertive lookahead

Intermediate Pyparsing (2)

Other ParserElement methods

Advanced Pyparsing

Forward - to define recursive grammars - example in just a few slides...

Regex - define regular expression to match

For example:

Regex(r"(\+|-)?\d+")

is equivalent to:

Combine( Optional(oneOf("+ -")) + Word(nums) )

Named re groups in the Regex string are also preserved as ParseResults names

(Don't be too hasty to optimize, though...)

Helpful Pyparsing Built-ins (new)

Advanced Pyparsing (2)

Dict - creates named fields using input data for field names (see pyparsing examples)

StringStart, StringEnd, LineStart, LineEnd - positional expressions for line-break dependent expressions

White - (sigh...) if you must parse whitespace, use this class

Exceptions

Debugging Pyparsing Grammars

setName(exprName) - useful for exceptions and debugging, to name parser expression (instead of default repr-like string)

For example, compare:

integer = Word(nums)
(integer + integer).parseString("123 J56")
pyparsing.ParseException: Expected W:(0123...) (at char 4), (line:1, col:5)

vs:

integer = Word(nums).setName("integer")
(integer + integer).parseString("123 J56")
pyparsing.ParseException: Expected integer (at char 4), (line:1, col:5)

Debugging Pyparsing Grammars (2)

setDebug() - activates logging of all match attempts, failures, and successes; custom debugging callback methods can be passed as optional arguments:

integer.setDebug()
(integer + integer).parseString("123 A6")

prints:

Match integer at loc 0 (1,1)
Matched integer -> ['123']
Match integer at loc 3 (1,4)
Exception raised: Expected integer (at char 4), (line:1, col:5)

Pyparsing Documentation

Pyparsing distribution includes

Pyparsing in Print (new)

Online resources

Pyparsing Applications

Parsing Lists

How to parse the string "['a', 100, 3.14]" back to a Python list

First pass:

lbrack = Literal("[")
rbrack = Literal("]")
integer = Word(nums).setName("integer")
real = Combine(Optional(oneOf("+ -")) + Word(nums) + "." +
               Optional(Word(nums))).setName("real")

listItem = real | integer | quotedString

listExpr = lbrack + delimitedList(listItem) + rbrack

Parsing Lists (2)

test = "['a', 100, 3.14]"

print listExpr.parseString(test)

Returns:

['[', "'a'", '100', '3.14', ']']

Brackets []'s are significant during parsing, but are not interesting as part of the results (just as delimitedList discarded the delimiting commas for us):

lbrack = Suppress("[")
rbrack = Suppress("]")

Parsing Lists (3)

We also want actual values, not strings, and we really don't want to carry the quotes around from our quoted strings, so use parse actions for conversions:

cvtInt = lambda s,l,toks: int(toks[0])
integer.setParseAction( cvtInt )

cvtReal = lambda s,l,toks: float(toks[0])
real.setParseAction( cvtReal )

listItem = real | integer | quotedString.setParseAction(removeQuotes)

Updated version now returns:

['a', 100, 3.1400000000000001]

Parsing Lists (4)

How would we handle nested lists?

We'd like to expand listItem to accept listExpr's, but this creates a chicken-and-egg problem - listExpr is defined in terms of listItem.

Solution: Forward declare listExpr before listItem, using Forward():

listExpr = Forward()

Add listExpr as another type of listItem - enclose as Group to preserve nesting:

listItem = real | integer | quotedString.setParseAction(removeQuotes) \
            | Group(listExpr)

Finally, define listExpr using '<<' operator instead of '=' assignment:

listExpr << ( lbrack + delimitedList(listItem) + rbrack )

Parsing Lists (5)

Using a nested list as input:

test = "['a', 100, 3.14, [ +2.718, 'xyzzy', -1.414] ]"

We now get returned:

['a', 100, 3.1400000000000001, [2.718, 'xyzzy', -1.4139999999999999]]

Parsing Lists (6)

Entire program:

cvtInt = lambda s,l,toks: int(toks[0])
cvtReal = lambda s,l,toks: float(toks[0])

lbrack = Suppress("[")
rbrack = Suppress("]")
integer = Word(nums).setName("integer").setParseAction( cvtInt )
real = Combine(Optional(oneOf("+ -")) + Word(nums) + "." +
               Optional(Word(nums))).setName("real").setParseAction( cvtReal )

listExpr = Forward()
listItem = real | integer | quotedString.setParseAction(removeQuotes)
            | Group(listExpr)
listExpr << ( lbrack + delimitedList(listItem) + rbrack )

test = "['a', 100, 3.14, [ +2.718, 'xyzzy', -1.414] ]"
print listExpr.parseString(test)

HTML Scraping

Want to extract all links from a web page to external URLs.

Basic form is:

<a href="www.blah.com/xyzzy.html">Zork magic words</a>

But anchor references are not always so clean looking:

HTML Scraping (2)

Use pyparsing makeHTMLTags helper method:

aStart,aEnd = makeHTMLTags("A")

Compose suitable expression:

link = aStart + SkipTo(aEnd)("link") + aEnd

Skip any links found in HTML comments:

link.ignore(htmlComment)

Use scanString to search through HTML page - avoids need to define complete HTML grammar:

for toks,start,end in link.scanString(htmlSource):
    print toks.link, "->", toks.startA.href

HTML Scraping (3)

Complete program:

from pyparsing import makeHTMLTags,SkipTo,htmlComment
import urllib

serverListPage = urllib.urlopen( "http://www.yahoo.com" )
htmlText = serverListPage.read()
serverListPage.close()

aStart,aEnd = makeHTMLTags("A")

link = aStart + SkipTo(aEnd)("link") + aEnd
link.ignore(htmlComment)

for toks,start,end in link.scanString(htmlText):
    print toks.link, "->", toks.startA.href

HTML Scraping (4)

Results:

<img src="http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/ml.gif" width=36 height=36 border=0 alt=""><br><nobr><font face=verdana size=-2 color=#000000>Mail</font></nobr> -> r/m1
<img src="http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/my.gif" width=36 height=36 border=0 alt=""><br><nobr><font face=verdana size=-2 color=#000000>My Yahoo!</font></nobr> -> r/i1
<img src="http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/msgn.gif" width=36 height=36 border=0 alt=""><br><nobr><font face=verdana size=-2 color=#000000>Messenger</font></nobr> -> r/p1
Advanced -> r/so
My Web -> r/mk
Answers <img src="http://us.i1.yimg.com/us.yimg.com/i/ww/beta.gif" border="0"> -> r/1l
<b>Yahoo! Health</b>: Make your resolutions a reality. -> s/266569
Lose weight -> s/266570
get fit -> s/266571
and be healthy -> s/266572
<b>Sign In</b> -> r/l6
<b>Sign Up</b> -> r/m7
360&#176; -> r/3t
Autos -> r/cr
Finance -> r/sq
Games -> r/pl

JSON Parser

JSON (JavaScript Object Notation) is a simple serialization format for JavaScript web applications:

object
    { members }
    {}
members
    string : value
    members , string : value
array
    [ elements ]
    []
elements
    value
    elements , value
value
    string | number | object | array| true| false | null

JSON Parser

Keyword constants and basic building blocks:

TRUE = Keyword("true")
FALSE = Keyword("false")
NULL = Keyword("null")

jsonString = dblQuotedString.setParseAction( removeQuotes )
jsonNumber = Combine( Optional('-') + ( '0' | Word('123456789',nums) ) +
                    Optional( '.' + Word(nums) ) +
                    Optional( Word('eE',exact=1) + Word(nums+'+-',nums) ) )

JSON Parser

Compound elements:

jsonObject = Forward()
jsonValue = Forward()
jsonElements = delimitedList( jsonValue )
jsonArray = Group( Suppress('[') + jsonElements + Suppress(']') )
jsonValue << ( jsonString | jsonNumber | jsonObject  |
               jsonArray | TRUE | FALSE | NULL )
memberDef = Group( jsonString + Suppress(':') + jsonValue )
jsonMembers = delimitedList( memberDef )
jsonObject << Dict( Suppress('{') + jsonMembers + Suppress('}') )

jsonComment = cppStyleComment
jsonObject.ignore( jsonComment )

JSON Parser

Conversion parse actions:

TRUE.setParseAction( replaceWith(True) )
FALSE.setParseAction( replaceWith(False) )
NULL.setParseAction( replaceWith(None) )

jsonString.setParseAction( removeQuotes )

def convertNumbers(s,l,toks):
    n = toks[0]
    try:
        return int(n)
    except ValueError, ve:
        return float(n)

jsonNumber.setParseAction( convertNumbers )

JSON Parser - Test Data

{   "glossary": {
        "title": "example glossary",
        "GlossDiv": {
            "title": "S",
            "GlossList":
                [{
                "ID": "SGML",
                "SortAs": "SGML",
                "GlossTerm": "Standard Generalized Markup Language",
                "TrueValue": true,
                "FalseValue": false,
                "IntValue": -144,
                "FloatValue": 6.02E23,
                "NullValue": null,
                "Acronym": "SGML",
                "Abbrev": "ISO 8879:1986",
                "GlossDef": "A meta-markup language, used to create markup languages such as DocBook.",
                "GlossSeeAlso": ["GML", "XML", "markup"]
                }] } } }

JSON Parser - Results

[['glossary',
  ['title', 'example glossary'],
  ['GlossDiv',
   ['title', 'S'],
   ['GlossList',
    [['ID', 'SGML'],
     ['SortAs', 'SGML'],
     ['GlossTerm', 'Standard Generalized Markup Language'],
     ['TrueValue', True],
     ['FalseValue', False],
     ['IntValue', -144],
     ['FloatValue', 6.02e+023],
     ['NullValue', None],
     ['Acronym', 'SGML'],
     ['Abbrev', 'ISO 8879:1986'],
     ['GlossDef',
      'A meta-markup language, used to create markup languages such as DocBook.'],
     ['GlossSeeAlso', ['GML', 'XML', 'markup']]]]]]]

JSON Parser - Results

Accessing results by hierarchical path name:

print results.glossary.title
print results.glossary.GlossDiv.GlossList.keys()
print results.glossary.GlossDiv.GlossList.ID
print results.glossary.GlossDiv.GlossList.FalseValue
print results.glossary.GlossDiv.GlossList.Acronym

example glossary
['GlossSeeAlso', 'GlossDef', 'FalseValue', 'Acronym', 'GlossTerm',
 'TrueValue', 'FloatValue', 'IntValue', 'Abbrev', 'SortAs', 'ID',
 'NullValue']
SGML
False
SGML

Extracting C Function Declarations

Implementer of a C API needed to parse the library header to extract the function and parameter declarations.

API followed a very predictable form:

int func1(float *arr, int len, double arg1);
int func2(float **arr, float *arr2, int len, double arg1, double arg2);

Was able to define a straightforward grammar using pyparsing:

ident = Word(alphas, alphanums + "_")
vartype = oneOf("float double int char") + Optional(Word("*"))
arglist = delimitedList( Group(vartype + ident) )

functionCall = Literal("int") + ident + "(" + arglist+ ")" + ";"

Added results names to simplify access to individual function elements

Extracting C Function Declarations (2)

Complete program:

from pyparsing import *
testdata = """
  int func1(float *arr, int len, double arg1);
  int func2(float **arr, float *arr2, int len, double arg1, double arg2);
  """
ident = Word(alphas, alphanums + "_")
vartype = Combine( oneOf("float double int") + Optional(Word("*")), adjacent = False)
# old style
#arglist = delimitedList( Group(vartype.setResultsName("type") +
#                                ident.setResultsName("name")) )
arglist = delimitedList( Group(vartype("type") + ident("name")) )
functionCall = Literal("int") + ident("name") + "(" + arglist("args") + ")" + ";"

for fn,s,e in functionCall.scanString(testdata):
    print fn.name
    for a in fn.args:
        print " -",a.type, a.name

Extracting C Function Declarations (3)

Results:

func1
 - float* arr
 - int len
 - double arg1
func2
 - float** arr
 - float* arr2
 - int len
 - double arg1
 - double arg2

Who's Using Pyparsing?

Where to Get Pyparsing

Linux distributions:

CheeseShop:

SourceForge:

... or just use easy_install

Summary

Pyparsing is not the solution to every text processing problem, but it is a potential solution for many such problems.

1