Author: | Paul McGuire <ptmcg@austin.rr.com> |
---|---|
Version: | 2.1 |
Extracting data from
Don't clutter up the parser grammar with whitespace, just handle it! (likewise for comments)
Grammars must tolerate change, as grammar evolves or input text becomes more challenging.
Grammars do not have to be exhaustive to be useful.
Simple grammars can return lists; complex grammars need named results.
Class names are easier to read and understand than specialized typography.
Parsers can (and sometimes should) do more than tokenize.
Grammars should be:
No separate code-generation step.
No imposed naming conventions.
Stay Pure Python.
Lightweight packaging (single .py source file)
Liberally licensed. (MIT License - free for commercial or non-commercial use)
Design-driven:
language -> BNF -> parser impl --+-> concept ^ | | refine/extend | +--- language --+
Data-driven:
gather determine sample -> text -> BNF -> parser ---+-> inputs patterns | ^ | | gather new | +----- (non-conforming) ---+ inputs
Compose grammar:
greeting = oneOf("Hello Ahoy Yo Hi") + Literal(",") + Word( string.uppercase, string.lowercase ) + Literal("!")
Call parseString():
print greeting.parseString("Hello, World!") print greeting.parseString("Ahoy, Matey !") print greeting.parseString("Yo,Adrian!") ['Hello', ',', 'World', '!'] ['Ahoy', ',', 'Matey', '!'] ['Yo', ',', 'Adrian', '!']
Words and Literals
Word:
Word("ABC","def") matches "C", "Added", "Beef", but not "BB", "ACE", "ADD"
Literal:
Literal("name(x)") matches "name(x)", but not "name (x)"
CaselessLiteral:
CaselessLiteral("ABC") matches "abc", "ABC", "AbC" returns "ABC" in all cases
Combining
And (operator +) - must match all (ignoring whitespace):
assignment = varName + Literal("=") + arithExpression
or:
assignment = varName + "=" + arithExpression
Or (operator ^) - match longest:
qty = integerNum ^ realNum
MatchFirst (operator |) - match first:
arithOp = Literal("+") | Literal("-") | Literal("X") | Literal("/") | Literal("^") | Literal("%")
Combining
Each (operator &) - like And, must match all, but in any order:
cmd = instruction + (Optional(qualA) & Optional(qualB) & Optional(qualC))
Repetition and Collection
Whitespace is not matched as part of the grammar:
Literal("name") + Literal("(") + Literal("x") + Literal(")") matches "name(x)", "name (x)", "name( x )", "name ( x )"
oneOf("string of space-separated literal strings"):
matches "strings", "string", "of", "space-separated", or "literal" arithOp = oneOf("+ - X / ^ %")
delimitedList(expression, delim=','):
matches expression or expression,expression,...
but not for items separated just by whitespace:
delimitedList(expression, ' ') # wrong OneOrMore(expression) # right
quotedString, dblQuotedString, sglQuotedString:
matches "abc", 'def', "string with escaped \" character"
makeHTMLTags("A") - returns beginning and ending tags:
beginning matches "<A>", "<a href="blah">", "<A/>" ending matches "</A>", "</a>"
makeXMLTags("BODY") - like makeHTMLTags, but with tighter case matching:
beginning matches "<BODY>", "<BODY align="top">" but not "<body>" or "<Body>" ending matches "</BODY>", but not "</body>"
Typical comment forms:
cStyleComment /* block comment, spans lines */ dblSlashComment // everything up to end of line cppStyleComment = cStyleComment | dblSlashComment htmlComment <!-- commented out HTML --> pythonComment # everything up to end of line
nestedExpr() - quick expression to match nested ()'s, {}'s, []'s, <>'s, etc. - returns parsed data in nested structure matching nesting of input string
countedArray(expr) - matches integer followed by 'n' expr:
countedArray(Word(alphas))
matches:
4 Bob Joe Fred Lloyd -> ['Bob', 'Joe', 'Fred', 'Lloyd']
matchPreviousLiteral / matchPreviousExpr:
integer = Word(nums) xypair = Combine(integer + ',' + integer) duplicateXYs = xypair + OneOrMore(',' + matchPreviousLiteral(xypair))
Strings of common character sets for Word definitions:
alphas - A-Za-z alphas8bit - accented and umlauted characters ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ *punc8bit - other 8-bit non-alpha characters* ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿×÷ nums - 0-9 alphanums = alphas + nums hexnums = 0-9A-Fa-f printables = string.printables, minus space character srange('A-Za-z') same as 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcde...'
SkipTo - "wildcard" advance to expression
Suppress - match, but suppress matching results; useful for punctuation, delimiters
NotAny - negative lookahead (can also use unary '~' operator)
FollowedBy - assertive lookahead
Other ParserElement methods
Forward - to define recursive grammars - example in just a few slides...
For example:
Regex(r"(\+|-)?\d+")
is equivalent to:
Combine( Optional(oneOf("+ -")) + Word(nums) )
Named re groups in the Regex string are also preserved as ParseResults names
(Don't be too hasty to optimize, though...)
operatorPrecedence - function to build up arithmetic expression parsers, including handling of precedence of operations:
operand = realNum | integer | varName expr = operatorPrecedence( operand, [(oneOf("+ -"), 1, opAssoc.RIGHT), ("!", 1, opAssoc.LEFT), (oneOf("* /"), 2, opAssoc.LEFT), (oneOf("+ -"), 2, opAssoc.LEFT),] )
simpler than:
expr = Forward() factor = Optional((oneOf("+ -")) + ( operand | Group('(' + expr + ')') ) fact1 = factor + Optional("!") term = fact1 + ZeroOrMore( (oneOf("* /") + fact1 ) expr << term + ZeroOrMore( (oneOf("+ -") + term )
Dict - creates named fields using input data for field names (see pyparsing examples)
StringStart, StringEnd, LineStart, LineEnd - positional expressions for line-break dependent expressions
White - (sigh...) if you must parse whitespace, use this class
Exceptions
setName(exprName) - useful for exceptions and debugging, to name parser expression (instead of default repr-like string)
For example, compare:
integer = Word(nums) (integer + integer).parseString("123 J56") pyparsing.ParseException: Expected W:(0123...) (at char 4), (line:1, col:5)
vs:
integer = Word(nums).setName("integer") (integer + integer).parseString("123 J56") pyparsing.ParseException: Expected integer (at char 4), (line:1, col:5)
setDebug() - activates logging of all match attempts, failures, and successes; custom debugging callback methods can be passed as optional arguments:
integer.setDebug() (integer + integer).parseString("123 A6")
prints:
Match integer at loc 0 (1,1) Matched integer -> ['123'] Match integer at loc 3 (1,4) Exception raised: Expected integer (at char 4), (line:1, col:5)
Pyparsing distribution includes
Getting Started with Pyparsing - O'Reilly e-book
Python Magazine, May, 2008 - Writing a Brainf*ck Interpreter/Compiler using Pyparsing
Python Magazine, August, 2008 - Advanced Pyparsing: Writing a JSON Interpreter using Dynamic Results Names
SourceForge pyparsing support forum (RSS feed via Gmane)
comp.lang.python (but not always best)
ONLamp article, Building Recursive Descent Parsers with Python
How to parse the string "['a', 100, 3.14]" back to a Python list
First pass:
lbrack = Literal("[") rbrack = Literal("]") integer = Word(nums).setName("integer") real = Combine(Optional(oneOf("+ -")) + Word(nums) + "." + Optional(Word(nums))).setName("real") listItem = real | integer | quotedString listExpr = lbrack + delimitedList(listItem) + rbrack
test = "['a', 100, 3.14]" print listExpr.parseString(test)
Returns:
['[', "'a'", '100', '3.14', ']']
Brackets []'s are significant during parsing, but are not interesting as part of the results (just as delimitedList discarded the delimiting commas for us):
lbrack = Suppress("[") rbrack = Suppress("]")
We also want actual values, not strings, and we really don't want to carry the quotes around from our quoted strings, so use parse actions for conversions:
cvtInt = lambda s,l,toks: int(toks[0]) integer.setParseAction( cvtInt ) cvtReal = lambda s,l,toks: float(toks[0]) real.setParseAction( cvtReal ) listItem = real | integer | quotedString.setParseAction(removeQuotes)
Updated version now returns:
['a', 100, 3.1400000000000001]
How would we handle nested lists?
We'd like to expand listItem to accept listExpr's, but this creates a chicken-and-egg problem - listExpr is defined in terms of listItem.
Solution: Forward declare listExpr before listItem, using Forward():
listExpr = Forward()
Add listExpr as another type of listItem - enclose as Group to preserve nesting:
listItem = real | integer | quotedString.setParseAction(removeQuotes) \ | Group(listExpr)
Finally, define listExpr using '<<' operator instead of '=' assignment:
listExpr << ( lbrack + delimitedList(listItem) + rbrack )
Using a nested list as input:
test = "['a', 100, 3.14, [ +2.718, 'xyzzy', -1.414] ]"
We now get returned:
['a', 100, 3.1400000000000001, [2.718, 'xyzzy', -1.4139999999999999]]
Entire program:
cvtInt = lambda s,l,toks: int(toks[0]) cvtReal = lambda s,l,toks: float(toks[0]) lbrack = Suppress("[") rbrack = Suppress("]") integer = Word(nums).setName("integer").setParseAction( cvtInt ) real = Combine(Optional(oneOf("+ -")) + Word(nums) + "." + Optional(Word(nums))).setName("real").setParseAction( cvtReal ) listExpr = Forward() listItem = real | integer | quotedString.setParseAction(removeQuotes) | Group(listExpr) listExpr << ( lbrack + delimitedList(listItem) + rbrack ) test = "['a', 100, 3.14, [ +2.718, 'xyzzy', -1.414] ]" print listExpr.parseString(test)
Want to extract all links from a web page to external URLs.
Basic form is:
<a href="www.blah.com/xyzzy.html">Zork magic words</a>
But anchor references are not always so clean looking:
Use pyparsing makeHTMLTags helper method:
aStart,aEnd = makeHTMLTags("A")
Compose suitable expression:
link = aStart + SkipTo(aEnd)("link") + aEnd
Skip any links found in HTML comments:
link.ignore(htmlComment)
Use scanString to search through HTML page - avoids need to define complete HTML grammar:
for toks,start,end in link.scanString(htmlSource): print toks.link, "->", toks.startA.href
Complete program:
from pyparsing import makeHTMLTags,SkipTo,htmlComment import urllib serverListPage = urllib.urlopen( "http://www.yahoo.com" ) htmlText = serverListPage.read() serverListPage.close() aStart,aEnd = makeHTMLTags("A") link = aStart + SkipTo(aEnd)("link") + aEnd link.ignore(htmlComment) for toks,start,end in link.scanString(htmlText): print toks.link, "->", toks.startA.href
Results:
<img src="http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/ml.gif" width=36 height=36 border=0 alt=""><br><nobr><font face=verdana size=-2 color=#000000>Mail</font></nobr> -> r/m1 <img src="http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/my.gif" width=36 height=36 border=0 alt=""><br><nobr><font face=verdana size=-2 color=#000000>My Yahoo!</font></nobr> -> r/i1 <img src="http://us.i1.yimg.com/us.yimg.com/i/ww/bt1/msgn.gif" width=36 height=36 border=0 alt=""><br><nobr><font face=verdana size=-2 color=#000000>Messenger</font></nobr> -> r/p1 Advanced -> r/so My Web -> r/mk Answers <img src="http://us.i1.yimg.com/us.yimg.com/i/ww/beta.gif" border="0"> -> r/1l <b>Yahoo! Health</b>: Make your resolutions a reality. -> s/266569 Lose weight -> s/266570 get fit -> s/266571 and be healthy -> s/266572 <b>Sign In</b> -> r/l6 <b>Sign Up</b> -> r/m7 360° -> r/3t Autos -> r/cr Finance -> r/sq Games -> r/pl
JSON (JavaScript Object Notation) is a simple serialization format for JavaScript web applications:
object { members } {} members string : value members , string : value array [ elements ] [] elements value elements , value value string | number | object | array| true| false | null
Keyword constants and basic building blocks:
TRUE = Keyword("true") FALSE = Keyword("false") NULL = Keyword("null") jsonString = dblQuotedString.setParseAction( removeQuotes ) jsonNumber = Combine( Optional('-') + ( '0' | Word('123456789',nums) ) + Optional( '.' + Word(nums) ) + Optional( Word('eE',exact=1) + Word(nums+'+-',nums) ) )
Compound elements:
jsonObject = Forward() jsonValue = Forward() jsonElements = delimitedList( jsonValue ) jsonArray = Group( Suppress('[') + jsonElements + Suppress(']') ) jsonValue << ( jsonString | jsonNumber | jsonObject | jsonArray | TRUE | FALSE | NULL ) memberDef = Group( jsonString + Suppress(':') + jsonValue ) jsonMembers = delimitedList( memberDef ) jsonObject << Dict( Suppress('{') + jsonMembers + Suppress('}') ) jsonComment = cppStyleComment jsonObject.ignore( jsonComment )
Conversion parse actions:
TRUE.setParseAction( replaceWith(True) ) FALSE.setParseAction( replaceWith(False) ) NULL.setParseAction( replaceWith(None) ) jsonString.setParseAction( removeQuotes ) def convertNumbers(s,l,toks): n = toks[0] try: return int(n) except ValueError, ve: return float(n) jsonNumber.setParseAction( convertNumbers )
{ "glossary": { "title": "example glossary", "GlossDiv": { "title": "S", "GlossList": [{ "ID": "SGML", "SortAs": "SGML", "GlossTerm": "Standard Generalized Markup Language", "TrueValue": true, "FalseValue": false, "IntValue": -144, "FloatValue": 6.02E23, "NullValue": null, "Acronym": "SGML", "Abbrev": "ISO 8879:1986", "GlossDef": "A meta-markup language, used to create markup languages such as DocBook.", "GlossSeeAlso": ["GML", "XML", "markup"] }] } } }
[['glossary', ['title', 'example glossary'], ['GlossDiv', ['title', 'S'], ['GlossList', [['ID', 'SGML'], ['SortAs', 'SGML'], ['GlossTerm', 'Standard Generalized Markup Language'], ['TrueValue', True], ['FalseValue', False], ['IntValue', -144], ['FloatValue', 6.02e+023], ['NullValue', None], ['Acronym', 'SGML'], ['Abbrev', 'ISO 8879:1986'], ['GlossDef', 'A meta-markup language, used to create markup languages such as DocBook.'], ['GlossSeeAlso', ['GML', 'XML', 'markup']]]]]]]
Accessing results by hierarchical path name:
print results.glossary.title print results.glossary.GlossDiv.GlossList.keys() print results.glossary.GlossDiv.GlossList.ID print results.glossary.GlossDiv.GlossList.FalseValue print results.glossary.GlossDiv.GlossList.Acronym example glossary ['GlossSeeAlso', 'GlossDef', 'FalseValue', 'Acronym', 'GlossTerm', 'TrueValue', 'FloatValue', 'IntValue', 'Abbrev', 'SortAs', 'ID', 'NullValue'] SGML False SGML
Implementer of a C API needed to parse the library header to extract the function and parameter declarations.
API followed a very predictable form:
int func1(float *arr, int len, double arg1); int func2(float **arr, float *arr2, int len, double arg1, double arg2);
Was able to define a straightforward grammar using pyparsing:
ident = Word(alphas, alphanums + "_") vartype = oneOf("float double int char") + Optional(Word("*")) arglist = delimitedList( Group(vartype + ident) ) functionCall = Literal("int") + ident + "(" + arglist+ ")" + ";"
Added results names to simplify access to individual function elements
Complete program:
from pyparsing import * testdata = """ int func1(float *arr, int len, double arg1); int func2(float **arr, float *arr2, int len, double arg1, double arg2); """ ident = Word(alphas, alphanums + "_") vartype = Combine( oneOf("float double int") + Optional(Word("*")), adjacent = False) # old style #arglist = delimitedList( Group(vartype.setResultsName("type") + # ident.setResultsName("name")) ) arglist = delimitedList( Group(vartype("type") + ident("name")) ) functionCall = Literal("int") + ident("name") + "(" + arglist("args") + ")" + ";" for fn,s,e in functionCall.scanString(testdata): print fn.name for a in fn.args: print " -",a.type, a.name
Results:
func1 - float* arr - int len - double arg1 func2 - float** arr - float* arr2 - int len - double arg1 - double arg2
Linux distributions:
CheeseShop:
SourceForge:
... or just use easy_install
Pyparsing is not the solution to every text processing problem, but it is a potential solution for many such problems.