Friday, July 16, 2010

An ANTLR grammar for parsing XPath 1.0 expressions

Ever been in a situation where you needed to parse XPath 1.0 expressions? I don't mean to evaluate them, but to actually parse them. Perhaps to script an automatic update to hundreds of BPEL processes. That's what I needed to do.

If you ever find yourself in such a situation, then perhaps the ANTLR grammar below will be of some use to you. Enjoy!

(This isn't the only one that's available on the web, but all the others that I tried had bugs or wouldn't even compile (with my version of ANTLR; they may have been written for an older version). No guarantees that my version hasn't got any bugs though!)


grammar XPath;

/*
XPath 1.0 grammar. Should conform to the official spec at
http://www.w3.org/TR/1999/REC-xpath-19991116. The grammar
rules have been kept as close as possible to those in the
spec, but some adjustmewnts were unavoidable. These were
mainly removing left recursion (spec seems to be based on
LR), and to deal with the double nature of the '*' token
(node wildcard and multiplication operator). See also
section 3.7 in the spec. These rule changes should make
no difference to the strings accepted by the grammar.

Written by Jan-Willem van den Broek
Version 1.0

Do with this code as you will.
*/

tokens {
  PATHSEP  =  '/';
  ABRPATH  =  '//';
  LPAR  =  '(';
  RPAR  =  ')';
  LBRAC  =  '[';
  RBRAC  =  ']';
  MINUS  =  '-';
  PLUS  =  '+';
  DOT  =  '.';
  MUL  =  '*';
  DOTDOT  =  '..';
  AT  =  '@';
  COMMA  =  ',';
  PIPE  =  '|';
  LESS  =  '<';
  MORE  =  '>';
  LE  =  '<=';
  GE  =  '>=';
  COLON  =  ':';
  CC  =  '::';
  APOS  =  '\'';
  QUOT  =  '\"';
}

main  :  expr
  ;

locationPath
  :  relativeLocationPath
  |  absoluteLocationPathNoroot
  ;

absoluteLocationPathNoroot
  :  '/' relativeLocationPath
  |  '//' relativeLocationPath
  ;

relativeLocationPath
  :  step (('/'|'//') step)*
  ;

step  :  axisSpecifier nodeTest predicate*
  |  abbreviatedStep
  ;

axisSpecifier
  :  AxisName '::'
  |  '@'?
  ;

nodeTest:  nameTest
  |  NodeType '(' ')'
  |  'processing-instruction' '(' Literal ')'
  ;

predicate
  :  '[' expr ']'
  ;

abbreviatedStep
  :  '.'
  |  '..'
  ;

expr  :  orExpr
  ;

primaryExpr
  :  variableReference
  |  '(' expr ')'
  |  Literal
  |  Number  
  |  functionCall
  ;

functionCall
  :  functionName '(' ( expr ( ',' expr )* )? ')'
  ;

unionExprNoRoot
  :  pathExprNoRoot ('|' unionExprNoRoot)?
  |  '/' '|' unionExprNoRoot
  ;

pathExprNoRoot
  :  locationPath
  |  filterExpr (('/'|'//') relativeLocationPath)?
  ;

filterExpr
  :  primaryExpr predicate*
  ;

orExpr  :  andExpr ('or' andExpr)*
  ;

andExpr  :  equalityExpr ('and' equalityExpr)*
  ;

equalityExpr
  :  relationalExpr (('='|'!=') relationalExpr)*
  ;

relationalExpr
  :  additiveExpr (('<'|'>'|'<='|'>=') additiveExpr)*
  ;

additiveExpr
  :  multiplicativeExpr (('+'|'-') multiplicativeExpr)*
  ;

multiplicativeExpr
  :  unaryExprNoRoot (('*'|'div'|'mod') multiplicativeExpr)?
  |  '/' (('div'|'mod') multiplicativeExpr)?
  ;

unaryExprNoRoot
  :  '-'* unionExprNoRoot
  ;

qName  :  nCName (':' nCName)?
  ;

functionName
  :  qName  // Does not match nodeType, as per spec.
  ;

variableReference
  :  '$' qName
  ;

nameTest:  '*'
  |  nCName ':' '*'
  |  qName
  ;

nCName  :  NCName
  |  AxisName
  ;

NodeType:  'comment'
  |  'text'
  |  'processing-instruction'
  |  'node'
  ;
  
Number  :  Digits ('.' Digits?)?
  |  '.' Digits
  ;

fragment
Digits  :  ('0'..'9')+
  ;

AxisName:  'ancestor'
  |  'ancestor-or-self'
  |  'attribute'
  |  'child'
  |  'descendant'
  |  'descendant-or-self'
  |  'following'
  |  'following-sibling'
  |  'namespace'
  |  'parent'
  |  'preceding'
  |  'preceding-sibling'
  |  'self'
  ;

Literal  :  '"' ~'"'* '"'
  |  '\'' ~'\''* '\''
  ;

Whitespace
  :  (' '|'\t'|'\n'|'\r')+ {$channel = HIDDEN;}
  ;

NCName  :  NCNameStartChar NCNameChar*
  ;

fragment
NCNameStartChar
  :  'A'..'Z'
  |   '_'
  |  'a'..'z'
  |  '\u00C0'..'\u00D6'
  |  '\u00D8'..'\u00F6'
  |  '\u00F8'..'\u02FF'
  |  '\u0370'..'\u037D'
  |  '\u037F'..'\u1FFF'
  |  '\u200C'..'\u200D'
  |  '\u2070'..'\u218F'
  |  '\u2C00'..'\u2FEF'
  |  '\u3001'..'\uD7FF'
  |  '\uF900'..'\uFDCF'
  |  '\uFDF0'..'\uFFFD'
// Unfortunately, java escapes can't handle this conveniently,
// as they're limited to 4 hex digits. TODO.
//  |  '\U010000'..'\U0EFFFF'
  ;

fragment
NCNameChar
  :  NCNameStartChar | '-' | '.' | '0'..'9'
  |  '\u00B7' | '\u0300'..'\u036F'
  |  '\u203F'..'\u2040'
  ;

24 comments:

Anonymous said...

ЎHola!
Todo dinбmica y muy positiva! :)
Gracias

[url=http://www.nolivefeel.com/]SuperSonic[/url]

Anonymous said...

Hello gays, very cool forum!

Anonymous said...

thanks for the post

Anonymous said...

In it something is also to me this idea is pleasant, I completely with you agree.

Anonymous said...

I love browsing your website because you can always get us new and awesome things, I think that I must at least say a thank you for your hard work.

- Henry

Anonymous said...

hello!This was a really outstanding topic!
I come from roma, I was fortunate to search your theme in yahoo
Also I learn much in your blog really thanks very much i will come later

Anonymous said...

Employing radar technologies, PARKTRONIC with Car parking Help debuted on S-Class automobiles, and is now readily available around the E-Class Coupe. The initial issue it does is aid you locate a car parking area.

Making use of radar engineering, PARKTRONIC with Car parking Help debuted on S-Class automobiles.

Sideways-inclined detectors around the front bumper report the length of a car parking room as you drive by it (at speeds as much as 35 km/h)! If it is a massive sufficient area, a green light is signaled around the show, which can be mounted on leading in the dash, straight beneath the rear view.

you are able to examine at:
[url=http://www.parkingassistant.co.uk]parktronic[/url]

Anonymous said...

how are you!This was a really marvelous topic!
I come from roma, I was fortunate to find your subject in google
Also I get a lot in your website really thanks very much i will come every day

Anonymous said...

joyncVany
[url=http://healthplusrx.com/bed-wetting]bed wetting[/url]
JuinnaGah

Anonymous said...

Good day!This was a really brilliant subject!
I come from roma, I was luck to search your theme in baidu
Also I get a lot in your blog really thank your very much i will come every day

Anonymous said...

hello I was luck to approach your website in baidu
your post is outstanding
I learn much in your blog really thank your very much
btw the theme of you blog is really quality
where can find it

Anonymous said...

every other day diet reviews
Strip That Fat reviews
Burn the fat feed the muscle scam

Anonymous said...

Man, really want to know how can you be that smart, lol...great read, thanks.

Anonymous said...

CNA jobs online
lpn jobs
LVN jobs by state

Anonymous said...

What a great resource!

Anonymous said...

Спасибо понравилось !

Anonymous said...

hi I was luck to look for your topic in wordpress
your Topics is impressive
I obtain a lot in your theme really thanks very much
btw the theme of you site is really impressive
where can find it

Anonymous said...

Благодарность за материалы! :)
Respect blog.jwbroek.com

Anonymous said...

hi I was luck to look for your topic in baidu
your subject is wonderful
I get a lot in your blog really thanks very much
btw the theme of you website is really marvelous
where can find it

Anonymous said...

Thanks for some quality points there. I am kind of new to online , so I printed this off to put in my file, any better way to go about keeping track of it then printing?

Anonymous said...

Молодежное видео http://2nt.ru/znak.htm
Где скачать игру
Frontlines: Fuel of War

Anonymous said...

There's many a good tune played on an old fiddle

tinne said...
This comment has been removed by the author.
tinne said...

Supplementary character support goes as easy as this:

[code]
fragment
NCNameStartChar
: 'A'..'Z'
| '_'
| 'a'..'z'
| '\u00C0'..'\u00D6'
| '\u00D8'..'\u00F6'
| '\u00F8'..'\u02FF'
| '\u0370'..'\u037D'
| '\u037F'..'\u1FFF'
| '\u200C'..'\u200D'
| '\u2070'..'\u218F'
| '\u2C00'..'\u2FEF'
| '\u3001'..'\uD7FF'
| '\uF900'..'\uFDCF'
| '\uFDF0'..'\uFFFD'
| PermittedHighSurrogateChar LowSurrogateChar
;

// UniCode supplementary character support added
// 2011-06-18 Karsten Tinnefeld
// cf. http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp
// [#x10000-#xEFFFF] is [#xD800 0xDC00-#xDB7F 0xDFFF] in UTF-16
// cf. http://en.wikipedia.org/wiki/UTF-16/UCS-2
// cf. http://www.w3.org/TR/xml11/#sec-common-syn
fragment
PermittedHighSurrogateChar
: '\uD800'..'\uDB7F'
;

fragment
LowSurrogateChar
: '\uDC00'..'\uDFFF'
;
[/code]