Friday, July 16, 2010

An ANTLR grammar for parsing XPath 1.0 expressions

Ever been in a situation where you needed to parse XPath 1.0 expressions? I don't mean to evaluate them, but to actually parse them. Perhaps to script an automatic update to hundreds of BPEL processes. That's what I needed to do.

If you ever find yourself in such a situation, then perhaps the ANTLR grammar below will be of some use to you. Enjoy!

(This isn't the only one that's available on the web, but all the others that I tried had bugs or wouldn't even compile (with my version of ANTLR; they may have been written for an older version). No guarantees that my version hasn't got any bugs though!)

grammar XPath;

XPath 1.0 grammar. Should conform to the official spec at The grammar
rules have been kept as close as possible to those in the
spec, but some adjustmewnts were unavoidable. These were
mainly removing left recursion (spec seems to be based on
LR), and to deal with the double nature of the '*' token
(node wildcard and multiplication operator). See also
section 3.7 in the spec. These rule changes should make
no difference to the strings accepted by the grammar.

Written by Jan-Willem van den Broek
Version 1.0

Do with this code as you will.

tokens {
  PATHSEP  =  '/';
  ABRPATH  =  '//';
  LPAR  =  '(';
  RPAR  =  ')';
  LBRAC  =  '[';
  RBRAC  =  ']';
  MINUS  =  '-';
  PLUS  =  '+';
  DOT  =  '.';
  MUL  =  '*';
  DOTDOT  =  '..';
  AT  =  '@';
  COMMA  =  ',';
  PIPE  =  '|';
  LESS  =  '<';
  MORE  =  '>';
  LE  =  '<=';
  GE  =  '>=';
  COLON  =  ':';
  CC  =  '::';
  APOS  =  '\'';
  QUOT  =  '\"';

main  :  expr

  :  relativeLocationPath
  |  absoluteLocationPathNoroot

  :  '/' relativeLocationPath
  |  '//' relativeLocationPath

  :  step (('/'|'//') step)*

step  :  axisSpecifier nodeTest predicate*
  |  abbreviatedStep

  :  AxisName '::'
  |  '@'?

nodeTest:  nameTest
  |  NodeType '(' ')'
  |  'processing-instruction' '(' Literal ')'

  :  '[' expr ']'

  :  '.'
  |  '..'

expr  :  orExpr

  :  variableReference
  |  '(' expr ')'
  |  Literal
  |  Number  
  |  functionCall

  :  functionName '(' ( expr ( ',' expr )* )? ')'

  :  pathExprNoRoot ('|' unionExprNoRoot)?
  |  '/' '|' unionExprNoRoot

  :  locationPath
  |  filterExpr (('/'|'//') relativeLocationPath)?

  :  primaryExpr predicate*

orExpr  :  andExpr ('or' andExpr)*

andExpr  :  equalityExpr ('and' equalityExpr)*

  :  relationalExpr (('='|'!=') relationalExpr)*

  :  additiveExpr (('<'|'>'|'<='|'>=') additiveExpr)*

  :  multiplicativeExpr (('+'|'-') multiplicativeExpr)*

  :  unaryExprNoRoot (('*'|'div'|'mod') multiplicativeExpr)?
  |  '/' (('div'|'mod') multiplicativeExpr)?

  :  '-'* unionExprNoRoot

qName  :  nCName (':' nCName)?

  :  qName  // Does not match nodeType, as per spec.

  :  '$' qName

nameTest:  '*'
  |  nCName ':' '*'
  |  qName

nCName  :  NCName
  |  AxisName

NodeType:  'comment'
  |  'text'
  |  'processing-instruction'
  |  'node'
Number  :  Digits ('.' Digits?)?
  |  '.' Digits

Digits  :  ('0'..'9')+

AxisName:  'ancestor'
  |  'ancestor-or-self'
  |  'attribute'
  |  'child'
  |  'descendant'
  |  'descendant-or-self'
  |  'following'
  |  'following-sibling'
  |  'namespace'
  |  'parent'
  |  'preceding'
  |  'preceding-sibling'
  |  'self'

Literal  :  '"' ~'"'* '"'
  |  '\'' ~'\''* '\''

  :  (' '|'\t'|'\n'|'\r')+ {$channel = HIDDEN;}

NCName  :  NCNameStartChar NCNameChar*

  :  'A'..'Z'
  |   '_'
  |  'a'..'z'
  |  '\u00C0'..'\u00D6'
  |  '\u00D8'..'\u00F6'
  |  '\u00F8'..'\u02FF'
  |  '\u0370'..'\u037D'
  |  '\u037F'..'\u1FFF'
  |  '\u200C'..'\u200D'
  |  '\u2070'..'\u218F'
  |  '\u2C00'..'\u2FEF'
  |  '\u3001'..'\uD7FF'
  |  '\uF900'..'\uFDCF'
  |  '\uFDF0'..'\uFFFD'
// Unfortunately, java escapes can't handle this conveniently,
// as they're limited to 4 hex digits. TODO.
//  |  '\U010000'..'\U0EFFFF'

  :  NCNameStartChar | '-' | '.' | '0'..'9'
  |  '\u00B7' | '\u0300'..'\u036F'
  |  '\u203F'..'\u2040'


tinne said...
Supplementary character support goes as easy as this:

: 'A'..'Z'
| '_'
| 'a'..'z'
| '\u00C0'..'\u00D6'
| '\u00D8'..'\u00F6'
| '\u00F8'..'\u02FF'
| '\u0370'..'\u037D'
| '\u037F'..'\u1FFF'
| '\u200C'..'\u200D'
| '\u2070'..'\u218F'
| '\u2C00'..'\u2FEF'
| '\u3001'..'\uD7FF'
| '\uF900'..'\uFDCF'
| '\uFDF0'..'\uFFFD'
| PermittedHighSurrogateChar LowSurrogateChar

// UniCode supplementary character support added
// 2011-06-18 Karsten Tinnefeld
// cf.
// [#x10000-#xEFFFF] is [#xD800 0xDC00-#xDB7F 0xDFFF] in UTF-16
// cf.
// cf.
: '\uD800'..'\uDB7F'

: '\uDC00'..'\uDFFF'

Vimal said...

You are right sir, the other XPath grammars available on the ANTLR site are not as good as yours.

do you also have a grammar annotated to produce the Tree grammar for XPath.

If you can post it, it will be very helpful for me.


Best Regards from Sandy Springs