thoughts.rnd(): July 2010

Ever been in a situation where you needed to parse XPath 1.0 expressions? I don't mean to evaluate them, but to actually parse them. Perhaps to script an automatic update to hundreds of BPEL processes. That's what I needed to do.

If you ever find yourself in such a situation, then perhaps the ANTLR grammar below will be of some use to you. Enjoy!

(This isn't the only one that's available on the web, but all the others that I tried had bugs or wouldn't even compile (with my version of ANTLR; they may have been written for an older version). No guarantees that my version hasn't got any bugs though!)



grammar XPath;



/*

XPath 1.0 grammar. Should conform to the official spec at

http://www.w3.org/TR/1999/REC-xpath-19991116. The grammar

rules have been kept as close as possible to those in the

spec, but some adjustmewnts were unavoidable. These were

mainly removing left recursion (spec seems to be based on

LR), and to deal with the double nature of the '*' token

(node wildcard and multiplication operator). See also

section 3.7 in the spec. These rule changes should make

no difference to the strings accepted by the grammar.



Written by Jan-Willem van den Broek

Version 1.0



Do with this code as you will.

*/



tokens {

  PATHSEP  =  '/';

  ABRPATH  =  '//';

  LPAR  =  '(';

  RPAR  =  ')';

  LBRAC  =  '[';

  RBRAC  =  ']';

  MINUS  =  '-';

  PLUS  =  '+';

  DOT  =  '.';

  MUL  =  '*';

  DOTDOT  =  '..';

  AT  =  '@';

  COMMA  =  ',';

  PIPE  =  '|';

  LESS  =  '<';

  MORE  =  '>';

  LE  =  '<=';

  GE  =  '>=';

  COLON  =  ':';

  CC  =  '::';

  APOS  =  '\'';

  QUOT  =  '\"';

}



main  :  expr

  ;



locationPath 

  :  relativeLocationPath

  |  absoluteLocationPathNoroot

  ;



absoluteLocationPathNoroot

  :  '/' relativeLocationPath

  |  '//' relativeLocationPath

  ;



relativeLocationPath

  :  step (('/'|'//') step)*

  ;



step  :  axisSpecifier nodeTest predicate*

  |  abbreviatedStep

  ;



axisSpecifier

  :  AxisName '::'

  |  '@'?

  ;



nodeTest:  nameTest

  |  NodeType '(' ')'

  |  'processing-instruction' '(' Literal ')'

  ;



predicate

  :  '[' expr ']'

  ;



abbreviatedStep

  :  '.'

  |  '..'

  ;



expr  :  orExpr

  ;



primaryExpr

  :  variableReference

  |  '(' expr ')'

  |  Literal

  |  Number  

  |  functionCall

  ;



functionCall

  :  functionName '(' ( expr ( ',' expr )* )? ')'

  ;



unionExprNoRoot

  :  pathExprNoRoot ('|' unionExprNoRoot)?

  |  '/' '|' unionExprNoRoot

  ;



pathExprNoRoot

  :  locationPath

  |  filterExpr (('/'|'//') relativeLocationPath)?

  ;



filterExpr

  :  primaryExpr predicate*

  ;



orExpr  :  andExpr ('or' andExpr)*

  ;



andExpr  :  equalityExpr ('and' equalityExpr)*

  ;



equalityExpr

  :  relationalExpr (('='|'!=') relationalExpr)*

  ;



relationalExpr

  :  additiveExpr (('<'|'>'|'<='|'>=') additiveExpr)*

  ;



additiveExpr

  :  multiplicativeExpr (('+'|'-') multiplicativeExpr)*

  ;



multiplicativeExpr

  :  unaryExprNoRoot (('*'|'div'|'mod') multiplicativeExpr)?

  |  '/' (('div'|'mod') multiplicativeExpr)?

  ;



unaryExprNoRoot

  :  '-'* unionExprNoRoot

  ;



qName  :  nCName (':' nCName)?

  ;



functionName

  :  qName  // Does not match nodeType, as per spec.

  ;



variableReference

  :  '$' qName

  ;



nameTest:  '*'

  |  nCName ':' '*'

  |  qName

  ;



nCName  :  NCName

  |  AxisName

  ;



NodeType:  'comment'

  |  'text'

  |  'processing-instruction'

  |  'node'

  ;

  

Number  :  Digits ('.' Digits?)?

  |  '.' Digits

  ;



fragment

Digits  :  ('0'..'9')+

  ;



AxisName:  'ancestor'

  |  'ancestor-or-self'

  |  'attribute'

  |  'child'

  |  'descendant'

  |  'descendant-or-self'

  |  'following'

  |  'following-sibling'

  |  'namespace'

  |  'parent'

  |  'preceding'

  |  'preceding-sibling'

  |  'self'

  ;



Literal  :  '"' ~'"'* '"'

  |  '\'' ~'\''* '\''

  ;



Whitespace

  :  (' '|'\t'|'\n'|'\r')+ {$channel = HIDDEN;}

  ;



NCName  :  NCNameStartChar NCNameChar*

  ;



fragment

NCNameStartChar

  :  'A'..'Z'

  |   '_'

  |  'a'..'z'

  |  '\u00C0'..'\u00D6'

  |  '\u00D8'..'\u00F6'

  |  '\u00F8'..'\u02FF'

  |  '\u0370'..'\u037D'

  |  '\u037F'..'\u1FFF'

  |  '\u200C'..'\u200D'

  |  '\u2070'..'\u218F'

  |  '\u2C00'..'\u2FEF'

  |  '\u3001'..'\uD7FF'

  |  '\uF900'..'\uFDCF'

  |  '\uFDF0'..'\uFFFD'

// Unfortunately, java escapes can't handle this conveniently,

// as they're limited to 4 hex digits. TODO.

//  |  '\U010000'..'\U0EFFFF'

  ;



fragment

NCNameChar

  :  NCNameStartChar | '-' | '.' | '0'..'9'

  |  '\u00B7' | '\u0300'..'\u036F'

  |  '\u203F'..'\u2040'

  ;

thoughts.rnd()

Friday, July 16, 2010

An ANTLR grammar for parsing XPath 1.0 expressions

Blog Archive

Labels

About Me

Links

My Blog List