Showing posts with label xpath. Show all posts
Showing posts with label xpath. Show all posts

Friday, July 16, 2010

An ANTLR grammar for parsing XPath 1.0 expressions

Ever been in a situation where you needed to parse XPath 1.0 expressions? I don't mean to evaluate them, but to actually parse them. Perhaps to script an automatic update to hundreds of BPEL processes. That's what I needed to do.

If you ever find yourself in such a situation, then perhaps the ANTLR grammar below will be of some use to you. Enjoy!

(This isn't the only one that's available on the web, but all the others that I tried had bugs or wouldn't even compile (with my version of ANTLR; they may have been written for an older version). No guarantees that my version hasn't got any bugs though!)


grammar XPath;

/*
XPath 1.0 grammar. Should conform to the official spec at
http://www.w3.org/TR/1999/REC-xpath-19991116. The grammar
rules have been kept as close as possible to those in the
spec, but some adjustmewnts were unavoidable. These were
mainly removing left recursion (spec seems to be based on
LR), and to deal with the double nature of the '*' token
(node wildcard and multiplication operator). See also
section 3.7 in the spec. These rule changes should make
no difference to the strings accepted by the grammar.

Written by Jan-Willem van den Broek
Version 1.0

Do with this code as you will.
*/

tokens {
  PATHSEP  =  '/';
  ABRPATH  =  '//';
  LPAR  =  '(';
  RPAR  =  ')';
  LBRAC  =  '[';
  RBRAC  =  ']';
  MINUS  =  '-';
  PLUS  =  '+';
  DOT  =  '.';
  MUL  =  '*';
  DOTDOT  =  '..';
  AT  =  '@';
  COMMA  =  ',';
  PIPE  =  '|';
  LESS  =  '<';
  MORE  =  '>';
  LE  =  '<=';
  GE  =  '>=';
  COLON  =  ':';
  CC  =  '::';
  APOS  =  '\'';
  QUOT  =  '\"';
}

main  :  expr
  ;

locationPath
  :  relativeLocationPath
  |  absoluteLocationPathNoroot
  ;

absoluteLocationPathNoroot
  :  '/' relativeLocationPath
  |  '//' relativeLocationPath
  ;

relativeLocationPath
  :  step (('/'|'//') step)*
  ;

step  :  axisSpecifier nodeTest predicate*
  |  abbreviatedStep
  ;

axisSpecifier
  :  AxisName '::'
  |  '@'?
  ;

nodeTest:  nameTest
  |  NodeType '(' ')'
  |  'processing-instruction' '(' Literal ')'
  ;

predicate
  :  '[' expr ']'
  ;

abbreviatedStep
  :  '.'
  |  '..'
  ;

expr  :  orExpr
  ;

primaryExpr
  :  variableReference
  |  '(' expr ')'
  |  Literal
  |  Number  
  |  functionCall
  ;

functionCall
  :  functionName '(' ( expr ( ',' expr )* )? ')'
  ;

unionExprNoRoot
  :  pathExprNoRoot ('|' unionExprNoRoot)?
  |  '/' '|' unionExprNoRoot
  ;

pathExprNoRoot
  :  locationPath
  |  filterExpr (('/'|'//') relativeLocationPath)?
  ;

filterExpr
  :  primaryExpr predicate*
  ;

orExpr  :  andExpr ('or' andExpr)*
  ;

andExpr  :  equalityExpr ('and' equalityExpr)*
  ;

equalityExpr
  :  relationalExpr (('='|'!=') relationalExpr)*
  ;

relationalExpr
  :  additiveExpr (('<'|'>'|'<='|'>=') additiveExpr)*
  ;

additiveExpr
  :  multiplicativeExpr (('+'|'-') multiplicativeExpr)*
  ;

multiplicativeExpr
  :  unaryExprNoRoot (('*'|'div'|'mod') multiplicativeExpr)?
  |  '/' (('div'|'mod') multiplicativeExpr)?
  ;

unaryExprNoRoot
  :  '-'* unionExprNoRoot
  ;

qName  :  nCName (':' nCName)?
  ;

functionName
  :  qName  // Does not match nodeType, as per spec.
  ;

variableReference
  :  '$' qName
  ;

nameTest:  '*'
  |  nCName ':' '*'
  |  qName
  ;

nCName  :  NCName
  |  AxisName
  ;

NodeType:  'comment'
  |  'text'
  |  'processing-instruction'
  |  'node'
  ;
  
Number  :  Digits ('.' Digits?)?
  |  '.' Digits
  ;

fragment
Digits  :  ('0'..'9')+
  ;

AxisName:  'ancestor'
  |  'ancestor-or-self'
  |  'attribute'
  |  'child'
  |  'descendant'
  |  'descendant-or-self'
  |  'following'
  |  'following-sibling'
  |  'namespace'
  |  'parent'
  |  'preceding'
  |  'preceding-sibling'
  |  'self'
  ;

Literal  :  '"' ~'"'* '"'
  |  '\'' ~'\''* '\''
  ;

Whitespace
  :  (' '|'\t'|'\n'|'\r')+ {$channel = HIDDEN;}
  ;

NCName  :  NCNameStartChar NCNameChar*
  ;

fragment
NCNameStartChar
  :  'A'..'Z'
  |   '_'
  |  'a'..'z'
  |  '\u00C0'..'\u00D6'
  |  '\u00D8'..'\u00F6'
  |  '\u00F8'..'\u02FF'
  |  '\u0370'..'\u037D'
  |  '\u037F'..'\u1FFF'
  |  '\u200C'..'\u200D'
  |  '\u2070'..'\u218F'
  |  '\u2C00'..'\u2FEF'
  |  '\u3001'..'\uD7FF'
  |  '\uF900'..'\uFDCF'
  |  '\uFDF0'..'\uFFFD'
// Unfortunately, java escapes can't handle this conveniently,
// as they're limited to 4 hex digits. TODO.
//  |  '\U010000'..'\U0EFFFF'
  ;

fragment
NCNameChar
  :  NCNameStartChar | '-' | '.' | '0'..'9'
  |  '\u00B7' | '\u0300'..'\u036F'
  |  '\u203F'..'\u2040'
  ;

Tuesday, May 11, 2010

XSLT, XQuery, XPath tooling

I'm always on the lookout for nice tooling. Preferably small and free (as in beer, though I like the other free too). Not because I'm cheap (though I am), but because that's the sort of tools that I can use without having to go through the bureaucratic hell associated with getting the boss or customer to pay for software licenses.

Here are two of my favorite free tools for working with XSLT, XQuery and XPath.

Architag XRay XML Editor. Technically a general purpose XML editor, but it really shines when you're writing XSLTs. As you're writing your XSLT, you can set an input document, and it'll continually evaluate your XSLT against that document. That way, the second you make a change to the XSLT, you'll get to see the effect it has. Awesome!

Besides this, it continually informs you of invalidities in your XML in a very unobtrusive way (no popups or any of that nonsense), and will automatically perform Schema validations if you have the right schema open. (No need to explicitly associate documents with schemas.)

Sadly, it only supports XSLT 1.0. Still, that's the version I usually have to use at work, so XRay still comes in handy very often.

Kernow. Kernow's stated goal is "to make it faster and easier to repeatedly run transforms using Saxon." It does so admirably, but it also has very convenient sandboxes for performing XSLT 2.0 or XQuery transforms, and XML Schema or Schematron validations. Maybe not yet as convenient as XRay, but still quite nice. Definitely recommended!

Sunday, May 09, 2010

On WebLogic deployment plans

Last week I was struggling to get an MQ adapter configured using a WebLogic deployment plan. I must have been especially low on brain power that day, as they're really not all that complex. Still, they are tricky if you've never seen them before (as I hadn't), so here are some pointers to help you out.

These deployment plans turn out to be part of an implementation of JSR-88. Don't try to read that spec unless you feel significantly more masochistic than I do though. Suffice it to say that - under WebLogic, at least - you can use them to provide server-specific configuration information to an application.

You may want to use a plan if your config is included in an jar/war/ear/etc. file; for instance in an included web.xml. Having config in such a place can be a pain, as you typically don't want to change those *ar files after QA has signed off on them. A deployment plan lets you work around this problem by telling the server to pretend that the config was actually at bit (or entirely) different from what those files in the *ar said.

The way this actually works is somewhat unintuitive, but it does work. Under WebLogic, you use an XML document that conforms to this schema. (I can't comment on other platforms, as I have only tried this on WebLogic.)

The most interesting part of this schema is the module-override element. With this element, we can specify what descriptors to update, and in what modules (rar, jar, etc.). For instance:

<module-override>
  <module-name>MQSeriesAdapter.rar</module-name>
  <module-type>rar</module-type>
  <module-descriptor external="false">
    <root-element>weblogic-connector</root-element>
    <uri>META-INF/weblogic-ra.xml</uri>
    <variable-assignment>
      ...
    </variable-assignment>
  </module-descriptor>
</module-override>

This config will update the META-INF/weblogic-ra.xml descriptor in MQSeriesAdapter.rar. The actual fields being updated are specified in the variable-assignment block. (You can have multiple of those, by the way.)

Those variable-assigment blocks are where things get confusing. Before I explain, allow me to show you their basic form.

<variable-assignment>
  <name>name</name>
  <xpath>pseudoXPath</xpath>
  <operation>add</operation> <-- Can also be remove or replace. -->
</variable-assignment>

The idea is that the data from the "name" element gets assigned to the location in the descriptor (i.e. META-INF/weblogic-ra in the example) indicated with the "xpath" element. The "operation" element specifies if this value must be added, removed or replaced. (The default is "add", and in this case, any XML elements that are specified in the xpath tag, but missing in the target document, will be added.)

This is where things start getting confusing. First of all, you can't actually specify the value you want to assign in the "name" element. There is an element of indirection here in that you must assign this value to a variable first, and then you can refer to this variable here. Sound confusing? I sure thought it was.

Here's an example of how it works:

<variable-definition>
  <variable>
    <name>myVar</name>
    <value>niftyValue</value>
  </variable>
</variable-definition>
<module-override>
  <module-name>MQSeriesAdapter.rar</module-name>
  <module-type>rar</module-type>
  <module-descriptor external="false">
    <root-element>weblogic-connector</root-element>
    <uri>META-INF/weblogic-ra.xml</uri>
    <variable-assignment>
      <name>myVar</name>
      <xpath>pseudoXPath</xpath>
    </variable-assignment>
  </module-descriptor>
</module-override>

In the example we assign the value "niftyValue" via the variable myVar. Ridiculously circumlocutious (word of the day), but it works.

There's something even more confusing though. Have you noticed how I put the value "pseudoXPath" in every "xpath" element so far? This is because despite what the name suggests, this element doesn't actually accept proper XPath!

JSR-88 restricts the XPaths allowed to just those who contain only ".", "..", "/", and tag names. I suppose it makes sense to restrict the allowed expressions a bit, considering that the expression can be used to create a new element. (In which case the plan processor must manipulate the XML until the XPath expression evaluates to a node. Could be tricky for complex XPaths.) Still, as it's now, it's probably a bit too restrictive.

Fortunately, WebLogic seems to accept an enhanced syntax, but this makes it deviate even more from the XPath standard. I haven't yet found proper documentation of the syntax, but I know that it accepts XPath predicates with equality comparisons. Unlike proper XPath though, they must be preceded by a slash ("/"). So you can have a construction like
book/[title="Lord of the Rings"]
to select only books with a title element with value "Lord of the Rings".

Don't bother with namespaces in these XPaths, by the way. WebLogic seems to ignore namespaces when processing them, which I guess is usually pretty convenient. Yet another way in which it differs from true XPath, however.

Putting it all together, you could end up with something like this:

<variable-definition>
  <variable>
    <name>port</name>
    <value>1414</value>
  </variable>
</variable-definition>
<module-override>
  <module-name>MQSeriesAdapter.rar</module-name>
  <module-type>rar</module-type>
  <module-descriptor external="false">
    <root-element>weblogic-connector</root-element>
    <uri>META-INF/weblogic-ra.xml</uri>
    <variable-assignment>
      <name>port</name>
      <xpath>/weblogic-connector/outbound-resource-adapter/connection-definition-group/[connection-factory-interface="javax.resource.cci.ConnectionFactory"]/connection-instance/[jndi-name="eisMQ/MQAdapter"]/connection-properties/properties/property/[name="portNumber"]/value</xpath>
    </variable-assignment>
  </module-descriptor>
</module-override>

By now you should have a pretty good idea of how these plans work. If you want to know how to deploy them, then you could do worse than to check out this helpful blog.

Thursday, April 29, 2010

Set operators in XPath

What's this? A post? Fer realz? Best pretend it never happened.

Meanwhile, I'll just continue typing. About set operators in XPath, no less.

In the olden days of XPath 1.0, there was only one set operator, "|", which performed a union on node sets. So you could use an expression like "//foo | //bar" to select all foo and bar elements. If, however, you wanted the intersection or difference of two node sets, you were out of luck, as there were no dedicated operators to help you do that.

Fortunately, the much improved XPath 2.0 remedies that situation by introducing the "intersect", and "except" operators. (And "union", which is just another way of saying "|".) So now you can do "foo intersect bar" to get nodes that are both foo's and bars, and "foo except bar" to get nodes that are foo's but not bars.

Sounds great, doesn't it? Sadly, there's a big pitfall in the way that nodes are determined to be equal. You see, for purposes of these operators, nodes are only deemed equal if they are the exact same nodes, not if they merely have the same name and the same content. So, in the document below, the two foo elements are not considered equal!

<root>
<foo>This is a foo.</foo>
<foo>This is a foo.</foo>
</root>


Now what if we want to perform set operations on nodes based on equality in terms of name and content, rather than XPath's strict definition?

Fortunately, XPath does come with a function that can determine if two nodes are identical, without taking into account if they're the exact same node. We can use this to write queries that do what we want.

Let's assume we have a variable a containing the following nodes (under a nameless root, not as a sequence):

<e></e>
<f>fff</f>
<g></g>


As well as a variable b containing these nodes:

<e></e>
<f>fff</f>
<i></i>


We can now simulate "except" as follows:

$a/*[empty(for $b1 in $b/* return (if (deep-equal(.,$b1)) then (true()) else ()))]

You should read this as: for all nodes of a, return only those for which no corresponding node can be found in b.

If you understand how our "except" works, you'll have no problem understanding "intersect":

$a/*[exists(for $b1 in $b/* return (if (deep-equal(.,$b1)) then (true()) else ()))]

Read this as: for all nodes of a, return only those for which a corresponding node can be found in b.

Finally, we have "union":

$a/*[empty(for $b1 in $b/* return (if (deep-equal(.,$b1)) then (true()) else ()))],$b/*

Read this as: take a except b and add all of b. (The "except b" bit is necessary to make sure you don't get the nodes that occur in both variables twice. Without it, you'd end up with duplicate <e> and <f> elements.)

There's one thing to be aware of in these queries though. Unlike the set operators, these queries don't remove duplicates when these are already in either of the inputs. So if a contains duplicate nodes, "$a except $b" will remove these, while our query won't. Depending on your use case this is either a bug or a feature. :-)

To help you out, here's a way to remove duplicates from a. If you combine this with the queries above, you'll have something that behaves just like the set operators do. (Apart from the difference in determining when nodes are equal, of course.)

$a/*[empty(for $a1 in subsequence($a/*,1,position()-1) return (if (deep-equal(.,$a1)) then (true()) else ()))]

Read this as: go through all nodes in a and return a node only if we haven't already passed an identical one.

There ya go. Hope this'll be of some help.