Thursday, April 29, 2010

Set operators in XPath

What's this? A post? Fer realz? Best pretend it never happened.

Meanwhile, I'll just continue typing. About set operators in XPath, no less.

In the olden days of XPath 1.0, there was only one set operator, "|", which performed a union on node sets. So you could use an expression like "//foo | //bar" to select all foo and bar elements. If, however, you wanted the intersection or difference of two node sets, you were out of luck, as there were no dedicated operators to help you do that.

Fortunately, the much improved XPath 2.0 remedies that situation by introducing the "intersect", and "except" operators. (And "union", which is just another way of saying "|".) So now you can do "foo intersect bar" to get nodes that are both foo's and bars, and "foo except bar" to get nodes that are foo's but not bars.

Sounds great, doesn't it? Sadly, there's a big pitfall in the way that nodes are determined to be equal. You see, for purposes of these operators, nodes are only deemed equal if they are the exact same nodes, not if they merely have the same name and the same content. So, in the document below, the two foo elements are not considered equal!

<root>
<foo>This is a foo.</foo>
<foo>This is a foo.</foo>
</root>


Now what if we want to perform set operations on nodes based on equality in terms of name and content, rather than XPath's strict definition?

Fortunately, XPath does come with a function that can determine if two nodes are identical, without taking into account if they're the exact same node. We can use this to write queries that do what we want.

Let's assume we have a variable a containing the following nodes (under a nameless root, not as a sequence):

<e></e>
<f>fff</f>
<g></g>


As well as a variable b containing these nodes:

<e></e>
<f>fff</f>
<i></i>


We can now simulate "except" as follows:

$a/*[empty(for $b1 in $b/* return (if (deep-equal(.,$b1)) then (true()) else ()))]

You should read this as: for all nodes of a, return only those for which no corresponding node can be found in b.

If you understand how our "except" works, you'll have no problem understanding "intersect":

$a/*[exists(for $b1 in $b/* return (if (deep-equal(.,$b1)) then (true()) else ()))]

Read this as: for all nodes of a, return only those for which a corresponding node can be found in b.

Finally, we have "union":

$a/*[empty(for $b1 in $b/* return (if (deep-equal(.,$b1)) then (true()) else ()))],$b/*

Read this as: take a except b and add all of b. (The "except b" bit is necessary to make sure you don't get the nodes that occur in both variables twice. Without it, you'd end up with duplicate <e> and <f> elements.)

There's one thing to be aware of in these queries though. Unlike the set operators, these queries don't remove duplicates when these are already in either of the inputs. So if a contains duplicate nodes, "$a except $b" will remove these, while our query won't. Depending on your use case this is either a bug or a feature. :-)

To help you out, here's a way to remove duplicates from a. If you combine this with the queries above, you'll have something that behaves just like the set operators do. (Apart from the difference in determining when nodes are equal, of course.)

$a/*[empty(for $a1 in subsequence($a/*,1,position()-1) return (if (deep-equal(.,$a1)) then (true()) else ()))]

Read this as: go through all nodes in a and return a node only if we haven't already passed an identical one.

There ya go. Hope this'll be of some help.