thoughts.rnd(): 2010

Friday, September 17, 2010

Cleaning up namespace declarations

Ever been bothered by redundant namespace declarations? Some tools generate tons and tons of these, and they really clutter up your XML, making it hard to read and maintain, as well as increasing file size for no reason.

Here's a handy XSLT to move all declarations to the top element of your document, removing duplicates. (It even moves seemingly unused declarations. This is intended behavior, as the declarations may still be referenced in the text, e.g. in the XPath expressions of a BPEL file.)



<!-- Copyright J.W. v/d Broek 2010. Do with this code as you will. -->

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" exclude-result-prefixes="#all">



<xsl:template match="/*">

<xsl:copy copy-namespaces="no">

<xsl:for-each-group select="for $x in //* return for $y in in-scope-prefixes($x)[.!='' and .!='xml'] return concat($y, ':',namespace-uri-for-prefix($y, $x))" group-by=".">

<xsl:namespace name="{substring-before(., ':')}" select="substring-after(., ':')"/>

</xsl:for-each-group>

<xsl:apply-templates select="node()|@*"/>

</xsl:copy>

</xsl:template>



<xsl:template match="node()|@*">

<xsl:copy copy-namespaces="no">

<xsl:apply-templates select="node()|@*"/>

</xsl:copy>

</xsl:template>



</xsl:stylesheet>

Friday, July 16, 2010

An ANTLR grammar for parsing XPath 1.0 expressions

Ever been in a situation where you needed to parse XPath 1.0 expressions? I don't mean to evaluate them, but to actually parse them. Perhaps to script an automatic update to hundreds of BPEL processes. That's what I needed to do.

If you ever find yourself in such a situation, then perhaps the ANTLR grammar below will be of some use to you. Enjoy!

(This isn't the only one that's available on the web, but all the others that I tried had bugs or wouldn't even compile (with my version of ANTLR; they may have been written for an older version). No guarantees that my version hasn't got any bugs though!)



grammar XPath;



/*

XPath 1.0 grammar. Should conform to the official spec at

http://www.w3.org/TR/1999/REC-xpath-19991116. The grammar

rules have been kept as close as possible to those in the

spec, but some adjustmewnts were unavoidable. These were

mainly removing left recursion (spec seems to be based on

LR), and to deal with the double nature of the '*' token

(node wildcard and multiplication operator). See also

section 3.7 in the spec. These rule changes should make

no difference to the strings accepted by the grammar.



Written by Jan-Willem van den Broek

Version 1.0



Do with this code as you will.

*/



tokens {

  PATHSEP  =  '/';

  ABRPATH  =  '//';

  LPAR  =  '(';

  RPAR  =  ')';

  LBRAC  =  '[';

  RBRAC  =  ']';

  MINUS  =  '-';

  PLUS  =  '+';

  DOT  =  '.';

  MUL  =  '*';

  DOTDOT  =  '..';

  AT  =  '@';

  COMMA  =  ',';

  PIPE  =  '|';

  LESS  =  '<';

  MORE  =  '>';

  LE  =  '<=';

  GE  =  '>=';

  COLON  =  ':';

  CC  =  '::';

  APOS  =  '\'';

  QUOT  =  '\"';

}



main  :  expr

  ;



locationPath 

  :  relativeLocationPath

  |  absoluteLocationPathNoroot

  ;



absoluteLocationPathNoroot

  :  '/' relativeLocationPath

  |  '//' relativeLocationPath

  ;



relativeLocationPath

  :  step (('/'|'//') step)*

  ;



step  :  axisSpecifier nodeTest predicate*

  |  abbreviatedStep

  ;



axisSpecifier

  :  AxisName '::'

  |  '@'?

  ;



nodeTest:  nameTest

  |  NodeType '(' ')'

  |  'processing-instruction' '(' Literal ')'

  ;



predicate

  :  '[' expr ']'

  ;



abbreviatedStep

  :  '.'

  |  '..'

  ;



expr  :  orExpr

  ;



primaryExpr

  :  variableReference

  |  '(' expr ')'

  |  Literal

  |  Number  

  |  functionCall

  ;



functionCall

  :  functionName '(' ( expr ( ',' expr )* )? ')'

  ;



unionExprNoRoot

  :  pathExprNoRoot ('|' unionExprNoRoot)?

  |  '/' '|' unionExprNoRoot

  ;



pathExprNoRoot

  :  locationPath

  |  filterExpr (('/'|'//') relativeLocationPath)?

  ;



filterExpr

  :  primaryExpr predicate*

  ;



orExpr  :  andExpr ('or' andExpr)*

  ;



andExpr  :  equalityExpr ('and' equalityExpr)*

  ;



equalityExpr

  :  relationalExpr (('='|'!=') relationalExpr)*

  ;



relationalExpr

  :  additiveExpr (('<'|'>'|'<='|'>=') additiveExpr)*

  ;



additiveExpr

  :  multiplicativeExpr (('+'|'-') multiplicativeExpr)*

  ;



multiplicativeExpr

  :  unaryExprNoRoot (('*'|'div'|'mod') multiplicativeExpr)?

  |  '/' (('div'|'mod') multiplicativeExpr)?

  ;



unaryExprNoRoot

  :  '-'* unionExprNoRoot

  ;



qName  :  nCName (':' nCName)?

  ;



functionName

  :  qName  // Does not match nodeType, as per spec.

  ;



variableReference

  :  '$' qName

  ;



nameTest:  '*'

  |  nCName ':' '*'

  |  qName

  ;



nCName  :  NCName

  |  AxisName

  ;



NodeType:  'comment'

  |  'text'

  |  'processing-instruction'

  |  'node'

  ;

  

Number  :  Digits ('.' Digits?)?

  |  '.' Digits

  ;



fragment

Digits  :  ('0'..'9')+

  ;



AxisName:  'ancestor'

  |  'ancestor-or-self'

  |  'attribute'

  |  'child'

  |  'descendant'

  |  'descendant-or-self'

  |  'following'

  |  'following-sibling'

  |  'namespace'

  |  'parent'

  |  'preceding'

  |  'preceding-sibling'

  |  'self'

  ;



Literal  :  '"' ~'"'* '"'

  |  '\'' ~'\''* '\''

  ;



Whitespace

  :  (' '|'\t'|'\n'|'\r')+ {$channel = HIDDEN;}

  ;



NCName  :  NCNameStartChar NCNameChar*

  ;



fragment

NCNameStartChar

  :  'A'..'Z'

  |   '_'

  |  'a'..'z'

  |  '\u00C0'..'\u00D6'

  |  '\u00D8'..'\u00F6'

  |  '\u00F8'..'\u02FF'

  |  '\u0370'..'\u037D'

  |  '\u037F'..'\u1FFF'

  |  '\u200C'..'\u200D'

  |  '\u2070'..'\u218F'

  |  '\u2C00'..'\u2FEF'

  |  '\u3001'..'\uD7FF'

  |  '\uF900'..'\uFDCF'

  |  '\uFDF0'..'\uFFFD'

// Unfortunately, java escapes can't handle this conveniently,

// as they're limited to 4 hex digits. TODO.

//  |  '\U010000'..'\U0EFFFF'

  ;



fragment

NCNameChar

  :  NCNameStartChar | '-' | '.' | '0'..'9'

  |  '\u00B7' | '\u0300'..'\u036F'

  |  '\u203F'..'\u2040'

  ;

Tuesday, June 01, 2010

Making XML Schema less of a pain by parsing text with XSLT

Allow me to get to the point immediately. XML Schema can be a royal pain.

Don't get me wrong; I'm glad it exists. It's powerful, serves a clear purpose, is well-supported, yadda, yadda, yadda. Unfortunately, it's also quite complex, has a lot of pitfalls (elementFormDefault!), and is terribly verbose.

For instance, would you rather have this:



http://blog.jwbroek.com/nifty-namespace

thingamabob        ; This is a comment.

  foo xsd:string

  bar xsd:boolean  ; Set to true to enable bar.

  baz

    alice

      count xsd:integer?  ; Count is optional.

      description  ; Type defaults to string.

    bobs           ; List of 0 or more bobs.

      bob xsd:boolean*

    charles +      ; At least one charles.

Or this:



<?xml version="1.0" encoding="UTF-8"?>

<xsd:schema xmlns:tns="http://blog.jwbroek.com/nifty-namespace"

            xmlns:xsd="http://www.w3.org/2001/XMLSchema"

            elementFormDefault="qualified"

            attributeFormDefault="unqualified"

            targetNamespace="http://blog.jwbroek.com/nifty-namespace">

   <xsd:element name="thingamabob">

      <xsd:annotation>

         <xsd:documentation>This is a comment.</xsd:documentation>

      </xsd:annotation>

      <xsd:complexType>

         <xsd:sequence>

            <xsd:element name="foo" type="xsd:string"/>

            <xsd:element name="bar" type="xsd:boolean">

               <xsd:annotation>

                  <xsd:documentation>Set to true to enable bar.</xsd:documentation>

               </xsd:annotation>

            </xsd:element>

            <xsd:element name="baz">

               <xsd:complexType>

                  <xsd:sequence>

                     <xsd:element name="alice">

                        <xsd:complexType>

                           <xsd:sequence>

                              <xsd:element name="count" type="xsd:integer" minOccurs="0">

                                 <xsd:annotation>

                                    <xsd:documentation>Count is optional.</xsd:documentation>

                                 </xsd:annotation>

                              </xsd:element>

                              <xsd:element name="description" type="xsd:string">

                                 <xsd:annotation>

                                    <xsd:documentation>Type defaults to string.</xsd:documentation>

                                 </xsd:annotation>

                              </xsd:element>

                           </xsd:sequence>

                        </xsd:complexType>

                     </xsd:element>

                     <xsd:element name="bobs">

                        <xsd:annotation>

                           <xsd:documentation>List of 0 or more bobs.</xsd:documentation>

                        </xsd:annotation>

                        <xsd:complexType>

                           <xsd:sequence>

                              <xsd:element name="bob" type="xsd:boolean" minOccurs="0" maxOccurs="unbounded"/>

                           </xsd:sequence>

                        </xsd:complexType>

                     </xsd:element>

                     <xsd:element name="charles" type="xsd:string" maxOccurs="unbounded">

                        <xsd:annotation>

                           <xsd:documentation>At least one charles.</xsd:documentation>

                        </xsd:annotation>

                     </xsd:element>

                  </xsd:sequence>

               </xsd:complexType>

            </xsd:element>

         </xsd:sequence>

      </xsd:complexType>

   </xsd:element>

</xsd:schema>

Both describe the same XML structure, but if you ask me, the first one is much clearer, and much quicker to write as well.

Granted, we're not using any of the fancy bells and whistles of XML Schema here. However, this would be quite sufficient for most of the things I see Schema being used for.

Wouldn't it be nice if you could actually write your Schema's using the first syntax?

Well, you're in luck: you can! The Schema above was entirely generated by applying the XSLT below to the simple syntax at the top. Hope you'll enjoy it as much as I do. :-)

(Tip: use Kernow to execute the XSLT. Put your input in C:\dev\projects\schemagen\test\input.txt, or override the parameter to use a file of your choice.)



<!--

Copyright 2010 J.W. van den Broek



Licensed under the Apache License, Version 2.0 (the "License");

you may not use this file except in compliance with the License.

You may obtain a copy of the License at



http://www.apache.org/licenses/LICENSE-2.0



Unless required by applicable law or agreed to in writing, software

distributed under the License is distributed on an "AS IS" BASIS,

WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

See the License for the specific language governing permissions and

limitations under the License.

-->

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

   xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:jws="http://blog.jwbroek.com/xslt/xsd/functions"

   exclude-result-prefixes="#all">

  

   <xsl:output indent="yes"/>

  

   <!-- Override this to read your file. -->

   <xsl:param name="input-file" select="'file:///C:/dev/projects/schemagen/test/input.txt'"/>

  

   <xsl:template match="/">

      <!-- Sequence of all non-empty lines in the input. -->

      <xsl:variable name="lines" select="tokenize(unparsed-text($input-file),'&#x0D;')[not(matches(.,'^\s*$'))]"/>

     

      <!-- Create the schema. Make the root schema element here, taking the target namespace from the first line of input. -->

      <xsd:schema elementFormDefault="qualified" attributeFormDefault="unqualified" targetNamespace="{$lines[1]}">

         <xsl:namespace name="tns" select="$lines[1]"/>

         <!-- Pass all other lines on the the element-declarations function, which will create the element declarations. -->

         <xsl:sequence select="jws:element-declarations(subsequence($lines,2,count($lines)-1))"/>

      </xsd:schema>

   </xsl:template>

  

   <!-- Create element declarations based on lines of input. -->

   <xsl:function name="jws:element-declarations" as="element()*">

      <xsl:param name="rawLines" as="xsd:string*"/>

     

      <!-- Only continue if we have lines of input remaining. -->

      <xsl:if test="exists($rawLines)">

         <!-- Take the indentation from the first line. We'll create declarations for all elements with this level of indentation. -->

         <!-- We'll recursively create declarations for elements at higher indentation. -->

         <xsl:variable name="curIndent" select="replace($rawLines[1],'^(\s*).+$','$1')"/>

         <!-- Remove the base indentation from all lines. The elements we're going to make declarations for now have no indentation. -->

         <xsl:variable name="lines" select="for $l in $rawLines return substring-after($l, $curIndent)"/>

         <!-- Determine indices for all elements without indentation. We'll use this info to efficiently access the right lines of input. -->

         <xsl:variable name="indicesAtRoot" select="index-of((for $l in $lines return matches($l, '^\i+.*')), true())"/>

         <!-- Contains the root indices, but also the end of input. We'll use this to create subsequences for our recursive calls. -->

         <xsl:variable name="indicesAndBound" select="$indicesAtRoot, count($lines)+1"/>

        

         <!-- Create declarations for all root elements. (And recursively all child elements as well.) -->

         <xsl:for-each select="$indicesAtRoot">

            <!-- Current line of input. -->

            <xsl:variable name="curLine" select="$lines[current()]"/>

            <!-- Name of current element. -->

            <xsl:variable name="name" select="replace($curLine,'^(\i\c*).*$','$1')"/>

            <!-- Type of current element. May be empty, in which case we'll use xsd:string as default later on. -->

            <xsl:variable name="type" select="replace($curLine,'^\i\c*\s*([^?*+;\s]*)?.*$','$1')"/>

            <!-- Occurrence of current element. ?: optional, *: 0 or more, +: 1 or more. Empty is XSD default (1). -->

            <xsl:variable name="occurrence" select="replace($curLine,'^[^?*+;]*(\?|\*|\+).*$','$1')"/>

            <!-- Documentation. Will go into a documentation annotation. -->

            <xsl:variable name="doc" select="replace($curLine,'^[^;]+(;\s*(.*))?$','$2')"/>

            <!-- Current position in the $indicesAtRoot sequence. -->

            <xsl:variable name="pos" select="position()"/>

            <!-- Select the subsequence of all lines that contain children of the current element. -->

            <xsl:variable name="children" select="subsequence($lines, $indicesAndBound[$pos]+1, $indicesAndBound[$pos+1] - $indicesAndBound[$pos] - 1)"/>

           

            <!-- Create the element declaration. -->

            <xsd:element name="{$name}">

               <!-- No type declaration if there are children. Is an inline complex type declaration. -->

               <xsl:if test="empty($children)">

                  <xsl:choose>

                     <!-- On empty type, we default to string. -->

                     <xsl:when test="$type = ''">

                        <xsl:attribute name="type" select="'xsd:string'"/>

                     </xsl:when>

                     <xsl:otherwise>

                        <xsl:attribute name="type" select="$type"/>

                     </xsl:otherwise>

                  </xsl:choose>

               </xsl:if>

              

               <!-- Set minOccurs and maxOccurs. -->

               <xsl:choose>

                  <xsl:when test="$occurrence='?'">

                     <xsl:attribute name="minOccurs" select="'0'"/>

                  </xsl:when>

                  <xsl:when test="$occurrence='*'">

                     <xsl:attribute name="minOccurs" select="'0'"/>

                     <xsl:attribute name="maxOccurs" select="'unbounded'"/>

                  </xsl:when>

                  <xsl:when test="$occurrence='+'">

                     <xsl:attribute name="maxOccurs" select="'unbounded'"/>

                  </xsl:when>

               </xsl:choose>

              

               <!-- Set documentation annotation. -->

               <xsl:if test="$doc != ''">

                  <xsd:annotation>

                     <xsd:documentation>

                        <xsl:sequence select="$doc"/>

                     </xsd:documentation>

                  </xsd:annotation>

               </xsl:if>

              

               <!-- Recursively do child declarations. -->

               <xsl:if test="exists($children)">

                  <xsd:complexType>

                     <xsd:sequence>

                        <xsl:sequence select="jws:element-declarations($children)"/>

                     </xsd:sequence>

                  </xsd:complexType>

               </xsl:if>

            </xsd:element>

         </xsl:for-each>

      </xsl:if>

   </xsl:function>

  

</xsl:stylesheet>

Tuesday, May 11, 2010

XSLT, XQuery, XPath tooling

I'm always on the lookout for nice tooling. Preferably small and free (as in beer, though I like the other free too). Not because I'm cheap (though I am), but because that's the sort of tools that I can use without having to go through the bureaucratic hell associated with getting the boss or customer to pay for software licenses.

Here are two of my favorite free tools for working with XSLT, XQuery and XPath.

Architag XRay XML Editor. Technically a general purpose XML editor, but it really shines when you're writing XSLTs. As you're writing your XSLT, you can set an input document, and it'll continually evaluate your XSLT against that document. That way, the second you make a change to the XSLT, you'll get to see the effect it has. Awesome!

Besides this, it continually informs you of invalidities in your XML in a very unobtrusive way (no popups or any of that nonsense), and will automatically perform Schema validations if you have the right schema open. (No need to explicitly associate documents with schemas.)

Sadly, it only supports XSLT 1.0. Still, that's the version I usually have to use at work, so XRay still comes in handy very often.

Kernow. Kernow's stated goal is "to make it faster and easier to repeatedly run transforms using Saxon." It does so admirably, but it also has very convenient sandboxes for performing XSLT 2.0 or XQuery transforms, and XML Schema or Schematron validations. Maybe not yet as convenient as XRay, but still quite nice. Definitely recommended!

Sunday, May 09, 2010

On WebLogic deployment plans

Last week I was struggling to get an MQ adapter configured using a WebLogic deployment plan. I must have been especially low on brain power that day, as they're really not all that complex. Still, they are tricky if you've never seen them before (as I hadn't), so here are some pointers to help you out.

These deployment plans turn out to be part of an implementation of JSR-88. Don't try to read that spec unless you feel significantly more masochistic than I do though. Suffice it to say that - under WebLogic, at least - you can use them to provide server-specific configuration information to an application.

You may want to use a plan if your config is included in an jar/war/ear/etc. file; for instance in an included web.xml. Having config in such a place can be a pain, as you typically don't want to change those *ar files after QA has signed off on them. A deployment plan lets you work around this problem by telling the server to pretend that the config was actually at bit (or entirely) different from what those files in the *ar said.

The way this actually works is somewhat unintuitive, but it does work. Under WebLogic, you use an XML document that conforms to this schema. (I can't comment on other platforms, as I have only tried this on WebLogic.)

The most interesting part of this schema is the module-override element. With this element, we can specify what descriptors to update, and in what modules (rar, jar, etc.). For instance:

<module-override>

  <module-name>MQSeriesAdapter.rar</module-name>

  <module-type>rar</module-type>

  <module-descriptor external="false">

    <root-element>weblogic-connector</root-element>

    <uri>META-INF/weblogic-ra.xml</uri>

    <variable-assignment>

      ...

    </variable-assignment>

  </module-descriptor>

</module-override>

This config will update the META-INF/weblogic-ra.xml descriptor in MQSeriesAdapter.rar. The actual fields being updated are specified in the variable-assignment block. (You can have multiple of those, by the way.)

Those variable-assigment blocks are where things get confusing. Before I explain, allow me to show you their basic form.

<variable-assignment>

  <name>name</name>

  <xpath>pseudoXPath</xpath>

  <operation>add</operation> <-- Can also be remove or replace. -->

</variable-assignment>

The idea is that the data from the "name" element gets assigned to the location in the descriptor (i.e. META-INF/weblogic-ra in the example) indicated with the "xpath" element. The "operation" element specifies if this value must be added, removed or replaced. (The default is "add", and in this case, any XML elements that are specified in the xpath tag, but missing in the target document, will be added.)

This is where things start getting confusing. First of all, you can't actually specify the value you want to assign in the "name" element. There is an element of indirection here in that you must assign this value to a variable first, and then you can refer to this variable here. Sound confusing? I sure thought it was.

Here's an example of how it works:

<variable-definition>

  <variable>

    <name>myVar</name>

    <value>niftyValue</value>

  </variable>

</variable-definition>

<module-override>

  <module-name>MQSeriesAdapter.rar</module-name>

  <module-type>rar</module-type>

  <module-descriptor external="false">

    <root-element>weblogic-connector</root-element>

    <uri>META-INF/weblogic-ra.xml</uri>

    <variable-assignment>

      <name>myVar</name>

      <xpath>pseudoXPath</xpath>

    </variable-assignment>

  </module-descriptor>

</module-override>

In the example we assign the value "niftyValue" via the variable myVar. Ridiculously circumlocutious (word of the day), but it works.

There's something even more confusing though. Have you noticed how I put the value "pseudoXPath" in every "xpath" element so far? This is because despite what the name suggests, this element doesn't actually accept proper XPath!

JSR-88 restricts the XPaths allowed to just those who contain only ".", "..", "/", and tag names. I suppose it makes sense to restrict the allowed expressions a bit, considering that the expression can be used to create a new element. (In which case the plan processor must manipulate the XML until the XPath expression evaluates to a node. Could be tricky for complex XPaths.) Still, as it's now, it's probably a bit too restrictive.

Fortunately, WebLogic seems to accept an enhanced syntax, but this makes it deviate even more from the XPath standard. I haven't yet found proper documentation of the syntax, but I know that it accepts XPath predicates with equality comparisons. Unlike proper XPath though, they must be preceded by a slash ("/"). So you can have a construction like

book/[title="Lord of the Rings"]

to select only books with a title element with value "Lord of the Rings".

Don't bother with namespaces in these XPaths, by the way. WebLogic seems to ignore namespaces when processing them, which I guess is usually pretty convenient. Yet another way in which it differs from true XPath, however.

Putting it all together, you could end up with something like this:

<variable-definition>

  <variable>

    <name>port</name>

    <value>1414</value>

  </variable>

</variable-definition>

<module-override>

  <module-name>MQSeriesAdapter.rar</module-name>

  <module-type>rar</module-type>

  <module-descriptor external="false">

    <root-element>weblogic-connector</root-element>

    <uri>META-INF/weblogic-ra.xml</uri>

    <variable-assignment>

      <name>port</name>

      <xpath>/weblogic-connector/outbound-resource-adapter/connection-definition-group/[connection-factory-interface="javax.resource.cci.ConnectionFactory"]/connection-instance/[jndi-name="eisMQ/MQAdapter"]/connection-properties/properties/property/[name="portNumber"]/value</xpath>

    </variable-assignment>

  </module-descriptor>

</module-override>

By now you should have a pretty good idea of how these plans work. If you want to know how to deploy them, then you could do worse than to check out this helpful blog.

Wednesday, May 05, 2010

Humble Indie Bundle

Wow. Five indie classics for literally whatever you want to pay for them. And you get to decide how much of whatever you want to pay goes to charity, and how much to the developers. Nothing to any middle-men except for the payment processor. (Your choice of Amazon, Google, or PayPal.) Also, no DRM, and all titles are available for Windows, linux, and Mac.

If ever you said that you wouldn't pirate if only prices weren't so high, if only developers/distributors weren't so evil, if only your favorite platform was supported, or if only there wasn't any DRM, then here's your chance to show that you meant it!

Check out the Humble Indie Bundle.

P.S. Also check it out if you never said anything like those things above. :-)

Saturday, May 01, 2010

XSLT copies and sequences

And yet another post! Is it a trend or an aberration? Only time will tell.

In a previous post, I offered some solutions for problems that arise from having to deal with nodes that are functionally identical, yet still different. Sort of how two cars can be absolutely identical in terms of brand, model, year, color, etc. and yet still remain two distinct cars. (Just try to argue with the tax man that those two identical cars are actually one and the same.)

This problem can arise very easily when you deal with variables in XSLT. Consider the following:

<xsl:variable name="foo">

  <a/>

<xsl:variable>

<xsl:variable name="bar">

  <a/>

<xsl:variable>

If you compare the nodes in these variables with "$foo/a is $bar/a", the result will be "false", indicating that while these nodes may look awfully identical, XPath doesn't consider them to be the same node. And XPath does have a point, because these <a> elements will be distinct copies in memory.

In fact, you may not realize just how many copies your XSLTs are making. It's not just these hard coded bits of XML in variables, it's also any time you use an xsl:copy, xsl:copy-of, or xsl:element, as well as when you use included content in an xsl:variable, xsl:param, or xsl:with-param without a type declaration. (And some other, less common constructs as well.)

Not only can this be most inconvenient (for instance when you want to use set operators), but you may also be wasting machine resources in your performance critical application.

Fortunately, it's relatively easy to reduce the number of copies. You just have to know the tricks of the trade. And those are just what I'm going to tell you right now.

Don't duplicate hard coded XML content
Whenever you have hard coded XML content, this will result in nodes being created. If you create the same XML content in multiple places (such as we did in the foo and bar variables earlier), those will be duplicates. We could have avoided this by just copying foo to bar, like so:

<xsl:variable name="bar" select="$foo">

Now XSLT will create a new node for foo, but not for bar, as the latter will simply point to the same node that was already created for foo.

Replace xsl:copy-of with xsl:sequence
Unlike xsl:copy-of, xsl:sequence can return existing nodes. And since everything is a sequence anyway in XSLT 2.0 (including the result of xsl:copy-of), there's really no reason to not just use xsl:sequence instead of xsl:copy-of. The same goes for xsl:copy's without children, but those tend to be uncommon.

So rather than this:

<xsl:copy-of select="//baz">

Use this:

<xsl:sequence select="//baz">

Simple!

Make sure xsl:variable, xsl:param, xsl:with-param have either a "select" or an "as" attribute (or both)
The elements xsl:variable, xsl:param, xsl:with-param always make a copy, unless you specify either the "select" attribute, or the "as" attribute (or both).

Whenever possible, use the select attribute, as in those cases you'll never get a copy. If that's not possible, and you really do have to use the element content, you can specify the variable's type with the "as" attribute. In such a case, XSLT will not force the copy to be make. However, if you use hard coded XML, or an xsl:copy-of in the variable content, then those'll still result in copies!

So this is good:

<xsl:variable name="baz" select="//bazElem">

As is:

<xsl:variable name="baz" as="node()">

  <xsl:sequence select="//bazElem"/>

</xsl:variable>

But this is going to create a new node in any case:

<xsl:variable name="baz" as="node()">

  <bazElem/>

</xsl:variable>

And any nodes here will also be copies:

<xsl:variable name="baz" as="node()">

  <xsl:copy-of select="//bazElem"/>

</xsl:variable>

Pro-tip: if you're unsure about the type of your variable, just specify "item()*". That'll allow any sort of sequence.

And that's all you need to know to get rid of most of those unnecessary copies. :-)

Thursday, April 29, 2010

Set operators in XPath

What's this? A post? Fer realz? Best pretend it never happened.

Meanwhile, I'll just continue typing. About set operators in XPath, no less.

In the olden days of XPath 1.0, there was only one set operator, "|", which performed a union on node sets. So you could use an expression like "//foo | //bar" to select all foo and bar elements. If, however, you wanted the intersection or difference of two node sets, you were out of luck, as there were no dedicated operators to help you do that.

Fortunately, the much improved XPath 2.0 remedies that situation by introducing the "intersect", and "except" operators. (And "union", which is just another way of saying "|".) So now you can do "foo intersect bar" to get nodes that are both foo's and bars, and "foo except bar" to get nodes that are foo's but not bars.

Sounds great, doesn't it? Sadly, there's a big pitfall in the way that nodes are determined to be equal. You see, for purposes of these operators, nodes are only deemed equal if they are the exact same nodes, not if they merely have the same name and the same content. So, in the document below, the two foo elements are not considered equal!

<root>
<foo>This is a foo.</foo>
<foo>This is a foo.</foo>
</root>

Now what if we want to perform set operations on nodes based on equality in terms of name and content, rather than XPath's strict definition?

Fortunately, XPath does come with a function that can determine if two nodes are identical, without taking into account if they're the exact same node. We can use this to write queries that do what we want.

Let's assume we have a variable a containing the following nodes (under a nameless root, not as a sequence):

<e></e>
<f>fff</f>
<g></g>

As well as a variable b containing these nodes:

<e></e>
<f>fff</f>
<i></i>

We can now simulate "except" as follows:

$a/*[empty(for $b1 in $b/* return (if (deep-equal(.,$b1)) then (true()) else ()))]

You should read this as: for all nodes of a, return only those for which no corresponding node can be found in b.

If you understand how our "except" works, you'll have no problem understanding "intersect":

$a/*[exists(for $b1 in $b/* return (if (deep-equal(.,$b1)) then (true()) else ()))]

Read this as: for all nodes of a, return only those for which a corresponding node can be found in b.

Finally, we have "union":

$a/*[empty(for $b1 in $b/* return (if (deep-equal(.,$b1)) then (true()) else ()))],$b/*

Read this as: take a except b and add all of b. (The "except b" bit is necessary to make sure you don't get the nodes that occur in both variables twice. Without it, you'd end up with duplicate <e> and <f> elements.)

There's one thing to be aware of in these queries though. Unlike the set operators, these queries don't remove duplicates when these are already in either of the inputs. So if a contains duplicate nodes, "$a except $b" will remove these, while our query won't. Depending on your use case this is either a bug or a feature. :-)

To help you out, here's a way to remove duplicates from a. If you combine this with the queries above, you'll have something that behaves just like the set operators do. (Apart from the difference in determining when nodes are equal, of course.)

$a/*[empty(for $a1 in subsequence($a/*,1,position()-1) return (if (deep-equal(.,$a1)) then (true()) else ()))]

Read this as: go through all nodes in a and return a node only if we haven't already passed an identical one.

There ya go. Hope this'll be of some help.

thoughts.rnd()