Showing posts with label xslt. Show all posts
Showing posts with label xslt. Show all posts

Friday, September 17, 2010

Cleaning up namespace declarations

Ever been bothered by redundant namespace declarations? Some tools generate tons and tons of these, and they really clutter up your XML, making it hard to read and maintain, as well as increasing file size for no reason.

Here's a handy XSLT to move all declarations to the top element of your document, removing duplicates. (It even moves seemingly unused declarations. This is intended behavior, as the declarations may still be referenced in the text, e.g. in the XPath expressions of a BPEL file.)


<!-- Copyright J.W. v/d Broek 2010. Do with this code as you will. -->
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" exclude-result-prefixes="#all">

<xsl:template match="/*">
<xsl:copy copy-namespaces="no">
<xsl:for-each-group select="for $x in //* return for $y in in-scope-prefixes($x)[.!='' and .!='xml'] return concat($y, ':',namespace-uri-for-prefix($y, $x))" group-by=".">
<xsl:namespace name="{substring-before(., ':')}" select="substring-after(., ':')"/>
</xsl:for-each-group>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>

<xsl:template match="node()|@*">
<xsl:copy copy-namespaces="no">
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>

</xsl:stylesheet>

Tuesday, June 01, 2010

Making XML Schema less of a pain by parsing text with XSLT

Allow me to get to the point immediately. XML Schema can be a royal pain.

Don't get me wrong; I'm glad it exists. It's powerful, serves a clear purpose, is well-supported, yadda, yadda, yadda. Unfortunately, it's also quite complex, has a lot of pitfalls (elementFormDefault!), and is terribly verbose.

For instance, would you rather have this:


http://blog.jwbroek.com/nifty-namespace
thingamabob        ; This is a comment.
  foo xsd:string
  bar xsd:boolean  ; Set to true to enable bar.
  baz
    alice
      count xsd:integer?  ; Count is optional.
      description  ; Type defaults to string.
    bobs           ; List of 0 or more bobs.
      bob xsd:boolean*
    charles +      ; At least one charles.


Or this:


<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:tns="http://blog.jwbroek.com/nifty-namespace"
            xmlns:xsd="http://www.w3.org/2001/XMLSchema"
            elementFormDefault="qualified"
            attributeFormDefault="unqualified"
            targetNamespace="http://blog.jwbroek.com/nifty-namespace">
   <xsd:element name="thingamabob">
      <xsd:annotation>
         <xsd:documentation>This is a comment.</xsd:documentation>
      </xsd:annotation>
      <xsd:complexType>
         <xsd:sequence>
            <xsd:element name="foo" type="xsd:string"/>
            <xsd:element name="bar" type="xsd:boolean">
               <xsd:annotation>
                  <xsd:documentation>Set to true to enable bar.</xsd:documentation>
               </xsd:annotation>
            </xsd:element>
            <xsd:element name="baz">
               <xsd:complexType>
                  <xsd:sequence>
                     <xsd:element name="alice">
                        <xsd:complexType>
                           <xsd:sequence>
                              <xsd:element name="count" type="xsd:integer" minOccurs="0">
                                 <xsd:annotation>
                                    <xsd:documentation>Count is optional.</xsd:documentation>
                                 </xsd:annotation>
                              </xsd:element>
                              <xsd:element name="description" type="xsd:string">
                                 <xsd:annotation>
                                    <xsd:documentation>Type defaults to string.</xsd:documentation>
                                 </xsd:annotation>
                              </xsd:element>
                           </xsd:sequence>
                        </xsd:complexType>
                     </xsd:element>
                     <xsd:element name="bobs">
                        <xsd:annotation>
                           <xsd:documentation>List of 0 or more bobs.</xsd:documentation>
                        </xsd:annotation>
                        <xsd:complexType>
                           <xsd:sequence>
                              <xsd:element name="bob" type="xsd:boolean" minOccurs="0" maxOccurs="unbounded"/>
                           </xsd:sequence>
                        </xsd:complexType>
                     </xsd:element>
                     <xsd:element name="charles" type="xsd:string" maxOccurs="unbounded">
                        <xsd:annotation>
                           <xsd:documentation>At least one charles.</xsd:documentation>
                        </xsd:annotation>
                     </xsd:element>
                  </xsd:sequence>
               </xsd:complexType>
            </xsd:element>
         </xsd:sequence>
      </xsd:complexType>
   </xsd:element>
</xsd:schema>


Both describe the same XML structure, but if you ask me, the first one is much clearer, and much quicker to write as well.

Granted, we're not using any of the fancy bells and whistles of XML Schema here. However, this would be quite sufficient for most of the things I see Schema being used for.

Wouldn't it be nice if you could actually write your Schema's using the first syntax?

Well, you're in luck: you can! The Schema above was entirely generated by applying the XSLT below to the simple syntax at the top. Hope you'll enjoy it as much as I do. :-)

(Tip: use Kernow to execute the XSLT. Put your input in C:\dev\projects\schemagen\test\input.txt, or override the parameter to use a file of your choice.)


<!--
Copyright 2010 J.W. van den Broek

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
   xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:jws="http://blog.jwbroek.com/xslt/xsd/functions"
   exclude-result-prefixes="#all">
  
   <xsl:output indent="yes"/>
  
   <!-- Override this to read your file. -->
   <xsl:param name="input-file" select="'file:///C:/dev/projects/schemagen/test/input.txt'"/>
  
   <xsl:template match="/">
      <!-- Sequence of all non-empty lines in the input. -->
      <xsl:variable name="lines" select="tokenize(unparsed-text($input-file),'&#x0D;')[not(matches(.,'^\s*$'))]"/>
     
      <!-- Create the schema. Make the root schema element here, taking the target namespace from the first line of input. -->
      <xsd:schema elementFormDefault="qualified" attributeFormDefault="unqualified" targetNamespace="{$lines[1]}">
         <xsl:namespace name="tns" select="$lines[1]"/>
         <!-- Pass all other lines on the the element-declarations function, which will create the element declarations. -->
         <xsl:sequence select="jws:element-declarations(subsequence($lines,2,count($lines)-1))"/>
      </xsd:schema>
   </xsl:template>
  
   <!-- Create element declarations based on lines of input. -->
   <xsl:function name="jws:element-declarations" as="element()*">
      <xsl:param name="rawLines" as="xsd:string*"/>
     
      <!-- Only continue if we have lines of input remaining. -->
      <xsl:if test="exists($rawLines)">
         <!-- Take the indentation from the first line. We'll create declarations for all elements with this level of indentation. -->
         <!-- We'll recursively create declarations for elements at higher indentation. -->
         <xsl:variable name="curIndent" select="replace($rawLines[1],'^(\s*).+$','$1')"/>
         <!-- Remove the base indentation from all lines. The elements we're going to make declarations for now have no indentation. -->
         <xsl:variable name="lines" select="for $l in $rawLines return substring-after($l, $curIndent)"/>
         <!-- Determine indices for all elements without indentation. We'll use this info to efficiently access the right lines of input. -->
         <xsl:variable name="indicesAtRoot" select="index-of((for $l in $lines return matches($l, '^\i+.*')), true())"/>
         <!-- Contains the root indices, but also the end of input. We'll use this to create subsequences for our recursive calls. -->
         <xsl:variable name="indicesAndBound" select="$indicesAtRoot, count($lines)+1"/>
        
         <!-- Create declarations for all root elements. (And recursively all child elements as well.) -->
         <xsl:for-each select="$indicesAtRoot">
            <!-- Current line of input. -->
            <xsl:variable name="curLine" select="$lines[current()]"/>
            <!-- Name of current element. -->
            <xsl:variable name="name" select="replace($curLine,'^(\i\c*).*$','$1')"/>
            <!-- Type of current element. May be empty, in which case we'll use xsd:string as default later on. -->
            <xsl:variable name="type" select="replace($curLine,'^\i\c*\s*([^?*+;\s]*)?.*$','$1')"/>
            <!-- Occurrence of current element. ?: optional, *: 0 or more, +: 1 or more. Empty is XSD default (1). -->
            <xsl:variable name="occurrence" select="replace($curLine,'^[^?*+;]*(\?|\*|\+).*$','$1')"/>
            <!-- Documentation. Will go into a documentation annotation. -->
            <xsl:variable name="doc" select="replace($curLine,'^[^;]+(;\s*(.*))?$','$2')"/>
            <!-- Current position in the $indicesAtRoot sequence. -->
            <xsl:variable name="pos" select="position()"/>
            <!-- Select the subsequence of all lines that contain children of the current element. -->
            <xsl:variable name="children" select="subsequence($lines, $indicesAndBound[$pos]+1, $indicesAndBound[$pos+1] - $indicesAndBound[$pos] - 1)"/>
           
            <!-- Create the element declaration. -->
            <xsd:element name="{$name}">
               <!-- No type declaration if there are children. Is an inline complex type declaration. -->
               <xsl:if test="empty($children)">
                  <xsl:choose>
                     <!-- On empty type, we default to string. -->
                     <xsl:when test="$type = ''">
                        <xsl:attribute name="type" select="'xsd:string'"/>
                     </xsl:when>
                     <xsl:otherwise>
                        <xsl:attribute name="type" select="$type"/>
                     </xsl:otherwise>
                  </xsl:choose>
               </xsl:if>
              
               <!-- Set minOccurs and maxOccurs. -->
               <xsl:choose>
                  <xsl:when test="$occurrence='?'">
                     <xsl:attribute name="minOccurs" select="'0'"/>
                  </xsl:when>
                  <xsl:when test="$occurrence='*'">
                     <xsl:attribute name="minOccurs" select="'0'"/>
                     <xsl:attribute name="maxOccurs" select="'unbounded'"/>
                  </xsl:when>
                  <xsl:when test="$occurrence='+'">
                     <xsl:attribute name="maxOccurs" select="'unbounded'"/>
                  </xsl:when>
               </xsl:choose>
              
               <!-- Set documentation annotation. -->
               <xsl:if test="$doc != ''">
                  <xsd:annotation>
                     <xsd:documentation>
                        <xsl:sequence select="$doc"/>
                     </xsd:documentation>
                  </xsd:annotation>
               </xsl:if>
              
               <!-- Recursively do child declarations. -->
               <xsl:if test="exists($children)">
                  <xsd:complexType>
                     <xsd:sequence>
                        <xsl:sequence select="jws:element-declarations($children)"/>
                     </xsd:sequence>
                  </xsd:complexType>
               </xsl:if>
            </xsd:element>
         </xsl:for-each>
      </xsl:if>
   </xsl:function>
  
</xsl:stylesheet>

Tuesday, May 11, 2010

XSLT, XQuery, XPath tooling

I'm always on the lookout for nice tooling. Preferably small and free (as in beer, though I like the other free too). Not because I'm cheap (though I am), but because that's the sort of tools that I can use without having to go through the bureaucratic hell associated with getting the boss or customer to pay for software licenses.

Here are two of my favorite free tools for working with XSLT, XQuery and XPath.

Architag XRay XML Editor. Technically a general purpose XML editor, but it really shines when you're writing XSLTs. As you're writing your XSLT, you can set an input document, and it'll continually evaluate your XSLT against that document. That way, the second you make a change to the XSLT, you'll get to see the effect it has. Awesome!

Besides this, it continually informs you of invalidities in your XML in a very unobtrusive way (no popups or any of that nonsense), and will automatically perform Schema validations if you have the right schema open. (No need to explicitly associate documents with schemas.)

Sadly, it only supports XSLT 1.0. Still, that's the version I usually have to use at work, so XRay still comes in handy very often.

Kernow. Kernow's stated goal is "to make it faster and easier to repeatedly run transforms using Saxon." It does so admirably, but it also has very convenient sandboxes for performing XSLT 2.0 or XQuery transforms, and XML Schema or Schematron validations. Maybe not yet as convenient as XRay, but still quite nice. Definitely recommended!

Saturday, May 01, 2010

XSLT copies and sequences

And yet another post! Is it a trend or an aberration? Only time will tell.

In a previous post, I offered some solutions for problems that arise from having to deal with nodes that are functionally identical, yet still different. Sort of how two cars can be absolutely identical in terms of brand, model, year, color, etc. and yet still remain two distinct cars. (Just try to argue with the tax man that those two identical cars are actually one and the same.)

This problem can arise very easily when you deal with variables in XSLT. Consider the following:

<xsl:variable name="foo">
  <a/>
<xsl:variable>

<xsl:variable name="bar">
  <a/>
<xsl:variable>

If you compare the nodes in these variables with "$foo/a is $bar/a", the result will be "false", indicating that while these nodes may look awfully identical, XPath doesn't consider them to be the same node. And XPath does have a point, because these <a> elements will be distinct copies in memory.

In fact, you may not realize just how many copies your XSLTs are making. It's not just these hard coded bits of XML in variables, it's also any time you use an xsl:copy, xsl:copy-of, or xsl:element, as well as when you use included content in an xsl:variable, xsl:param, or xsl:with-param without a type declaration. (And some other, less common constructs as well.)

Not only can this be most inconvenient (for instance when you want to use set operators), but you may also be wasting machine resources in your performance critical application.

Fortunately, it's relatively easy to reduce the number of copies. You just have to know the tricks of the trade. And those are just what I'm going to tell you right now.

Don't duplicate hard coded XML content
Whenever you have hard coded XML content, this will result in nodes being created. If you create the same XML content in multiple places (such as we did in the foo and bar variables earlier), those will be duplicates. We could have avoided this by just copying foo to bar, like so:

<xsl:variable name="bar" select="$foo">

Now XSLT will create a new node for foo, but not for bar, as the latter will simply point to the same node that was already created for foo.

Replace xsl:copy-of with xsl:sequence
Unlike xsl:copy-of, xsl:sequence can return existing nodes. And since everything is a sequence anyway in XSLT 2.0 (including the result of xsl:copy-of), there's really no reason to not just use xsl:sequence instead of xsl:copy-of. The same goes for xsl:copy's without children, but those tend to be uncommon.

So rather than this:
<xsl:copy-of select="//baz">

Use this:
<xsl:sequence select="//baz">

Simple!

Make sure xsl:variable, xsl:param, xsl:with-param have either a "select" or an "as" attribute (or both)
The elements xsl:variable, xsl:param, xsl:with-param always make a copy, unless you specify either the "select" attribute, or the "as" attribute (or both).

Whenever possible, use the select attribute, as in those cases you'll never get a copy. If that's not possible, and you really do have to use the element content, you can specify the variable's type with the "as" attribute. In such a case, XSLT will not force the copy to be make. However, if you use hard coded XML, or an xsl:copy-of in the variable content, then those'll still result in copies!

So this is good:
<xsl:variable name="baz" select="//bazElem">

As is:
<xsl:variable name="baz" as="node()">
  <xsl:sequence select="//bazElem"/>
</xsl:variable>

But this is going to create a new node in any case:
<xsl:variable name="baz" as="node()">
  <bazElem/>
</xsl:variable>

And any nodes here will also be copies:
<xsl:variable name="baz" as="node()">
  <xsl:copy-of select="//bazElem"/>
</xsl:variable>

Pro-tip: if you're unsure about the type of your variable, just specify "item()*". That'll allow any sort of sequence.

And that's all you need to know to get rid of most of those unnecessary copies. :-)