Tuesday, June 01, 2010

Making XML Schema less of a pain by parsing text with XSLT

Allow me to get to the point immediately. XML Schema can be a royal pain.

Don't get me wrong; I'm glad it exists. It's powerful, serves a clear purpose, is well-supported, yadda, yadda, yadda. Unfortunately, it's also quite complex, has a lot of pitfalls (elementFormDefault!), and is terribly verbose.

For instance, would you rather have this:


http://blog.jwbroek.com/nifty-namespace
thingamabob        ; This is a comment.
  foo xsd:string
  bar xsd:boolean  ; Set to true to enable bar.
  baz
    alice
      count xsd:integer?  ; Count is optional.
      description  ; Type defaults to string.
    bobs           ; List of 0 or more bobs.
      bob xsd:boolean*
    charles +      ; At least one charles.


Or this:


<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema xmlns:tns="http://blog.jwbroek.com/nifty-namespace"
            xmlns:xsd="http://www.w3.org/2001/XMLSchema"
            elementFormDefault="qualified"
            attributeFormDefault="unqualified"
            targetNamespace="http://blog.jwbroek.com/nifty-namespace">
   <xsd:element name="thingamabob">
      <xsd:annotation>
         <xsd:documentation>This is a comment.</xsd:documentation>
      </xsd:annotation>
      <xsd:complexType>
         <xsd:sequence>
            <xsd:element name="foo" type="xsd:string"/>
            <xsd:element name="bar" type="xsd:boolean">
               <xsd:annotation>
                  <xsd:documentation>Set to true to enable bar.</xsd:documentation>
               </xsd:annotation>
            </xsd:element>
            <xsd:element name="baz">
               <xsd:complexType>
                  <xsd:sequence>
                     <xsd:element name="alice">
                        <xsd:complexType>
                           <xsd:sequence>
                              <xsd:element name="count" type="xsd:integer" minOccurs="0">
                                 <xsd:annotation>
                                    <xsd:documentation>Count is optional.</xsd:documentation>
                                 </xsd:annotation>
                              </xsd:element>
                              <xsd:element name="description" type="xsd:string">
                                 <xsd:annotation>
                                    <xsd:documentation>Type defaults to string.</xsd:documentation>
                                 </xsd:annotation>
                              </xsd:element>
                           </xsd:sequence>
                        </xsd:complexType>
                     </xsd:element>
                     <xsd:element name="bobs">
                        <xsd:annotation>
                           <xsd:documentation>List of 0 or more bobs.</xsd:documentation>
                        </xsd:annotation>
                        <xsd:complexType>
                           <xsd:sequence>
                              <xsd:element name="bob" type="xsd:boolean" minOccurs="0" maxOccurs="unbounded"/>
                           </xsd:sequence>
                        </xsd:complexType>
                     </xsd:element>
                     <xsd:element name="charles" type="xsd:string" maxOccurs="unbounded">
                        <xsd:annotation>
                           <xsd:documentation>At least one charles.</xsd:documentation>
                        </xsd:annotation>
                     </xsd:element>
                  </xsd:sequence>
               </xsd:complexType>
            </xsd:element>
         </xsd:sequence>
      </xsd:complexType>
   </xsd:element>
</xsd:schema>


Both describe the same XML structure, but if you ask me, the first one is much clearer, and much quicker to write as well.

Granted, we're not using any of the fancy bells and whistles of XML Schema here. However, this would be quite sufficient for most of the things I see Schema being used for.

Wouldn't it be nice if you could actually write your Schema's using the first syntax?

Well, you're in luck: you can! The Schema above was entirely generated by applying the XSLT below to the simple syntax at the top. Hope you'll enjoy it as much as I do. :-)

(Tip: use Kernow to execute the XSLT. Put your input in C:\dev\projects\schemagen\test\input.txt, or override the parameter to use a file of your choice.)


<!--
Copyright 2010 J.W. van den Broek

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
   xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:jws="http://blog.jwbroek.com/xslt/xsd/functions"
   exclude-result-prefixes="#all">
  
   <xsl:output indent="yes"/>
  
   <!-- Override this to read your file. -->
   <xsl:param name="input-file" select="'file:///C:/dev/projects/schemagen/test/input.txt'"/>
  
   <xsl:template match="/">
      <!-- Sequence of all non-empty lines in the input. -->
      <xsl:variable name="lines" select="tokenize(unparsed-text($input-file),'&#x0D;')[not(matches(.,'^\s*$'))]"/>
     
      <!-- Create the schema. Make the root schema element here, taking the target namespace from the first line of input. -->
      <xsd:schema elementFormDefault="qualified" attributeFormDefault="unqualified" targetNamespace="{$lines[1]}">
         <xsl:namespace name="tns" select="$lines[1]"/>
         <!-- Pass all other lines on the the element-declarations function, which will create the element declarations. -->
         <xsl:sequence select="jws:element-declarations(subsequence($lines,2,count($lines)-1))"/>
      </xsd:schema>
   </xsl:template>
  
   <!-- Create element declarations based on lines of input. -->
   <xsl:function name="jws:element-declarations" as="element()*">
      <xsl:param name="rawLines" as="xsd:string*"/>
     
      <!-- Only continue if we have lines of input remaining. -->
      <xsl:if test="exists($rawLines)">
         <!-- Take the indentation from the first line. We'll create declarations for all elements with this level of indentation. -->
         <!-- We'll recursively create declarations for elements at higher indentation. -->
         <xsl:variable name="curIndent" select="replace($rawLines[1],'^(\s*).+$','$1')"/>
         <!-- Remove the base indentation from all lines. The elements we're going to make declarations for now have no indentation. -->
         <xsl:variable name="lines" select="for $l in $rawLines return substring-after($l, $curIndent)"/>
         <!-- Determine indices for all elements without indentation. We'll use this info to efficiently access the right lines of input. -->
         <xsl:variable name="indicesAtRoot" select="index-of((for $l in $lines return matches($l, '^\i+.*')), true())"/>
         <!-- Contains the root indices, but also the end of input. We'll use this to create subsequences for our recursive calls. -->
         <xsl:variable name="indicesAndBound" select="$indicesAtRoot, count($lines)+1"/>
        
         <!-- Create declarations for all root elements. (And recursively all child elements as well.) -->
         <xsl:for-each select="$indicesAtRoot">
            <!-- Current line of input. -->
            <xsl:variable name="curLine" select="$lines[current()]"/>
            <!-- Name of current element. -->
            <xsl:variable name="name" select="replace($curLine,'^(\i\c*).*$','$1')"/>
            <!-- Type of current element. May be empty, in which case we'll use xsd:string as default later on. -->
            <xsl:variable name="type" select="replace($curLine,'^\i\c*\s*([^?*+;\s]*)?.*$','$1')"/>
            <!-- Occurrence of current element. ?: optional, *: 0 or more, +: 1 or more. Empty is XSD default (1). -->
            <xsl:variable name="occurrence" select="replace($curLine,'^[^?*+;]*(\?|\*|\+).*$','$1')"/>
            <!-- Documentation. Will go into a documentation annotation. -->
            <xsl:variable name="doc" select="replace($curLine,'^[^;]+(;\s*(.*))?$','$2')"/>
            <!-- Current position in the $indicesAtRoot sequence. -->
            <xsl:variable name="pos" select="position()"/>
            <!-- Select the subsequence of all lines that contain children of the current element. -->
            <xsl:variable name="children" select="subsequence($lines, $indicesAndBound[$pos]+1, $indicesAndBound[$pos+1] - $indicesAndBound[$pos] - 1)"/>
           
            <!-- Create the element declaration. -->
            <xsd:element name="{$name}">
               <!-- No type declaration if there are children. Is an inline complex type declaration. -->
               <xsl:if test="empty($children)">
                  <xsl:choose>
                     <!-- On empty type, we default to string. -->
                     <xsl:when test="$type = ''">
                        <xsl:attribute name="type" select="'xsd:string'"/>
                     </xsl:when>
                     <xsl:otherwise>
                        <xsl:attribute name="type" select="$type"/>
                     </xsl:otherwise>
                  </xsl:choose>
               </xsl:if>
              
               <!-- Set minOccurs and maxOccurs. -->
               <xsl:choose>
                  <xsl:when test="$occurrence='?'">
                     <xsl:attribute name="minOccurs" select="'0'"/>
                  </xsl:when>
                  <xsl:when test="$occurrence='*'">
                     <xsl:attribute name="minOccurs" select="'0'"/>
                     <xsl:attribute name="maxOccurs" select="'unbounded'"/>
                  </xsl:when>
                  <xsl:when test="$occurrence='+'">
                     <xsl:attribute name="maxOccurs" select="'unbounded'"/>
                  </xsl:when>
               </xsl:choose>
              
               <!-- Set documentation annotation. -->
               <xsl:if test="$doc != ''">
                  <xsd:annotation>
                     <xsd:documentation>
                        <xsl:sequence select="$doc"/>
                     </xsd:documentation>
                  </xsd:annotation>
               </xsl:if>
              
               <!-- Recursively do child declarations. -->
               <xsl:if test="exists($children)">
                  <xsd:complexType>
                     <xsd:sequence>
                        <xsl:sequence select="jws:element-declarations($children)"/>
                     </xsd:sequence>
                  </xsd:complexType>
               </xsl:if>
            </xsd:element>
         </xsl:for-each>
      </xsl:if>
   </xsl:function>
  
</xsl:stylesheet>