General Description

This page shall serve as a short description of the XStandoff format. See Stührenberg and Goecke 2008 for an article describing XStandoff's ancestor, the Sekimo Generic Format and Stührenberg and Jettka 2009 for an article describing XStandoff together with its toolkit.
Note, that some features are only present in XStandoff 2.0 (which in turn has some minor compatibility breaks), these will highlighted in the following text.

Introductory words

For demonstration purposes we use a simple example, the single sentence shown in the listing below:

The sun shines brighter.

We then apply two inline annotation levels to this simple example: one for the morpheme structure and a second for the syllable structure.

The inline annotation of the morpheme structure:

<?xml version="1.0" encoding="UTF-8"?>
<morphemes xmlns="http://www.xstandoff.net/morphemes"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.xstandoff.net/morphemes ../xsd/morphemes.xsd">
  <m>The</m>
  <m>sun</m>
  <m>shine</m>
  <m>s</m>
  <m>bright</m>
  <m>er</m>.
</morphemes>

This annotation can be easily rendered as the following graphic:

Graphic representation of the morpheme structure

The inline annotation of the syllable structure:

<?xml version="1.0" encoding="UTF-8"?>
<syllables xmlns="http://www.xstandoff.net/syllables"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://www.xstandoff.net/syllables ../xsd/syllables.xsd">
  <s>The</s>
  <s>sun</s>
  <s>shines</s>
  <s>brigh</s>
  <s>ter</s>.
</syllables>

Again, this annotation can be rendered as a graphic:

Graphic representation of the syllable structure

When we try to combine these two annotation levels we get an overlapping structure which can be seen in the following figure:

Graphic representation of the overlapping structures

These overlaps cannot be annotated in a single XML instance, but it is possible to use standoff annotation methods.
XStandoff (like other standoff formats) uses the character positions of the primary data to depict the positions where annotation elements occur:

  T  h  e     s  u  n     s  h  i  n  e  s     b  r  i  g  h  t  e  r  .
00|01|02|03|04|05|06|07|08|09|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24

The character 'T' ranges from position 0 to 1, the character 'h' from 1 to 2, and so on. We will use this information when constructing an XSF instance. In addition, using character positions allows us for easily identify the overlapping positions, in this case at position 20 and 21 at which the letter 't' is stored.

Converting inline annotation

We will now construct two XStandoff instances corresponding to the two example annotations. For converting inline annotations to the respective XStandoff instances we use the XStandoff Toolkit (see the download section), to be more specific, the inline2XSF XSLT stylesheet. The result can be seen in the next two listings.

The XStandoff representation of the morpheme structure:

<?xml version="1.0" encoding="UTF-8"?>
<xsf:corpusData xmlns="http://www.xstandoff.net/2009/xstandoff/1.1"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns:xsf="http://www.xstandoff.net/2009/xstandoff/1.1"
  xsi:schemaLocation="http://www.xstandoff.net/2009/xstandoff/1.1 
  http://www.xstandoff.net/2009/xstandoff/1.1/xsf.xsd" xsfVersion="1.1"
  xml:id="sentence_morphemes">
  <xsf:primaryData start="0" end="24">
    <xsf:primaryDataRef uri="../pd/sentence.txt" mimeType="text/plain"/>
  </xsf:primaryData>
  <xsf:segmentation>
    <xsf:segment xml:id="seg1" type="char" start="0" end="24"/>
    <xsf:segment xml:id="seg2" type="char" start="0" end="3"/>
    <xsf:segment xml:id="seg3" type="char" start="4" end="7"/>
    <xsf:segment xml:id="seg4" type="char" start="8" end="13"/>
    <xsf:segment xml:id="seg5" type="char" start="13" end="14"/>
    <xsf:segment xml:id="seg6" type="char" start="15" end="21"/>
    <xsf:segment xml:id="seg7" type="char" start="21" end="23"/>
  </xsf:segmentation>
  <xsf:annotation>
    <xsf:level xml:id="sentence_morphemes_l">
      <xsf:layer xmlns:m="http://www.xstandoff.net/morphemes"
        xsi:schemaLocation="http://www.xstandoff.net/morphemes ../xsd/morphemes.xsd"
        priority="0">
        <m:morphemes xsf:segment="seg1">
          <m:m xsf:segment="seg2"/>
          <m:m xsf:segment="seg3"/>
          <m:m xsf:segment="seg4"/>
          <m:m xsf:segment="seg5"/>
          <m:m xsf:segment="seg6"/>
          <m:m xsf:segment="seg7"/>
        </m:morphemes>
      </xsf:layer>
    </xsf:level>
  </xsf:annotation>
</xsf:corpusData>

The XStandoff representation of the syllable structure:

<?xml version="1.0" encoding="UTF-8"?>
<xsf:corpusData xmlns="http://www.xstandoff.net/2009/xstandoff/1.1"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns:xsf="http://www.xstandoff.net/2009/xstandoff/1.1"
  xsi:schemaLocation="http://www.xstandoff.net/2009/xstandoff/1.1 
  http://www.xstandoff.net/2009/xstandoff/1.1/xsf.xsd" xsfVersion="1.1"
  xml:id="sentence_syllables">
  <xsf:primaryData start="0" end="24">
    <xsf:primaryDataRef uri="../pd/sentence.txt" mimeType="text/plain"/>
  </xsf:primaryData>
  <xsf:segmentation>
    <xsf:segment xml:id="seg1" type="char" start="0" end="24"/>
    <xsf:segment xml:id="seg2" type="char" start="0" end="3"/>
    <xsf:segment xml:id="seg3" type="char" start="4" end="7"/>
    <xsf:segment xml:id="seg4" type="char" start="8" end="14"/>
    <xsf:segment xml:id="seg5" type="char" start="15" end="20"/>
    <xsf:segment xml:id="seg6" type="char" start="20" end="23"/>
  </xsf:segmentation>
  <xsf:annotation>
    <xsf:level xml:id="sentence_syllables_s">
      <xsf:layer xmlns:s="http://www.xstandoff.net/syllables"
        xsi:schemaLocation="http://www.xstandoff.net/syllables ../xsd/syllables.xsd"
        priority="0">
        <s:syllables xsf:segment="seg1">
          <s:s xsf:segment="seg2"/>
          <s:s xsf:segment="seg3"/>
          <s:s xsf:segment="seg4"/>
          <s:s xsf:segment="seg5"/>
          <s:s xsf:segment="seg6"/>
        </s:syllables>
      </xsf:layer>
    </xsf:level>
  </xsf:annotation>
</xsf:corpusData>

Merging XStandoff instances

In the next step, we combine these two XStandoff instances by means of the mergeXSF XSLT stylesheet. This is possible since both instances use the same primary data.
The resulting XStandoff instance:

<?xml version="1.0" encoding="UTF-8"?>
<xsf:corpusData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.xstandoff.net/2009/xstandoff/1.1
  http://www.xstandoff.net/2009/xstandoff/1.1/xsf.xsd" 
  xmlns="http://www.xstandoff.net/2009/xstandoff/1.1"
  xmlns:xsf="http://www.xstandoff.net/2009/xstandoff/1.1" xml:id="c1" xsfVersion="1.1">
  <xsf:primaryData start="0" end="24" xml:lang="en">
    <textualContent>The sun shines brighter.</textualContent>
  </xsf:primaryData>
  <xsf:segmentation>
    <xsf:segment xml:id="seg1" type="char" start="0" end="24"/>
    <xsf:segment xml:id="seg2" type="char" start="0" end="3"/>
    <xsf:segment xml:id="seg3" type="char" start="4" end="7"/>
    <xsf:segment xml:id="seg4" type="char" start="8" end="14"/>
    <xsf:segment xml:id="seg5" type="char" start="8" end="13"/>
    <xsf:segment xml:id="seg6" type="char" start="13" end="14"/>
    <xsf:segment xml:id="seg7" type="char" start="15" end="21"/>
    <xsf:segment xml:id="seg8" type="char" start="15" end="20"/>
    <xsf:segment xml:id="seg9" type="char" start="20" end="23"/>
    <xsf:segment xml:id="seg10" type="char" start="21" end="23"/>
  </xsf:segmentation>
  <xsf:annotation>
    <xsf:level xml:id="l_morph">
      <xsf:layer xmlns:m="http://www.xstandoff.net/morphemes"
        xsi:schemaLocation="http://www.xstandoff.net/morphemes ../xsd/morphemes.xsd" priority="0">
        <m:morphemes xsf:segment="seg1">
          <m:m xsf:segment="seg2"/>
          <m:m xsf:segment="seg3"/>
          <m:m xsf:segment="seg5"/>
          <m:m xsf:segment="seg6"/>
          <m:m xsf:segment="seg7"/>
          <m:m xsf:segment="seg10"/>
        </m:morphemes>
      </xsf:layer>
    </xsf:level>
    <xsf:level xml:id="l_syll">
      <xsf:layer xmlns:s="http://www.xstandoff.net/syllables"
        xsi:schemaLocation="http://www.xstandoff.net/syllables ../xsd/syllables.xsd" priority="1">
        <s:syllables xsf:segment="seg1">
          <s:s xsf:segment="seg2"/>
          <s:s xsf:segment="seg3"/>
          <s:s xsf:segment="seg4"/>
          <s:s xsf:segment="seg8"/>
          <s:s xsf:segment="seg9"/>
        </s:syllables>
      </xsf:layer>
    </xsf:level>
  </xsf:annotation>
</xsf:corpusData>

As one can observe we end up with 10 instead of 13 (6+7) segment elements since some segments are shared between both annotation levels. XStandoff heavily makes use of XML's inherent ID/IDREF(S) mechanism to connect segments of the primary data with single or multiple annotation layer(s).
Each XStandoff instance can store several annotation levels (which itself can store several annotation layers, cf. the examples section).

Adding metadata

Metadata can be applied at various locations throughout the document:

<?xml version="1.0" encoding="UTF-8"?>
<xsf:corpusData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.xstandoff.net/2009/xstandoff/1.1
  http://www.xstandoff.net/2009/xstandoff/1.1/xsf.xsd" 
  xmlns="http://www.xstandoff.net/2009/xstandoff/1.1"
  xmlns:xsf="http://www.xstandoff.net/2009/xstandoff/1.1" xml:id="c1" xsfVersion="1.1">
  <xsf:meta xmlns:olac="http://www.language-archives.org/OLAC/1.0/" xmlns="http://purl.org/dc/elements/1.1/"
    xmlns:dcterms="http://purl.org/dc/terms/"
    xsi:schemaLocation="http://www.language-archives.org/OLAC/1.0/ ../xsd/meta/olac.xsd">
    <olac:olac>
      <creator>Maik Stührenberg</creator>
      <date>2009-02-19</date>
      <description>Example sentence "The sun shines brighter" annotated with morphemes and syllables.</description>
    </olac:olac>
  </xsf:meta>
  <xsf:primaryData start="0" end="24" xml:lang="en">
    <textualContent>The sun shines brighter.</textualContent>
  </xsf:primaryData>
  <xsf:segmentation>
    <xsf:segment xml:id="seg1" type="char" start="0" end="24"/>
    <xsf:segment xml:id="seg2" type="char" start="0" end="3"/>
    <xsf:segment xml:id="seg3" type="char" start="4" end="7"/>
    <xsf:segment xml:id="seg4" type="char" start="8" end="14"/>
    <xsf:segment xml:id="seg5" type="char" start="8" end="13"/>
    <xsf:segment xml:id="seg6" type="char" start="13" end="14"/>
    <xsf:segment xml:id="seg7" type="char" start="15" end="21"/>
    <xsf:segment xml:id="seg8" type="char" start="15" end="20"/>
    <xsf:segment xml:id="seg9" type="char" start="20" end="23"/>
    <xsf:segment xml:id="seg10" type="char" start="21" end="23"/>
  </xsf:segmentation>
  <xsf:annotation>
    <xsf:level xml:id="l_morph">
      <xsf:meta xmlns:olac="http://www.language-archives.org/OLAC/1.0/" xmlns="http://purl.org/dc/elements/1.1/"
        xmlns:dcterms="http://purl.org/dc/terms/"
        xsi:schemaLocation="http://www.language-archives.org/OLAC/1.0/ 
        ../xsd/meta/olac.xsd">
        <olac:olac>
          <creator>Maik Stührenberg</creator>
          <date>2009-02-19</date>
          <description>Morphem annotation. Manually annotated.</description>
        </olac:olac>
      </xsf:meta>
      <xsf:layer xmlns:m="http://www.xstandoff.net/morphemes"
        xsi:schemaLocation="http://www.xstandoff.net/morphemes ../xsd/morphemes.xsd" priority="0">
        <m:morphemes xsf:segment="seg1">
          <m:m xsf:segment="seg2"/>
          <m:m xsf:segment="seg3"/>
          <m:m xsf:segment="seg5"/>
          <m:m xsf:segment="seg6"/>
          <m:m xsf:segment="seg7"/>
          <m:m xsf:segment="seg10"/>
        </m:morphemes>
      </xsf:layer>
    </xsf:level>
    <xsf:level xml:id="l_syll">
      <xsf:meta xmlns:olac="http://www.language-archives.org/OLAC/1.0/" xmlns="http://purl.org/dc/elements/1.1/"
        xmlns:dcterms="http://purl.org/dc/terms/"
        xsi:schemaLocation="http://www.language-archives.org/OLAC/1.0/ 
        ../xsd/meta/olac.xsd">
        <olac:olac>
          <creator>Maik Stührenberg</creator>
          <date>2009-02-19</date>
          <description>Syllables annotation. Manually annotated.</description>
        </olac:olac>
      </xsf:meta>
      <xsf:layer xmlns:s="http://www.xstandoff.net/syllables"
        xsi:schemaLocation="http://www.xstandoff.net/syllables ../xsd/syllables.xsd" priority="1">
        <s:syllables xsf:segment="seg1">
          <s:s xsf:segment="seg2"/>
          <s:s xsf:segment="seg3"/>
          <s:s xsf:segment="seg4"/>
          <s:s xsf:segment="seg8"/>
          <s:s xsf:segment="seg9"/>
        </s:syllables>
      </xsf:layer>
    </xsf:level>
  </xsf:annotation>
</xsf:corpusData>

XStandoff's meta element is a wrapper element for metadata annotation defined in an external schema file (or without any schema definition at all), similar to the content of the layer element.

Storing whole corpora

It is possible to use an XStandoff instance in two different ways: for storing a whole corpus or for storing single corpus entries together with a corpus definition file. See the following listing:

<corpus xmlns="http://www.XStandoff.net/2009/XStandoff/1.1"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xmlns:xsf="http://www.XStandoff.net/2009/XStandoff/1.1">
  <corpusData xml:id="c1" xsfVersion="1.1">
    <!-- [...] -->
  </corpusData>
  <corpusData xml:id="c2" xsfVersion="1.1">
    <!-- [...] -->
  </corpusData>
</corpus>

In this case the root element of an XStandoff instance is not the corpusData element but the corpus element.

The most common way would be to use a set of files: one for the corpus definition and at least another one for storing single corpus entries (instances with corpusData root elements). The following listings shows the corpus definition file:

<corpus xmlns="http://www.XStandoff.net/2009/XStandoff/1.1"
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xmlns:xsf="http://www.XStandoff.net/2009/XStandoff/1.1">
  <corpusDataRef xml:id="c1" uri="c1.xml" mime-type="text/xml" 
   encoding="UTF-8" xsfVersion="1.1"/>
  <corpusDataRef xml:id="c2" uri="c2.xml" mime-type="text/xml" 
   encoding="UTF-8" xsfVersion="1.1"/>
</corpus>

Note, that corpusDataRef elements are used instead of corpusData elements.
In case you are using XStandoff 2.0 (in which the corpus element has been removed), you may use nested corpusData elements (or corpusDataRef respectively).

Annotating complex documents with XStandoff 2: multimodal documents

In digital humanities (or eHumanities), one does not only analyze texts. Often information encoded in graphics such as diagrams or facsimiles have to taken into account, too. These multimodal documents can be interesting objects of investigation, since they often not only include alternative encodings of the very same information, but sometimes encode information items in a single way only (providing clues about the importance of this item). In addition, relations between these information items (regardless of their encodings) may exist which may be of interest in analysis, too. Classic examples of such multimodal documents are instruction manuals or match analysis (similar to the ones that can be found at http://www.spielverlagerung.de).

Camera

In this case, one often ends up to use different annotation schemas for textual and non-textual primary data without having any chance to easily compare both annotation layers. Since version 2.0, XStandoff supports spatial segmentation in case of graphical primary data files.

XStandoff's segment element has been modified and some further attributes has been added to identify regions in graphical primary data (still or moving images) by using a coordinate system similar to that of HTML's image map approach.
Several things have to be taken into account: the type attribute's value has to be set to spatial, and the shape and coords attributes have to be provided as well.

If we would like to identify the mode dial present both in the graphical presentation and the textual instructions as markable (and annotate it afterwards), we could the following coordinates (denoted in x,y syntax) to create a polygon shape: 176,0 219,14 232,37 215,60 172,72 128,63 110,33 126,13. Note that we use a graphic representation containing only the camera and not the running text. These coordinates are stored in the coords attribute of the segment element as list of coordinate pairs (you may use triples for either defining a 3-dimensional polygon object or a circle).
The following listing shows the resulting XStandoff instance including a word annotation layer.

<?xml version="1.0" encoding="UTF-8"?>
<xsf:corpusData xml:id="c1" xmlns:xsf="http://www.xstandoff.net/2009/xstandoff/1.1"
  xmlns:olac="http://www.language-archives.org/OLAC/1.0/"
  xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.xstandoff.net/2009/xstandoff/1.1 ../xsd/xsf2_1.1.xsd">
  <xsf:primaryData xml:id="txt">
    <xsf:primaryDataRef uri="../pd/camera.txt" mimeType="text/plain" encoding="utf-8" start="0" end="238"/>
  </xsf:primaryData>
  <xsf:primaryData xml:id="img">
    <xsf:primaryDataRef uri="../pd/camera.png" mimeType="image/png" width="564" height="538"
      horizontalResolution="300" verticalResolution="300"/>
  </xsf:primaryData>
  <xsf:segmentation>
    <xsf:segment xml:id="img_seg1" primaryData="img" coords="61,137,24" shape="circle" name="shutter"/>
    <xsf:segment xml:id="img_seg2" primaryData="img" coords="285,130 396,194 363,217 354,272 241,202 250,149" shape="poly" name="flash"/>
    <xsf:segment xml:id="img_seg3" primaryData="img" coords="176,0 219,14 232,37 215,60 172,72 128,63 110,33 126,13" shape="poly" name="mode_dial" />
    <xsf:segment xml:id="img_seg4" primaryData="img" coords="371,96 448,138 388,176 310,134" shape="poly" name="hotshoe"/>
    <xsf:segment xml:id="txt_seg1" primaryData="txt" start="0" end="238"/>
    <xsf:segment xml:id="txt_seg2" primaryData="txt" start="0" end="47"/>
    <xsf:segment xml:id="txt_seg3" primaryData="txt" start="8" end="17"/>
    <xsf:segment xml:id="txt_seg4" primaryData="txt" start="49" end="124"/>
    <xsf:segment xml:id="txt_seg5" primaryData="txt" start="59" end="66"/>
    <xsf:segment xml:id="txt_seg6" primaryData="txt" start="110" end="124"/>
    <xsf:segment xml:id="txt_seg7" primaryData="txt" start="125" end="186"/>
    <xsf:segment xml:id="txt_seg8" primaryData="txt" start="179" end="186"/>
    <xsf:segment xml:id="txt_seg9" primaryData="txt" start="187" end="238"/>
    <xsf:segment xml:id="txt_seg10" primaryData="txt" start="197" end="204"/>
  </xsf:segmentation>
  <xsf:annotation>
    <xsf:level xml:id="camera_manual">
      <xsf:layer xmlns="http://www.xstandoff.net/example/manual" priority="0"
        xsi:schemaLocation="http://www.xstandoff.net/example/manual ../xsd/manual.xsd">
        <manual xsf:segment="txt_seg1">
          <step n="1" xsf:segment="txt_seg2">
            <part name="mode_dial" xsf:segment="txt_seg3 img_seg3"/>
          </step>
          <step n="2" xsf:segment="txt_seg4">
            <part name="shutter" xsf:segment="txt_seg5 img_seg1"/>
            <part name="flash" xsf:segment="txt_seg6 img_seg2"/>
          </step>
          <step n="2a" xsf:segment="txt_seg7">
            <part name="hotshoe" xsf:segment="txt_seg8 img_seg4"/>
          </step>
          <step n="3" xsf:segment="txt_seg9">
            <part name="shutter" xsf:segment="txt_seg10 img_seg1"/>
          </step>
        </manual>
      </xsf:layer>
    </xsf:level>
  </xsf:annotation>
</xsf:corpusData>

Since XStandoff supports multiple primary data files we could use both the graphic and the text to annotate specific parts of the camera. For this example we have used a very simple fictional annotation format.

Annotating complex documents with XStandoff 2: pre-annotated documents

In general, most NLP tools tend to use raw text as input data instead of already annotated documents (although there are exceptions to this rule). However, even if tools may accept documents of a given markup language, the addition of supplemental annotation layers may fail. XStandoff 2 adds the ability to use pre-annotated primary data by extending its segmentation mechanism.

Application scenarios: using XStandoff

XStandoff can be used in several ways: in Stührenberg and Goecke 2008 we describe the use of an XStandoff annotated corpus for extracting possible antecedent candidates in the linguistic field of anaphora resolution. In Stührenberg and Jettka 2009 analyzing cross-level-relations is described. Other use cases were tested as well: for calculating Inter-Annotator Agreement or comparing the quality of different annotation software (e.g., linguistic parsers). Have a look at the examples section to get an impression.