While quite a few linguistic corpora with syntactic annotations are available today, resources are scarce on the level of discourse annotation. A flexible, extendible annotation format speeds up development. We therefore propose an XML format for annotating rhetorical structure trees. In human and automatic analysis, rhetorical structure is often difficult and assigned incrementally. Thus, the format allows for underspecification. The paper discusses the various design decisions involved, illustrates the format with an example, and sketches some applications.
We provide an XML grammar for URML for download: urml-base.dtd. (You may need to right-click or Ctrl-Click onto the link to save the file to disk.)
Other DTDs may be derived from this URML to fit your specific annotation needs. Derived DTDs should, nethertheless, implement (=allow for) the whole original URML language. We demonstrate this with another DTD, urml-pos.dtd and an example XML file, urml-pos-sample.xml.
Reitter, D. & Stede, M., Step by step: underspecified markup in incremental rhetorical analysis, LINC-03, 2003.
These tools are available for free for non-commercial research purposes. They may be freely improved -- please send patches back to us.
This tool converts the ISI corpus of rhetorical texts (Carlson et al. 2001) from their LISP-based format to URML.
This tool tokenizes tags a URML corpus with part-of-speech information, using a <sign> tag for each token. It interfaces the TnT Tagger (Brants 2000), which is available for free. It delivers state-of-the-art performance. We used language models acquired from the German NEGRA corpus and the English SUSANNE corpus.
Available upon request. You'll need the parser and a language model.
Simply extracts a document with a given ID from an URML corpus.
Simple search/replace script to replace rhetorical relations with their subsuming categories. Must be adapted to fit your needs.
Please ask us if you need it.
Creates RST diagrams for LaTeX, to be used with the rst package.
Usage: urm2latex.perl [-i] urml-file.xml [[document-id] analysis-id]
If no document-id is given, the program prints a list of document-ids contained in the file. Parameter -i instructs urml2latex to include the minimal discourse units directly in the tree.
Splits an URML corpus in a training / test set according to a given ratio. As parameters, give ratio, source file, first target, second target. Example:
./separate-urml.perl 0.8 pdm-corpus.xml pdm-training.xml pdm-test.xml
URML data can be visualized in LaTeX.
We collected a corpus of newspaper texts and performed manual RST annotation. Two annotators worked through 173 texts. Data was converted from the annotation application format to URML.
Status: The corpus is complete and was subject to a non-blind cross-validation. It should be considered as "beta", until a blind-cross validation could be performed and inter-annotator agreement measures are calculated. Volunteers are welcome!
Availability: Please contact Manfred Stede.