We present the Embra system, a first-time entry to DUC for 2005 which performed at or above median for the manual assessment of responsiveness and on 4 out of 5 linguistic quality questions. The system takes a novel approach to relevance and redundancy, modeling sentence similarity using a latent semantic space constructed over a very large corpus. We present a simple approach to modeling specificity based on named entities which shows a small improvement over baseline. Finally, we discuss coherence and present a sentence reordering algorithm with a component-level evaluation demonstrating a positive effect. A key task in an extraction system for query-oriented multi-document summarisation, necessary for computing relevance and redundancy, is modelling text semantics. In the Embra system, we use a representation derived from the singular value decomposition of a term co-occurrence matrix. We present methods to show the reliability of performance improvements. We find that Embra performs better with dimensionality reduction. (With Ben Hachey and Gabriel Murray.)
![]() Example for a rhetorical relation |
Most text displays an internal coherence structure, which can be analyzed as a tree structure of relations that hold between short segments of text. We present a machine-learning governed approach to such an analysis in the framework of Rhetorical Structure Theory. Our rhetorical analyzer observes a variety of textual properties, such as cue phrases, part-of-speech information, rhetorical context and lexical chaining. A two-stage parsing algorithm uses local and global optimization to find an analysis. Decisions during parsing are driven by an ensemble of support vector classifiers. This training method allows for a non-linear separation of samples with many relevant features. We define a chain of annotation tools that profits from a new underspecified representation of rhetorical structure. Classifiers are trained on a newly introduced German language corpus, as well as on a large English one. We present evaluation data for the recognition of rhetorical relations.
Please find the document type definition grammars and several tools to convert (LDC corpus, O'Donnell's RS3) and access URML data here.
More about the Potsdam Commentary Corpus can be found here.
CyMON-NLU can inform, chat and gather user information using an advanced natural language understanding engine. It combines statistical morphosyntactic disambiguation methods (trigram tagging), a stemming algorithm and a robust parser for a large semantic grammar implemented in an XML formalism. The scalable CyMON-NLU engine is implemented in C++ and provides interfaces to the agent-based CRM platform CyMON. Further features include automatic language detection and dialog tracking using a semantic network interface. A development kit enables language engineers to easily create semantic grammars for the specific domain.
I developed CyMON-NLU with the help of programmers and UI designers in 2000/01 at Agentscape AG, Berlin and its daughter Agentscape Romania SRL.