Language and Cognition Research

Other projects

Automatic Summarization of Texts based on Latent Semantic Analysis

We present the Embra system, a first-time entry to DUC for 2005 which performed at or above median for the manual assessment of responsiveness and on 4 out of 5 linguistic quality questions. The system takes a novel approach to relevance and redundancy, modeling sentence similarity using a latent semantic space constructed over a very large corpus. We present a simple approach to modeling specificity based on named entities which shows a small improvement over baseline. Finally, we discuss coherence and present a sentence reordering algorithm with a component-level evaluation demonstrating a positive effect. A key task in an extraction system for query-oriented multi-document summarisation, necessary for computing relevance and redundancy, is modelling text semantics. In the Embra system, we use a representation derived from the singular value decomposition of a term co-occurrence matrix. We present methods to show the reliability of performance improvements. We find that Embra performs better with dimensionality reduction. (With Ben Hachey and Gabriel Murray.)

 
Ben Hachey, Gabriel Murray, and David Reitter.
The Embra system at DUC 2005: Query-oriented multi-document summarization with a very large latent semantic space.
In Document Understanding Conference 2005, Vancouver, Canada, 2005.
[ abstract | .pdf ]
 
Ben Hachey, Gabriel Murray, and David Reitter.
Dimensionality reduction aids term co-occurrence based multi-document summarization.
In Proc. COLING-ACL Workshop Task-Focused Summarization and Question Answering 2006, Sydney, Australia, 2006.
[ abstract | .pdf ]

Rhetorical Analysis with Support Vector Machines


Example for a rhetorical relation

Most text displays an internal coherence structure, which can be analyzed as a tree structure of relations that hold between short segments of text. We present a machine-learning governed approach to such an analysis in the framework of Rhetorical Structure Theory. Our rhetorical analyzer observes a variety of textual properties, such as cue phrases, part-of-speech information, rhetorical context and lexical chaining. A two-stage parsing algorithm uses local and global optimization to find an analysis. Decisions during parsing are driven by an ensemble of support vector classifiers. This training method allows for a non-linear separation of samples with many relevant features. We define a chain of annotation tools that profits from a new underspecified representation of rhetorical structure. Classifiers are trained on a newly introduced German language corpus, as well as on a large English one. We present evaluation data for the recognition of rhetorical relations.

David Reitter: Simple Signals for Complex Rhetorics: On Rhetorical Analysis with Rich-Feature Support Vector Models. In: Uta Seewald-Heeg (Ed.), Sprachtechnologie für die multilinguale Kommunikation. Sankt Augustin: Gardez!.

David Reitter: Rhetorical Analysis with Rich-Feature Support Vector Models. Diplomarbeit (Master's Thesis), University of Potsdam, Germany. 2003 [Best thesis award of the GLDV] PDF: from the publications page.

David Reitter and Manfred Stede. Step by step: underspecified markup in incremental rhetorical analysis. In Proceedings of the 4th International Workshop on Linguistically Interpreted Corpora (LINC-03) (at EACL 2003), Budapest, 2003. [ abstract | .pdf ]

Please find the document type definition grammars and several tools to convert (LDC corpus, O'Donnell's RS3) and access URML data here.

More about the Potsdam Commentary Corpus can be found here.

 

 CyMON-NLU
A customer relationship management natural language dialog system

CyMON-NLU can inform, chat and gather user information using an advanced natural language understanding engine. It combines statistical morphosyntactic disambiguation methods (trigram tagging), a stemming algorithm and a robust parser for a large semantic grammar implemented in an XML formalism. The scalable CyMON-NLU engine is implemented in C++ and provides interfaces to the agent-based CRM platform CyMON. Further features include automatic language detection and dialog tracking using a semantic network interface. A development kit enables language engineers to easily create semantic grammars for the specific domain.

I developed CyMON-NLU with the help of programmers and UI designers in 2000/01 at Agentscape AG, Berlin and its daughter Agentscape Romania SRL.

D. Reitter, Hybrid Natural Language Processing in a Customer Care Environment, presented as workshop paper at the TaCoS 2001, Heidelberg.

CyMON is a registered trademark of Agentscape.