CSLI Publications logo
new books
catalog
series
contact us
for authors
order
search
LFG Proceedings
CSLI Publications
Facebook

A Suite of Linguistic Tools for Use with the Penn-II Treebank

Aoife Cahill, Mairead McCarthy, Ruth O'Donovan, Josef van Genabith and Andy Way

Abstract

Treebanks of parsed, annotated text corpora are becoming more and more important resources in many areas of descriptive, theoretical and computational linguistic research. In LFG too, there exists quite a large body of work on semi-automatic extraction of large-scale resources, including grammars (e.g. Cahill et al., 2002a; Zinsemeister et al., 2002; Frank et al., 2003), subcategorisation frames (van Genabith et al., 1999; Cahill et al., 2003), and < c,f > pairs of LFG representations (e.g. Cahill et al., 2002b).

The current paper describes a suite of tools for inspection of the Penn-II Treebank. Cahill et al. (2002a) describes an algorithm for automatically annotating the 1 million words in 50,000 sentences in the treebank with f-structure annotations. This annotation method scales up by an order of magnitude on the method of van Genabith et al. (1999). Given the size of the dataset, a number of tools have been built in order to facilitate the inspection and annotation of the treebank trees. The tools include:

Figure 1 illustrates the display of a < c,f > pair for the simple sentence A man saw a woman. While some of the tools have been described in (Cahill and van Genabith, 2002), we shall demonstrate a number of new facilities, including the extraction of subcategorisation frames and quasi-logical forms, an automatic annotation algorithm, and full LFG parsing into both c- and f-structures of unseen input, should the user require. (Available at http://www.computing.dcu.ie/~acahill/get_lfg.html) This is made possible by a PCFG chart parser (based on the CYK algorithm) which operates on CFG grammars extracted by the annotation algorithm presented in Cahill et al. (2002a).


Figure 1: A $< c,f >$ pair for a simple sentence

References

pubs @ csli.stanford.edu 
CSLI Publications
Stanford University
Cordura Hall
210 Panama Street
Stanford, CA 94305-4101
(650) 723-1839