A Suite of Linguistic Tools for Use with the Penn-II Treebank
Abstract
Proceedings of LFG03; CSLI Publications On-line
Treebanks of parsed, annotated text corpora are becoming more and more important resources in many areas of descriptive, theoretical and computational linguistic research. In LFG too, there exists quite a large body of work on semi-automatic extraction of large-scale resources, including grammars (e.g. Cahill et al., 2002a; Zinsemeister et al., 2002; Frank et al., 2003), subcategorisation frames (van Genabith et al., 1999; Cahill et al., 2003), and < c,f > pairs of LFG representations (e.g. Cahill et al., 2002b).
The current paper describes a suite of tools for inspection of the Penn-II Treebank. Cahill et al. (2002a) describes an algorithm for automatically annotating the 1 million words in 50,000 sentences in the treebank with f-structure annotations. This annotation method scales up by an order of magnitude on the method of van Genabith et al. (1999). Given the size of the dataset, a number of tools have been built in order to facilitate the inspection and annotation of the treebank trees. The tools include:
- treebank inspection and viewing options which enable searching for CFG-rule tokens extracted from the treebank;
- graphical display of trees and subtrees according to rule instances;
- display of the yield of the subtree (with and without context);
- tagging and PCFG-parsing of new input;
- an automatic annotation tool;
- an f-structure generator.
Figure 1 illustrates the display of a < c,f > pair for the simple sentence A man saw a woman. While some of the tools have been described in (Cahill and van Genabith, 2002), we shall demonstrate a number of new facilities, including the extraction of subcategorisation frames and quasi-logical forms, an automatic annotation algorithm, and full LFG parsing into both c- and f-structures of unseen input, should the user require. (Available at http://www.computing.dcu.ie/~acahill/get_lfg.html) This is made possible by a PCFG chart parser (based on the CYK algorithm) which operates on CFG grammars extracted by the annotation algorithm presented in Cahill et al. (2002a).
Figure 1: A $< c,f >$ pair for a simple sentence
References
- Cahill, A., M. McCarthy, J. van Genabith and A. Way (2002a): `Automatic Annotation of the Penn-Treebank with LFG F-Structure Information', in Proceedings of the LREC Workshop on Linguistic Knowledge Acquisition and Representation: Bootstrapping Annotated Language Data, Las Palmas, Spain, pp.8-15.
- Cahill, A., M. McCarthy, J. van Genabith and A. Way (2002b): `Parsing Text with a PCFG derived from Penn-II with an Automatic F-Structure Annotation Procedure', in M. Butt and T. Holloway-King (eds.) Proceedings of the Seventh International Conference on LFG, CSLI Publications, Stanford, CA., pp.76-95.
- Cahill, A., M. McCarthy, J. van Genabith and A. Way (2003): `Quasi-logical forms from f-structures for the Penn treebank', in Proceedings of the Fifth International Workshop on Computational Semantics, Tilburg, The Netherlands, pp.55-71.
- Cahill, A. and J. van Genabith (2002): `TTS: A Treebank Tool Suite', in Proceedings of LREC 2002, Third International Conference on Language Resources and Evaluation, Las Palmas, Spain, p.1712-1717.
- Frank, A., L. Sadler, J. van Genabith and A. Way (2003): `From Treebank Resources to LFG f-Structures', in A. Abeille (ed.) Building and using Parsed Corpora, Kluwer, Dordrecht, The Netherlands (in press).
- van Genabith, J., L.Sadler, and A.Way. (1999): `Data-driven Compilation of LFG Semantic Forms', in EACL-99 Workshop on Linguistically Interpreted Corpora, Bergen, Norway, pp.69-76.
- Zinsemeister, H., J. Kuhn and S. Dipper (2002): `Utilizing LFG Parses for Treebank Annotation', in M. Butt and T. Holloway-King (eds.) Proceedings of the Seventh International Conference on LFG, CSLI Publications, Stanford, CA., pp.427-447.