Lexicalization of Long-distance Dependencies in a Treebank-based, Wide-coverage Statistical LFG Grammar

Aoife Cahill, Mairead McCarthy, Ruth O'Donovan, Josef van Genabith and Andy Way

Abstract

Proceedings of LFG03; CSLI Publications On-line

The development of rich, unification-based, wide-coverage computational grammatical resources is time consuming and expensive. (Cahill et al., 2002, 2003) present methods to construct robust, wide-coverage, statistical LFG grammars automatically from an f-structure annotated version of the Penn-II treebank. The f-structure annotations for the treebank trees are generated automatically by an f-structure annotation algorithm. The trees in the Penn-II treebank contain empty productions and a rich arsenal of traces to coindex ``displaced'' linguistic material with tree positions where this material should be interpreted semantically. The automatic f-structure annotation algorithm is sensitive to these traces and captures long distance dependencies (LDDs) in terms of corresponding reentrancies in the f-structure annotations. However, the wide-coverage statistical grammars automatically extracted from this resource in (Cahill et al., 2002) do not capture LDDs, but rather parse new text into ``proto-f-structures''. Proto-f-structures interpret linguistic material locally where it occurs in the parse tree. The reason is that current statistical parsers do not produce trees with empty productions and coindexed traces (two exceptions are Collins' (1999) model 3 and Johnson's (2002) tree post-processing approach). Indeed, statistical parsers standardly remove empty productions and traces from the training set (Charniak, 1996).

In this paper we present a method for resolving LDDs in an automatically constructed, wide-coverage, statistical LFG grammar for parse trees that do not contain empty nodes or coindexed traces. We follow the lead of standard LFG and resolve such dependencies on f-structure involving paths through f-structure (functional-uncertainty paths) and lexical information (semantic forms). In contrast to other approaches, however, we compute this information automatically from the (proper) f-structure annotated Penn-II treebank resource. Given such a resource it is possible to automatically extract semantic forms following (van Genabith et al., 1999). The precise results obtained are detailed in a companion paper. The semantic forms are associated with conditional probabilities P(s|l) (derived from the corpus) where l is a lemma and s a semantic form. We extract more than 15,500 (non-empty) semantic forms with probabilities. In a similar manner, from the same resource it is possible to automatically extract shortest paths linking LDD reentrancies in f-structure. These are classifies according to LDD type (e.g. TOPIC, FOCUS etc.) and associated with conditional probabilities P(p|d) where p is a path and d is either TOPIC or FOCUS. From the f-structure annotated Penn-II we extract 23 TOPIC and 54 FOCUS path types with associated probabilities. Given a proto-f-structure F, the LDD algorithm recursively traverses F and at each level tries to:

find TOPIC:T | FOCUS:F attribute-value pair ; retrieve TOPIC | FOCUS paths ; for each path p of the form GF_1 : ... : GF_n :GF traverse f along GF_1 : ... : GF_n to local sub-f-structure h ; retrieve local PRED:P_h
- add GF:T | GF:F to h iff
  - GF is not present at h
  - h together with GF is locally complete and coherent with respect to a semantic form s for PRED:P_h
- multiply path and semantic form probabilities involved to rank resolution

**Table 1 Sample focus path types**
Focus Paths	#	Focus Paths	#
up-subj	7894	up-obj	1167
up-xcomp	956	up-xcomp:obj	793
up-xcomp:xcomp	161	up-xcomp:xcomp:obj	135
up-comp:subj	119	up-xcomp:subj	92

The algorithm supports multiple topic/focus LDDs and multiplies the probabilities associated with each resolution to rank the resolved f-structure. It also supports resolution of LDDs where no overt linguistic material introduces a source topic/function (e.g. in wh-less ``reduced'' relative clause constructions).

We have implemented and carried out initial tests on the algorithm on grammars trained on sections 02--21 of the WSJ part of the Penn-II treebank and evaluated on section 23. Evaluation is carried out against a test set of manually constructed gold-standard f-structures for 105 sentences randomly extracted from section 23 and against the (proper) f-structures generated by the automatic annotation algorithm (Cahill et al. 2002) for the full set of 2400 sentences in section 23. The results in table 2 show that we get an increase of almost 3% in f-score when we resolve the LDDs.

**Table 2: Parsing result before and after resolution of LDDs**
	Before Resolution			After Resolution
	P	R	F	P	R	F
A-PCFG	69.82	52.57	59.98	69.16	57.92	63.04

In our view, the research reported here has thrown up a number of interesting issues. First, surprisingly perhaps, the wide-coverage, proto-f-structure grammars and parsers in (Cahill et al., 2002) did not use lexical information. In order to extend these resources to proper f-structures (i.e. in order to account for LDDs) we naturally came to an architecture that involves lexical information in the form of subcat frames (semantic forms). However, these were not hand-coded but automatically extracted from the (proper) f-structure annotated Penn-II treebank (Cahill et al., 2002, 2003). Perhaps the most important aspect of this work, at least in our view, is that we have developed initial methodologies for the automatic construction of robust, wide-coverage, treebank based, proper LFG grammars and parsers that can parse the Penn-II treebank at a much reduced development cost compared to manual development of comparable resources. We believe that this constitutes an alternative to manual, wide-coverage, rich (deep-analysis) unification grammar development and opens up the possibility for interesting research on combining manual and automatic grammar development.

References

Cahill, A., M. McCarthy, J. van Genabith and A. Way (2002): `Parsing Text with a PCFG derived from Penn-II with an Automatic F-Structure Annotation Procedure', in M. Butt and T. Holloway-King (eds.) Proceedings of the Seventh International Conference on LFG`, CSLI Publications, Stanford, CA., pp.76-95.
Cahill, A., M. McCarthy, J. van Genabith and A. Way (2003): `Quasi-logical forms from f-structures for the Penn treebank', in Proceedings of the Fifth International Workshop on Computational Semantics, Tilburg, The Netherlands, pp.55-71.
Charniak, E. (1996): `Tree-bank grammars', Proceedings of the Thirteenth National Conference on Artificial Intelligence, AAAI Press/MIT Press, Menlo Park, pp. 1031-6
Johnson, M. (2002): `A simple pattern-matching algorithm for recovering empty nodes and their antecedents', Proceedings of the 40th Annual Meeting of the Association of Computational Linguistics.
van Genabith, J., L.Sadler, and A.Way. (1999): `Data-driven Compilation of LFG Semantic Forms', in EACL-99 Workshop on Linguistically Interpreted Corpora, Bergen, Norway, pp.69-76.