Open-Domain Question Answering from Large Text Collections

Vast amounts of information–from newspapers, journals, legal transcripts, conference proceedings, correspondence, web pages, and other sources–have become increasingly accessible online. Yet current keyword-based search technologies offer little support to users searching for a few relevant text fragments among thousands of documents.

Open-domain question answering has recently emerged as a new field aimed at the extraction of brief, relevant answers from large text collections in response to written questions submitted by users. Individual related fields–such as natural language processing or information retrieval–do not allow for practicable solutions to open-domain question answering. For example, document retrieval alone is insufficient because relevant information is concentrated in document fragments that are small when compared to the size of the entire document. Advanced methods based on higher levels of text understanding cannot be applied directly to gigabyte-sized collections of unrestricted text. Similarly, the amount of knowledge required by an open-domain question answering system to act as an intelligent conversational agent is beyond the boundaries of present technologies.

This book presents the design of novel and robust methods for capturing the semantics of natural language questions and for finding relevant text snippets. The theoretical contributions of this research are reflected in a fully implemented architecture whose performance was evaluated within the DARPA-sponsored Text Retrieval Conference. In addition, experimental results show significant qualitative improvements with respect to the output from web search engines, revealing both the challenges and desired features of next-generation web search technologies.

Marius Pasca is Director of Question Answering Research and Development at Language Computer Corporation.

Preface xi
1 Introduction 1

1.1 Motivation 1
1.2 Description of the Task 3

1.2.1 Assumptions 6

1.3 Previous Work in NLP for Question Answering 6
1.4 Question Answering at the Text REtrieval Conference 7

1.4.1 Test Collection 8
1.4.2 Gold Standard 9
1.4.3 Scoring Metrics 9

1.5 Book Overview 10

2 An Approach to Open-Domain Question Answering 14

2.1 Introduction 14
2.2 Document Retrieval versus Answer Extraction 14

2.2.1 A Generic Document Retrieval Architecture 14
2.2.2 A Generic Question Answering Architecture 16

2.3 A Complete Architecture for Open-Domain Question Answering 18

2.3.1 Background on External Resources 18
2.3.2 Architecture Description 22

2.4 Answering Natural Language Questions: An Example 24
2.5 Alternative Perspectives on Question Answering 26

2.5.1 Review of Information Retrieval (IR) for QA 26
2.5.2 Review of Information Extraction (IE) for QA 27
2.5.3 Review of Text-Based Inference (TI) for QA 28
2.5.4 Impact on QA Subproblems 30

2.6 Summary 31

3 Question Processing 33

3.1 Introduction 33
3.2 Information Conveyed by Natural-Language Questions 34

3.2.1 Layer 1: Lexical Terms 34
3.2.2 Layer 2: Inter-Term Relations 35
3.2.3 Layer 3: Question Stems and Expected Answer Types 37
3.2.4 Layer 4: Semantic Constraints 39

3.3 A Dependency Representation Model 40

3.3.1 Model Description 40
3.3.2 Semantic Operators 41
3.3.3 Application to Answer Extraction 41

3.4 Construction of Dependency Representations 42

3.4.1 Preprocessing 43
3.4.2 Derivation of Relations 43

3.5 Summary 43

4 Answer Type Determination 47

4.1 Introduction 47
4.2 A Hierarchy of Answer Types 48

4.2.1 Overview of the Hierarchy 48
4.2.2 Connecting the Answer Types with WordNet Hierarchies 50
4.2.3 Correlation between Answer Types and Named Entities 52

4.3 Building the Answer Type Hierarchy 54

4.3.1 Part of Speech Coverage 54
4.3.2 Selection of Word Senses 55
4.3.3 Refinement of the Hierarchy Nodes 55

4.4 Derivation of the Expected Answer Type of a Question 56

4.4.1 Derivation of the Question Stem and Answer Type Term 57
4.4.2 Hierarchy Filtering Based on the Question Stem 59
4.4.3 Hierarchy Search Guided by the Answer Type Term 61
4.4.4 Extraction of the Expected Answer Type 63

4.5 Limitations and Extensions 64

4.5.1 Refining the Hierarchy of Answer Types 64
4.5.2 Dynamic Answer Type Categories 67
4.5.3 Pattern-Based Answer Type Recognition 69

4.6 Evaluation 70
4.7 Summary 72

5 Passage Retrieval 74

5.1 Introduction 74
5.2 Conversion of Questions into Ordered Sequences of Keywords 75

5.2.1 Factors in the Selection of Question Terms as Keywords 75
5.2.2 High-Relevance Terms 76
5.2.3 Medium-Relevance Terms 79
5.2.4 Low-Relevance Terms 79
5.2.5 Assembling Ordered Sequences of Keywords 79

5.3 Passage Retrieval Through Dynamic Query Adjustment 81

5.3.1 Query Definition 81
5.3.2 The Passage Retrieval Loop 82
5.3.3 Control of Passage Granularity 82

5.4 Summary 85

6 Answer Extraction 87

6.1 Introduction 87
6.2 Question-Driven Passage Ranking 88

6.2.1 Matching the Question on a Passage 88
6.2.2 Lexical-Matching Relevance Features for Passage Ranking 91
6.2.3 Passage Ranking Scheme 93

6.3 Identification of Candidate Answers 95

6.3.1 Named-Entity Based Identification of Candidate Answers 95
6.3.2 Pattern-Based Identification of Candidate Answers 96

6.4 Extraction of Answer Strings 97
6.5 Empirical Ranking of Candidate Answers 99

6.5.1 Semantic-Matching Relevance Features for Answer Ranking 100
6.5.2 An Empirical Answer Scoring Formula 103
6.5.3 Evaluation 105

6.6 A Machine Learning Approach to Answer Ranking 108

6.6.1 Perceptron-Based Learning for Answer Ranking 108
6.6.2 Evaluation 110

6.7 Summary 112

7 Answer Extraction from Web Documents 114

7.1 Introduction 114
7.2 Finding Relevant Answers on the Web 115

7.2.1 Architecture Overview 115
7.2.2 Retrieval of Text Passages from Web Search Engines 117

7.3 Evaluation 118

7.3.1 Results in Terms of Precision/MRR 119
7.3.2 Results in Terms of Time Saving 123

7.4 Summary 124

8 Related Work 125

8.1 Question Processing 126
8.2 Passage Retrieval 128
8.3 Answer Extraction 129

9 Conclusion 132
References 137
Name Index 145
Subject Index 148

4/15/2003

ISBN (Paperback): 1575864282 (9781575864280)
ISBN (Cloth): 1575864274 (9781575864273)
ISBN (Electronic): 1575869438 (9781575869438)

Open-Domain Question Answering from Large Text Collections

Contents