Published online by Cambridge University Press: 01 January 2008
Information Extraction (IE) systems assist analysts to assimilate information from electronic documents. This paper focuses on IE tasks designed to support information discovery applications. Since information discovery implies examining large volumes of heterogeneous documents for situations that cannot be anticipated a priori, they require IE systems to have breadth as well as depth. This implies the need for a domain-independent IE system that can easily be customized for specific domains: end users must be given tools to customize the system on their own. It also implies the need for defining new intermediate level IE tasks that are richer than the subject-verb-object (SVO) triples produced by shallow systems, yet not as complex as the domain-specific scenarios defined by the Message Understanding Conference (MUC). This paper describes InfoXtract, a robust, scalable, intermediate-level IE engine that can be ported to various domains. It describes new IE tasks such as synthesis of entity profiles, and extraction of concept-based general events which represent realistic near-term goals focused on deriving useful, actionable information. Entity profiles consolidate information about a person/organization/location etc. within a document and across documents into a single template; this takes into account aliases and anaphoric references as well as key relationships and events pertaining to that entity. Concept-based events attempt to normalize information such as time expressions (e.g., yesterday) as well as ambiguous location references (e.g., Buffalo). These new tasks facilitate the correlation of output from an IE engine with structured data to enable text mining. InfoXtract's hybrid architecture comprised of grammatical processing and machine learning is described in detail. Benchmarking results for the core engine and applications utilizing the engine are presented.