InfoXtract: A customizable intermediate level information extraction engine

ROHINI K. SRIHARI; WEI LI; THOMAS CORNELL; CHENG NIU

doi:10.1017/S1351324906004116

InfoXtract: A customizable intermediate level information extraction engine

Published online by Cambridge University Press: 01 January 2008

ROHINI K. SRIHARI ,

WEI LI ,

THOMAS CORNELL and

CHENG NIU

Show author details

ROHINI K. SRIHARI: Affiliation:
Janya Inc., 1408 Sweet Home Road, Amherst, NY 14228, USA, State University of New York at Buffalo e-mail: rohini@janyainc.com
WEI LI: Affiliation:
Janya Inc., 1408 Sweet Home Road, Amherst, NY 14228, USA e-mail: wei@janyainc.comcornell@janyainc.com
THOMAS CORNELL: Affiliation:
Janya Inc., 1408 Sweet Home Road, Amherst, NY 14228, USA e-mail: wei@janyainc.comcornell@janyainc.com
CHENG NIU: Affiliation:
Microsoft Research China, 5/F, Beijing Sigma Center, No. 49, Zhichun Road, Haidian District, Beijing100080, P.R.C. e-mail: cniu@microsoft.com

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Information Extraction (IE) systems assist analysts to assimilate information from electronic documents. This paper focuses on IE tasks designed to support information discovery applications. Since information discovery implies examining large volumes of heterogeneous documents for situations that cannot be anticipated a priori, they require IE systems to have breadth as well as depth. This implies the need for a domain-independent IE system that can easily be customized for specific domains: end users must be given tools to customize the system on their own. It also implies the need for defining new intermediate level IE tasks that are richer than the subject-verb-object (SVO) triples produced by shallow systems, yet not as complex as the domain-specific scenarios defined by the Message Understanding Conference (MUC). This paper describes InfoXtract, a robust, scalable, intermediate-level IE engine that can be ported to various domains. It describes new IE tasks such as synthesis of entity profiles, and extraction of concept-based general events which represent realistic near-term goals focused on deriving useful, actionable information. Entity profiles consolidate information about a person/organization/location etc. within a document and across documents into a single template; this takes into account aliases and anaphoric references as well as key relationships and events pertaining to that entity. Concept-based events attempt to normalize information such as time expressions (e.g., yesterday) as well as ambiguous location references (e.g., Buffalo). These new tasks facilitate the correlation of output from an IE engine with structured data to enable text mining. InfoXtract's hybrid architecture comprised of grammatical processing and machine learning is described in detail. Benchmarking results for the core engine and applications utilizing the engine are presented.

Information

Type: Papers
Information: Natural Language Engineering , Volume 14 , Issue 1 , January 2008 , pp. 33 - 69

DOI: https://doi.org/10.1017/S1351324906004116 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2006

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

ACE 2004. http://www.nist.gov/speech/tests/ace/index.htm.Google Scholar

Aho, A. V. and Ullman, J. D. (1971) Translations on a context-free grammar. Information and Control 19 (5): 439–475.CrossRef Google Scholar

Aone, A. and Ramos-Santacruz, M. (2000) REES: A Large-Scale Relation and Event Extraction System. Proceedings of ANLP-NAACL 2000. Seattle, WA. http://acl.ldc.upenn.edu/A/A00/A00-1011.pdf.Google Scholar

Bagga, A. and Baldwin, B. (1998) Entity-based cross-document coreferencing using the vector space model. Proceedings of COLING-ACL'98, pp. 79–85. Montreal, Canada.CrossRef Google Scholar

Bikel, D. M., Schwartz, R. and Weischedel, R. M. (1999) An algorithm that learns what's in a name. Machine Learning 34: 211–231.CrossRef Google Scholar

Chinchor, N. and Marsh, E. (1998) MUC-7 information extraction task definition (version 5.1). Proceedings of MUC-7.Google Scholar

Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., Ursu, C., Dimitrov, M., Dowman, M., Aswani, N. and Roberts, I. (2003) Developing Language Processing Components with GATE: A User Guide.Google Scholar

Damianos, L., Wohlever, S., Kozierok, R. and Ponte, J. (2003) MiTAP: A case study of integrated knowledge discovery tools. Proceedings of the Thirty-Sixth Annual Hawaii International Conference on Systems Sciences (HICSS-36). Big Island, Hawaii.CrossRef Google Scholar

Engelfriet, J., Hoogeboom, H. J. and Van Best, J.-P. (1999) Trips on trees. Acta Cybernetica 14 (1): 51–64.Google Scholar

Gécseg, F. and Steinby, M. (1997) Tree languages. In: Rozenberg, G. and Salomaa, A. (Eds.), Handbook of Formal Languages: Beyond Words. (Vol. 3, pp. 1–68). Springer.Google Scholar

Grishman, R. (1997) TIPSTER Architecture Design Document Version 2.3.Google Scholar

Han, J. (1999) Data Mining. In: Dasgupta, J. U. a. P. (editor), Encyclopedia of Distributed Computing. Kluwer Academic.Google Scholar

Hobbs, J. R. (1993) FASTUS: A system for extracting information from text. Proceedings of the DARPA workshop on Human Language Technology, pp. 133–137. Princeton, NJ.CrossRef Google Scholar

Humphreys, K., Gaizauskas, R., Azzam, S., Huyck, C., Mitchell, B. and Cunningham, H. (1998) University of Sheffield: Description of the LaSIE-II System as used for MUC-7. Proceedings of the Seventh Message Understanding Conference, pp. 84–89.Google Scholar

Kornai, A. and Sundheim, B. (editors) (2003) Proceedings of the HLT-NAACL 2003 Workshop on Analysis of Geographic References. Edmonton, Alberta, Canada.Google Scholar

Krupka, G. R. and Hausman, K. (1998) IsoQuest Inc.: Description of the NetOwl (TM) Extractor System as Used for MUC-7. Proceedings of MUC-7.Google Scholar

Li, W. and Srihari, R. (2000) Flexible Information Extraction Learning Algorithm. Phase 1 Final Technical Report AFRL-IF-RS-TR-2000-26. Rome, NY: Air Force Research Laboratory.Google Scholar

Li, W. and Srihari, R. K. (2003) Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization. Phase 1 Final Technical Report AFRL-IF-RS-TR-2002-245. Rome, NY: Air Force Research Laboratory, Information Directorate.Google Scholar

Li, H., Srihari, R., Niu, C. and Li, W. (2002) Location normalization for information extraction. Proceedings of the 19th International Conference on Computational Linguistics (COLING-2002). Taipei, Taiwan.Google Scholar

Li, W., Srihari, R., Niu, C. and Li, X. (2003a) Entity profile extraction from large corpora. Proceedings of Pacific Association for Computational Linguistics 2003 (PACLING '03), pp. 295–304. Halifax, Nova Scotia, Canada.Google Scholar

Li, W., Zhang, X., Niu, C., Jiang, Y. and Srihari, R. (2003) An expert lexicon approach to identifying English phrasal verbs. Proceedings of the Association for Computational Linguistics (ACL 2003), pp. 513–520. Sapporo, Japan.CrossRef Google Scholar

Miller, S., Michael, C., Fox, H., Ramshaw, L., Schwartz, R. and Stone, R. (1998) Algorithms that learn to extract information; BBN: Description of the SIFT System as Used for MUC-7. Proceedings of MUC-7.Google Scholar

Mönnich, U., Morawietz, F. and Kepser, S. (2001) A regular query for context-sensitive relations. In: Bird, P. B. S. and Liberman, M. (editors), IRCS Workshop Linguistic Databases 2001, pp. 187–195.Google Scholar

Niu, C., Li, W., Ding, J. and Srihari, R. K. (2003a) A bootstrapping approach to named entity classification using successive learners. Proceedings of the Association for Computational Linguistics (ACL), pp. 335–342.CrossRef Google Scholar

Niu, C., Li, W., Srihari, R. K. and Crist, L. (2003b) Bootstrapping a Hidden Markov Model for relationship extraction using multi-level contexts. Proceedings of Pacific Association for Computational Linguistics 2003 (PACLING '03), pp. 305–314.Google Scholar

Niu, C., Li, W., Ding, J. and Srihari, R. K. (2004) Orthographic case restoration using supervised learning without manual annotation. International Journal of Artificial Intelligence Tools 141–156.CrossRef Google Scholar

Pustejovsky, J., Castaño, J., Ingria, R., Sauri, R., Gaizauskas, R. and Setzer, A. (2003) TimeML: Robust specification of event and temporal expressions in text. New Directions in Question Answering 2003, pp. 28–34.Google Scholar

Riloff, E. (1996) Automatically generating extraction patterns from untagged text. Proceedings of AAAI-96, pp. 1044–1049.Google Scholar

Riloff, E. (2003) From manual knowledge engineering to bootstrapping: Progress in information extraction and NLP. ICCBR 2003, p. 4.CrossRef Google Scholar

Riloff, E. and Jones, R. (1999) Learning dictionaries for information extraction by multi-level boot-strapping. Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99), pp. 1044–1049. AAI Press/MIT Press.Google Scholar

Roche, E. and Schabes, Y. (1997) Finite-State Language Processing. MIT Press.CrossRef Google Scholar

Silberztein, M. (1999) INTEX: a Finite State Transducer toolbox. Theoretical Computer Science 231 (1). Elsevier Science.Google Scholar

Srihari, R. and Li, W. (2000) A question answering system supported by information extraction. Proceedings of ANLP 2000, pp. 166–172. Seattle, WA.CrossRef Google Scholar

Srihari, R., Niu, C. and Li, W. (2000) A hybrid approach for named entity and sub-type tagging. Proceedings of ANLP 2000, pp. 247–254. Seattle, WA.CrossRef Google Scholar

Srihari, R. K., Li, W., Niu, C. and Cornell, T. (2003) InfoXtract: A customizable intermediate level information extraction engine. Proceedings of NAACL '03 Workshop on Software Engineering and Architecture of Language Technology Systems (SEALTS), pp. 52–59. Edmonton, Alberta, Canada.CrossRef Google Scholar

Article contents

InfoXtract: A customizable intermediate level information extraction engine

Abstract

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests