Hostname: page-component-cd9895bd7-mkpzs Total loading time: 0 Render date: 2024-12-26T07:45:54.608Z Has data issue: false hasContentIssue false

InfoXtract: A customizable intermediate level information extraction engine

Published online by Cambridge University Press:  01 January 2008

ROHINI K. SRIHARI
Affiliation:
Janya Inc., 1408 Sweet Home Road, Amherst, NY 14228, USA, State University of New York at Buffalo e-mail: rohini@janyainc.com
WEI LI
Affiliation:
Janya Inc., 1408 Sweet Home Road, Amherst, NY 14228, USA e-mail: wei@janyainc.comcornell@janyainc.com
THOMAS CORNELL
Affiliation:
Janya Inc., 1408 Sweet Home Road, Amherst, NY 14228, USA e-mail: wei@janyainc.comcornell@janyainc.com
CHENG NIU
Affiliation:
Microsoft Research China, 5/F, Beijing Sigma Center, No. 49, Zhichun Road, Haidian District, Beijing100080, P.R.C. e-mail: cniu@microsoft.com

Abstract

Information Extraction (IE) systems assist analysts to assimilate information from electronic documents. This paper focuses on IE tasks designed to support information discovery applications. Since information discovery implies examining large volumes of heterogeneous documents for situations that cannot be anticipated a priori, they require IE systems to have breadth as well as depth. This implies the need for a domain-independent IE system that can easily be customized for specific domains: end users must be given tools to customize the system on their own. It also implies the need for defining new intermediate level IE tasks that are richer than the subject-verb-object (SVO) triples produced by shallow systems, yet not as complex as the domain-specific scenarios defined by the Message Understanding Conference (MUC). This paper describes InfoXtract, a robust, scalable, intermediate-level IE engine that can be ported to various domains. It describes new IE tasks such as synthesis of entity profiles, and extraction of concept-based general events which represent realistic near-term goals focused on deriving useful, actionable information. Entity profiles consolidate information about a person/organization/location etc. within a document and across documents into a single template; this takes into account aliases and anaphoric references as well as key relationships and events pertaining to that entity. Concept-based events attempt to normalize information such as time expressions (e.g., yesterday) as well as ambiguous location references (e.g., Buffalo). These new tasks facilitate the correlation of output from an IE engine with structured data to enable text mining. InfoXtract's hybrid architecture comprised of grammatical processing and machine learning is described in detail. Benchmarking results for the core engine and applications utilizing the engine are presented.

Type
Papers
Copyright
Copyright © Cambridge University Press 2006

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Aho, A. V. and Ullman, J. D. (1971) Translations on a context-free grammar. Information and Control 19 (5): 439475.CrossRefGoogle Scholar
Aone, A. and Ramos-Santacruz, M. (2000) REES: A Large-Scale Relation and Event Extraction System. Proceedings of ANLP-NAACL 2000. Seattle, WA. http://acl.ldc.upenn.edu/A/A00/A00-1011.pdf.Google Scholar
Bagga, A. and Baldwin, B. (1998) Entity-based cross-document coreferencing using the vector space model. Proceedings of COLING-ACL'98, pp. 79–85. Montreal, Canada.CrossRefGoogle Scholar
Bikel, D. M., Schwartz, R. and Weischedel, R. M. (1999) An algorithm that learns what's in a name. Machine Learning 34: 211231.CrossRefGoogle Scholar
Chinchor, N. and Marsh, E. (1998) MUC-7 information extraction task definition (version 5.1). Proceedings of MUC-7.Google Scholar
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., Ursu, C., Dimitrov, M., Dowman, M., Aswani, N. and Roberts, I. (2003) Developing Language Processing Components with GATE: A User Guide.Google Scholar
Damianos, L., Wohlever, S., Kozierok, R. and Ponte, J. (2003) MiTAP: A case study of integrated knowledge discovery tools. Proceedings of the Thirty-Sixth Annual Hawaii International Conference on Systems Sciences (HICSS-36). Big Island, Hawaii.CrossRefGoogle Scholar
Engelfriet, J., Hoogeboom, H. J. and Van Best, J.-P. (1999) Trips on trees. Acta Cybernetica 14 (1): 5164.Google Scholar
Gécseg, F. and Steinby, M. (1997) Tree languages. In: Rozenberg, G. and Salomaa, A. (Eds.), Handbook of Formal Languages: Beyond Words. (Vol. 3, pp. 168). Springer.Google Scholar
Grishman, R. (1997) TIPSTER Architecture Design Document Version 2.3.Google Scholar
Han, J. (1999) Data Mining. In: Dasgupta, J. U. a. P. (editor), Encyclopedia of Distributed Computing. Kluwer Academic.Google Scholar
Hobbs, J. R. (1993) FASTUS: A system for extracting information from text. Proceedings of the DARPA workshop on Human Language Technology, pp. 133–137. Princeton, NJ.CrossRefGoogle Scholar
Humphreys, K., Gaizauskas, R., Azzam, S., Huyck, C., Mitchell, B. and Cunningham, H. (1998) University of Sheffield: Description of the LaSIE-II System as used for MUC-7. Proceedings of the Seventh Message Understanding Conference, pp. 84–89.Google Scholar
Kornai, A. and Sundheim, B. (editors) (2003) Proceedings of the HLT-NAACL 2003 Workshop on Analysis of Geographic References. Edmonton, Alberta, Canada.Google Scholar
Krupka, G. R. and Hausman, K. (1998) IsoQuest Inc.: Description of the NetOwl (TM) Extractor System as Used for MUC-7. Proceedings of MUC-7.Google Scholar
Li, W. and Srihari, R. (2000) Flexible Information Extraction Learning Algorithm. Phase 1 Final Technical Report AFRL-IF-RS-TR-2000-26. Rome, NY: Air Force Research Laboratory.Google Scholar
Li, W. and Srihari, R. K. (2003) Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization. Phase 1 Final Technical Report AFRL-IF-RS-TR-2002-245. Rome, NY: Air Force Research Laboratory, Information Directorate.Google Scholar
Li, H., Srihari, R., Niu, C. and Li, W. (2002) Location normalization for information extraction. Proceedings of the 19th International Conference on Computational Linguistics (COLING-2002). Taipei, Taiwan.Google Scholar
Li, W., Srihari, R., Niu, C. and Li, X. (2003a) Entity profile extraction from large corpora. Proceedings of Pacific Association for Computational Linguistics 2003 (PACLING '03), pp. 295–304. Halifax, Nova Scotia, Canada.Google Scholar
Li, W., Zhang, X., Niu, C., Jiang, Y. and Srihari, R. (2003) An expert lexicon approach to identifying English phrasal verbs. Proceedings of the Association for Computational Linguistics (ACL 2003), pp. 513–520. Sapporo, Japan.CrossRefGoogle Scholar
Miller, S., Michael, C., Fox, H., Ramshaw, L., Schwartz, R. and Stone, R. (1998) Algorithms that learn to extract information; BBN: Description of the SIFT System as Used for MUC-7. Proceedings of MUC-7.Google Scholar
Mönnich, U., Morawietz, F. and Kepser, S. (2001) A regular query for context-sensitive relations. In: Bird, P. B. S. and Liberman, M. (editors), IRCS Workshop Linguistic Databases 2001, pp. 187–195.Google Scholar
Niu, C., Li, W., Ding, J. and Srihari, R. K. (2003a) A bootstrapping approach to named entity classification using successive learners. Proceedings of the Association for Computational Linguistics (ACL), pp. 335–342.CrossRefGoogle Scholar
Niu, C., Li, W., Srihari, R. K. and Crist, L. (2003b) Bootstrapping a Hidden Markov Model for relationship extraction using multi-level contexts. Proceedings of Pacific Association for Computational Linguistics 2003 (PACLING '03), pp. 305–314.Google Scholar
Niu, C., Li, W., Ding, J. and Srihari, R. K. (2004) Orthographic case restoration using supervised learning without manual annotation. International Journal of Artificial Intelligence Tools 141–156.CrossRefGoogle Scholar
Pustejovsky, J., Castaño, J., Ingria, R., Sauri, R., Gaizauskas, R. and Setzer, A. (2003) TimeML: Robust specification of event and temporal expressions in text. New Directions in Question Answering 2003, pp. 28–34.Google Scholar
Riloff, E. (1996) Automatically generating extraction patterns from untagged text. Proceedings of AAAI-96, pp. 1044–1049.Google Scholar
Riloff, E. (2003) From manual knowledge engineering to bootstrapping: Progress in information extraction and NLP. ICCBR 2003, p. 4.CrossRefGoogle Scholar
Riloff, E. and Jones, R. (1999) Learning dictionaries for information extraction by multi-level boot-strapping. Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99), pp. 1044–1049. AAI Press/MIT Press.Google Scholar
Roche, E. and Schabes, Y. (1997) Finite-State Language Processing. MIT Press.CrossRefGoogle Scholar
Silberztein, M. (1999) INTEX: a Finite State Transducer toolbox. Theoretical Computer Science 231 (1). Elsevier Science.Google Scholar
Srihari, R. and Li, W. (2000) A question answering system supported by information extraction. Proceedings of ANLP 2000, pp. 166–172. Seattle, WA.CrossRefGoogle Scholar
Srihari, R., Niu, C. and Li, W. (2000) A hybrid approach for named entity and sub-type tagging. Proceedings of ANLP 2000, pp. 247–254. Seattle, WA.CrossRefGoogle Scholar
Srihari, R. K., Li, W., Niu, C. and Cornell, T. (2003) InfoXtract: A customizable intermediate level information extraction engine. Proceedings of NAACL '03 Workshop on Software Engineering and Architecture of Language Technology Systems (SEALTS), pp. 52–59. Edmonton, Alberta, Canada.CrossRefGoogle Scholar