Hostname: page-component-78c5997874-mlc7c Total loading time: 0 Render date: 2024-11-13T05:19:21.513Z Has data issue: false hasContentIssue false

Datasets for generic relation extraction*

Published online by Cambridge University Press:  09 March 2011

B. HACHEY
Affiliation:
Language Technology Group, Macquarie University, NSW 2109, Australia email: bhachey@cmcrc.com
C. GROVER
Affiliation:
Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland email: C.Grover@ed.ac.uk; R.Tobin@ed.ac.uk
R. TOBIN
Affiliation:
Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland email: C.Grover@ed.ac.uk; R.Tobin@ed.ac.uk

Abstract

A vast amount of usable electronic data is in the form of unstructured text. The relation extraction task aims to identify useful information in text (e.g. PersonW works for OrganisationX, GeneY encodes ProteinZ) and recode it in a format such as a relational database or RDF triplestore that can be more effectively used for querying and automated reasoning. A number of resources have been developed for training and evaluating automatic systems for relation extraction in different domains. However, comparative evaluation is impeded by the fact that these corpora use different markup formats and notions of what constitutes a relation. We describe the preparation of corpora for comparative evaluation of relation extraction across domains based on the publicly available ACE 2004, ACE 2005 and BioInfer data sets. We present a common document type using token standoff and including detailed linguistic markup, while maintaining all information in the original annotation. The subsequent reannotation process normalises the two data sets so that they comply with a notion of relation that is intuitive, simple and informed by the semantic web. For the ACE data, we describe an automatic process that automatically converts many relations involving nested, nominal entity mentions to relations involving non-nested, named or pronominal entity mentions. For example, the first entity is mapped from ‘one’ to ‘Amidu Berry’ in the membership relation described in ‘Amidu Berry, one half of PBS’. Moreover, we describe a comparably reannotated version of the BioInfer corpus that flattens nested relations, maps part-whole to part-part relations and maps n-ary to binary relations. Finally, we summarise experiments that compare approaches to generic relation extraction, a knowledge discovery task that uses minimally supervised techniques to achieve maximally portable extractors. These experiments illustrate the utility of the corpora.1

Type
Articles
Copyright
Copyright © Cambridge University Press 2011

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Agichtein, E. and Gravano, L. 2000. Snowball: extracting relations from large plain-text collections. In Proceedings of the 5th ACM Conference on Digital Libraries, pp. 8594. New York, NY: ACM.CrossRefGoogle Scholar
Aone, C., Halverson, L., Hampton, T. and Ramos-Santacruz, M. 1998. SRA: description of the IE2 system used for MUC-7. In Proceedings of the 7th Message Understanding Conference (MUC-7), Columbia, MD. Gaithersburg: NIST.Google Scholar
Auer, S., Dietzold, S., Lehmann, J., Hellmann, S., and Aumueller, D. 2009. Triplify: light-weight linked data publication from relational databases. In Proceedings of the 18th International World Wide Web Conference, Madrid, Spain, pp. 621–30. New York, NY: ACM.CrossRefGoogle Scholar
Berry, M. W., Dumais, S. T. and O'Brien, G. W. 1995. Using linear algebra for intelligent information retrieval. SIAM Review 37 (4): 573–95.CrossRefGoogle Scholar
Bizer, C., Heath, T. and Berners-Lee, T. 2009. Linked data – the story so far. International Journal on Semantic Web and Information Systems 5 (3): 122.Google Scholar
Blei, D., Ng, A. Y. and Jordan, M. I. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3: 9931022.Google Scholar
Brin, S. 1999. Extracting patterns and relations from the world wide web. In: Atzeni, P., Mendelzon, A., and Mecca, G. (eds.), The World Wide Web and Databases: Selected Papers from WebDB '98, pp. 172–83. Lecture Notes in Computer Science. Berlin: Springer.CrossRefGoogle Scholar
Bunescu, R., Ge, R., Kate, R. J., Marcotte, E. M., Mooney, R. J., Ramani, A. K., and Wong, Y. W. 2004. Comparative experiments on learning information extractors for proteins and their interactions. Artificial Intelligence in Medicine 33 (2): 139–55.CrossRefGoogle Scholar
Byrne, K. 2009. Populating the Semantic Web – Combining Text and Relational Databases as RDF Graphs. PhD thesis, University of Edinburgh.Google Scholar
Chinchor, N. 1998. Overview of MUC-7. In Proceedings of the 7th Message Understanding Conference. Gaithersburg, MD: NIST.Google Scholar
Cohen, K. B., Fox, L., Ogren, P. V. and Hunter, L. 2005. Corpus design for biomedical natural language processing. In Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, pp. 3845. Morristown, TN: ACL.Google Scholar
Cohen, K. B. and Hunter, L. 2006. A critical review of PASBio's argument structures for biomedical verbs. BMC Bioinformatics 7 (Suppl 3): S6.CrossRefGoogle ScholarPubMed
Conrad, J. G. and Utt, M. H. 1994. A system for discovering relationships by feature extraction from text databases. In Proceedings of the 17th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 260–70. New York, NY: ACM.Google Scholar
Curran, J. R. and Clark, S. 2003. Investigating GIS and smoothing for maximum entropy taggers. In Proceedings of the 11th Meeting of the European Chapter of the Association for Computational Linguistics, pp. 91–8. Morristown, TN: ACL.Google Scholar
Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassel, S., and Weischedel, R. 2004. The automatic content extraction (ACE) program – tasks, data, and evaluation. In Proceedings of the 4th International Conference on Language Resources and Evaluation, pp. 837–40. Paris: ELDA.Google Scholar
Eckart, C. and Young, G. 1936. The approximation of one matrix by another of lower rank. Psychometrika 1 (3): 211218.Google Scholar
Filatova, E. and Hatzivassiloglou, V. 2003. Marking atomic events in sets of related texts. In: Nicolov, N., Bontcheva, K., Angelova, G., and Mitkov, R (eds.), Recent Advances in Natural Language Processing III, pp. 247–56. Amsterdam, Netherlands: John Benjamins.Google Scholar
Ginter, F., Pyysalo, S., Björne, J., Heimonen, J., and Salakoski, T. 2007. BioInfer relationship annotation manual. Technical Report 806, Turku Centre for Computer Science.Google Scholar
Griffiths, T. L. and Steyvers, M. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences 101 (Suppl 1): 52285235.CrossRefGoogle ScholarPubMed
Grover, C., Matthews, M. and Tobin, R. 2006. Tools to address the interdependence between tokenisation and standoff annotation. In Proceedings of the EACL Workshop on Multi-dimensional Markup in Natural Language Processing, pp. 1926. Morristown: ACL.Google Scholar
Hachey, B. 2009 a. Multi-document summarisation using generic relation extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Singapore, pp. 420–9. Morristown, TN: ACL.Google Scholar
Hachey, B. 2009 b. Towards Generic Relation Extraction. Ph.D. thesis, University of Edinburgh.Google Scholar
Hasegawa, T., Sekine, S. and Grishman, R. 2004. Discovering relations among named entities from large corpora. In Proceedings of the 42nd Annual Meeting of Association of Computational Linguistics, pp. 415–22. Morristown, TN: ACL.Google Scholar
Hasegawa, T., Sekine, S. and Grishman, R. 2005. Unsupervised paraphrase acquisition via relation discovery. Technical Report 05-012, Proteus Project, Computer Science Department, New York University.Google Scholar
Heimonen, J., Pyysalo, S., Ginter, F. and Salakoski, T. 2008. Complex-to-pairwise mapping of biological relationships using a semantic network representation. In Proceedings of the 3rd International Symposium on Semantic Mining in Biomedicine, pp. 4552. Turku: Turku Centre for Computer Science Turku, Finland.Google Scholar
Johnson, H. L. Jr., Baumgartner, William A., Krallinger, M., Cohen, K. B., and Hunter, L. 2007. Corpus refactoring: a feasibility study. Journal of Biomedical Discovery and Collaboration 2: 4.CrossRefGoogle ScholarPubMed
Landauer, T. K., Foltz, P. W. and Laham, D. 1998. An introduction to latent semantic analysis. Discourse Processes 25 (2): 259284.CrossRefGoogle Scholar
Linguistic Data Consortium (LDC) 2004 a. Annotation Guidelines for Entity Detection and Tracking (EDT). Philadelphia, PA: LDC. http://www.ldc.upenn.edu/Projects/ACE/docs/EnglishEDTV4-2-6.PDF Accessed 22 July 2008.Google Scholar
Linguistic Data Consortium (LDC) 2004 b. Annotation Guidelines for Relation Detection and Characterization (RDC). Philadelphia, PA: LDC. http://www.ldc.upenn.edu/Projects/ACE/docs/EnglishRDCV4-3-2.PDF. Accessed 22 July 2008.Google Scholar
Linguistic Data Consortium (LDC) 2005 a. ACE (Automatic Content Extraction) English Annotation Guidelines for Entities. Philadelphia, Pa: LDC. http://www.ldc.upenn.edu/Projects/ACE/docs/English-Entities-Guidelines_v5.6.1.pdf. Accessed 22 July 2008.Google Scholar
Linguistic Data Consortium (LDC) 2005 b. ACE (Automatic Content Extraction) English Annotation Guidelines for Relations. Philadelphia, PA: LDC. http://www.ldc.upenn.edu/Projects/ACE/docs/English-Relations-Guidelines_v5.8.3.pdf. Accessed 22 July 2008.Google Scholar
Lin, D. 1998. Dependency-based evaluation of MINIPAR. In Proceedings of the LREC Workshop Evaluation of Parsing Systems, pp. 317–30. Paris: ELDA.Google Scholar
Lin, D. and Pantel, P. 2001. Discovery of inference rules for question answering. Natural Language Engineering 7 (4): 343360.CrossRefGoogle Scholar
Marcus, M. P., Marcinkiewicz, M. A. and Santorini, B. 1993. Building a large annotated corpus of English: the Penn treebank. Computational Linguistics 19 (2): 313–30. ISSN .Google Scholar
McDonald, R., Pereira, F., Kulick, S., Winters, S., Jin, Y., and White, P. 2005. Simple algorithms for complex relation extraction with applications to biomedical IE. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pp. 491–8. Morristown, TN: ACL.Google Scholar
Minnen, G., Carroll, J. and Pearce, D. 2000. Robust, applied morphological generation. In Proceedings of the 1st International Natural Language Generation Conference, pp. 201–8. Morristown, TN: ACL.Google Scholar
Mitchell, A., Strassel, S., Huang, S. and Zakhary, R. 2005. ACE 2004 Multilingual Training Corpus. Philadelphia, PA: Linguistic Data Consortium.Google Scholar
Pustejovsky, J., Saurí, R., Castaño, J., Radev, D., Gaizauskas, R., Setzer, A., Sundheim, B., and Katz, G. 2004. Representing temporal and event knowledge for QA systems. In: Maybury, M. T. (ed.), New Directions in Question Answering, pp. 99112. Menlo Park, CA: AAAI Press.Google Scholar
Pyysalo, S., Airola, A., Heimonen, J., Björne, J., Ginter, F., and Salakoski, T. 2008. Comparative analysis of five protein–protein interaction corpora. BMC Bioinformatics 9 (Suppl 3): S6.CrossRefGoogle ScholarPubMed
Pyysalo, S., Ginter, F., Heimonen, J., Björne, J., Boberg, J., Järvinen, J., and Salakoski, T. 2007. BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinformatics 8: 50.CrossRefGoogle ScholarPubMed
Rzhetsky, A., Iossifov, I., Koike, T., Krauthammer, M., Kra, P., Morris, M., Yu, H., Dubou, P. A., Weng, W., Wilbur, W. J., Hatzivassiloglou, V., and Friedman, C. 2004. Geneways: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. Journal of Biomedical Informatics 37 (1): 4353.Google Scholar
Sekine, S. 2006. On-demand information extraction. In Proceedings of the COLING/ACL Main Conference Poster Sessions, pp. 731–8. Morristown, TN: ACL.CrossRefGoogle Scholar
Smith, D. A. 2002. Detecting and browsing events in unstructured text. In Proceedings of the 25th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 7380. New York, NY: ACM.Google Scholar
Swanson, D. R. (1986) Fish oil, Raynaud's syndrome, and undiscovered public knowledge. Perspectives in Biology and Medicine 30 (1): 718.Google Scholar
Turmo, J., Ageno, A. and Català, N. 2006. Adaptive information extraction. ACM Computing Surveys, 38 (2): 4.CrossRefGoogle Scholar
Walker, C., Strassel, S., Medero, J. and Maeda, K. 2006. ACE 2005 Multilingual Training Corpus. Philadelphia, PA: Linguistic Data Consortium.Google Scholar
Wattarujeekrit, T., Shah, P. and Collier, N. 2004. PASBio: predicate-argument structures for event extraction in molecular biology. BMC Bioinformatics 5: 155.Google Scholar