Hostname: page-component-cd9895bd7-gvvz8 Total loading time: 0 Render date: 2024-12-26T08:48:16.122Z Has data issue: false hasContentIssue false

Extraction of templates from phrases using Sequence Binary Decision Diagrams

Published online by Cambridge University Press:  19 July 2018

D. HIRANO
Affiliation:
Graduate School and Faculty of Information Science and Electrical Engineering, Kyushu University, Fukuoka, Japan e-mail: hirano@limu.ait.kyushu-u.ac.jp
K. TANAKA-ISHII
Affiliation:
Research Center for Advanced Science and Technology, University of Tokyo, Tokyo, Japan e-mail: kumiko@cl.rcast.u-tokyo.ac.jp
A. FINCH
Affiliation:
National Institute of Information and Communications Technology Kyoto, Kyoto, Japan e-mail: andrew.finch@gmail.com

Abstract

The extraction of templates such as ‘regard X as Y’ from a set of related phrases requires the identification of their internal structures. This paper presents an unsupervised approach for extracting templates on-the-fly from only tagged text by using a novel relaxed variant of the Sequence Binary Decision Diagram (SeqBDD). A SeqBDD can compress a set of sequences into a graphical structure equivalent to a minimal deterministic finite state automata, but more compact and better suited to the task of template extraction. The main contribution of this paper is a relaxed form of the SeqBDD construction algorithm that enables it to form general representations from a small amount of data. The process of compression of shared structures in the text during Relaxed SeqBDD construction, naturally induces the templates we wish to extract. Experiments show that the method is capable of high-quality extraction on tasks based on verb+preposition templates from corpora and phrasal templates from short messages from social media.

Type
Article
Copyright
Copyright © Cambridge University Press 2018 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abney, S., and Abney, S. P. 1991. Parsing by chunks. In Principle-Based Parsing, pp. 257278. Netherlands: Kluwer Academic Publishers.Google Scholar
Baldwin, T., Cook, P., Lui, M., Mackinlay, A., and Wang, L. 2013. How noisy social media text, how diffrnt social media sources. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP-2013). Mark: John Blitzer.Google Scholar
Baldwin, T. and Kim, S. N. 2010. Multiword expressions. In Indurkhya, N., and Damerau, F. J. (eds.), Handbook of Natural Language Processing, 3rd ed. Boca Raton, FL: CRC Press, Taylor and Francis Group.Google Scholar
Bryant, R. E., 1986. Graph-based algorithms for boolean function manipulation. IEEE Transactions on Computers 35 (8): 677691.Google Scholar
Cormen, T. H., Leiserson, C. E., and Rivest, R. L., 1990. Introduction to Algorithms. USA: MIT Press.Google Scholar
Cui, H., Kan, M.-Y., and Chua, T.-S. 2004. Unsupervised learning of soft patterns for generating definitions from online news. In Proceedings of the 13th International Conference on World Wide Web (WWW-2004), 90–99.Google Scholar
Daciuk, J., Mihov, S., Watson, B. W., and Watson, R. E., 2000. Incremental construction of minimal acyclic finite state automata. Computational Linguistics: Special Issue on Finite-State Methods in NLP 26 : 316.Google Scholar
de Medeiros Caseli, H., Ramisch, C., das Graças Volpe Nunes, M., and Villavicencio, A., 2010. Alignment-based extraction of multiword expressions. Language Resources and Evaluation: Special Issue on Multiword Expression: Hard Going or Plain Sailing 44 (1–2): 5977.Google Scholar
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R., 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41 (6): 391407.Google Scholar
Denzumi, S., Yoshinaka, R., Arimura, H., and Minato, S.-I. 2016. Sequence binary decision diagram: Minimization, relationship to acyclic automata, and complexities of boolean set operations. Discrete Applied Mathematics 212 (C): 6180. Elsevier Sicence Publishers B. V. Amsterdam, The Netherlands, The Netherlands.Google Scholar
Denzumi, S., Yoshinaka, R., Minato, S.-I., and Arimura, H. 2011. Efficient algorithms on sequence binary decision diagrams for manipulating sets of strings. Technical Report, DCS, Hokkaido U., TCS-TR-A-11-53.Google Scholar
Duan, J., Lu, R., Wu, W., Hu, Y., and Tian, Y. 2006. A bio-inspired approach for multi-word expression extraction. In Proceedings of the COLING/ACL on Main Conference Poster Sessions (COLING-ACL-2006). Association for Computational Linguistics, Stroudsburg, PA, USA, 176–82.Google Scholar
Fazly, A. 2006. Automatically constructing a lexicon of verb phrase idiomatic combinations. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL-2006), 337–344.Google Scholar
Fernau, H., 2009. Algorithms for learning regular expressions from positive data. Journal of Information and Computation 207 : 521541.Google Scholar
Francis, G., Hunston, S., and Manning, E., 1996. Collins COBUILD Grammar Patterns 1: Verbs. London: Harper Collins.Google Scholar
Francis, G., Hunston, S., and Manning, E., 1998. Collins COBUILD Grammar Patterns 2: Nouns and Adjectives. London: HarperCollins.Google Scholar
Gimpel, K. and Smith, N. A. 2012. Concavity and initialization for unsupervised dependency parsing. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 577–581.Google Scholar
Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD-2000). ACM, 1–12.Google Scholar
Headden, W. P. III, Johnson, M., and McClosky, D. 2009. Improving unsupervised dependency parsing with richer contexts and smoothing. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-2009). Association for Computational Linguistics, 101–109.Google Scholar
Huang, Z., Xu, W., and Yu, K. 2015. Bidirectional LSTM-CRF models for sequence tagging. CoRR abs/1508.01991.Google Scholar
Hunston, S., and Francis, G. 2000. Pattern Grammar: A Corpus-driven Approach to the Lexical Grammar of English. Studies in Corpus Linguistics 4. John Benjamins Publishing Company.Google Scholar
Kim, C. 2010. Text: Automatic template extraction from heterogeneous web pages. IEEE Computer Society, 23 (4): 612626.Google Scholar
Kita, K., Kato, Y., Omoto, T., and Yano, Y. 1994. A comparative study of automatic extraction of collocations from corpora: mutual information vs. cost criteria. Journal of Natural Language Processing 1, 1: 2133.Google Scholar
Klein, D. and Manning, C. D. 2004. Corpus-based induction of syntactic structure: Models of dependency and constituency. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics.Google Scholar
Knuth, D. 2009. The Art of Computer Programming, vol. 4, Fascicle 1. Addison-Wesley.Google Scholar
Loekito, E., Bailey, J., and Pei, J., 2010. A binary decision diagram based approach for mining frequent subsequences. Knowledge and Information Systems 24 (2): 235268.Google Scholar
Macqueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 281–297.Google Scholar
Manning, C. D., Raghavan, P., and Schütze, H., 2008. Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press.Google Scholar
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., and McClosky, D. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of the Association for Computational Linguistics System Demonstrations, 55–60.Google Scholar
Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B., 1993. Building a large annotated corpus of english: The penn treebank. Computational Linguistics 19 (2): 313330.Google Scholar
Martens, S. 2010. Varro: an algorithm and toolkit for regular structure discovery in treebanks. In Proceedings of the Coling 2010: Posters. Coling 2010 Organizing Committee, Beijing, China, 810–8.Google Scholar
István Nagy, T., and Vincze, V. 2014. Vpctagger: Detecting verb-particle constructions with syntax-based methods. In Proceedings of the 10th Workshop on Multiword Expressions (MWE). Association for Computational Linguistics, Gothenburg, Sweden, 17–25.Google Scholar
Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., and Hsu, M. 2001. Prefixspan: mining sequential patterns by prefix-projected growth. In Proceedings of the 17th International Conference on Data Engineering. IEEE Computer Society, 215–224.Google Scholar
Sag, I. A., Baldwin, T., Bond, F., Copestake, A., and Flickinger, D. 2002. Multiword expressions: a pain in the neck for NLP. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 1–15.Google Scholar
Sangati, F., and van Cranenburgh, A. 2015. Multiword expression identification with recurring tree fragments and association measures. In Proceedings of the MWE@ NAACL-HLT, 10–18.Google Scholar
Smadja, F., 1993. Retrieving collocations from text: Xtract. Computational Linguistics 19 (1): 143177.Google Scholar
Tu, Y., and Roth, D. 2012. Sorting out the most confusing english phrasal verbs. In Proceedings of the 1st Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the 6th International Workshop on Semantic Evaluation (SemEval-2012). Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 65–9.Google Scholar
Zarrieß, S. and Kuhn, J. 2009. Exploiting translational correspondences for pattern-independent MWE identification. In Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications (MWE-2009). Association for Computational Linguistics, Stroudsburg, PA, USA, 23–30.Google Scholar