Hostname: page-component-cd9895bd7-8ctnn Total loading time: 0 Render date: 2024-12-26T09:08:50.307Z Has data issue: false hasContentIssue false

Exploiting extra-textual and linguistic information in keyphrase extraction

Published online by Cambridge University Press:  30 September 2014

GÁBOR BEREND*
Affiliation:
University of Szeged, Department of Informatics, Árpád tér 2, Szeged, H6720, Hungary email: berendg@inf.u-szeged.hu

Abstract

Keyphrases are the most important phrases of documents that make them suitable for improving natural language processing tasks, including information retrieval, document classification, document visualization, summarization and categorization. Here, we propose a supervised framework augmented by novel extra-textual information derived primarily from Wikipedia. Wikipedia is utilized in such an advantageous way that – unlike most other methods relying on Wikipedia – a full textual index of all the Wikipedia articles is not required by our approach, as we only exploit the category hierarchy and a list of multiword expressions derived from Wikipedia. This approach is not only less resource intensive, but also produces comparable or superior results compared to previous similar works. Our thorough evaluations also suggest that the proposed framework performs consistently well on multiple datasets, being competitive or even outperforming the results obtained by other state-of-the-art methods. Besides introducing features that incorporate extra-textual information, we also experimented with a novel way of representing features that are derived from the POS tagging of the keyphrase candidates.

Type
Articles
Copyright
Copyright © Cambridge University Press 2014 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Barker, K., and Cornacchia, N., 2000. Using noun phrase heads to extract document keyphrases. In Proceedings of the 13th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence (AI ’00), London, UK, UK: Springer-Verlag, pp. 4052.Google Scholar
Berend, G. 2011. Opinion expression mining by exploiting keyphrase extraction. In Proceedings of 5th International Joint Conference on Natural Language Processing. Chiang Mai, Thailand: Asian Federation of Natural Language Processing, pp. 11621170.Google Scholar
Blei, D. M., Ng, A. Y., and Jordan, M. I., 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3 (Mar.): 9931022.Google Scholar
Bougouin, A., Boudin, F., and Daille, B. 2013. TopicRank: graph-based topic ranking for keyphrase extraction. In Proceedings of the Sixth International Joint Conference on Natural Language Processing. Nagoya, Japan: Asian Federation of Natural Language Processing, pp. 543551.Google Scholar
Buckley, C., and Voorhees, E. M. 2004. Retrieval evaluation with incomplete information. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’04), New York, NY, USA: ACM, pp. 2532.Google Scholar
Budanitsky, A., and Hirst, G., 2006. Evaluating WordNet-based measures of lexical semantic relatedness. Computational Linguistics 32 (1): 1347.CrossRefGoogle Scholar
Ding, Z., Zhang, Q., and Huang, X. 2011. Keyphrase extraction from online news using binary integer programming. In Proceedings of 5th International Joint Conference on Natural Language Processing. Chiang Mai, Thailand: Asian Federation of Natural Language Processing, pp. 165173.Google Scholar
Dunning, T., 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19 (1): 6174.Google Scholar
Eisterlehner, F., Hotho, A., and Jäschke, R. (eds). 2009 (Sept.). ECML PKDD Discovery Challenge 2009 (DC09), CEUR-WS.org, vol. 497.Google Scholar
Farkas, R., Berend, G., Hegedűs, I., Kárpáti, A., and Krich, B. 2010. Automatic free-text-tagging of online news archives. In Proceedings of the 2010 Conference on ECAI 2010: 19th European Conference on Artificial Intelligence. Amsterdam, The Netherlands, The Netherlands: IOS Press, pp. 529534.Google Scholar
Fellbaum, C. 1998. WordNet: An Electronic Lexical Database. Language, Speech and Communication. Mit Press.CrossRefGoogle Scholar
Gabrilovich, E., and Markovitch, S. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 1606–1611.Google Scholar
Hasan, K. S., and Ng, V. 2010. Conundrums in unsupervised keyphrase extraction: making sense of the state-of-the-art. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters (COLING ’10), Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 365373.Google Scholar
Hulth, A. 2003. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing (EMNLP ’03), Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 216223.Google Scholar
Kim, S. N., and Kan, M.-Y. 2009. Re-examining automatic keyphrase extraction approaches in scientific articles. In Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications (MWE ’09), Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 916.Google Scholar
Kim, S. N., Medelyan, O., Kan, M.-Y., and Baldwin, T. 2010. SemEval-2010 task 5: automatic keyphrase extraction from scientific articles. In Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval ’10), Morristown, NJ, USA: ACL, pp. 2126.Google Scholar
Kim, S. N., Medelyan, O., Kan, M.-Y., and Baldwin, T., 2013. Automatic keyphrase extraction from scientific articles. Language Resources and Evaluation 47 (3): 723742.CrossRefGoogle Scholar
Landauer, T. K., and Dutnais, S. T. 1997. A solution to Platos problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review 211–240.Google Scholar
Liu, F., Pennell, D., Liu, F., and Liu, Y. 2009a. Unsupervised approaches for automatic keyword extraction using meeting transcripts. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL ’09), Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 620628CrossRefGoogle Scholar
Liu, Z., Huang, W., Zheng, Y., and Sun, M. 2010. Automatic keyphrase extraction via topic decomposition. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP ’10), Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 366376.Google Scholar
Liu, Z., Li, P., Zheng, Y., and Sun, M. 2009b (August). Clustering to find exemplar terms for keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 257–266.Google Scholar
Lopez, P., and Romary, L. 2010. HUMB: automatic key term extraction from scientific articles in GROBID. In Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval ’10), Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 248251.Google Scholar
Lopez, P.et al. 2010. GRISP: a massive multilingual terminological database for scientific and technical domains. In LREC 2010.Google Scholar
Mahdi, A. E., and Joorabchi, A., 2010. A citation-based approach to automatic topical indexing of scientific literature. Journal of Information Science 36 (6): 798811.CrossRefGoogle Scholar
McCallum, A. K. 2002. MALLET: A Machine Learning for Language Toolkit. http://mallet.cs.umass.edu.Google Scholar
Medelyan, O., and Witten, I. H. 2006. Thesaurus based automatic keyphrase indexing. In Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries (JCDL ’06), New York, NY, USA: ACM, pp. 296297.Google Scholar
Medelyan, O., Frank, E., and Witten, I. H. 2009. Human-competitive tagging using automatic keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics, pp. 13181327.Google Scholar
Mihalcea, R., and Tarau, P. 2004. TextRank: bringing order into texts. In Proceedings of EMNLP, vol. 4. Barcelona, Spain, p. 275.Google Scholar
Mishne, G. 2006. AutoTag: a collaborative approach to automated tag assignment for weblog posts. In WWW ’06: Proceedings of the 15th International Conference on World Wide Web. New York, NY, USA: ACM Press, pp. 953954.Google Scholar
Navigli, R., and Ponzetto, S. P. 2012. BabelRelate! a joint multilingual approach to computing semantic relatedness. In AAAI Conference on Artificial Intelligence.Google Scholar
Nguyen, T. D., and Kan, M.-Y. 2007. Keyphrase extraction in scientific publications. In Proceedings of the 10th International Conference on Asian Digital Libraries: Looking Back 10 Years and Forging New Frontiers (ICADL’07), Berlin, Heidelberg: Springer-Verlag, pp. 317326.Google Scholar
Nguyen, T. D., and Luong, M.-T. 2010. WINGNUS: keyphrase extraction utilizing document logical structure. In Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval ’10), Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 166169.Google Scholar
Page, L., Brin, S., Motwani, R., and Winograd, T. 1999 (November). The PageRank Citation Ranking: Bringing Order to the Web. Previous number = SIDL-WP-1999-0120.Google Scholar
Pedersen, T., Patwardhan, S., and Michelizzi, J. 2004. WordNet: similarity: measuring the relatedness of concepts. In Demonstration Papers at HLT-NAACL 2004. HLT-NAACL–Demonstrations ’04. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 3841.Google Scholar
Resnik, P. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial IntelligenceIJCAI’95, vol. 1. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc, pp. 448453.Google Scholar
Sag, I. A., Baldwin, T., Bond, F., Copestake, A. A., and Flickinger, D. 2002. Multiword expressions: a pain in the neck for NLP. In Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing (CICLing ’02), London, UK, UK: Springer-Verlag, pp. 115.Google Scholar
Sood, S., Owsley, S., Hammond, K., and Birnbaum, L. 2007. TagAssist: automatic tag suggestion for blog posts. In Proceedings of the International Conference on Weblogs and Social Media (ICWSM 2007).Google Scholar
Strube, M., and Ponzetto, S. P. 2006. WikiRelate! computing semantic relatedness using Wikipedia. In AAAI’06: Proceedings of the 21st National Conference on Artificial Intelligence, pp. 1419–1424.Google Scholar
Tatu, M., Srikanth, M., and D’Silva, T. 2008. RSDC’08: tag recommendations using bookmark content. In Proceedings of the ECML PKDD Discovery Challenge 2008.Google Scholar
Tomokiyo, T., and Hurst, M. 2003. A language model approach to keyphrase extraction. In Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment (MWE ’03), vol. 18. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 3340.Google Scholar
Toutanova, K., and Manning, C. D. 2000. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP ’00), Stroudsburg, PA, USA: ACL, pp. 6370.Google Scholar
Turney, P., 2000. Learning algorithms for keyphrase extraction. Information Retrieval 2: 303336.CrossRefGoogle Scholar
Turney, P. 2003. Coherent keyphrase extraction via web mining. In Proceedings of IJCAI ’03, pp. 434–439.Google Scholar
Voorhees, E. M. 1999. The TREC-8 question answering track report. In In Proceedings of TREC-8, pp. 77–82.Google Scholar
Wan, X., and Xiao, J. 2008. Single document keyphrase extraction using neighborhood knowledge. In Proceedings of the 23rd National Conference on Artificial Intelligence (AAAI’08), vol. 2. AAAI Press pp. 855860.Google Scholar
Wang, D. X., Gao, X., and Andreae, P. 2012. DIKEA: domain-independent keyphrase extraction algorithm. In Proceedings of the 25th Australasian Joint Conference on Advances in Artificial Intelligence (AI’12), Berlin, Heidelberg: Springer-Verlag, pp. 719730.Google Scholar
Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., and Nevill-Manning, C. G. 1999. KEA: practical automatic keyphrase extraction. ACM DL, pp. 254–255.Google Scholar
Wu, Z., and Giles, C. L. 2013. Measuring term informativeness in context. In Proceedings of NAACL-HLT, pp. 259–269.Google Scholar
Yeh, E., Ramage, D., Manning, C. D., Agirre, E., and Soroa, A. 2009. WikiWalk: random walks on wikipedia for semantic relatedness. In Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing. TextGraphs-4. Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 4149.Google Scholar
You, W., Fontaine, D., and Barthès, J.-P. A., 2013. An automatic keyphrase extraction system for scientific documents. Knowledge and Information Systems 34 (3): 691724.CrossRefGoogle Scholar