Hostname: page-component-cd9895bd7-dzt6s Total loading time: 0 Render date: 2024-12-26T09:01:15.318Z Has data issue: false hasContentIssue false

A topological embedding of the lexicon for semantic distance computation

Published online by Cambridge University Press:  15 June 2010

N. DAVIS
Affiliation:
Department of Computer Science, Brigham Young University, Provo, UT 84602, USA e-mail: cgc@cs.byu.edu
C. GIRAUD-CARRIER
Affiliation:
Department of Computer Science, Brigham Young University, Provo, UT 84602, USA e-mail: cgc@cs.byu.edu
D. JENSEN
Affiliation:
KJ Nova, Inc., Provo, UT 84601, USA

Abstract

We show how a quantitative context may be established for what is essentially qualitative in nature by topologically embedding a lexicon (here, WordNet) in a complete metric space. This novel transformation establishes a natural connection between the order relation in the lexicon (e.g., hyponymy) and the notion of distance in the metric space, giving rise to effective word-level and document-level lexical semantic distance measures. We provide a formal account of the topological transformation and demonstrate the value of our metrics on several experiments involving information retrieval and document clustering tasks.

Type
Papers
Copyright
Copyright © Cambridge University Press 2010

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paşca, M., and Soroa, A. 2009. A study on similarity and relatedness using distributional and WordNet-based approaches. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies, Boulder, CO, pp. 1927.Google Scholar
Agirre, E., and Edmonds, P. (eds.) 2007. Word Sense Disambiguation: Algorithms and Applications. Springer.Google Scholar
Beeferman, D., and Berger, A. 2000. Agglomerative clustering of a search engine query log. In Proceedings of the Sixth ACM SIGKDD International Conference, Boston, MA, pp. 407415.Google Scholar
Blei, F., Ester, M., and Xu, X. 2002. Frequent term-based text clustering. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, AB, Canada, pp. 436442.CrossRefGoogle Scholar
Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3: 9931022.Google Scholar
Budanitsky, A., and Hirst, G. 2006. Evaluating WordNet-based measures of semantic distance. Computational Linguistics 32 (1): 1347.CrossRefGoogle Scholar
Choudhary, B., and Bhattacharyya, P. 2002. Text clustering using semantics. In Proceedings of the Eleventh International World Wide Web Conference, Honolulu, HI.Google Scholar
Curran, J. R. 2004. From Distributional to Semantic Similarity. PhD thesis, University of Edinburgh, Edinburgh, UK.Google Scholar
Cutting, D. R., Karger, D. R., Pedersen, J. O., and Tukey, J. W. 1992. Scatter/gather: a cluster-based approach to browsing large document collections. In Proceedings of the Fifteenth ACM SIGIR International Conference on Research and Development in Information Retrieval, Copenhagen, Denmark, pp. 318329.Google Scholar
Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. 1990. Indexing by latent semantic analysis. Journal of the Society for Information Science 41 (6): 391407.3.0.CO;2-9>CrossRefGoogle Scholar
Edwards, R. E. 1965. Functional Analysis: Theory and Applications. Holt, Rinehart and Winston, New York. Dover reprint 1994.Google Scholar
Erné, M., Koslowski, J., Melton, A., and Strecker, G. E. 1993. A primer on Galois connections. Annals of the New York Academy of Sciences 704: 103125.CrossRefGoogle Scholar
Everitt, B. 1993. Cluster Analysis. John Wiley & Sons, Inc.Google Scholar
Fellbaum, C. (ed.) 1998. WordNet: An Electronic Lexical Database. The MIT Press.CrossRefGoogle Scholar
Fisher, D. 1987. Knowledge acquisition via incremental conceptual clustering. Machine Learning 2: 139172.CrossRefGoogle Scholar
Fung, B., Wang, K., and Ester, M. 2003. Hierarchical document clustering using frequent itemsets. In Proceedings of the SIAM International Conference on Data Mining, San Francisco, CA, pp. 5970.Google Scholar
Gabrilovich, E., and Markovitch, S. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the Twentieth International Joint Conference on Artificial Intelligence, Hyderabad, India, pp. 16061611.Google Scholar
Gennari, J. H., Langley, P., and Fisher, D. 1989. Models of incremental concept formation. Artificial Intelligence 40: 1161.CrossRefGoogle Scholar
Golub, G. H., and Van Loan, C. F. 1996. Matrix Computations, 3rd ed. The Johns Hopkins University Press.Google Scholar
Guha, R., McCool, R., and Miller, E. 2003. Semantic search. In Proceedings of the Twelfth International World Wide Web Conference, Budapest, Hungary, pp. 700709.CrossRefGoogle Scholar
Henstock, P. V., Pack, D. J., Lee, Y.-S., and Weinstein, C. J. 2001. Toward an improved concept-based information retrieval system. In Proceedings of the Twenty-Fourth ACM SIGIR International Conference on Research and Development in Information Retrieval, New Orleans, LA, pp. 384385.Google Scholar
Hocking, J. G., and Young, G. S. 1961. Topology. Addison-Wesley. Dover reprint 1988.Google Scholar
Hofmann, T. 1999. Probabilistic latent semantic indexing. In Proceedings of the Twenty-Second ACM SIGIR International Conference on Research and Development in Information Retrieval, Berkeley, CA, pp. 5054.Google Scholar
Hotho, A., Staab, S., and Maedche, A. 2001. Ontology-based text clustering. In Proceedings of the IJCAI Workshop on Text Learning: Beyond Supervision, Seattle, WA, pp. 3037.Google Scholar
Hubert, L., and Arabie, P. 1985. Comparing partitions. Journal of Classification 2 (1): 193218.CrossRefGoogle Scholar
Ide, N., and Veronis, J. 1998. Introduction to the Special Issue on Word Sense Disambiguation. Computational Linguistics 24 (1): 240.Google Scholar
Jain, A. K., and Dubes, R. C. 1988. Algorithms for Clustering Data. Prentice-Hall, Inc.Google Scholar
Jensen, D., and Giraud-Carrier, C. 2007. A topological embedding of the lexicon for effective semantic distance computation. In Proceedings of the Seventh International Workshop on Computational Semantics, Tilburg, The Netherlands, pp. 259270.Google Scholar
Jensen, D., Giraud-Carrier, C., and Davis, N. 2008. A method for computing lexical semantic distance using linear functionals. Journal of Web Semantics: Science, Services and Agents on the World Wide Web 6: 99108.CrossRefGoogle Scholar
Johnson, S. C. 1967. Hierarchical clustering schemes. Psychometrika 2: 241254.CrossRefGoogle Scholar
Kaufman, L., and Rousseeuw, P. J. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, Inc.CrossRefGoogle Scholar
Kim, S.-B., Seo, H.-C., and Rim, H.-C. 2004. Information retrieval using word senses: root sense tagging approach. In Proceedings of the Twenty-Seventh ACM SIGIR International Conference on Research and Development in Information Retrieval, Sheffield, UK, pp. 258265.Google Scholar
Kogan, J., Teboulle, M., and Nicholas, C. 2003. The entropic geometric means algorithm: an approach to building small clusters for large text datasets. In Proceedings of the ICDM Workshop on Clustering Large Data Sets, Melbourne, FL, pp. 6371.Google Scholar
Kohonen, T. 1982. Self-organized formation of topologically correct feature maps. Biological Cybernetics 43 (1): 5969.CrossRefGoogle Scholar
Landauer, T. K., Foltz, P. W., and Laham, D. 1998. An introduction to latent semantic analysis. Discourse Processes 25: 259284.CrossRefGoogle Scholar
Landauer, T. K., and Dumais, S. T. 1997. A solution to Plato's problem: the latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review 104 (2): 211240.CrossRefGoogle Scholar
Lang, K. 1995. News weeder: learning to filter netnews. In Proceedings of the Twelfth International Conference of Machine Learning, Tahoe City, CA, pp. 331339.Google Scholar
Larsen, B., and Aone, C. 1999. Fast and effective text mining using linear-time document clustering. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, pp. 1622.CrossRefGoogle Scholar
Lebart, L., and Rajman, M. 2000. Computing similarity. In Dale, R., Moisl, H. and Somers, H. (eds.), Handbook of Natural Language Processing, pp. 477506. New York, NY, Marcel Dekker, Inc.Google Scholar
Leouski, A., and Croft, W. 1996. An evaluation of techniques for clustering search results. Technical Report IR-76, Department of Computer Science, University of Massachusetts, Amherst, MA.Google Scholar
Lucene 2007. An open source information retrieval library. http://lucene.apache.org/java/docs/index.htmlGoogle Scholar
MacQueen, J. B. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, pp. 281297.Google Scholar
Mann, G. 2002. Fine-grained proper noun ontologies for question answering. In Proceedings of the COLING Workshop on Building and Using Semantic Networks (SemaNet'02), Taipei, Taiwan, pp. 17.Google Scholar
Marton, Y., Mohammad, S., and Resnik, P. 2009. Estimating semantic distance using soft semantic constraints in knowledge-source-corpus hybrid models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Singapore, pp. 775783.Google Scholar
Mihalcea, R. F., and Mihalcea, S. I. 2001. Word semantics for information retrieval: moving one step closer to the semantic web. In Proceedings of the Thirteenth IEEE International Conference on Tools with Artificial Intelligence, Dallas, TX, pp. 280287.Google Scholar
Miller, G. A. 1995. WordNet: a lexical database for english. Communications of the ACM 38 (11): 3941.CrossRefGoogle Scholar
Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., and Miller, K. J. 1990. Introduction to WordNet: an on-line lexical database. International Journal of Lexicography 3 (4): 235344.CrossRefGoogle Scholar
Mohammad, S., and Hirst, G. 2006. Distributional measures of concept-distance: a task-oriented evaluation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Sydney, Australia, pp. 3543.Google Scholar
Navigli, R. 2009. Word sense disambiguation: a survey. ACM Computing Surveys 41 (2): 169.CrossRefGoogle Scholar
Norvig, P. 2006. The future of search. In Invited Talk at the Sixth Annual Workshop on Technology for Family History and Genealogical Research, Brigham Young University, Provo, UT.Google Scholar
Pedersen, T., Pakhomov, S. V. S., Patwardham, S., and Chute, C. G. 2007. Measures of semantic similarity and relatedness in the biomedical domain. Journal of Biomedical Informatics 40 (3): 288299.CrossRefGoogle ScholarPubMed
Pedersen, T., Patwardham, S., and Michelizzi, J. 2004. WordNet::similarity – measuring the relatedness of concepts. In Proceedings of the Nineteenth National Conference on Artificial Intelligence, San Jose, CA, pp. 10241025.Google Scholar
Ponzetto, S. P., and Strube, M. 2007. Knowledge derived from Wikipedia for computing semantic relatedness. Journal of Artificial Intelligence Research 30: 181212.CrossRefGoogle Scholar
Prince, V., and Lafourcade, M. 2003. Mixing semantic networks and conceptual vectors: the case of hyperonymy. In Proceedings of the Second IEEE International Conference on Cognitive Informatics, London, UK, pp. 121128.Google Scholar
Priss, U. E. 1998. The formalization of WordNet by methods of relational concept analysis. In Fellbaum, C. (ed.), WordNet: An Electronic Lexical Database, pp. 179196. Cambridge, MA: MIT Press.Google Scholar
Rada, R., Mili, H., Bicknell, E., and Blettner, M. 1989. Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man, and Cybernetics 19 (1): 1730.CrossRefGoogle Scholar
Resnik, P. 1993. Selection and Information: A Class-based Approach to Lexical Relationships. PhD thesis, University of Pennsylvania, Philadelphia, PA.Google Scholar
Resnik, P. 1999. Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research 11: 95130.CrossRefGoogle Scholar
Richardson, R., and Smeaton, A. F. 1995. Using WordNet in a knowledge-based approach to information retrieval. Technical Report CA-0395, Dublin City University, School of Computer Applications.Google Scholar
Rocha, C., Schwabe, D., and de Aragão, M. 2004. A hybrid approach for searching in the semantic web. In Proceedings of the Thirteenth International World Wide Web Conference, New York, NY, pp. 374383.Google Scholar
Rosso, P., Ferretti, E., Jiménez, D., and Vidal, V. 2004. Text categorization and information retrieval using WordNet senses. In Proceedings of the Second International Conference of the Global WordNet Association, Brno, Czech Republic, pp. 299304.Google Scholar
Roy, P., Mohania, M., Bhamba, B., and Raman, S. 2005. Towards automatic association of relevant unstructured content with structured query results. In Proceedings of the Fourteenth ACM International Conference on Information and Knowledge Management, Bremen, Germany, pp. 405412.Google Scholar
Rubenstein, H., and Goodenough, J. B. 1965. Contextual correlates of synonymy. Communications of the ACM 8 (10): 627633.CrossRefGoogle Scholar
Rumelhart, D. E., and Zipser, D. 1985. Feature discovery by competitive learning. Cognitive Science 9: 75112.Google Scholar
Salton, G., Wong, A., and Yang, C. S. 1975. A vector space model for automatic indexing. Communications of the ACM 18: 613620.CrossRefGoogle Scholar
Slonim, N., and Tishby, N. 2000. Document clustering using word clusters via the information bottleneck method. In Proceedings of the Twenty-Third ACM SIGIR International Conference on Research and Development in Information Retrieval, Athens, Greece, pp. 208215.Google Scholar
Song, W., and Park, S. C. 2007. A novel document clustering model based on latent semantic analysis. In Proceedings of the Third International Conference on Semantics, Knowledge and Grid, Xi'an, China, pp. 539542.Google Scholar
Stairmand, M. A. 1997. Textual context analysis for information retrieval. In Proceedings of the Twentieth ACM SIGIR International Conference on Research and Development in Information Retrieval, Philadelphia, PA, pp. 140147.Google Scholar
Steinbach, M., Karypis, G., and Kumar, V. 2000. A comparison of document clustering techniques. In Proceedings of the KDD Workshop on Text Mining, Boston, MA, pp. 109111.Google Scholar
Termier, A., Rousset, M-C., and Sebag, M. 2001. Combining statistics and semantics for word and document clustering. In Proceedings of the IJCAI Workshop on Ontology Learning, Seattle, WA, pp. 4954.Google Scholar
Thrall, R. M., and Tornheim, L. 1970. Vector Spaces and Matrices. Dover.Google Scholar
Tsuruoka, Y., and Tsujii, J. 2005 Bidirectional inference with the easiest-first strategy for tagging sequence data. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, Vancouver, BC, Canada, pp. 467474.Google Scholar
Voorhees, E. M. 1994. Query expansion using lexical-semantic relations. In Proceedings of the Seventeenth ACM SIGIR International Conference on Research and Development in Information Retrieval, Dublin, Ireland, pp. 6169.Google Scholar
Voorhees, E. M. 2005. The TREC robust retrieval track. SIGIR Forum 39 (1): 1120.CrossRefGoogle Scholar
Wang, Y., and Hodges, J. 2006. Document clustering with semantic analysis. In Proceedings of the Thirty-Ninth Hawaii International Conference on System Sciences, Kauai, HI, p. 54.3.Google Scholar
Weeds, J., Weir, D., and McCarthy, D. 2004. Characterising measures of lexical distributional similarity. In Proceedings of the Twentieth International Conference of Computational Linguistics, Geneva, Switzerland, pp. 10151021.Google Scholar
Wiebe, J., and Mihalcea, R. 2006. Word sense and subjectivity. In Proceedings of the Twenty-First International Conference on Computational Linguistics and Forty-Fourth Annual Meeting of the ACL, Sydney, Australia, pp. 10651072.Google Scholar
Willett, P. 1988. Recent trends in hierarchic document clustering: a critical review. Information Processing and Management 24 (5): 577597.CrossRefGoogle Scholar
Yang, Y. 1999. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval 1 (1/2): 6788.CrossRefGoogle Scholar
Zamir, O., and Etzioni, O. 1998. Web document clustering: a feasibility demonstration. In Proceedings of the Twenty-First ACM SIGIR International Conference on Research and Development in Information Retrieval, Melbourne, Australia, pp. 4654.Google Scholar
Zhang, J., Ghahramani, Z., and Yang, Y. 2004. A probabilistic model for online document clustering with application to novelty detection. In Proceedings of the Eighteenth Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, pp. 16171624.Google Scholar
Zhang, L., Yong, Y., Zhou, J., Lin, C., and Yang, Y. 2005. An enhanced model for searching in semantic portals. In Proceedings of the Fourteenth International World Wide Web Conference, Chiba, Japan, pp. 453462.Google Scholar