Published online by Cambridge University Press: 02 November 2017
The semantic information of documents needs to be represented because it is the basis for many applications, such as document summarization, web search, and text analysis. Although many studies have explored this problem by enriching document vectors with the relatedness of the words involved, the performance remains far from satisfactory because the physical boundaries of documents hinder the evaluation of the relatedness between words. To address this problem, we propose an effective approach to further infer the implicit relatedness between words via their common related words. To avoid overestimation of the implicit relatedness, we restrict the inference in terms of the marginal probabilities of the words based on the law of total probability. The proposed method measures the relatedness between words, which is confirmed theoretically and experimentally. Thorough evaluation on real datasets illustrates that significant improvement on document clustering has been achieved with the proposed method compared with state-of-the-art methods.