Hostname: page-component-745bb68f8f-d8cs5 Total loading time: 0 Render date: 2025-01-14T00:30:40.787Z Has data issue: false hasContentIssue false

A hierarchical Dirichlet language model

Published online by Cambridge University Press:  12 September 2008

David J. C. MacKay
Affiliation:
Cavendish LaboratoryCambridge CB3 0HE, UK email: mackay@mrao.cam.ac.uk
Linda C. Bauman Peto
Affiliation:
Department of Computer ScienceUniversity of Toronto, Canada email: peto@cs.toronto.edu

Abstract

We discuss a hierarchical probabilistic model whose predictions are similar to those of the popular language modelling procedure known as ‘smoothing’. A number of interesting differences from smoothing emerge. The insights gained from a probabilistic view of this problem point towards new directions for language modelling. The ideas of this paper are also applicable to other problems such as the modelling of triphomes in speech, and DNA and protein sequences in molecular biology. The new algorithm is compared with smoothing on a two million word corpus. The methods prove to be about equally accurate, with the hierarchical model using fewer computational resources.

Type
Articles
Copyright
Copyright © Cambridge University Press 1995

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Antoniak, C. E. (1974) Mixtures of Dirichlet processes with applications to nonparametric problems. Annals of Statistics 2: 11521174.CrossRefGoogle Scholar
Bahl, L. R., Brown, P., de Souza, P., Mercer, R. L. and Nahamoo, D. (1991) A fast algorithm for deleted interpolation. Proceedings of Eurospeech '91 Genoa, pp. 12091212.Google Scholar
Bahl, L. R., Jelinek, F. and Mercer, R. L. (1983) A maximum likelihood approach to continuous speech recognition. IEEE Trans PAMI 5 (2): 179190.CrossRefGoogle ScholarPubMed
Bell, T. C., Cleary, J. G. and Witten, I. H. (1990) Text compression. Englewood Cliffs, NJ: Prentice Hall.Google Scholar
Brown, P. F., Delia Pietra, S. A., Delia Pietra, V. J., Lai, J. C. and Mercer, R. L. (1992) An estimate of an upper bound for the entropy of English. Computational Linguistics 18 (1): 3140.Google Scholar
Brown, P. F., Delia Pietra, S. A., Delia Pietra, V. J. and Mercer, R. L. (1993) The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19 (2): 263311.Google Scholar
Buntine, W. (1992) Learning classification trees. Statistics and Computing 2: 6373.CrossRefGoogle Scholar
Cox, R. (1946) Probability, frequency, and reasonable expectation. Am. J. Physics 14: 113.CrossRefGoogle Scholar
Gale, W. and Church, K. 1991 A program for aligning sentences in bilingual corpora. Proceedings of 29th Annual Meeting of the ACL, pp. 177184.Google Scholar
Gull, S. F. (1989) Developments in maximum entropy data analysis. In Maximum Entropy and Bayesian Methods, Cambridge 1988, Skilling, J., (ed.), pp. 5371. Dordrecht: Kluwer.CrossRefGoogle Scholar
Hanson, R., Stutz, J. and Cheeseman, P. 1991 Bayesian classification with correlation and inheritance. Proceedings of the 12th International Joint Conference on Artificial Intelligence, Sydney, Australia, volume 2, pp. 692698. San Mateo, CA: Morgan Kaufmann.Google Scholar
Jelineks, F. and Mercer, R. L. (1980) Interpolated estimation of Markov source parameters from sparse data. In Pattern Recognition in Practice, Gelsema, E. S. and Kanal, L. N., (eds.), pp. 381402. Amsterdam: North-Holland.Google Scholar
Katz, S. M. (1987) Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing 35 (3): 400401.CrossRefGoogle Scholar
MacKay, D. J. C. (1995a) Bayesian neural networks and density networks. Nuclear Instruments and Methods in Physics Research, Section A 354 (1): 7380.CrossRefGoogle Scholar
MacKay, D. J. C. (1995b) Density networks and protein modelling. In Maximum Entropy and Bayesian Methods, Cambridge 1994, Skilling, J. and Sibisi, S., (eds.), Dordrecht: Kluwer.Google Scholar
MacKay, D. J. C. (1995c) Hyperparameters: Optimize, or integrate out? In Maximum Entropy and Bayesian Methods, Santa Barbara 1993, Heidbreder, G., (ed.), Dordrecht: Kluwer.Google Scholar
MacKay, D. J. C. (1995d) Models for dice factories and amino acid probability vectors. In preparation.Google Scholar
Nadas, A. (1984) Estimation of probabilities in the language model of the IBM speech recognition system. IEEE Trans ASSP 32 (4): 859861.CrossRefGoogle Scholar
Neal, R. M. (1992) Bayesian mixture modelling. In Maximum Entropy and Bayesian Methods, Seattle 1991, Smith, C., Erickson, G. and Neudorfer, P., (eds.), pp. 197211. Dordrecht: Kluwer.CrossRefGoogle Scholar
Neal, R. M. (1993) Probabilistic inference using Markov chain Monte Carlo methods. Technical Report CRG-TR-93-1, Dept. of Computer Science, University of Toronto.Google Scholar
Peto, L. B. (1994) A comparison of two smoothing methods for word bigram models. Technical Report CSRI-304, Computer Systems Research Institute, University of Toronto.Google Scholar
Press, W., Flannery, B., Teukolsky, S. A. and Vetterling, W. T. (1988) Numerical Recipes in C. Cambridge: Cambridge University Press.Google Scholar
Skilling, J. (1989) Classic maximum entropy. In Maximum Entropy and Bayesian Methods, Cambridge 1988, Skilling, J., (ed.), Dordrecht: Kluwer.CrossRefGoogle Scholar
West, M. (1992) Hyperparameter estimation in Dirichlet process mixture models. Working paper 92-A03, Duke Inst. of Stats, and Decision Sciences.Google Scholar
Williams, C. K. I. and Hinton, G. E. (1991) Mean field networks that learn to discriminate temporally distorted strings. In Connectionist Models: Proceedings of the 1990 Summer School, Touretzky, D. S., Elman, T. J. and Sejnowski, T. J., (eds.). San Mateo, CA: Morgan Kaufmann.Google Scholar