A hierarchical Dirichlet language model

David J. C. MacKay; Linda C. Bauman Peto

doi:10.1017/S1351324900000218

A hierarchical Dirichlet language model

Published online by Cambridge University Press: 12 September 2008

David J. C. MacKay and

Linda C. Bauman Peto

Show author details

David J. C. MacKay: Affiliation:
Cavendish LaboratoryCambridge CB3 0HE, UK email: mackay@mrao.cam.ac.uk
Linda C. Bauman Peto: Affiliation:
Department of Computer ScienceUniversity of Toronto, Canada email: peto@cs.toronto.edu

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

We discuss a hierarchical probabilistic model whose predictions are similar to those of the popular language modelling procedure known as ‘smoothing’. A number of interesting differences from smoothing emerge. The insights gained from a probabilistic view of this problem point towards new directions for language modelling. The ideas of this paper are also applicable to other problems such as the modelling of triphomes in speech, and DNA and protein sequences in molecular biology. The new algorithm is compared with smoothing on a two million word corpus. The methods prove to be about equally accurate, with the hierarchical model using fewer computational resources.

Information

Type: Articles
Information: Natural Language Engineering , Volume 1 , Issue 3 , September 1995 , pp. 289 - 308

DOI: https://doi.org/10.1017/S1351324900000218 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 1995

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Antoniak, C. E. (1974) Mixtures of Dirichlet processes with applications to nonparametric problems. Annals of Statistics 2: 1152–1174.CrossRef Google Scholar

Bahl, L. R., Brown, P., de Souza, P., Mercer, R. L. and Nahamoo, D. (1991) A fast algorithm for deleted interpolation. Proceedings of Eurospeech '91 Genoa, pp. 1209–1212.Google Scholar

Bahl, L. R., Jelinek, F. and Mercer, R. L. (1983) A maximum likelihood approach to continuous speech recognition. IEEE Trans PAMI 5 (2): 179–190.CrossRef Google Scholar PubMed

Bell, T. C., Cleary, J. G. and Witten, I. H. (1990) Text compression. Englewood Cliffs, NJ: Prentice Hall.Google Scholar

Brown, P. F., Delia Pietra, S. A., Delia Pietra, V. J., Lai, J. C. and Mercer, R. L. (1992) An estimate of an upper bound for the entropy of English. Computational Linguistics 18 (1): 31–40.Google Scholar

Brown, P. F., Delia Pietra, S. A., Delia Pietra, V. J. and Mercer, R. L. (1993) The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19 (2): 263–311.Google Scholar

Buntine, W. (1992) Learning classification trees. Statistics and Computing 2: 63–73.CrossRef Google Scholar

Cox, R. (1946) Probability, frequency, and reasonable expectation. Am. J. Physics 14: 1–13.CrossRef Google Scholar

Gale, W. and Church, K. 1991 A program for aligning sentences in bilingual corpora. Proceedings of 29th Annual Meeting of the ACL, pp. 177–184.Google Scholar

Gull, S. F. (1989) Developments in maximum entropy data analysis. In Maximum Entropy and Bayesian Methods, Cambridge 1988, Skilling, J., (ed.), pp. 53–71. Dordrecht: Kluwer.CrossRef Google Scholar

Hanson, R., Stutz, J. and Cheeseman, P. 1991 Bayesian classification with correlation and inheritance. Proceedings of the 12th International Joint Conference on Artificial Intelligence, Sydney, Australia, volume 2, pp. 692–698. San Mateo, CA: Morgan Kaufmann.Google Scholar

Jelineks, F. and Mercer, R. L. (1980) Interpolated estimation of Markov source parameters from sparse data. In Pattern Recognition in Practice, Gelsema, E. S. and Kanal, L. N., (eds.), pp. 381–402. Amsterdam: North-Holland.Google Scholar

Katz, S. M. (1987) Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing 35 (3): 400–401.CrossRef Google Scholar

MacKay, D. J. C. (1995a) Bayesian neural networks and density networks. Nuclear Instruments and Methods in Physics Research, Section A 354 (1): 73–80.CrossRef Google Scholar

MacKay, D. J. C. (1995b) Density networks and protein modelling. In Maximum Entropy and Bayesian Methods, Cambridge 1994, Skilling, J. and Sibisi, S., (eds.), Dordrecht: Kluwer.Google Scholar

MacKay, D. J. C. (1995c) Hyperparameters: Optimize, or integrate out? In Maximum Entropy and Bayesian Methods, Santa Barbara 1993, Heidbreder, G., (ed.), Dordrecht: Kluwer.Google Scholar

MacKay, D. J. C. (1995d) Models for dice factories and amino acid probability vectors. In preparation.Google Scholar

Nadas, A. (1984) Estimation of probabilities in the language model of the IBM speech recognition system. IEEE Trans ASSP 32 (4): 859–861.CrossRef Google Scholar

Neal, R. M. (1992) Bayesian mixture modelling. In Maximum Entropy and Bayesian Methods, Seattle 1991, Smith, C., Erickson, G. and Neudorfer, P., (eds.), pp. 197–211. Dordrecht: Kluwer.CrossRef Google Scholar

Neal, R. M. (1993) Probabilistic inference using Markov chain Monte Carlo methods. Technical Report CRG-TR-93-1, Dept. of Computer Science, University of Toronto.Google Scholar

Peto, L. B. (1994) A comparison of two smoothing methods for word bigram models. Technical Report CSRI-304, Computer Systems Research Institute, University of Toronto.Google Scholar

Press, W., Flannery, B., Teukolsky, S. A. and Vetterling, W. T. (1988) Numerical Recipes in C. Cambridge: Cambridge University Press.Google Scholar

Skilling, J. (1989) Classic maximum entropy. In Maximum Entropy and Bayesian Methods, Cambridge 1988, Skilling, J., (ed.), Dordrecht: Kluwer.CrossRef Google Scholar

West, M. (1992) Hyperparameter estimation in Dirichlet process mixture models. Working paper 92-A03, Duke Inst. of Stats, and Decision Sciences.Google Scholar

Williams, C. K. I. and Hinton, G. E. (1991) Mean field networks that learn to discriminate temporally distorted strings. In Connectionist Models: Proceedings of the 1990 Summer School, Touretzky, D. S., Elman, T. J. and Sejnowski, T. J., (eds.). San Mateo, CA: Morgan Kaufmann.Google Scholar

Article contents

A hierarchical Dirichlet language model

Abstract

Information

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests