Hostname: page-component-cd9895bd7-dzt6s Total loading time: 0 Render date: 2024-12-26T09:36:39.125Z Has data issue: false hasContentIssue false

A new PPM variant for Chinese text compression

Published online by Cambridge University Press:  01 July 2008

PEILIANG WU
Affiliation:
School of Informatics, University of Wales Bangor, Dean Street, Bangor, Gwynedd LL57 1UT, UK email: perry@informatics.bangor.ac.uk, wjt@informatics.bangor.ac.uk
W. J. TEAHAN
Affiliation:
School of Informatics, University of Wales Bangor, Dean Street, Bangor, Gwynedd LL57 1UT, UK email: perry@informatics.bangor.ac.uk, wjt@informatics.bangor.ac.uk

Abstract

Large alphabet languages such as Chinese are very different from English, and therefore present different problems for text compression. In this article, we first examine the characteristics of Chinese, then we introduce a new variant of the Prediction by Partial Match (PPM) model especially for Chinese characters. Unlike the traditional PPM coding schemes, which encodes an escape probability if a novel character occurs in the context, the new coding scheme directly encodes the order first before encoding a symbol, without having to output an escape probability. This scheme achieves excellent compression rates in comparison with other schemes on a variety of Chinese text files.

Type
Papers
Copyright
Copyright © Cambridge University Press 2007

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Bassiou, N., and Kotropoulos, C. 2005. Interpolated distanced bigram language models for robust word clustering. In International Workshop on Nonlinear Signal and Image Processing (NSIP 2005).Google Scholar
Bell, T. C., Cleary, J. G., and Witten, I. H. 1990. Text Compression. Upper Saddle Rivee, NJ: Prentice Hall.Google Scholar
Bodden, E., Clasen, M., and Kneis, J. 2004. Arithmetic Coding revealed. Germany: RWTH Aachen University.Google Scholar
Burrows, M., and Wheeler, D. J. 1994. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, CA.Google Scholar
Cheng, K.-S., Young, G. H., and Wong, K.-F. 1999. A study of word-based and integral-bit chinese text compression algorithms. Journal of the American Society for Information Science 50 (3):218228.3.0.CO;2-1>CrossRefGoogle Scholar
Cleary, J. G., and Witten, I. H. 1984. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communication COM-32 (4):396402.CrossRefGoogle Scholar
Cleary, J. G., and Teahan, W. J. 1997. Unbounded length contexts for PPM. The Computer Journal 40 (2/3):6775.CrossRefGoogle Scholar
Gu, H. Y. 1995. Large-alphabet Chinese text compression using adaptive Markov model and arithmetic coder. Computer Processing of Chinese and Oriental Languages 9 (2):111124.Google Scholar
Gu, H.-Y. 2005. A large-alphabet-oriented scheme for Chinese and English text compression. Software—Practice and Experience 35:10271039.CrossRefGoogle Scholar
Jelinek, F. 1985. Self-organized language modeling for speech recognition. In Readings in Speech Recognition, A. Waibel and K. Lee (eds.), Morgan Kaufmann, Weshington, DC vol. 28, pp. 25912594.Google Scholar
Jin, G. 1992. PH Corpus of Mandarin Chinese. ftp://ftp.cogsci.ed.ac.uk/pub/chinese. Date accessed 30 June, 2007.Google Scholar
Lua, K. T. 1994. Compression of Chinese text. In International Conference on Chinese Computing. pp. 367–375.Google Scholar
McEnery, T., and Xiao, R. 2004. The Lancaster Corpus of Mandarin Chinese. European Language Resources Association. http://bowland-files.lancs.ac.uk/corplang/lcmc/. Date accessed 30 June, 2007.Google Scholar
Moffat, A. 1989. Word-based text compression. Software Practice and Experience 19 (2):185198.CrossRefGoogle Scholar
Moffat, A. 1990. Implementing the PPM data compression scheme. IEEE Transaction on Communication 38 (11):19171921.CrossRefGoogle Scholar
Moffat, A., Neal, R., and Witten, I. 1998. Arithmetic Coding Revisited. ACM Transactions on Information Systems 16 (3):256294.CrossRefGoogle Scholar
Ong, G. H., and Ng, J. P. 2005. Dynamic Markov Compression using a crossbar-like tree initial structure for Chinese texts. In ICITA '05: Proceedings of the Third International Conference on Information Technology and Applications (ICITA'05), vol. 2, pp. 407–410.Google Scholar
Shkarin, D. 2002. PPM : One step to practicality. In Data Compression Conference 2002. pp. 202–211.Google Scholar
Teahan, W. J. 1998. Modelling English Text, PhD thesis. New Zealand: University of Waikato.Google Scholar
Teahan, W. J., and Harper, D. J. 2001. Combining PPM models using a text mining approach. In Data Compression Conference 2001 pp. 153–162.Google Scholar
TREC Mandarin Corpus. 2000 Text Retrieval Conference test data. http://www.ldc.upenn.edu. Date accessed 30 June, 2007.Google Scholar
Vines, P., and Zobel, J. 1998 Compression techniques for Chinese text. Software—Practice and Experience 28 (12):12991314.3.0.CO;2-E>CrossRefGoogle Scholar
Witten, I. H., Bray, Z., Mahoui, M., and Teahan, W. J. 1999. Text mining: A new frontier for lossless compression. In Data Compression Conference 1999. pp. 198–207.CrossRefGoogle Scholar
Wu, P., and Teahan, W. J. 2005. Modelling Chinese for text compression. In Data Compression Conference, 2005, Proceedings. DCC 2005. p. 488.Google Scholar