Book contents
- Designing and Evaluating Language Corpora
- Designing and Evaluating Language Corpora
- Copyright page
- Contents
- Figures
- Tables
- Acknowledgments
- 1 Introduction
- 2 Approaches to Representativeness in Previous Corpus Linguistic Research
- 3 Corpus Representativeness
- 4 Domain Considerations
- 5 Distribution Considerations
- 6 The Influence of Domain and Distribution Considerations on Corpus Representativeness
- 7 Corpus Design and Representativeness in Practice – With Daniel Keller
- Glossary
- Book part
- References
- Index
6 - The Influence of Domain and Distribution Considerations on Corpus Representativeness
Bringing It All Together
Published online by Cambridge University Press: 07 April 2022
- Designing and Evaluating Language Corpora
- Designing and Evaluating Language Corpora
- Copyright page
- Contents
- Figures
- Tables
- Acknowledgments
- 1 Introduction
- 2 Approaches to Representativeness in Previous Corpus Linguistic Research
- 3 Corpus Representativeness
- 4 Domain Considerations
- 5 Distribution Considerations
- 6 The Influence of Domain and Distribution Considerations on Corpus Representativeness
- 7 Corpus Design and Representativeness in Practice – With Daniel Keller
- Glossary
- Book part
- References
- Index
Summary
We emphasize that the ability for a corpus to provide accurate estimates of a linguistic parameter depends on the combined influence of domain considerations (coverage bias and selection bias) and distribution considerations (corpus size). By using a series of experimental corpora on the domain of Wikipedia articles, we can demonstrate the impact of corpus size, coverage bias, selection bias, and stratification on representativeness. Empirical results show that robust sampling methods and large sample sizes can only give you a better representation of the operational domain (i.e. overcome selection bias). However, by themselves, these factors cannot help you achieve accurate quantitative-linguistic analyses for the actual domain (i.e. overcome coverage bias) Uncontrolled domain considerations can lead to unpredictable results with respect to accuracy.
Keywords
- Type
- Chapter
- Information
- Designing and Evaluating Language CorporaA Practical Framework for Corpus Representativeness, pp. 156 - 176Publisher: Cambridge University PressPrint publication year: 2022