Book contents
- Frontmatter
- Contents
- Preface
- 1 Introduction: goals and methods of the corpus-based approach
- Part I Investigating the use of language features
- Part II Investigating the characteristics of varietie
- Part III Summing up and looking ahead
- Part IV Methodology boxes
- 1 Issues in corpus design
- 2 Issues in diachronic corpus design
- 3 Concordancing packages versus programming for corpus analysis
- 4 Characteristics of tagged corpora
- 5 The process of tagging
- 6 Norming frequency counts
- 7 Statistical measures of lexical associations
- 8 The unit of analysis in corpus-based studies
- 9 Significance tests and the reporting of statistics
- 10 Factor loadings and dimension scores
- Appendix: commercially available corpora and analytical tools
- References
- Index
1 - Issues in corpus design
Published online by Cambridge University Press: 05 June 2012
- Frontmatter
- Contents
- Preface
- 1 Introduction: goals and methods of the corpus-based approach
- Part I Investigating the use of language features
- Part II Investigating the characteristics of varietie
- Part III Summing up and looking ahead
- Part IV Methodology boxes
- 1 Issues in corpus design
- 2 Issues in diachronic corpus design
- 3 Concordancing packages versus programming for corpus analysis
- 4 Characteristics of tagged corpora
- 5 The process of tagging
- 6 Norming frequency counts
- 7 Statistical measures of lexical associations
- 8 The unit of analysis in corpus-based studies
- 9 Significance tests and the reporting of statistics
- 10 Factor loadings and dimension scores
- Appendix: commercially available corpora and analytical tools
- References
- Index
Summary
A corpus is not simply a collection of texts. Rather, a corpus seeks to represent a language or some part of a language. The appropriate design for a corpus therefore depends upon what it is meant to represent. The representativeness of the corpus, in turn, determines the kinds of research questions that can be addressed and the generalizability of the results of the research. For example, a corpus composed primarily of news reportage would not allow a general investigation of variation in English. Similarly, research based on a corpus containing a single type of conversation – such as conversations between teenagers – could not be generalized to conversation overall. Thus, whether you are designing a corpus of your own, choosing a corpus to use in a study, or reading others' corpus-based work, issues of representativeness in corpus design are crucial.
It is important to realize up front that representing a language – or even part of a language – is a problematic task. We do not know the full extent of variation in languages or all the contextual variables that need to be covered in order to capture all variation in texts. However, attention to certain issues will ensure that a corpus is as representative as possible, given our current knowledge of language. This methodology box introduces these issues, as well as some means for improving corpus design in the future.
- Type
- Chapter
- Information
- Corpus LinguisticsInvestigating Language Structure and Use, pp. 246 - 250Publisher: Cambridge University PressPrint publication year: 1998