We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure no-reply@cambridge.org
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Contrary to traditional thought in linguistics and editing, recent studies using corpus-based evidence suggest that historical English usage patterns influenced prescriptive usage manuals’ guidelines more than the other way around. To explore the modern relationship between English language prescriptions and usage, this study focuses on the wide-reaching genre of written online news and the topic of gender-fair language. It compares changes regarding gender-specific titles in the Associated Press's stylebooks to actual usage trends as documented by the News on the Web (NOW) corpus. Results from NOW show -man title variants as the dominant form in the early 2010s, consistent with AP style at that time. However, many gender-neutral (including -person) variants saw rapid uptake in usage in the mid-2010s to become the most frequent forms by 2021, contrasting AP guidelines that only started listing -person and other neutral forms as ‘acceptable' around 2017 and as the prescribed forms more recently. These results indicate both an increased cultural consciousness for changing gender equity standards as well as a willingness of many news writers, editors, and publishers to defer to culturally significant language trends even if authoritative guides do not yet endorse them.
This inductive examination of the topics in the public administration literature using computational social science and corpus linguistics (17 journals, N=12,760 articles, 1991–2019) reveals a new landscape of public administration topics, changes in topics over time and their distribution: Topic modelling of the stock of the whole corpus identifies 50 topics: the top ten topics included health care, federal government, performance management, environmental regulation, HRM and networks and accounted for just over a third of scholarship between 1991–2019. Focal topics identified in individual journals identified similarities with popular topics in the whole corpus – networks, health care, HRM – and less frequently examined topics including gender and diversity and partnerships. Analysis of topics over time shows a substantial flow in topics moving from a country and practice focus in the early stages of our study period to concepts such as governance, networks and citizens in the late stages (2015–2019).
The English Preposing in PP construction (PiPP; e.g., Happy though/as we were) is extremely rare but displays an intricate set of stable syntactic properties. How do people become proficient with this construction despite such limited evidence? It is tempting to posit innate learning mechanisms, but present-day large language models seem to learn to represent PiPPs as well, even though such models employ only very general learning mechanisms and experience very few instances of the construction during training. This suggests an alternative hypothesis on which knowledge of more frequent constructions helps shape knowledge of PiPPs. I seek to make this idea precise using model-theoretic syntax (MTS). In MTS, a grammar is essentially a set of constraints on forms. In this context, PiPPs can be seen as arising from a mix of construction-specific and general-purpose constraints, all of which seem inferable from general linguistic experience.
We investigate a class of adjective phrases composed of a deadjectival adverb ending in -ly and an adjective head (e.g. staggeringly incompetent, absolutely terrific, fiscally responsible), a compact construction whereby two adjectives may jointly contribute to evaluative meaning. Using corpus methodologies on more than 1 million examples and relying on semantic analyses of about 1,000 instances, we propose that the construction can be divided into different semantic subtypes, including Degree (deeply disturbing), Focus (utterly ridiculous), Manner (delightfully performed), Reaction (strangely compelling), Topical (historically inaccurate) and Epistemic (intuitively obvious), among others. Using this typology, we investigate the relative distribution of each subtype across several registers of written English. We found a high frequency of the Reaction subtype in book, film and art reviews, and we suggest a discourse-functional explanation for this, linked to the perceived value of originality in expressive writing. This investigation reveals the power of semantically informed, corpus methodologies to shed light on the distribution of specific constructions.
This Element offers intermediate or experienced programmers algorithms for Corpus Linguistic (CL) programming in the Python language using dataframes that provide a fast, efficient, intuitive set of methods for working with large, complex datasets such as corpora. This Element demonstrates principles of dataframe programming applied to CL analyses, as well as complete algorithms for creating concordances; producing lists of collocates, keywords, and lexical bundles; and performing key feature analysis. An additional algorithm for creating dataframe corpora is presented including methods for tokenizing, part-of-speech tagging, and lemmatizing using spaCy. This Element provides a set of core skills that can be applied to a range of CL research questions, as well as to original analyses not possible with existing corpus software.
This chapter discusses linguistic variation in Slavic languages by presenting an overview of the relationship between human communication in the society and the corresponding linguistic features. In this chapter we focus on the parameters of variation according to the language user, such as age or dialects, and according to the language use, such as communicative functions or communication styles, e.g. politeness. We cite both qualitative and quantitative methods for studying aspects of sociolinguistic variation. Examples are drawn from large corpora of two Slavic languages, Russian and Serbo-Croatian, with a particular focus on academic writing, news reporting, and reporting personal experience in social media, as well as from dictionaries and field studies.
This chapter surveys the history and main directions of natural language processing research in general, and for Slavic languages in particular. The field has grown enormously since its beginning. Especially since 2010, the amount of digital texts has been rapidly growing; furthermore, research has yielded an ever-greater number of highly usable applications. This is reflected in the increasing number and attendance of NLP conferences and workshops. Slavic countries are no exception; several have been organising international conferences for decades, and their proceedings are the best place to find publications on Slavic NLP research. The general trend of the evolution of NLP is difficult to predict. It is certain that deep learning, including various new types (e.g. contextual, multilingual) of word embeddings and similar ‘deep’ models will play an increasing role, while predictions also mention the increasing importance of the Universal Dependencies framework and treebanks and research into the theory, not only the practice, of deep learning, coupled with attempts at achieving better explainability of the resulting models.
The introduction to this volume describes its content. It also provides the rationale for including selected topics and provides comments on the manner of presentation adopted in this volume.
One of the questions that still surrounds the history of auxiliary do is what function it had during the Middle English period (c.1100–1500). Scholars have put forward different hypotheses, suggesting that it could serve, among others, as a perfective marker (Denison 1985), agentive marker (Ecay 2015) and habitual marker (Garrett 1998). The present article reports on a quantitative study that aims to shed further light on this issue. By means of a collexeme analysis, this article investigates the semantic features of the infinitives that occur with auxiliary do in several Middle English corpora. The results show that auxiliary do was not connected to verbs with specific semantic profiles, but it was employed in different contexts and had various functions. Specifically, the data suggest that auxiliary do was used (i) as an accommodation tool to facilitate the use of low-frequency verbs, particularly of French origin, and (ii) as an aspectual particle to mark both perfectivity and habituality. It is argued that the multifunctionality of auxiliary do in Middle English played a crucial role in the preservation of the construction before it spread to the NICE (i.e. negation, inversion, code and emphasis) environments.
The linguistic study of the Slavic language family, with its rich syntactic and phonological structures, complex writing systems, and diverse socio-historical context, is a rapidly growing research area. Bringing together contributions from an international team of authors, this Handbook provides a systematic review of cutting-edge research in Slavic linguistics. It covers phonetics and phonology, morphology and syntax, lexicology, and sociolinguistics, and presents multiple theoretical perspectives, including synchronic and diachronic. Each chapter addresses a particular linguistic feature pertinent to Slavic languages, and covers the development of the feature from Proto-Slavic to present-day Slavic languages, the main findings in historical and ongoing research devoted to the feature, and a summary of the current state of the art in the field and what the directions of future research will be. Comprehensive yet accessible, it is essential reading for academic researchers and students in theoretical linguistics, linguistic typology, sociolinguistics and Slavic/East European Studies.
Formularity, or the poet’s reliance on prefabricated linguistic features in the composition of his verses, has been the most debated feature of Oral-Formulaic Theory. This chapter reviews the history of Homeric formularity (Part 1), while introducing new key insights from the fields of linguistics (esp. usage-based linguistics, corpus linguistics, and language acquisition studies) and the cognitive sciences (Parts 2-5). Parts 2-3 argue that formularity is a general feature of human language and cognition. Homer’s formularity is quantitatively notable, however, in that it involves sequences that are particularly long when compared to repeated sequences in corpora of both contemporary written or spoken English and ancient prose and hexameter authors. This is interpreted as a sign of Homer’s extreme mastery of his medium, which was arguably necessitated by the oral-improvisational nature of the task. Part 4 develops a new theory of Homeric formularity, borrowing insights from connectionism, lexical priming, and construction grammar, and introduces fine-grained distinctions between conceptual associations, collocations, constructions, metrical constructions and structural formulas.
The chapter considers gesture studies in relation to corpus linguistic work. The focus is on the Multimedia Russian Corpus (MURCO), part of the Russian National Corpus. The chapter includes a brief biography of the creator of this corpus, Elena Grishina. The compilation of the corpus out of a set of Russian classic feature films and recorded lectures is described as well as the methods of annotating it in detail. The gesture coding is not limited to manual/hand gestures, but also includes head gestures and use of eye gaze. The chapter considers the findings from the corpus, and reported in Grishina’s posthumously published volume on Russian gestures from a linguistic point of view. The categories include pointing gestures, representational gestures, auxiliary (discourse-structuring) gestures, and several cross-cutting categories, including gestures in relation to pragmatics and to grammatical categories, like verbal aspect. Additional consideration is given to other video corpora in English (and other languages) which are being used for gesture research, namely the UCLA NewsScape library being managed by the Red Hen Lab and the Television Archive.
The chapter introduces the material used for the study, that is, the Old Bailey Corpus (OBC) as well as the Old Bailey Online resource and the Proceedings that the OBC has been based on. The analytical frameworks adopted are also discussed, comprising the corpuslinguistic approach, and the historical sociopragmatics, the language variation and change, and the grammaticalization and pragmatic-semantic change frameworks. Attention is also paid to the late modern courtroom and to the issues of relevance to the study of past spoken interaction based on written records.
This is the main methodology and first-results chapter. It opens with an introduction to the lexeme-based approach used for the investigation, contrasting this to previous, variationist approaches. The chapter proceeds to explain the data retrieval and screening processes and presents an overview of the data, the nearly 65,000 intensifier tokens found in the corpus, across the three main categories (maximizers, boosters, downtoners), and the descriptive results across time for the most frequent items. The word counts of the different sociopragmatic groups of speakers (divided by speakers’ role in the courtroom, gender and social class) are introduced, as well as the diachronic distribution of intensifiers across the genders and social classes. Results are presented within the descriptive statistics framework, but the chapter also briefly introduces the regression model, or the inferential, multivariate statistical method to be used in Chapters 8–11 to disentangle the complex interplay of the sociopragmatic variables of speakers on the use of intensifiers.
This article sheds light on the significant yet nuanced roles of shame and guilt in influencing moral behaviour, a phenomenon that became particularly prominent during the COVID-19 pandemic with the community’s heightened desire to be seen as moral. These emotions are central to human interactions, and the question of how they are conveyed linguistically is a vast and important one. Our study contributes to this area by analysing the discourses around shame and guilt in English and Japanese online forums, focusing on the terms shame, guilt, haji (‘shame’) and zaiakukan (‘guilt’). We utilise a mix of corpus-based methods and natural language processing tools, including word embeddings, to examine the contexts of these emotion terms and identify semantically similar expressions. Our findings indicate both overlaps and distinct differences in the semantic landscapes of shame and guilt within and across the two languages, highlighting nuanced ways in which these emotions are expressed and distinguished. This investigation provides insights into the complex dynamics between emotion words and the internal states they denote, suggesting avenues for further research in this linguistically rich area.
Chapter 5 provides a statistical analysis of the legal rulings that comprise the book’s corpus. The analysis looks at the possible relationships between the variables that make up the corpus, including types of invocations for counsel, types of suspects (e.g., juveniles, L2 speakers), and the judges’ presidential appointments, among an array of other legal and linguistic factors, and the judges’ rulings on the legal standing of the suspects’ invocations for counsel. To frame the discussion and understand the seeming disconnect between suspects’ invocations for counsel and the application of the law, the chapter provides a description of the corpus, the entry and selection of variables, and the research questions posed. The findings of the analysis provide further evidence of the effect of judicial rulings on the suspects/defendants’ legal journey. Given the potential significant role of suspects’ statements in the conviction of a crime, this chapter also includes a discussion on whether a ruling that suppresses such statements is enough to reverse a lower court’s ruling on the use of such statements and/or its content in court.
Homophony avoidance has often been claimed to be a mechanism of language change. We investigate this mechanism in Dutch by applying two strands of research – corpus studies and experimental data – to find support for claims based on earlier historical observations. Throughout the history of Dutch, homophony avoidance has been named as the cause of language change or inhibition of change on several occasions. We build on these historical observations with an experimental study and a corpus study on a synchronic Dutch alternation, where avoidance of homophony between present and past tense can appear. Plurals of verbs with a stem ending in a dental show homophony with the present when they are used in the preterite (compare zetten ‘put’ pst-pl with zetten ‘put’ prs-pl). This homophony can be avoided by using the perfectum (hebben gezet ‘have put’). A wug-style experiment shows that verbs with dental stem are indeed used significantly more in the perfectum in the plural than in the singular, while verbs without dental stem do not show this difference. A corpus study on Dutch further corroborates these results. Combined, these studies make a strong case for homophony avoidance as a plausible mechanism of language change.
A comprehensive description of the combination of the finite auxiliary verbs wērden ‘become’/wēsen ‘be’ and a present participle in Middle Low German is still a strong desideratum. This study presents a corpus-based analysis of the aforementioned phenomenon with a special focus on its grammatical structure and its different meanings. In particular, it focuses on a wide range of temporal and aspectual meanings, depending on the auxiliary verb, its tense and mood. Moreover, the relationship between the semantics of the main verb and the meaning of the whole construction is investigated. Finally, the competition with alternative verbal complex constructions expressing the same meaning is also explored. The analysis is carried out on the basis of Middle Low German texts from different times, language areas, and genres.
The paper presents a detailed corpus-based analysis of the German prospective stehen vor NP light verb construction. The starting point of the analysis is the claim that the construction is restricted to change-of-state nouns in the NP-internal position (Fleischhauer & Gamerschlag 2019, Fleischhauer et al. 2019). Based on corpus data, I demonstrate that although the construction shows a strong preference for such nouns, other semantic types of nouns (such as state nouns or process nouns) occur in the construction as well. I argue that process nouns in particular require contextual support to be licensed within the construction. In the paper, I present an analysis of the prospective light verb construction in terms of current relevance. This analysis accounts for the observed preference for change-of-state NP-internal nouns as well as for the need to provide contextual support for process nouns. The notion current relevance is frequently employed in the analysis of the perfect aspect; the current paper represents the first attempt to extend this notion to the prospective aspect.*
English fulfils important intra- and international functions in 21st century India. However, the country's size in terms of area, population, and linguistic diversity means that completely uniform developments in Indian English (IndE) are unlikely. Using sophisticated corpus-linguistic and statistical methods, this Element explores the unity and diversity of IndE by providing studies of selected lexical and morphosyntactic features that characterise Indian English(es) in the 21st century. The findings indicate a degree of incipient 'supralocalisation', i.e. a spread of features beyond their place of origin, cutting through the typological Indo-Aryan vs. Dravidian divide.