1 Introduction
The goal of this special issue is to make different theoretical proposals concrete enough to provide testable predictions. If those predictions are then borne out, the proposal is supported; if not, the proposal isn't. Generating precise, testable predictions for theories is something I deeply support, and computational cognitive modeling (a methodology I use most often in my own work) provides one way to do exactly this (e.g., see Pearl, in press for more detailed discussion on this point).
Here, I've been asked to represent the perspective of proposals that involve both Universal Grammar (UG) and statistics (so I'll refer to them as UG+stats proposals). In almost every case study where I'll present the UG+stats proposals I'm aware of, a proposal is implemented concretely in a computational cognitive model. Why the focus on computational cognitive models? This is because it's often hard to pin down a specific prediction that a UG+stats proposal makes without a concrete model that uses the proposed UG knowledge and implements a specific learning strategy relying on the proposed statistics. When we have a computational cognitive model, predictions about children's behavior can be generated that are precise enough to evaluate with empirical data that either already exist or can be obtained in the future.
So, computational cognitive modeling offers a way to implement a UG+stats developmental theory, which is typically a theory of both the linguistic representations the child is learning (this is usually the UG part) and the acquisition process the child undergoes (this is usually the statistics part). The computational model then becomes a “proof of concept” for the developmental theory, as implemented by that model (see Pearl, Reference Pearl2014; Pearl, in press for more detailed discussion about this). This is in fact why an effective way to evaluate a UG+stats theory (or really, any developmental theory) is to implement it in a computational cognitive model; implementing the model involves (i) embedding the relevant prior knowledge and learning mechanisms proposed for the child in the model, (ii) giving the modeled child realistic input to learn from, and (iii) generating output predictions from that modeled child that connect in some interpretable way to children's behavior. This is the approach that the proposals reviewed here have generally taken for investigating how children learn morphology and syntax.
Also, it's likely I've been asked to represent the UG+stats perspective because I often work on learning problems where UG representations are combined with statistical learning in some form. This is because I think UG approaches to development can often greatly benefit from integrating statistical learning approaches (see Pearl, in press for more detailed discussion on this point). However, for some of the case studies in the development of morphology and syntax that will be discussed here, I don't necessarily agree that the UG+stats proposals I'm aware of are the best approaches. As relevant, I'll briefly note the caveats I have for the UG+stats proposals discussed.
In the remainder of this article, I'll first briefly review what UG is meant to be and why UG has traditionally been part of many developmental theories of language. I'll then discuss some common statistical learning approaches that are often part of UG+stats proposals. I'll then turn to specific morphology and syntax phenomena, including aspects of core syntax and morphology, as well as more complex phenomena, and how UG+stats proposals account for each (or don't yet). More specifically, for each phenomenon, I'll first discuss specific UG+stats proposals for learning it, including a brief overview of both the UG part and the statistics part. I'll then present the predictions that the UG+stats proposals make, aiming to specify at least one prediction that would support a specific proposal and one that would undermine it. I'll then discuss whether the proposal predictions hold up, don't hold up, or if we just don't know yet. In cases where we don't know yet, we have a clear path forward for fruitful future avenues of behavioral research (namely, studies that would test specific proposal predictions). I'll conclude with a brief summary of where we are when it comes to how well UG+stats proposals help us understand the development of morphology and syntax.
2 UG + statistics
2.1 The UG part
A key motivation for UG has always been developmental: UG could help children acquire the linguistic knowledge that they do as quickly as they do from the data that's available to them (Chomsky, Reference Chomsky1981; Jackendoff, Reference Jackendoff1994; Laurence & Margolis, Reference Laurence and Margolis2001; Crain & Pietroski, Reference Crain and Pietroski2002). That is, UG would allow children to solve what's been called the Poverty of the Stimulus (see Pearl, Reference Pearl2020 for a recent review), where the available data often seem inadequate for pinpointing the right linguistic knowledge as efficiently as children seem to. So, without some internal bias, children wouldn't succeed at language acquisition. UG is then a proposal for what that internal bias could be that enables language acquisition to in fact succeed.
Typically, a UG proposal would provide a way to structure the child's hypothesis space with respect to a specific piece of linguistic knowledge – that is, UG can help define what explicit linguistic hypotheses are considered, and what building blocks allow children to construct those explicit hypotheses for consideration. For instance, traditional linguistic parameters (Chomsky, Reference Chomsky1981; Chomsky, Reference Chomsky1986) are building blocks that children can construct their linguistic system from. So, a language's system would be described by a specific collection of parameter values for these linguistic parameters. Having these parameter building blocks then allows a child to construct and consider explicit hypotheses about a language's system as she encounters her language's data. In some of the phenomena we'll discuss below (basic word order, lack of inflection, movement), linguistic parameters supplied by UG allow the child to construct a constrained set of possible hypotheses to navigate through, given her input.
More generally, a working definition of UG is that it's anything that's both innate and language-specific (Pearl, Reference Pearl2020; Pearl, in press). So, linguistic parameters fit this definition because they would be innate knowledge and they're only used for learning language. In the specific linguistic phenomena reviewed in this article, we'll see a variety of examples of UG knowledge, as relevant for morphology and syntax.
2.2 The statistics part
In UG+stats proposals, the statistics part refers to statistical learning. That is, on the basis of the statistics of her input, the child is learning something. One reason that statistical learning can work so well in combination with UG is that statistical learning is often used to navigate through a hypothesis space to identify the correct hypothesis for the language. Because UG can provide a hypothesis space to the child, statistical learning can then naturally complement UG proposals to language development.
How does this work exactly? At its core, statistical learning is about counting things (this is the “statistical” part), and updating hypotheses on the basis of those counts (this is the “learning” part, sometimes also called inference (Pearl, in press)). Counting things is a domain-general ability, because we can count lots of different things, both linguistic and non-linguistic (even as babies: Saffran, Aslin & Newport, Reference Saffran, Aslin and Newport1996; Aslin, Saffran & Newport, Reference Aslin, Saffran and Newport1998; Saffran, Johnson, Aslin & Newport, Reference Saffran, Johnson, Aslin and Newport1999; Fiser & Aslin, Reference Fiser and Aslin2002; Kirkham, Slemmer & Johnson, Reference Kirkham, Slemmer and Johnson2002; Wu, Gopnik, Richardson & Kirkham, Reference Wu, Gopnik, Richardson and Kirkham2011; Stahl, Romberg, Roseberry, Golinkoff & Hirsh-Pasek, Reference Stahl, Romberg, Roseberry, Golinkoff and Hirsh-Pasek2014; Ferry et al., Reference Ferry, Fló, Brusini, Cattarossi, Macagno, Nespor and Mehler2016; Aslin, Reference Aslin2017; Fló et al., Reference Fló, Brusini, Macagno, Nespor, Mehler and Ferry2019). These counts can then be converted into probabilities – for example, seeing something 3 times out of 10 yields a probability of ${3 \over {10}} = 0.30$. Then, things with higher probabilities can be interpreted as more likely than things with lower probabilities.
So, to effectively use statistical learning, a child has to know what to count. UG can identify what to count, because UG defines the hypothesis space. This means that the relevant things to count are the relevant things for determining which hypothesis in the hypothesis space is the right one for the language. For language acquisition, the relevant things are typically linguistic things (though sometimes non-linguistic things might be relevant to count too, depending on what the child's trying to learn). Importantly, the statistical learning mechanism itself doesn't seem to change – once the child knows the units over which inference is operating, counts of the relevant units are collected and inference can operate. In the rest of this subsection, I'll briefly review some common approaches to doing inference over collected counts: Bayesian inference, reinforcement learning, and the Tolerance & Sufficiency Principles (for a more comprehensive overview of each, see Pearl, in press). Table 1 summarizes which inference mechanisms are used by particular UG+stats proposals for the different morphology and syntax phenomena discussed in the rest of this article.
2.2.1 Bayesian inference
Bayesian inference operates over probabilities (as mentioned above, probabilities can be derived from counts). This inference mechanism involves both prior assumptions about the probability of different hypotheses and an estimation of how well a given hypothesis fits the data. A Bayesian model assumes the learner (for our purposes, the modeled child) has some space of hypotheses H, each of which represents a possible explanation for how the data D in the relevant part of the child's input were generated. For example, a UG+stats modeled child relying on a linguistic parameter to determine if her language has wh-movement might consider both a + wh-movement option and a -wh-movement option as two hypotheses ({+wh-movement, -wh-movement} ε H); the data might be the collection of questions in the child's input involving wh-words ({What did Jack climb?, Jack climbed what?!, …} ε D).
Given D, the modeled child's goal is to determine the posterior probability of each possible hypothesis h ε H, written as P(h|D). This is calculated via Bayes’ Theorem as shown in (1).
(1) $P( h\vert D) = {{P( D\vert h) \ast P( h) } \over {P( D) }} = {{P( D\vert h) \ast P( h) } \over {\sum _{{h}^{\prime}\in {\rm H}}\,P( D\vert {h}^{\prime}) \ast P( {h}^{\prime}) }}\,\propto \,P( D\vert h) \ast P( h) $
In the numerator, P(D|h) represents the likelihood of the data D given hypothesis h, and describes how compatible that hypothesis is with the data. Hypotheses with a poor fit to the data (e.g., the -wh-movement hypothesis for a dataset where 30% of the data are compatible only with +wh-movement) have a lower likelihood; hypotheses with a good fit to the data have a higher likelihood.
P(h) represents the prior probability of the hypothesis. Intuitively, this corresponds to how plausible the hypothesis is, irrespective of any data. This is often where considerations about the complexity of the hypothesis will be implemented (e.g., considerations of simplicity or economy, such as those included in the grammar evaluation metrics of Chomsky, Reference Chomsky1965, and those explicitly implemented in Perfors, Tenenbaum & Regier, Reference Perfors, Tenenbaum and Regier2011 and Piantadosi, Tenenbaum & Goodman (Reference Piantadosi, Tenenbaum and Goodman2012). So, for example, more complex hypotheses will typically have lower prior probabilities. A hypothesis's prior is something that could be specified by UG – but all that matters is that the prior is specified beforehand somehow, wherever it comes from.
The likelihood and prior make up the numerator of the posterior calculation, while the denominator consists of the normalizing factorP(D), which is the probability of the data under any hypothesis. Mathematically, this is the summation of the likelihood * prior for all possible hypotheses in H, and ensures that all the hypothesis posteriors sum to 1. Notably, because we often only care about how one hypothesis compares to another (e.g., is +wh-movement or -wh-movement more probable after seeing the data D?), calculating P(D) can be skipped over and the numerator alone used (hence, the ∝ in (1)).
From a developmental perspective, there's a considerable body of evidence suggesting that young children are capable of Bayesian inference (3 years: Xu & Tenenbaum, Reference Xu and Tenenbaum2007; 9 months: Gerken, Reference Gerken2006; Dewar & Xu, Reference Dewar and Xu2010; Gerken, Reference Gerken2010; 6 months: Denison, Reed & Xu, Reference Denison, Reed and Xu2011, among many others). Given this, Bayesian inference seems a plausible statistical learning mechanism for language acquisition.
2.2.2 Reinforcement learning
Reinforcement learning also operates over probabilities and is a principled way to update the probability of a categorical option which is in competition with other categorical options (see Sutton & Barto, Reference Sutton and Barto2018 for a recent overview). For example, with a wh-movement linguistic parameter, a child might consider both a + wh-movement and a -wh-movement option. A common implementation used by UG+stats proposals is the linear reward-penalty scheme (Bush & Mosteller, Reference Bush and Mosteller1951). As the name suggests, there are two choices when a data point is processed – either the categorical option under consideration is rewarded or it's penalized. This translates to the option's current probability being increased (rewarded) or decreased (penalized). For instance, if the + wh-movement option is under consideration, and it's compatible with the current data point (like What's Jack climbing _what?), the + wh-movement option is rewarded and its probability is increased. In contrast, if that same option is under consideration, but it's not compatible with the current data point (e.g., an echo question like Jack's climbing what?!), the +wh-movement option is penalized and its probability is decreased.
While applying reinforcement learning in UG approaches to language acquisition is a fairly recent innovation, reinforcement learning itself is well-supported in the child development literature more generally (sometimes under the name “operant conditioning”). In particular, we have evidence that very young children are capable of it (under 18 months: Hulsebus, Reference Hulsebus1974; 12 months: Lipsitt, Pederson & Delucia, Reference Lipsitt, Pederson and Delucia1966; 10 months: de Sousa, Garcia & de Alcantara Gil, Reference de Sousa, Garcia and de Alcantara Gil2015; 3 months: Rovee-Collier & Capatides, Reference Rovee-Collier and Capatides1979; 10 weeks: Rovee & Rovee, Reference Rovee and Rovee1969; Watson, Reference Watson1969; among many others). So, it seems plausible that young children could use reinforcement learning for language acquisition.
2.2.3 Tolerance and Sufficiency Principles
The Tolerance and Sufficiency Principles (Yang, Reference Yang2005; Yang, Reference Yang2016) together describe a particular inference mechanism, and this mechanism operates over specific kinds of counts that have already been collected. More specifically, these principles together provide a formal approach for when a child would choose to adopt a “rule”, generalization, or default pattern to account for a set of items. For example, these principles can be used to determine if there's a general rule for forming the past tense in English from a verb's root form (e.g., kiss → kissed).
Both principles are based on cognitive considerations of knowledge storage and retrieval in real time, incorporating how frequently individual items occur, the absolute ranking of items by frequency, and serial memory access. The learning innovation of these principles is that they're designed for situations where there are exceptions to a potential rule. In the English past tense example above, there are certainly exceptions in the child's input: past tense forms like drank (rather than drinked) and caught (rather than catched).
So, these two principles help the child infer whether the rule is robust enough to bother with, despite the exceptions. In particular, a rule should be bothered with if it speeds up average retrieval time for any item. For instance, it's faster on average to have a past tense rule to retrieve a regular past tense form (like -ed for English). However, if the past tense is too irregular, it's not useful to have the rule: retrieving the target information (i.e., the correct past tense form) takes too long on average.
The Tolerance Principle determines how many exceptions a rule can “tolerate” in the data before it's not worthwhile for the child to have that rule at all; the Sufficiency Principle uses that tolerance threshold to determine how many rule-abiding items are “sufficient” in the data to justify having the rule. This means, of course, that the child needs to have previously counted how many items obey the potential rule and how many don't. With these counts in hand, the child can then apply the Tolerance and Sufficiency Principles to infer whether the data justify adopting the rule under consideration (or not).
Together, these two principles have been used for investigating a rule, generalization, or default pattern for a variety of linguistic knowledge types (Yang, Reference Yang2005; Legate & Yang, Reference Legate, Yang, Berwick and Piatelli-Palmarini2013; Yang, Reference Yang2015; Schuler, Yang & Newport, Reference Schuler, Yang and Newport2016; Yang, Reference Yang2016; Pearl, Lu & Haghighi, Reference Pearl, Lu and Haghighi2017; Yang, Reference Yang2017; Irani, Reference Irani, Brown and Dailey2019; Pearl & Sprouse, Reference Pearl and Sprouse2019a). However, there isn't yet much evidence that children are capable of using the Tolerance and Sufficiency Principles – the main support comes from the study by Schuler et al. (Reference Schuler, Yang and Newport2016), which demonstrates that 5- to 8-year-old behavior is consistent with children using these principles. Still, these principles seem like a promising statistical learning mechanism for UG+stats proposals, given their current success at predicting child behavior (more on this in the subsection on learning morphology in highly-inflected languages).
3 The phenomena
3.1 Core syntax: Basic syntactic categories
Foundational knowledge in any language includes the syntactic categories of the language, and which words belong in each category. For instance, how does a child learn that her language has categories like noun, verb, and determiner? For English, how would a child learn that both kitty and idea are nouns, while kissed is a verb and the is a determiner? Ambridge (Reference Ambridge2017) notes that developmental researchers who are interested in UG have largely turned their attention away from investigating how children learn basic syntactic categories. I think why that is will become clearer when we look at the predictions that can currently be generated. Relatedly, it can be tricky to tell if a proposal is really a UG+stats proposal; this is because the UG part would need to be innate, language-specific knowledge about syntactic categories, and it's not always clear the prior knowledge assumed by a proposal is necessarily UG-type knowledge (more on this below).
3.1.1 Specific UG+stats proposals (potentially): Semantic bootstrapping
The main proposal I'm aware of that could potentially be a UG +stats proposal is semantic bootstrapping (Pinker, Reference Pinker1984; Pinker, Reference Pinker and MacWhinney1987). This proposal suggests that children have innate links between abstract syntactic categories and semantic relations (e.g., Noun ↔ name of a concrete thing). These innate links allow children to initially break into the syntactic category system, as children would expect that similar semantic relations (e.g., concrete things like ball and kitty) map to the same syntactic category (which we refer to as noun). Children would then rely on statistical learning to fine-tune which words really belong to which categories, on the basis of their input. So, children start with abstract syntactic categories, and via their input, they identify the true implementation of that category in their language (Valian, Reference Valian2009; Valian, Reference Valian2014). Importantly, the true implementation typically will go far beyond a specific semantic relation (e.g., the noun idea isn't a concrete thing).
Specific UG+stats proposals (potentially): The UG part
If children have innate links from innate abstract syntactic categories to certain semantic relations, then that would be UG knowledge – the child has innate knowledge that's specifically about language. However, it could be that the links emerge from the child considering the words that seem to be clustered together in her language in a particular category. That is, the child notices that the semantic relations encoded by the members of category1 (which we as adults recognize as a type of noun) seem to include a lot of concrete things. So, on the basis of that observation, the child constructs the hypothesis about the link (category1 ↔ concrete things), and uses this hypothesized link to accomplish whatever the innate link would have accomplished. Moreover, if there are innate abstract categories that are language-specific (i.e., something like noun and verb), then these too would be UG knowledge. However, it's possible that the innate knowledge about categories may not necessarily be language-specific. For example, suppose a child innately knows that there are in fact categories of some kind, but doesn't have something as specific as noun and verb in mind. Could we tell the difference between innate knowledge that category1 and category2 exist, as opposed to innate knowledge that noun and verb exist? What would that difference be? If the difference is about the links between categories (e.g., noun ↔ concrete thing), then the innate knowledge is really about the links and not the categories themselves. That is, this link could just as easily be expressed as some_category ↔ concrete thing. As we saw above, it's not clear that this link is necessarily innate, rather than something that could be derived from the child's input. So, more generally, it's not obvious that the UG part for semantic bootstrapping is necessarily UG.
Specific UG+stats proposals (potentially): The statistics part
The learning mechanism for fine-tuning a language's syntactic category implementations is distributional learning. In distributional learning, items with the same distributions (that is, appearing in the same contexts, and so for instance preceded and followed by the same elements) are perceived as the same kind of thing. The way a child might tell that two items have the same distributions is by tracking which elements precede and/or follow those items. One common implementation of distributional learning for discovering a language's syntactic categories is called Frequent Frames (Mintz, Reference Mintz2003; Mintz, Reference Mintz, Hirsh-Pasek and Golinkoff2006; Xiao, Cai & Lee, Reference Xiao, Cai, Lee and Otsu2006; Wang & Mintz, Reference Wang, Mintz, Chan, Jacob and Kapia2008; Chemla, Mintz, Bernal & Christophe, Reference Chemla, Mintz, Bernal and Christophe2009; Erkelens, Reference Erkelens2009; Weisleder & Waxman, Reference Weisleder and Waxman2010; Wang, Höhle, Ketrez, Küntay & Mintz, Reference Wang, Höhle, Ketrez, Küntay, Mintz, Danis, Mesh and Sung2011; Bar-Sever & Pearl, Reference Bar-Sever and Pearl2016). A child using Frequent Frames tracks which items appear between two elements (e.g., two words like the_is for a noun, or two morphemes like is_ing for a verb) – this is the “frames” part. The “frequent” part is that the child tracks how often frames appears and only really pays attention to those frames that are frequent (a simple way to do this is by counting how many instances of a frame have appeared). The frequent frames then form the foundation of the language-specific syntactic categories. Under a UG+stats approach, these language-specific categories can be matched against the innate, abstract categories, based on the semantic relations they encode. For instance, the the_is frame's items may map to noun, if these items correspond to concrete objects (Mintz, Reference Mintz2003). However, frequent frames are also compatible with a non-UG +stats approach; in that case, the child fine-tunes her language's categories by using the frequent-frame-based categories as a starting point and noticing what semantic relations these categories encode.
3.1.2 Predictions made
A lot of syntactic categorization research is about when children seem to demonstrate knowledge of different syntactic categories in their language (Valian, Reference Valian1986; Capdevila i Batet & Llinàs i Grau, Reference Capdevila i Batet and Llinàs i Grau1995; Pine & Martindale, Reference Pine and Martindale1996; Pine & Lieven, Reference Pine and Lieven1997; Tomasello, Reference Tomasello2000; Fisher, Reference Fisher2002; Tomasello & Abbot-Smith, Reference Tomasello and Abbot-Smith2002; Booth & Waxman, Reference Booth and Waxman2003; Tomasello, Reference Tomasello2004; Kemp, Lieven & Tomasello, Reference Kemp, Lieven and Tomasello2005; Rowland & Theakston, Reference Rowland and Theakston2009; Theakston & Rowland, Reference Theakston and Rowland2009; Tomasello & Brandt, Reference Tomasello and Brandt2009; Valian, Solt & Stewart, Reference Valian, Solt and Stewart2009; Yang, Reference Yang2011; Shin, Reference Shin2012; Pine, Freudenthal, Krajewski & Gobet, Reference Pine, Freudenthal, Krajewski and Gobet2013; Theakston, Ibbotson, Freudenthal, Lieven & Tomasello, Reference Theakston, Ibbotson, Freudenthal, Lieven and Tomasello2015; Ambridge, Reference Ambridge2017; Meylan, Frank, Roy & Levy, Reference Meylan, Frank, Roy and Levy2017; Bates, Pearl & Braunwald, Reference Bates, Pearl, Braunwald, Garvin, Hermalin, Lapierre, Melguy, Scott and Wilbanks2018). In general, very early knowledge of language-specific syntactic categories has been tacitly taken as a signal that children rely on innate (UG) knowledge to achieve that level of linguistic development so early. That is, from a UG perspective, the assumption has been that innate knowledge of abstract syntactic categories and links from those categories to semantic relations should speed up the development of language-specific categories. So, when children seem to converge on language-specific syntactic categories very early (say, before age two), this has been interpreted as evidence for UG knowledge.
However, it's difficult to be sure about this interpretation without knowing what developmental trajectory we expect with vs. without the abstract category and linking knowledge. That is, how can we know that children's acquisition of syntactic categories is faster than it should have been if they didn't have this innate knowledge? For instance, it's not clear we have precise predictions about how long it should take children to identify their language-specific noun category if they did in fact have abstract knowledge of noun and linking rules like noun ↔ concrete object. (We could assume children used something like Frequent Frames to create language-specific clusters of words and then mapped those clusters to abstract categories on the basis of the number of concrete objects named by the words in any given cluster.) Similarly, it's not clear we have precise predictions for how long it should take children if they didn't in fact have that innate knowledge, but used Frequent Frames to create language-specific clusters and then identified that some clusters seemed to have a lot of words that named concrete objects.
One option to generate these kind of precise predictions that map to specific ages of acquisition is to use an information-theoretic analysis, like the Minimum Description Length (MDL) approach leveraged by Chater and colleagues for syntactic rule acquisition (Hsu & Chater, Reference Hsu and Chater2010; Hsu, Chater & Vitányi, Reference Hsu, Chater and Vitányi2011, Reference Hsu, Chater and Vitányi2013; Chater, Clark, Goldsmith & Perfors, Reference Chater, Clark, Goldsmith and Perfors2015). In essence, MDL quantifies how much space it takes to store information, with preference given to more compact storage options (see Pearl, Reference Pearl2020 for a more detailed discussion of the MDL approach). For language acquisition, the information that needs to be stored is both the child's internal representation of some knowledge (like syntactic categories) and the data the child encounters, as encoded by using that representation. So, more complex representations (e.g., involving abstract categories and linking rules) may not be very compact compared to simpler representations (e.g., not involving either abstract categories or linking rules). However, as the child encounters data from her input, she encodes the data using the representation she has available – and a more complex representation may offer some storage savings on the incoming data, compared to a simpler representation. Over time, as the child encounters more data, those storage savings add up and can yield a “breakeven” point, where the more complex representation and the input data encoded so far take up less space than the simpler representation and the input data encoded so far. That breakeven point can be mapped to a specific age of acquisition, based on how frequently the child hears the data that the representation is encoding. I should note that I don't have a firm idea of how exactly to implement this for the problem of syntactic category representations. However, this approach seems like a promising avenue to explore if we want to try to generate precise predictions about expected ages of acquisition with vs. without UG knowledge. These expected ages could then be matched against observed ages of acquisition for different language-specific syntactic categories.
3.1.3 Prediction evaluation
As mentioned above, the basic problem of what we're predicting hasn't yet been solved, at least with respect to the expected age of acquisition. So, it hasn't yet been possible to really evaluate UG +stats proposals against the available data on age of acquisition. This may be why UG-friendly researchers haven't spent as much energy on this area of linguistic development. I think it's still very worthwhile to understand the learning strategies that are capable of yielding language-specific adult syntactic category knowledge. But, this area is less interesting to researchers specifically interested in UG approaches to language development.
3.2 Core syntax: Basic word order
Another type of core syntactic knowledge is the basic canonical word order of languages that have (relatively) fixed word order. For example, English is canonically a Subject-Verb-Object (SVO) language, which is why the default way to express the idea that Lily likes penguins is LilySubject likesVerb penguinsObject. In contrast, German has a canonical word order of Subject-Object-Verb (SOV); so, we might reasonably think that the way to express that same idea in German is LilySubject PinguineObject liebtVerb. But, this isn't quite right, because in main clauses, another syntactic operation occurs called Verb-second (V2) movement, where the Verb moves to the second position in the clause and something else (like the Subject or Object) moves to the first position. This is why we're likely to hear either LilySubject liebtVerb PinguineObject or PinguineObject liebtVerb LilySubject to express the idea that Lily likes penguins, but not the canonical SOV order. More specifically, these two utterances have a structure something like what's in (2c-i) and (2c-ii), where_element represents the underlying position of the linguistic element:
(2) V2 movement with an underlying SOV canonical word order in German
These kinds of complications, where multiple syntactic operations may be active, can make uncovering the canonical word order for a language difficult. For instance, if a child encounters an SVO utterance, and she doesn't know whether she's learning English or German, the canonical word order for her language could either be SVO (English, no V2 movement) or SOV (German, with V2 movement). This kind of ambiguity (and far more) is what children face when trying to identify the basic word order of their language.
3.2.1 Specific UG+stats proposals: The variational learning approach
The variational learning (VarLearn) approach (Yang, Reference Yang2002; Yang, Reference Yang2004; Legate & Yang, Reference Legate and Yang2007; Yang, Reference Yang2012) combines the UG idea of linguistic parameters with reinforcement learning; this combination allows a VarLearner to probabilistically search a hypothesis space defined by the linguistic parameters. For instance, one parameter may be VO vs. OV word order (corresponding to the SVO order of English vs. the SOV order of German), while another is -V2 vs. +V2 movement. With these two parameters and potential values, the hypothesis space consists of four possible language word orders: VO and -V2 (English), VO and +V2, OV and -V2, and OV and +V2 (German). More generally, L linguistic parameters with opt options each will yield a hypothesis space of optL language word orders. In this small example, that's only 4 (22), but if we had 10 parameters with 2 possible values each, now we have 210=1024. So, even with linguistic parameters, the word order hypothesis space can get very large very quickly. This is why UG-oriented researchers have long been interested in how a child could navigate a hypothesis space defined by linguistic parameters (Clark, Reference Clark1992; Gibson & Wexler, Reference Gibson and Wexler1994; Niyogi & Berwick, Reference Niyogi and Berwick1996; Fodor, Reference Fodor1998b, Reference Fodor1998a; Sakas & Fodor, Reference Sakas, Fodor and Bertolo2001; Sakas & Nishimoto, Reference Sakas and Nishimoto2002; Yang, Reference Yang2002; Sakas, Reference Sakas2003; Yang, Reference Yang2004; Fodor & Sakas, Reference Fodor and Sakas2005; Fodor, Sakas & Hoskey, Reference Fodor, Sakas and Hoskey2007; Sakas & Fodor, Reference Sakas and Fodor2012; Boeckx & Leivada, Reference Boeckx and Leivada2014; Sakas, Reference Sakas, Lidz, Snyder and Pater2016; Fodor, Reference Fodor2017; Fodor & Sakas, Reference Fodor, Sakas and Roberts2017).
The VarLearn approach assigns probability to each parameter value for a given parameter, and typically these values are equal initially. For example, a VarLearner might start out with VO and OV each with probability 0.5, and -V2 and +V2 each with probability 0.5. When encountering a data point from the input, the VarLearner probabilistically samples a complete set of parameter values (which is equivalent to some language's word order), based on the probability of those values. So, in our example above, the VarLearner might select the VO and -V2 parameter values with probability 0.5*0.5 (prob(VO) * prob(-V2)) = 0.25. Whichever word order is sampled, the VarLearner then sees if that word order, as defined by the parameter values chosen, can account for the data point. In this example, the word order specified by VO and -V2 would be able to account for Lily likes penguins (Subject Object Verb), but not for Pinguine liebt Lily (Object Verb Subject). If the word order can account for the data point, all the participating parameter values are rewarded (and have their probability increased); if not, all parameter values are penalized (and have their probability decreased).
Over time (in particular, as the child encounters more input from her language), the idea is that the language's true parameter values will have their probabilities increased until they're near 1; the alternative parameter values will have their probabilities correspondingly decreased. Importantly, this means that unambiguous data for a parameter value are very impactful – these data will always reward the corresponding parameter value and always penalize the alternative parameter value(s). For example, data perceived by the child as unambiguous +V2 data will always reward the +V2 value and always penalize the -V2 value. This means that the parameter value perceived as having more unambiguous data (that is, an unambiguous data advantage) will be the one that has its probability increased to around 1 – it's the value the child will choose, given enough input. This is why VarLearn approaches typically do an analysis of the unambiguous data advantage a child might perceive from her input. The higher the unambiguous data advantage for a parameter value, the faster a child using the VarLearn strategy should converge on that parameter value. This means that age of acquisition predictions can be made from careful analysis of the child's input. Specifically, parameter values that have higher unambiguous data advantages are predicted to be learned earlier.
Specific UG+stats proposals: The UG part
Linguistic parameters are meant to be innate, language-specific knowledge. One reason linguistic parameters have been a core component of UG approaches to language development is that they're intended as extremely useful building blocks. More specifically, linguistic parameters allow a child to construct a (potentially very large) collection of explicit hypotheses about a language's word order, without having to specify all those hypotheses out beforehand. Moreover, linguistic parameters are meant to constrain the child's possible hypotheses to those that correspond to actual languages the child may be learning. So, linguistic parameters are helpful for acquisition because they're a compact way to represent the space of possible hypotheses a child might reasonably need to consider (in this case, about word order). See Pearl and Lidz (Reference Pearl, Lidz, Grohmann and Boeckx2013), Pearl (in press - a), and Pearl (in press - b) for additional discussion about why UG approaches to acquisition like to incorporate linguistic parameters.
Specific UG+stats proposals: The statistics part
Reinforcement learning is a type of statistical learning, and forms the basis for the VarLearn learning mechanism.
3.2.2 Predictions made
As mentioned above, a VarLearn approach will often be able to analyze the unambiguous data advantage for one linguistic parameter value over another that a child would perceive from her input. On the basis of this advantage, a VarLearner can generate predictions about relative order of acquisition for different word order aspects related to different parameters. For instance, on the basis of one VarLearn analysis from Yang (Reference Yang2012) (shown in Table 2), it appears that English has an unambiguous advantage of 25% for wh-movement in questions. That is, in an English child's perceived input, the proportion of English wh-questions with wh-movement (e.g., Who did you see?) is .25 more than the proportion of English wh-questions without wh-movement (e.g., You saw who?). In contrast, it appears that German has an unambiguous data advantage of 1.2% for allowing V2 movement. So, we would then expect that +wh-movement in English would be learned earlier than +V2 movement in German for a VarLearn child. Based on the observed ages of acquisition shown in Table 2, that does seem to be true (+wh-movement in English is learned by 1 year 8 months (1;8), while +V2-movement in German is learned around 3 years old).
Perhaps more interestingly, the VarLearn approach predicts that similar unambiguous data advantages ought to lead to similar ages of acquisition. This then allows more precise predictions about what ages we ought to observe children acquiring certain word order options. More generally, Table 2 shows existing VarLearn child input analyses for several word order phenomena (see Yang, Reference Yang2012 and Pearl, in press for more discussion about these individual word order phenomena).
As a concrete example, consider “pro-drop”, which allows the optional omission of subjects. English isn't a language like this – while English speakers do sometimes leave out subjects in conversational speech (e.g., Speaker 1: “Are you going?” Speaker 2: “Headed out now.”), the basic usage is that English speakers have to include the subject. This is why (unlike languages like Spanish and Italian), English speakers use what are called expletive subjects, which are subjects that aren't contentful; some examples of expletive subjects are the it in It's raining and It seems that a penguin is on the ice. In both cases, the “it” isn't referring to anything, the way the pronoun “it” typically does (e.g., It's a penguin, Look what it's doing). Instead, the “it” appears because English requires the subject to be there as a default, whether the subject refers to anything or not. Hence, English uses expletive subjects. So, expletive subjects serve as an unambiguous signal that English is not a pro-drop language that can optionally drop its subjects. The VarLearn analysis by Yang (Reference Yang2012) suggested that expletive subjects (unambiguously signalling -pro-drop) had a 1.2% advantage in children's input over any -pro-drop signals (shown in Table 2). Notably, this is the same unambiguous data advantage for +V2 movement in both German and Dutch (i.e., 1.2%). When we look at the observed age of acquisition, -pro-drop in English – just like +V2-movement in German and Dutch – appears to be acquired around age 3. So, the same unambiguous data advantage (1.2%) seems to correlate with the same observed age of acquisition for these two word order phenomena.
This means that the VarLearn approach has the potential to generate fairly specific predictions about age of acquisition, on the basis of the unambiguous data advantage a VarLearn child would perceive in her input. So, for any language and any word order linguistic parameter, we need to decide what the unambiguous data would be for the parameter value of the language (e.g., +V2 or -pro-drop) as well as the unambiguous data for any alternative parameter values (e.g., -V2 or +pro-drop). I should note that this is by no means trivial – what counts as unambiguous very much depends on what the competing options are for the parameter in question, as well as what other word order parameters in the language may obscure the target value's observable signature in the input. For instance, consider that the unambiguous signal for +V2 movement involved the order Object Verb Subject but not Subject Verb Object – this is because Subject Verb Object could be generated by -V2 combined with an SVO basic word order. Still, with a concrete idea of what unambiguous data are for each parameter value under consideration, we can calculate how much unambiguous data the child would perceive for the target value vs. the other values, and so calculate the unambiguous data advantage perceived by the child for the target value.
Once we know the unambiguous data advantage for the target word order parameter values (in either the same language or across several languages), we then know their predicted relative acquisition trajectory: those with a higher unambiguous data advantage should be acquired earlier. If we have enough of this kind of data, we may also be able to triangulate on a specific expected age of acquisition for any given parameter value. Parameter values with similar unambiguous data advantages are predicted to have similar observed ages of acquisition, like +V2 movement in German and -pro-drop in English. Based on this, here's an example specific prediction the VarLearn account makes.
Predictions made: Specific prediction
Identify a word order phenomenon WOrdPhen in a language, and the unambiguous data that correspond to it. Calculate the unambiguous data advantage for WOrdPhen in children's input. If the unambiguous advantage is less than 1.2%, the VarLearn account predicts children acquire knowledge of WOrdPhen after age 3.
3.2.3 Prediction evaluation
As mentioned above, from the available VarLearn analyses shown in Table 2, it seems that current predictions (both relative and absolute) are borne out. Of course, there are many more word order aspects that can be captured by linguistic parameters and many more languages where VarLearn analyses are yet to be done. The VarLearn approach would be supported any time the unambiguous data advantage aligns with the relative order of acquisition (e.g., learning WOrdPhen after age 3 if its advantage is < 1.2%); if the unambiguous data advantage also allows us to pinpoint a specific age of acquisition, then the VarLearn approach would be supported whenever that predicted age of acquisition is in fact observed (e.g., learning WOrdPhen at age 3 if its advantage = 1.2%).
In contrast, the VarLearn approach wouldn't be supported any time the unambiguous data advantage doesn't align with the relative order of acquisition (e.g., learning WOrdPhen before age 3 when its unambiguous advantage < 1.2%) or doesn't predict the observed age of acquisition (e.g., learning WOrdPhen at some age other than 3 when its unambiguous advantage = 1.2%). I do note that supporters of the VarLearn approach might then argue that the data considered unambiguous for the target parameter value might be the issue, rather than giving up on the VarLearn approach altogether (that is, the calculated unambiguous data advantage was incorrect). However, the burden of proof would be on those supporters to identify plausible unambiguous data that would lead to the appropriate unambiguous data advantage.
3.3 Core morphology: Inflectional morphology
For languages that have a rich inflectional morphology system, children need to learn how to indicate features like verb tense, person, and number, as well as noun case. Even for languages with sparser morphology (like English), children still need to learn to indicate some subset of features using morphology (e.g., past tense). Whether rich or sparse, morphology systems are harder to learn the more irregular they are – that is, the more exceptions there are to the default rule. This is because the default morphological rule(s) may well get obscured in children's input when there are many exceptions. So, a core aspect of morphological acquisition is how children figure out their morphology systems, particularly in the presence of exceptions. In the interest of space, I'll focus on one UG+stats approach that involves reinforcement learning, but see studies by Gagliardi and colleagues (Gagliardi & Lidz, Reference Gagliardi and Lidz2014; Gagliardi, Feldman & Lidz, Reference Gagliardi, Feldman and Lidz2017) for an approach that involves Bayesian inference.
3.3.1 Specific UG+stats proposals: The Tolerance and Sufficiency Principles
An approach to morphology acquisition proposed by Yang (Reference Yang2005); Yang (Reference Yang2016) involves the Tolerance and Sufficiency Principles (TolP+SuffP), and has been used to account for the acquisition of a variety of semi-regular morphology in both English and German. More specifically, the TolP+SuffP learner identifies (i) whether a morphological affix is productive, and so is applied to new word forms, or (ii) whether the affix is restricted to a certain subclass of words in the language (i.e., an exception to the productive rule). In English, this approach has been used to identify productive morphology for the past tense (+ed default: kiss-kissed), noun plurals (+s default: penguin-penguins), and derivational morphology (e.g., productive = -ness, cute-cuteness; -ment, enjoy-enjoyment; -er, teach-teacher; -ity, stupid-stupidity; unproductive = -age, pack-package, -th, true-truth). In German, this approach has been used to identify productive noun plural morphology when the nouns have certain properties, such as a certain grammatical gender (e.g., being +feminine), a certain phonological property (e.g., a reduced final syllable), or a certain morphological property (e.g., being monosyllabic). When the nouns don't fit in any of these specified classes, the TolP+SuffP learner can also identify -s (Auto-Autos) as the productive plural, despite its infrequency.
The general approach a TolP+SuffP learner takes is to monitor the morphological forms in her input, and on the basis of that input, hypothesize potential rules that might be productive (e.g., for the English past tense, +ed and alternatives like “word rime becomes /ɔt/”, as in catch-caught and buy-bought). Then, the TolP+SuffP learner identifies the relevant domain where these potential rules could apply (e.g., all English verbs for the English past tense). The learner then uses the Tolerance and Sufficiency Principles to identify how many exceptions a productive rule can tolerate while still being productive; if there are sufficient rule-following words (i.e., the exceptions are fewer than the specified number that a productive rule can tolerate), the TolP+SuffP learner identifies that rule as the productive rule for that domain. This process is done for every potential rule. Importantly, only one potential rule could be the productive rule, because of the implementation of the Tolerance and Sufficiency Principles – a productive rule requires a majority of the words that could obey it to actually obey it (see Yang, Reference Yang2016 and Pearl, in press for more detailed discussion on exactly why this is.) So, after this evaluation process, a TolP+SuffP learner could either (i) identify one of the potential rules (i.e., morphological affixes) to be productive within the specified domain of words, or (ii) identify that none of the potential rules are productive (and so there is no productive morphological affix for that domain of words).
Specific UG+stats proposals: The UG part
For the TolP+SuffP learner, it might be argued that these innate principles (i.e., the Tolerance and Sufficiency Principles) are language-specific, as they're derived from considerations of linguistic item storage and retrieval in real time (see Yang, Reference Yang2016 for discussion of this perspective).
Specific UG+stats proposals: The statistics part
The Tolerance and Sufficiency Principles operate over counts of relevant items (i.e., how many words obey a potential rule vs. how many are exceptions to that rule).
3.3.2 Predictions made: TolP+SuffP
As mentioned above, a TolP+SuffP approach is able to capture the correct qualitative result for several cases of semi-regular morphology in English and German – that is, a TolP+SuffP child can identify the correct generalization for productive morphology. More generally, if a child acquires a productive rule for some piece of morphology, we would expect to see application of that morphology to new words that fall within the relevant domain. For example, once the child acquires the -ed morphology rule for the English past tense, we would expect to see new words in the past tense with the -ed form (e.g., Jack wugs today. He wugged yesterday.). In fact, a productive rule might cause overregularizations in semi-regular systems where there are exceptions, but the child hasn't learned all the exceptions yet (e.g., drink-drinked, go-goed). We see both these kind of child outputs in English and German, as discussed by Yang (Reference Yang2016).
Similarly, we can consider lexical gaps, where certain forms with inflectional morphology don't seem to exist for adults. Some examples are the past participle of stride in English (Jack has *stridden.), the first person singular in the present tense of abolish in Spanish (*abuelo = I abolish), and the first person singular of non-past verbs like win in Russian (*pobežu/*pobeždu = I win). When asked to create these forms, adults in these languages don't quite know what to do because the relevant morphology isn't productive for that domain of words. Yang (Reference Yang2016) demonstrates how a TolP+SuffP learner can fail to identify a productive morphological rule in these cases.
However, we have yet to see precise predictions about exactly what age TolP+SuffP children should identify that certain morphology is productive (or not). In cases where the morphology is in fact productive, we might expect that the recognition of productivity depends on how frequently the individual words in the relevant domain appear in the child's input. The more often they do, the more likely the child is to notice them and be able to make the correct generalization using the TolP+SuffP approach. Importantly, applying the TolP+SuffP approach means the child has to also identify the relevant domain where the morphology would be productive, and it's unclear that we have precise predictions about when this would happen (or really, what might trigger this to happen).
In cases of lexical gaps where morphology isn't productive, we face a similar problem of not knowing precisely what age a child ought to figure out that there isn't a productive morphological rule for some domain of words. However, given that the target state at the end of development is the lack of a productive rule, we can at least see if children's input over time would lead a TolP+SuffP learner to decide there isn't a productive morphologial rule. What might be especially interesting is if a child's input could lead a TolP+SuffP learner to the temporary belief that there is in fact a productive rule, and we see evidence of that temporary state in children's behavior (either through application to novel words in the domain, or overregularization).
What's in common for generating more precise predictions about children's age of acquisition for morphology under the TolP+SuffP approach is a more incremental application of this approach to children's input. That is, we need to understand whether a TolP+SuffP child would predict a specific morphological affix to be productive when given realistic child input from specific ages (e.g., up to 12 months vs. 12–18 months vs. 18–24 months, and so on). With that kind of analysis, we would have specific predictions about whether a child of a particular age in a particular language should perceive a particular affix as productive or not (an example specific prediction of this kind is below). Then, we can assess whether these predictions are borne out in child linguistic behavior.
Predictions made: Specific prediction
Identify the age MorAge when a productive morphological affix ProdMor first becomes productive for children (e.g., the age when English-learning children overregularize past tense + ed may be around 30 months (Maslen, Theakston, Lieven & Tomasello, Reference Maslen, Theakston, Lieven and Tomasello2004)). A modeled TolP+SuffP child who learns from the data that children learn from just before MorAge (e.g., 24–30 months for English + ed) should identify ProdMor as productive. In contrast, a modeled TolP+SuffP child who learns from the data that children learn from long before MorAge (e.g., before 12 months, or 12–18 months for English + ed) should identify ProdMor as unproductive.
3.3.3 Prediction evaluation
As mentioned above, it seems like a TolP+SuffP learner can get the right adult morphological generalizations for certain cases of semi-regular morphology in English and German. However, we don't yet have precise predictions about the expected age of acquisition for these generalizations, given children's input. So, it seems that the way forward is to look for other morphology systems, especially semi-regular ones where there are exceptions and/or probabilistic associations of different types of information. Then, we can apply this UG +stats approach to the acquisition of those morphology systems to generate predictions about how acquisition ought to proceed, given realistic child input data.
3.4 A more complex thing: A temporary lack of inflection
In many languages that have relatively less inflectional morphology (e.g., those shown in Table 3), children go through a stage where they seem to systematically leave off obligatory inflection on verbs. So, the verb appears to be in the non-finite (infinitive) form, where tense is missing. This stage is sometimes called the optional infinitive (OI) stage, as children optionally use what seems to be the infinitive form of the verb, instead of the appropriate inflected form.
For example, in English, a child might want to express the idea that her father has something – the target form Papa has it is expressed as Papa have it, where the verb have is missing the 3rd person singular present morphology. In Hebrew, a child might express the target form involving the present tense of sit by using the infinitive equivalent (lashevet - to sit), which has additional morphology clearly indicating that the child used the infinitive form with infinitive morphology, rather than a root form with no morphology. This is also the case in Dutch, French, and German, where the form the child uses has clear infinitive morphology (e.g., drinken-to drink in Dutch, dormir-to sleep in French, and hintelln-to put in German from Table 3). Moreover, in these languages, the use of the infinitive is often accompanied by a word order that's appropriate for the infinitive form of the verb but not for the inflected form.
Interestingly, children's frequency of OIs seems to vary by language, with some children using them very infrequently and tapering off OI use prior to age two (e.g., Spanish children), while other children still use OIs fairly frequently into age three and beyond (e.g., English children). So, from an acquisition perspective, we want to understand why children across the world's languages show the amount of OI use that they do and how they break out of this stage to reach the adult use (which doesn't involve these OIs).
3.4.1 Specific UG+stats proposals: The variational learning approach
Legate and Yang (Reference Legate and Yang2007) propose a VarLearn approach to explain the different rates of OIs in child-produced speech, with the idea that children are relying on a linguistic parameter that determines whether their language is one that uses tense morphology (+Tense) or not (-Tense). +Tense languages like English, Hebrew, Dutch, French, and German express tense morphosyntactically (e.g., English has = havepresent+ 3rd+sg); -Tense languages like Mandarin Chinese don't, relying on other linguistic mechanisms to communicate tense (e.g., Mandarin Chinese Zhangsan zai da qiu = Zhangsan ASPECT play ball = “Zhangsan is playing ball.”). The OI stage of a +Tense language happens because children think the correct parameter value for their language is -Tense. As children perceive more unambiguous +Tense data in their input, the +Tense grammar is rewarded and the -Tense grammar generating the OIs is penalized until it's no longer active. How fast this happens depends on how many more unambiguous +Tense data are available than unambiguous -Tense data (i.e., the +Tense unambiguous data advantage).
Specific UG+stats proposals: The UG part
The Tense linguistic parameter is meant as UG knowledge, and children need to both know this parameter exists and that it has two values (+/-Tense).
Specific UG+stats proposals: The statistics part
As with the VarLearn approach for word order, reinforcement learning forms the basis of the learning mechanism.
3.4.2 Predictions made
As mentioned above, the VarLearn child is driven by the unambiguous data advantage she perceives from her input. So, for any given language, the perceived unambiguous data advantage for +Tense can be calculated. Then, unambiguous +Tense data advantages can be compared across languages for a relative order of acquisition. In particular, higher +Tense advantages indicate a shorter OI stage. Moreover, if the length of the OI stage is known for a specific language (i.e., what age children leave the OI stage), the +Tense advantage can be correlated with that age. Similar +Tense advantages predict similar ages when children leave the OI stage.
3.4.3 Prediction evaluation
Legate and Yang (Reference Legate and Yang2007) use the VarLearn approach to analyze the perceived unambiguous data advantage for +Tense in Spanish, French, and English children (who are all learning +Tense languages); they find a qualitative fit between the unambiguous data advantage and these children's production of OIs. More specifically, the unambiguous data advantage for +Tense in Spanish > French > English, while the Spanish rate of OI production < French OI production < English OI production. This in turn suggests that the OI stage for Spanish < French < English (i.e., the stage in English lasts the longest), and this seems to be true. So, the greater the unambiguous data advantage for +Tense in a language's child-directed speech, the faster children acquiring that language stop using OIs.
Still to do is to evaluate the VarLearn approach on other languages where children have OI stages, such as Hebrew, Dutch, and German. I should also note an important caveat – an alternative non-UG+stats account for investigating OIs called MOSAIC (Model of Syntactic Acquisition in Children) has already been applied to a large number of languages (Freudenthal, Pine, Aguado-Orea & Gobet, Reference Freudenthal, Pine, Aguado-Orea and Gobet2007; Freudenthal, Pine & Gobet, Reference Freudenthal, Pine and Gobet2009, Reference Freudenthal, Pine and Gobet2010; Freudenthal, Pine, Jones & Gobet, Reference Freudenthal, Pine, Jones and Gobet2015), including those that the VarLearn approach has been applied to. (See Pearl, in press for more discussion about the MOSAIC approach.) MOSAIC is also able to account for the different cross-linguistic rates of OIs in children, and additionally offers an explanation as to why certain specific verbs appear with OI errors. Currently, the VarLearn approach doesn't offer the same ability to explain OI errors with specific verbs in these languages. So, for this reason, the non-UG+stats MOSAIC account may be preferable for now to the VarLearn approach when it comes to OIs.
3.5 A more complex thing: Movement
A more sophisticated type of syntactic knowledge involves “movement”, where linguistic elements are understood in certain positions of an utterance and yet don't appear to be in those positions. So, the idea is that the linguistic elements have moved from the positions where they're understood. Some examples of this are wh-movement in questions, passives, and raising vs. control structures. In the interest of space, I'll focus on raising vs. control structures, but see work by Yang (Yang, Reference Yang2002; Yang, Reference Yang2004; Legate & Yang, Reference Legate and Yang2007; Yang, Reference Yang2012) for the VarLearn account of wh-movement in questions, and Nguyen and Pearl (Reference Nguyen and Pearl2019) for a Bayesian learning approach to passives.
Raising vs. control structures
In subject-raising structures like Jack seemed to kiss Lily, the subject of the main clause Jack doesn't appear to have an agent thematic role for the main clause verb seem – that is, Jack isn't a “seemer” (whatever that is). Instead, Jack is the agent of kiss, which is the embedded clause verb. That's why this utterance can be rephrased as It seemed that Jack kissed Lily, which has an expetive it as the main clause subject and Jack overtly as the embedded clause subject. So, the original sentence would have a structure more like Jack seemed _Jack to kiss Lily, where _Jack marks the position where Jack moved (or “raised”) from.
Subject-raising structures contrast with subject-control structures like Jack wanted to kiss Lily – here, the main clause subject Jack connects to two thematic roles: the agent of main clause verb wanted and the agent of embedded clause verb kiss. (This is why we can't rephrase this utterance as *It wanted that Jack kissed Lily – expletive it can't be the agent of wanted.) Because traditional linguistic theory disliked linguistic elements having more than one thematic role, a solution was for this utterance to have a structure more like Jack wanted PRO to kiss Lily, where Jack is connected to the silent pronoun PRO; this allows Jack to be the agent of wanted while PRO is the agent of kiss. So, unlike raising structures, there's no movement associated with control structures. Instead, the child has to recognize the connection between the main clause subject and the silent pronoun PRO.
The same raising vs. control distinction also happens for objects – that is, there are object-raising verbs and object-control verbs. In object-raising structures like Jack wanted Lily to laugh, the main clause object Lily is only the agent of the embedded clause verb laugh, rather than also having a thematic role for the main clause verb wanted. So, the structure is something like Jack wanted Lily_Lily to laugh, with Lily raised from the embedded clause position. In contrast, in object-control structures like Jack asked Lily to laugh, the main clause object Lily connects to two thematic roles: the agent of embedded clause verb laugh and the goal of main clause verb asked. So, the structure is something like Jack asked Lily PRO to laugh, with Lily and PRO connected to each other.
For raising and control verbs, children therefore need to learn that these interpretations are possible (i.e., the main clause subject or object effectively gets associated with either one thematic role or two). This involves learning where the main clause subject or object moved from (raising) or that the main clause subject or object is connected to the silent PRO in the embedded clause (control). Moreover, children need to identify which verbs allow which types of structures (e.g., seem is a subject-raising verb, want is a subject-control verb and also an object-raising verb, and ask is a subject-control verb and also an object-control verb). Current behavioral evidence suggests that English four- and five-year-olds have these interpretation options available and have sorted some frequent raising and control verbs into relevant classes that allow adult-like interpretation of these verbs (Becker, Reference Becker2006; Becker, Reference Becker2007, Reference Becker2009; Kirby, Reference Kirby2009a, Reference Kirby2009b, Reference Kirby2010; Becker, Reference Becker2014).
3.5.1 Specific UG+stats proposals: Raising vs. control
The potential UG+stats approaches I'm aware of involve children attending to certain features of verbs and their arguments (e.g., whether the subject is animate, or what syntactic contexts a verb can appear in), and then using Bayesian inference to cluster together verbs that behave the same way with respect to these features (Mitchener & Becker, Reference Mitchener and Becker2010; Becker, Reference Becker2014; Pearl & Sprouse, Reference Pearl and Sprouse2019b). For instance, verbs that take inanimate subjects are more likely to be subject-raising verbs (e.g., The rock seemed to fall (seem is subject-raising) vs. *The rock wanted to fall (want is subject-control)). The approach of Becker and Mitchener (Mitchener & Becker, Reference Mitchener and Becker2010; Becker, Reference Becker2014) focuses primarily on the animacy of the subject, while the approach of Pearl and Sprouse (Reference Pearl and Sprouse2019b) considers the animacy of all verb arguments, the thematic roles the verb arguments take (e.g., whether the subject is an agent or a theme), and the syntactic contexts a verb can appear in (e.g., a transitive frame like Jack kissed Lily or a frame that involves a non-finite embedded clause like Jack wanted to kiss Lily).
Specific UG+stats proposals: The UG part
In these approaches, the main place where I see a role for UG is which features children use to sort verbs into relevant classes. In particular, it could be that innate, language-specific knowledge causes children to focus on animacy when clustering verbs together into classes, as opposed to other salient conceptual features of verb arguments. The Pearl and Sprouse approach considers a wider range of verb and verb argument features than the Becker and Mitchener approach, but still restricts the range of possibilities for the thematic role distinctions and the syntactic positions that children perceive; these restrictions are based on current theoretical proposals in the syntactic literature. If these thematic role and syntactic position distinctions are innate, language-specific knowledge, then they would come from UG.
Specific UG+stats proposals: The statistics part
The learning mechanism for these approaches is Bayesian inference.
3.5.2 Predictions made: Raising vs. control
The Bayesian approaches cluster verbs into classes, where the classes allow different raising and control constructions; these Bayesian approaches can then predict the classes that children of different ages ought to cluster their verbs into. These predicted verb classes can then be checked against behavioral data from children of different ages. For example, if children treat two verbs the same way (e.g., both verbs allowing subject-raising, but not subject-control, object-raising, or object-control), then the Bayesian approaches ought to have clustered those two verbs together into the same class. This prediction check can be done for all verbs where we have empirical data about how children treat the verbs (i.e., as belonging to the same class or not). An example specific prediction of this kind is below.
Predictions made: Specific prediction
One model variant from Pearl and Sprouse (Reference Pearl and Sprouse2019b) predicts that English five-year-olds treat want, like, and need as belonging to the same class, while another variant predicts only want and like belong to the same class. We can check these predictions to see if English five-year-olds treat want, need, and like the same (e.g., interpreting them as subject-control verbs that take two thematic roles in instances like Jack wants/needs/likes to go). If five-year-olds do treat these the same, the first model variant is supported; if they treat only want and need as the same, the second model variant is supported; if they don't treat any of these verbs the same, then no model variant is supported – and maybe different features need to be considered for verb classification.
3.5.3 Prediction evaluation: Raising vs. control
The Bayesian approaches to clustering verbs into classes that involve raising vs. control interpretations appear to match children's verb classifications fairly well (Pearl & Sprouse, Reference Pearl and Sprouse2019b). So, these approaches seem promising, particularly when we allow children to consider a range of features (conceptual, thematic, and syntactic). A useful aspect of a model predicting verb classes is that we have a variety of ways to evaluate if children in fact have similar verb classes. One way is what's been done already – derive children's verb classes from their aggregated behavioral data and compare those against the model's verb classes. However, another way is to use the model's predicted verb classes to predict child behavior in specific experiments. For instance, given a specific context (i.e., animacy of the verb arguments, thematic roles of the verb arguments, and syntactic context of the verb), what's the probability that a child will interpret a novel verb as raising vs. control? This quantitative prediction about interpretation rate can be compared against the rates at which children actually do interpret a verb a particular way in context. Becker and Kirby (Becker, Reference Becker2006; Becker, Reference Becker2007, Reference Becker2009; Kirby, Reference Kirby2009a, Reference Kirby2009b, Reference Kirby2010; Becker, Reference Becker2014) have already conducted several behavioral experiments like these that can provide precise testing grounds for these Bayesian approaches.
3.6 A more complex thing: Constraints
Another more sophisticated type of syntactic knowledge involves “constraints”; constraints disallow certain structures (and their accompanying interpretations), rather than specifying which structures are allowed. Two prominent examples of constraints investigated by UG+stats proposals are syntactic islands (sometimes called subjacency) and binding. In the interest of space, I'll focus on syntactic islands; see Orita, McKeown, Feldman, Lidz, and Boyd-Graber (Reference Orita, McKeown, Feldman, Lidz and Boyd-Graber2013) (and the discussion of that study in Pearl, in press) for a Bayesian learning approach to binding that involves UG knowledge of c-command.
Syntactic islands: Constraints on wh-dependencies
In English, a wh-word typically appears at the front of a question. The relationship between the overt position of the wh-word and where it's understood can be called a dependency, and so (3a) shows a wh-dependency between What and where it's understood at the position marked by _what. It turns out that there are constraints on the wh-dependencies that are allowed; one way to describe this is that there are certain structures called syntactic islands that wh-dependencies can't cross (Chomsky, Reference Chomsky1965; Ross, Reference Ross1967; Chomsky, Reference Chomsky, Anderson and Kiparsky1973). Four examples of syntactic islands in English are shown in (3b)–(3e), with the proposed syntactic island structure in square brackets ([…]). During acquisition, English children have to learn the constraints on wh-dependencies that allow them to recognize that the wh-dependencies in (3b)–(3e) aren't allowed, while the wh-dependency in (3a) is fine.
(3)
3.6.1 Specific UG+stats proposals: Syntactic islands
Pearl and Sprouse (Reference Pearl, Sprouse, Sprouse and Hornstein2013a) and Pearl and Sprouse (Reference Pearl and Sprouse2013b, Reference Pearl and Sprouse2015) investigated a probabilistic learning strategy that relies on trigrams (i.e., sequences of three elements) constructed from certain pieces of syntactic structure in wh-dependencies. So, we can think of this as a probabilistic syntactic trigrams approach (SynTrigrams). The SynTrigrams strategy relies on children viewing a wh-dependency as a path from the head of the dependency (e.g., Who in (4)) through the phrasal nodes that contain the tail of the dependency, as shown in (4a)–(4b)). So, a SynTrigrams child just needs to learn which wh-dependencies have grammatical syntactic paths and which don't. The SynTrigrams child does this by tracking smaller building blocks of these syntactic paths – the syntactic trigrams. More specifically, a SynTrigrams learner breaks the syntactic path of a wh-dependency into a collection of syntactic trigrams that can be combined to reproduce the original syntactic path, as shown in (4c).
(4) Who did Jack think that the story about penguins amused _who?
The SynTrigrams child then tracks the frequencies of syntactic trigrams that the child perceives in her input. Importantly, every instance of a wh-dependency is composed of some set of syntactic trigrams, so a child can potentially learn about a specific syntactic trigram (e.g., start-IP-VP) from a variety of wh-dependencies. That is, the building blocks of a particular wh-dependency syntactic path can come from other wh-dependencies, not just that particular wh-dependency. The SynTrigrams child can later use the syntactic trigram frequencies to calculate the probability of any wh-dependency she likes, whether she's encountered it before or not; this is because all wh-dependencies can be broken into syntactic trigram building blocks, and the child has a sense from her input of how probable any particular syntactic trigram is, based on its frequency in her input. For example, the wh-dependency in What did the penguin eat _what? can be characterized as in (5), and its probability generated from some of the same syntactic trigrams observed in (4).
(5) What did the penguin eat _eat?
The predicted probability of a wh-dependency's syntactic path corresponds to the grammaticality of the dependency, with higher probabilities indicating more grammatical dependencies. These predictions can then be compared to judgments of how allowable different wh-dependencies are.
Specific UG+stats proposals: The UG part
A key component of the SynTrigrams approach is what elements the trigrams are constructed from. In the implementation by Pearl and Sprouse (Reference Pearl and Sprouse2013b) and Pearl and Sprouse (Reference Pearl, Sprouse, Sprouse and Hornstein2013a, Reference Pearl and Sprouse2015), the elements are the phrasal nodes that contained the wh-dependency. How the child determines what these nodes are (e.g., the labels CPthat or VP) is currently unknown. It could be that this kind of phrasal structure representation requires the child to rely on innate, language-specific knowledge; if so, this would be UG knowledge.
Specific UG+stats proposals: The statistics part
The SynTrigrams learner relies on tracking the frequencies of syntactic trigrams, converting these frequencies to probabilities, and combining these probabilities into a single probability for any wh-dependency's syntactic path.
3.6.2 Predictions made: Syntactic islands
The SynTrigrams learner of Pearl and Sprouse (Reference Pearl, Sprouse, Sprouse and Hornstein2013a) and Pearl and Sprouse (Reference Pearl and Sprouse2013b, Reference Pearl and Sprouse2015) learned from a realistic sample of English child-directed speech, estimated syntactic trigram probabilities from that sample, and then generated probabilities for a specific set of wh-dependencies that previous work (Sprouse, Wagers & Phillips, Reference Sprouse, Wagers and Phillips2012) had collected acceptability judgments for. More specifically, Sprouse et al. (Reference Sprouse, Wagers and Phillips2012) had judgments about the relative acceptability of the four syntactic island types in (3b)-(3e), as well as control wh-dependencies that varied with respect to their syntactic path. These judgments served as a target for the SynTrigram learner and allowed for the following specific prediction.
Predictions made: Specific prediction
If the SynTrigrams learner can generate the same relative judgment pattern (based on the probability the learner calculated for each wh-dependency), then we can conclude that the modeled learner has internalized a representation that's similar to what humans used to generate their judgments. If instead the SynTrigrams learner fails to generate the same relative judgment pattern for these wh-dependencies, then we conclude that the representation it internalized isn't similar enough to the one humans use.
3.6.3 Prediction evaluation: Syntactic islands
The SynTrigrams learner of Pearl and Sprouse (Reference Pearl, Sprouse, Sprouse and Hornstein2013a) and Pearl and Sprouse (Reference Pearl and Sprouse2013b, Reference Pearl and Sprouse2015) was in fact able to replicate the observed judgment pattern that indicated knowledge of the four syntactic islands investigated by Sprouse et al. (Reference Sprouse, Wagers and Phillips2012). This suggests that the learning strategy of the SynTrigrams learner is a plausible way for English children to acquire knowledge of these islands. What remains to be investigated is how well this learning strategy fares cross-linguistically, as there's variation on the syntactic islands that languages seem to have (even among the four in (3b)-(3e)). For instance, Italian and Spanish seem to have complex NP islands but not wh-islands (Rizzi, Reference Rizzi and Rizzi1982; Torrego, Reference Torrego1984); can this SynTrigrams learner yield the appropriate adult judgment pattern after learning from Italian or Spanish child-directed speech? Moreover, there are other types of wh-dependency constraints (e.g., see discussion in Pearl and Sprouse, Reference Pearl, Sprouse, Sprouse and Hornstein2013a about wh-dependencies with multiple gaps), and it's unknown if a SynTrigrams strategy can handle these cases as well.
4 Conclusion
I've reviewed several UG+stats approaches to the acquisition of different specific morphology and syntax phenomena, with the idea that these approaches make the developmental theories they implement concrete enough to evaluate. In common across nearly all these approaches is that the UG part helps determine what's being counted by the child from the vast array of information available in the input, while the statistics part determines both how the counting is in fact done and how the counts are used to update the child's hypotheses about her language's morphology or syntax. Importantly, these UG+stats proposals have been specified in enough detail to make specific predictions about child acquisition, which can then be evaluated against available empirical data or data that can be obtained in the future. In the cases I discussed, the predictions of the UG+stats proposals have generally held up – this suggests that these proposals are worth pursuing more fully, and I've also suggested possibilities for future exploration (often looking cross-linguistically or at related morphology or syntax phenomena). With this in hand, I hope we can continue making progress from the UG+stats perspective on understanding how children learn all the things they do about morphology and syntax.