1. Introduction
There have been notable efforts in developing different methods to study language acquisition in children. These include resources for naturalistic/spontaneous diaries, corpora, elicited production, and imitation for language production, and the preferential looking paradigm, act-out tasks, and truth value judgement tasks for language comprehension. Cross-sectional experiments in which several children are tested at specific ages are useful to see if they have acquired a particular grammatical structure at a given age. This type of data is extremely useful for providing norms of typical development. However, it does not address one of the field’s goals, which is to understand how a given child’s knowledge of language evolves over time. Longitudinal case studies and corpora can provide the kind of detailed, fine-grained data on the transition from one stage of generalisation to the next and can also reveal the individual variances in the process of learning (Demuth, Reference Demuth2008). This paper focuses on this rich data source, the corpus, in which the natural interactions between interlocutors can be used to test different hypotheses and theoretical claims. Brown (Reference Brown1973) pioneered the use of child–adult interactions to assess various theories and hypotheses related to language acquisition. Since then, there has been a substantial increase in the construction and utilisation of corpora in numerous languages. Linguistic databases are particularly instrumental in investigating the acquisition of lexical, syntactic, and discourse knowledge.
Corpora derived from formal, written adult data are inadequate for drawing any conclusions about the input children draw on to acquire language, although they may provide a measure of adult grammar. Conversely, corpora of child-directed speech (CDS) play a crucial role in examining the characteristics of the input that children receive. Therefore, there is a pressing need for corpora containing child and CDS data for various languages and language varieties (Dash & Arulmozi, Reference Dash and Arulmozi2018; Demuth, Reference Demuth2008).
Despite their advantages, corpora are not without challenges. A language corpus cannot fully encompass the limitless variations in language usage by its speakers across diverse situations and contexts (Dash & Arulmozi, Reference Dash and Arulmozi2018). Many existing child corpora are limited by the inclusion of data from a small number of children, and they may be biased by the contexts in which they are collected (Soderstrom, Reference Soderstrom2007). Small sample sizes and lack of diversity in some CDS corpora can hinder the generalisability of findings (Lieven, Reference Lieven2010). CDS corpora recorded in laboratory settings can lack the naturalness of everyday interaction, which may lead to skewed representations of language use (Snow, Reference Snow, Fletcher and MacWhinney1995). Furthermore, corpora may misrepresent the grammatical knowledge of children or fail to capture grammatical well-formedness accurately (Demuth, Reference Demuth2008).
In spite of these limitations, corpora remain a necessary source of information on child language development. While no corpus can fully represent the vast variability of language in real-world contexts, CDS corpora provide invaluable insights into the linguistic input that shapes children’s language acquisition. A notable database in the field of child language acquisition is CHILDES (MacWhinney, Reference MacWhinney1996; MacWhinney, Reference MacWhinney2000; MacWhinney & Snow, Reference MacWhinney and Snow1990), a comprehensive resource comprising naturalistic data from over 28 languages, totalling around 44 million spoken words. Developed over the years, CHILDES gathers spontaneous speech from diverse languages, from English to Chinese, from the Romance languages to the Germanic and Japanese. The database, built with the contribution of researchers conducting their own studies, predominantly features adult–child interactions (Corrigan, Reference Corrigan and Hoff2012).
Researchers interested in the study of the acquisition of Arabic had, up to this date, some resources, although all of them dwelled on adult varieties. For example, there is one morphologically annotated corpus available for Palestinian Arabic, the Curras Notebook (Jarrar et al., Reference Jarrar, Habash, Alrimawi, Akra and Zalmout2017). It consists of 56,000 tokens from written sources. However, to the best of our knowledge, no child corpora for any Arabic variety is available except for Egyptian Arabic, the Salama corpus, which can be found on the CHILDES platform.
The lack of a Palestinian Arabic child corpus motivated us to build one for child and CDS. The aims of this paper are twofold: we aim to present the new corpus of child and CDS for Palestinian Arabic and also to conduct a first analysis of the spontaneous productions of children and compare it to that of adults. In particular, the hypothesis we consider is whether the development of morphosyntactic features is as early as Very Early Parameter Setting (Wexler, Reference Wexler1998) establishes (see also Hoekstra & Hyams, Reference Hoekstra and Hyams1998). In particular, we investigate subject–verb agreement, the presence of null subjects, and word order alternations. This research is, therefore, part of the collective effort to outline what very young children have established about their target grammar by the time they produce their first syntactic productions.
2. Method
The corpus presented in this paper consists of the transcripts of the early acquisition of Palestinian Arabic based on the recordings of child–adult interactions collected at different sites (Hebron, Taybeh, Tulkarm, Jenin, Nablus, Qalqilya, Ramallah, Jaljuliya) so that the corpus constitutes a representative sample of Palestinian Arabic. The recordings include the spontaneous productions of 16 healthy monolingual Palestinian Arabic-speaking children aged between 19 and 58 months at the time of recording (mean age in months = 36; 50% females) and those of their adult interlocutors (mean age in years = 28; 84% males). None of the children recorded were premature, had a hearing impairment or had any other health issues.
To obtain the data, families were recruited within the personal and professional networks of the first author. Parents, having signed informed consent forms, agreed to record 30-minute face-to-face spontaneous interactions with their child every 2 weeks. Adults were encouraged to engage in a variety of play activities, ask open-ended questions, and discuss life events with their children to promote active conversation. These interactions were recorded between February and August 2021 using Apple iPhones. The smartphone was placed in the room where the recording was taking place, with children moving around it within a range of 3 m. Adult participants provided additional details for each recording, including which people were present in each recording and where the recording took place (e.g., the child’s home, the grandparents’ house). Audio files were sent to the authors by an adult family member via WeTransfer. Only recordings with clear and high-quality sound were considered, resulting in the exclusion of one recording.
Under the supervision of the authors and two speech-language pathologists based in Palestine, 59 recordings were obtained from the 16 families who agreed to participate. Each child was recorded 2–5 times, resulting in a total duration of 1,387 minutes of recordings. The mean recording duration was 23.52 minutes, ranging from 7 to 35 minutes. While the target recording duration was 30 minutes, some sessions ended up being as short as 7 minutes due to the children’s lack of cooperation. In those instances, the children lost interest and disengaged from the activity with their parents. However, since these shorter recordings still captured valuable natural interactions in Palestinian Arabic, we decided to include them in our analysis.
This resulted in 9,285 utterances of child speech and 10,496 utterances of CDS. Table 1 presents the characteristics of the children (identified by number), the number of their productions, and their mean length of utterance in words. MLUw was calculated manually over the first 100 spontaneous utterances by each child. The characteristics of the adults interacting with the children are shown in Table 2.
Table 1. Characteristics of children’s recordings and production

Table 2. Characteristics of the adults’ recordings

Notice that the MLU of two children, 3 and 16, were noticeably lower than those of their age peers (MLU of 3 was between 1.67 and 1.77, and between 1.88 and 1.92 for 16). Under closer scrutiny, the productions of 16 became similar to those of his peers when a larger sample of his productions were taken into account. The productions of 3, under the same method, remained low for his age, and therefore one might consider the possibility that he is affected by language delay or even disorder.Footnote 1
2.1. Coding
Collected recordings were transcribed in Arabic as well as a romanised system. The transcription and coding for the data of both children and adults followed the same method. Unintelligible utterances were transcribed as xxx, while incomplete words were transcribed with the omitted part in parenthesis as (ban)do:ra for bando:ra “tomato.” Furthermore, special markers were used to indicate special forms of speech, such as @d for dialect form, @f for family form, @s\$n for second-language form, and @si for singing (MacWhinney, Reference MacWhinney2000).
The corpus created, named the [Nazzal] corpus, was manually transcribed following the CHILDES manual (MacWhinney, Reference MacWhinney2000) and checked using the CLAN software. The corpus can be found under the subdirectory of Arabic in the Other directory. Only the parents of 11 participants of those reported above gave permission for their transcripts to be part of the CHILDES platform, whereas the other five participants did not because personal family issues were discussed in the recordings. The children and parents whose interactions are currently available online in CHILDES are detailed in the Appendix A1 and A2.
The totally of the recordings (that is those detailed on Tables 1 and 2) were used for the purposes of the morphosyntactic analysis presented in the remainder of this paper. A total of 3,370 utterances from children’s utterances and 5,681 adult utterances were included in the analysis reported. Single-word utterances including yes/no answers, and utterances such as recite the months in a year or a number series were excluded, as well as passive sentences. When utterances consisted of more than one clause, each was analysed separately for the purposes of the analysis of word order and subject production.
Furthermore, for the purposes of the analysis of the productions, and to establish if there was any change in the children’s productions over the course of development, the children’s productions were divided into three age groups: 19–26 months (mean age in months = 22.7; 50% females), 29–37 months (mean age in months = 33.5; 100% females), and 41–58 months (mean age in months = 48.6; 14% females). These groupings allowed for roughly equal intervals between age ranges and age groups of similar length. In addition, this division provided the most balanced distribution of participants within each group. This grouping meant that some age groups were not represented in the sample: there is a gap between 26 and 29 months and another between 37 and 41 months. Lacking information on these two periods does not seem to compromise the validity of the results, as argued in the next section.
3. Results
In this section, we present a quantitative study of the productions of children and their adult interlocutors in the corpus built (including all the children and adults reported, not only those whose transcripts appear in CHILDES). Our goal was to consider the macroparameters that characterise Palestinian Arabic; we focused on the presence of null subjects, subject–verb agreement and word order. We included in our analysis declarative, imperative, and interrogative clauses, that contain a verb, modal, or auxiliary verb, as well as verbless copular sentences in the present tense, where “be” is phonetically null and the predicate can be a noun phrase, an adjective phrase, or a prepositional phrase (Aoun et al., Reference Aoun, Benmamoun and Choueiri2009; Benmamoun, Reference Benmamoun2000). No difference in the word order structure was found when the subject or the object appeared as a noun or as a pronoun, and for that reason they were not coded differently. Therefore, “S” denotes a subject, and “O” denotes an object (direct or indirect, including reflexives), and may refer to full Determiner phrases, proper names, and pronouns. Since the analysis focuses on word order, wh- elements are coded as “S” or “O” according to their function. Finally, the term auxiliary verb “aux” has been used to refer to both auxiliary verbs, such as kana “was” and s ʕa:ra “become” and modals verbs such as biddi “I want,” la:zim “have to,” yimken “may, could,” baqdar “can,” considered modal verbs in Palestinian Arabic (Alharbi, Reference Alharbi2002; Aoun et al., Reference Aoun, Benmamoun and Choueiri2009).
3.1. Null subject
We measured the incidence of null and overt subjects in the productions of children and adults; they are exemplified in (1). Since imperatives consistently present null subjects, only declarative and interrogative clauses, along with verbless copular sentences, were included in the analysis.

The results appear in Table 3.
Table 3. Distribution of overt and null subjects by children and adults

We calculated the proportions of null subjects over null and overt subjects in the speech samples of the four age groups (three groups of children, 19–26, 29–37, and 41–58 months and one group of adults). A Wilcoxon Signed Ranks Test was used to assess whether there were significant differences between the performances of the three child groups compared to adults. The results showed no significant difference between adults and the youngest age group (Z = .365, p = .715), nor between adults and the middle age group (Z = −.135, p = .893), or adults and the oldest age group (Z = −.338, p = .735). To compare the performance differences among the three age groups (young, middle, and old), a one-way analysis of variance (ANOVA) was conducted. The dependent variable was the performance score, and the independent variable was age group. Since the overall ANOVA was not significant, a post hoc test (Tukey’s HSD) was performed to further examine pairwise comparisons among the groups. The one-way ANOVA revealed that there was no statistically significant difference in performance across the three age groups, F(2,13) = .475, p = .632. The post hoc Tukey’s HSD test confirmed that none of the pairwise comparisons reached statistical significance (p > .05 for all comparisons). These findings indicate that age does not have a significant effect on performance in this dataset.
3.2. Subject–verb agreement
We then considered subject–verb agreement, which appears in the form of discontinuous morphology in Palestinian Arabic, as in the Semitic languages in general; agreement was considered for all inflected forms, main verbs and auxiliaries. The results appear in Table 4. Children in the youngest age group (19–26 months) produced a noticeably higher rate of agreement errors (9.3%) compared to the two older groups (1.7% for 29–37 months and 0.8% for 41–58 months). Most of these errors involved incorrect marking of numbers (29 out of 30), with only one person error and no gender errors, for reasons that remain for future research.
Table 4. Subject–verb agreement errors in child production

Notice that copular sentences with ka:n ‘be’ present the overt form of the verb only in some tenses, but not in the present tense (Aoun et al., Reference Aoun, Benmamoun and Choueiri2009; Benmamoun, Reference Benmamoun2000). According to our recounts, children produced copular sentences with kana “was” or raħ “will” in the past and future tenses but never in the present tense. A total of 376 copular sentences were produced by children, 260 (69%) in the present tense, all with null “be,” 116 in the past and future tenses, all with overt “be.” No errors were found. Examples of child production are given in (2a) and (2b). For adults, 524 copular sentences were found, of which 399 (76%) presented null “be” in the present tense.

3.3. Word order
The frequency of different word orders was manually analysed; declarative, imperative, and interrogative clauses were coded (with a total of 5,634 sentences for adults and 3,298 for children). No variation in word order structure was observed whether the subject or object was expressed as a noun or a pronoun, as mentioned above. For sentences with an overt subject and an overt object, SVO (3a) was the predominant order in adult (71.1% of sentences) and child production (91.55% of sentences), while VSO accounted for 2.39% of sentences in adults and 4.21% of sentences in children. Non-canonical but grammatical word orders OVS, OSV, VOCliticS, VOS, and OSVOClitic (3b) represented 26.51% of adult production and 4.24% of child production.

Word order distribution for adults and children is presented in Table 5 (sentences with overt S and O) and Table 6 (sentences with null arguments included). (The order of presentation of different word orders is based on their incidence in adult production.)
Table 5. Frequency and percentage of sentences with S, V, O, children, and adults

Abbreviations: V = Verb, S = Subject, O = Direct or Indirect Object, PP = Prepositional Phrase, Aux.v = Auxiliary Verb.
Table 6. Frequency and percentage of different word orders, children and adults

Abbreviations: V = Verb, S = Subject, O = Direct or Indirect Object, PP = Prepositional Phrase, Aux.v = Auxiliary Verb, Pred = Predicate.
In total, adults exhibited 50 different word orders, while children produced 45.
We found consistent word order structures in children and adults, with no ill-formed sequences in either case.Footnote 2 Following a reviewer’s suggestion, we calculated the frequency of occurrence of SV versus VS structures for the three age groups separately to explore the developmental trajectory of children. The results confirmed that SV was the predominant structure across all three age groups of children, with the mean frequencies as follows: youngest group (19–26 months) 0.81, middle group (29–37 months) 0.87, and older group (41–58 months) 0.78. Similarly, the adult group also showed a mean of 0.78 for SV.
As may be observed, sentences with more than one verb (VV, VVO, etc.), known as serial verb constructions (Altakhaineh & Zibin, Reference Altakhaineh and Zibin2017; Hussein, Reference Hussein, Joseph and Zwicky1990), were found in the corpus (see (5)). Serial verb constructions were found in 4.53% of adults’ sentences (featuring various word orders: VV, VVO, VV PP, SVV, SVVO, Aux VVO, VVS, and VVV) and 2.06% of child sentences (from 8 children, with an age range of 23 to 56 months); the first occurrence was found at 23 months.

4. Discussion
In this study, we examined several grammatical features in the speech of Palestinian Arabic-speaking children across different age groups. Overall, child production in the domains investigated shows a strong tendency towards adult-like patterns, though some variability remains, particularly in the youngest age group. The results show a clear developmental trajectory in subject–verb agreement and word order. The youngest group (19–26 months) had a higher error rate in agreement (9.3%), particularly with number, compared to the older groups (1.7% and 0.8%). This is the one domain where the younger group is barely above 90% adult-like performance. SVO word order was consistently used across age groups, with the youngest group (0.81%) already aligning with the adult rate (0.78). Serial verb constructions, though more frequent in adults (4.53%), were also present in children (2.06%), and attested from 23 months of age. Palestinian Arabic is a null subject language and, as such, it allows for phonetically null object when the discourse context allows the speakers to retrieve the relevant information (Albirini et al., Reference Albirini, Benmamoun and Saadah2011; Kenstowicz, Reference Kenstowicz, Jaeggli and Safir1989; Rizzi, Reference Rizzi1982). We have shown that children produce null subjects at the same rate as adults do. This is consistent with previous studies in other null subject languages. For Romance, the results of null subject production are 62%, 70%, and 67% for Catalan (Bel, Reference Bel2003; Cabŕe-Sans & Gavarŕo, Reference Cabŕe-Sans, Gavarŕo, Belikova, Meroni and Umeda2006), Italian (Lorusso et al., Reference Lorusso, Caprin and Guasti2005), and Spanish (Bel, Reference Bel2003), respectively; child null subject rates are not significantly different from them. A Yemeni Ibbi Arabic study by Qasem (Reference Qasem2020) reported an 86–87% of null subjects in children’s production, although no results were given for adult production. These studies converge in the idea that the null subject parameter is set very early (Wexler, Reference Wexler1998).
Palestinian Arabic presents person, number, and gender agreement, as exemplified in (3a) above. The error rate in production of subject–verb agreement in children was 1.92% for the age range of 18 to 56 months. The near-perfect subject–verb agreement observed in further attests to the early mastery of agreement morphology in Palestinian Arabic. Given the non-concatenative nature of Arabic morphology, this challenges any claims that morphological complexity hinders early language development (Dromi et al., Reference Dromi, Leonard, Adam and Zadunaisky-Ehrlich1999). The non-concatenational character of Arabic morphology and its complexities (T et al., Reference Taha, Stojanovik and Pagnamenta2021) are no obstacle for early attainment. The absence of “be” in copular sentences in the present tense among Palestinian Arabic-speaking children aligns with Schütze’s (Reference Schütze, Svenonius and Richardsen2004) claim that children have early command of the realisation of Tense, since these children do not insert a copula where it is not present in the adult language, and systematically insert it when is required.
Regarding word order, the current study found SVO as the predominant order in Palestinian Arabic speech production, whether adult or child. This observation is in line with Benmamoun (Reference Benmamoun1997), Shlonsky (Reference Shlonsky1997), Mohammad (Reference Mohammad2000), and Saiegh–Haddad (Reference Saiegh–Haddad2003), who assert that SVO is the default word order in Palestinian Arabic, while VSO is the basic word order in Standard Arabic. Similar word order preferences, favouring SVO, were identified in various spoken Arabic varieties, such as Jordanian (El-Yasin, Reference El-Yasin1985), Egyptian (Albirini et al., Reference Albirini, Benmamoun and Saadah2011), and Moroccan (Announi, Reference Announi2021).
In contrast to our findings, Friedmann and Costa (Reference Friedmann and Costa2011) found a preference for VS order as opposed to SV in their study of child Palestinian Arabic, using a repetition task with 20 children of ages 1;9 to 3;0. Similarly, Khamis-Dakwar (Reference Khamis-Dakwar, Broselow and Ouali2011) found a preference for VSO as opposed to SVO in another repetition task run with Palestinian Arabic children in the same age range. The source of this contrast may be in the methods used in those two studies; in particular, the fact that children chose to change the word order in the repetition tasks reported may indicate that the discourse setting invited a given word order over another; the discrepancy remains for future research. On the other hand, the current study’s findings are in line with Abboud et al.’s (Reference Abboud, Choueiri, Seifeddine and Tuller2022) research on Lebanese Arabic-speaking children, indicating simultaneous emergence of SV and VS orders. Overall, children’s spontaneous production indicates knowledge of numerous word order alternations, with no deviant word orders attested. These word order alternations imply the resource to various syntactic operations (wh-movement in questions, object dislocations).
Serial verb constructions have received no attention in the literature on the acquisition of Arabic. These are sentences where multiple verb forms appear consecutively in a single clause, denoting a complex action or event (Altakhaineh & Zibin, Reference Altakhaineh and Zibin2017; Hussein, Reference Hussein, Joseph and Zwicky1990). The absence of research on the acquisition of serial verb constructions in Arabic leaves an open avenue for future investigation.
The results of the analysis of some of the core properties of Palestinian Arabic in the children’s early productions align with the predictions of Very Early Parameter Setting (Wexler, Reference Wexler1998) or Early Morphosyntactic Convergence (Hoekstra & Hyams, Reference Hoekstra and Hyams1998). The grammatical phenomena examined range from the production of null subjects and subject–verb agreement to word order distribution, absence/presence of copular “be” and production of serial verbs. While in this last case, the findings may be nearly anecdotal, while for null subjects and agreement, naturalistic data provide abundant evidence for grammatical acquisition. Moreover, when we consider the children grouping them in three age subgroups, we find that early production does not differ from that of the older children (with the exception of subject–verb agreement errors, which are slightly higher for younger children and may be attributed to the acquisition of morphological exponents). Overall, in our interpretation, our results point to continuity in early development. Other domains in which child grammar is generally agreed to be delayed, as for example passive voice, have not been considered, and have been left for later work.
The observations so far were possible thanks to the collection of child and adult interactions in a naturalistic setting. The resulting corpus has been made available to the community through the CHILDES platform serving as a valuable resource for researchers, educators, and practitioners alike. While corpora have their limitations, in the case of child language they provide abundant information on grammatical phenomena. These findings, and other drawn from the corpus, can be used in comparative work with other languages, can serve as reference in language impairment studies, and can inform experimental design.
Acknowledgements
The authors wish to thank the Palestinian children and their parents for their participation in this research. The authors also thank An-Najah National University (www.najah.edu) and Universitat Autònoma de Barcelona for the technical support provided to publish the present manuscript.
Funding statement
This study received financial support through the project Development and acquisition of preverbal syntax and semantics (DAPSS), PID2022-138413NB-100, Ministerio de Ciencia e Innovación.
Competing interests
The authors declare none.
Appendix
Table A1. Characteristics of the children’s recordings in CHILDES

Table A2. Characteristics of the adults’ recordings in CHILDES
