Introduction
Attention-deficit hyperactivity disorder (ADHD) symptoms are defined by difficulties in the domains of attention and/or hyperactivity/impulsivity (American Psychiatric Association, 2013). The estimated prevalence of ADHD in children is around 5% globally (Sayal et al., Reference Sayal, Prasad, Daley, Ford and Coghill2018), though figures vary, with Thomas et al. (Reference Thomas, Sanders, Doust, Beller and Glasziou2015) suggesting a prevalence of 7%, while Polanczyk et al. (Reference Polanczyk, Salum, Sugaya, Caye and Rohde2015) reporting a prevalence rate of 3.4%. In the UK, prevalence rates have been estimated to range from 1.4% (Russell et al., Reference Russell, Rodgers, Ukoumunne and Ford2014) to 5% (NICE, 2008) in children. However, symptoms are dimensional in nature, impacting individuals at both clinical and sub-clinical levels (Coghill & Sonuga-Barke, Reference Coghill and Sonuga-Barke2012; Salum et al., Reference Salum, Sonuga-Barke, Sergeant, Vandekerckhove, Gadelha, Moriyama, Graeff-Martins, Manfro, Polanczyk and Rohde2014).
While ADHD has been associated with certain strengths (Sedgwick et al., Reference Sedgwick, Merwood and Asherson2019), it has also been linked to poorer outcomes in a range of domains, such as peer, academic, occupational, and addiction issues (Cherkasova et al., Reference Cherkasova, Roy, Molina, Scott, Weiss, Barkley, Biederman, Uchida, Hinshaw and Owens2022; Sedgwick, Reference Sedgwick2018; Strine et al., Reference Strine, Lesesne, Okoro, McGuire, Chapman, Balluz and Mokdad2006). There is growing recognition that outcomes for young people with ADHD can be improved with earlier identification and intervention provision (Arnold et al., Reference Arnold, Hodgkins, Kahle, Madhoo and Kewley2020; DuPaul et al., Reference DuPaul, Evans, Mautone, Owens and Power2020; Halperin & Marks, Reference Halperin and Marks2019; Shephard et al., Reference Shephard, Zuccolo, Idrees, Godoy, Salomone, Ferrante, Sorgato, Catao, Goodwin and Bolton2022; Sonuga-Barke & Halperin, Reference Sonuga-Barke and Halperin2010), benefitting young people, their families, and healthcare systems.
Unfortunately, despite advances in this area, it is estimated that approximately 50% of children and adolescents who meet validated diagnostic criteria for ADHD remain undiagnosed (Cuffe et al., Reference Cuffe, Moore and McKeown2005; Froehlich et al., Reference Froehlich, Lanphear, Epstein, Barbaresi, Katusic and Gilman2007; Madsen et al., Reference Madsen, Ravn, Arnfred, Olsen, Rask and Obel2018; Okumura et al., Reference Okumura, Usami, Okada, Saito, Negoro, Tsujii, Fujita and Iida2019), prompting continued efforts to improve the early identification of ADHD. Accumulating evidence suggests that ADHD symptoms may be predictable based on information available very early in life, which could facilitate earlier identification and intervention. For example, studies have linked ADHD to a wide range of maternal, sociodemographic, and perinatal variables about which information can be known at or shortly after birth. For example, some evidence has suggested that younger maternal age (Chang et al., Reference Chang, Lichtenstein, D’Onofrio, Almqvist, Kuja-Halkola, Sjölander and Larsson2014), lower socioeconomic status (Russell et al., Reference Russell, Ford and Russell2015), membership in some ethnic groups (Bax et al., Reference Bax, Bard, Cuffe, McKeown and Wolraich2019; Coker et al., Reference Coker, Elliott, Toomey, Schwebel, Cuccaro, Tortolero Emery, Davies, Visser and Schuster2016; Zilanawala et al., Reference Zilanawala, Sacker and Kelly2018), and birth parity (being first-born) (Marín et al., Reference Marín, Seco, Serrano, García, Gaviria Gómez and Ney2014; Reimelt et al., Reference Reimelt, Wolff, Hölling, Mogwitz, Ehrlich, Martini and Roessner2021) may be linked to a greater risk of ADHD. Maternal health and health behaviors during pregnancy, such as prenatal infection (Hall et al., Reference Hall, Speyer, Murray and Auyeung2022; Walle et al., Reference Walle, Askeland, Gustavson, Mjaaland, Ystrom, Lipkin, Magnus, Stoltenberg, Susser and Bresnahan2022), metabolic syndrome (Kwok et al., Reference Kwok, Speyer, Soursou, Murray, Fanti and Auyeung2023), instrumental delivery (e.g., forceps or ventouse use) (Ben Amor et al., Reference Ben Amor, Grizenko, Schwartz, Lageix, Baron, Ter-Stepanian, Zappitelli, Mbekou and Joober2005; Romero et al., Reference Romero, Lindström, Listermar, Westgren and Ajne2023), pre-eclampsia (Sun et al., Reference Sun, Moster, Harmon and Wilcox2020), anemia (Wiegersma et al., Reference Wiegersma, Dalman, Lee, Karlsson and Gardner2019), stress (Ronald et al., Reference Ronald, Pennell and Whitehouse2011), mental health (Clements et al., Reference Clements, Castro, Blumenthal, Rosenfield, Murphy, Fava, Erb, Churchill, Kaimal and Doyle2015; Speyer et al., Reference Speyer, Neaves, Hall, Hemani, Lombardo, Murray, Auyeung and Luciano2022), alcohol use and smoking (He et al., Reference He, Chen, Zhu, Hua and Ke2020; Langley et al., Reference Langley, Heron, Smith and Thapar2012) have also been linked to a greater risk of ADHD. Finally, birth and early infant outcomes such as prematurity, low birth weight (Franz et al., Reference Franz, Bolat, Bolat, Matijasevich, Santos, Silveira, Procianoy, Rohde and Moreira-Maia2018; Pettersson et al., Reference Pettersson, Sjölander, Almqvist, Anckarsäter, D’Onofrio, Lichtenstein and Larsson2015), and small head circumference (Lahti et al., Reference Lahti, Räikkönen, Kajantie, Heinonen, Pesonen, Järvenpää and Strandberg2006) have additionally been linked to ADHD. Male infants are also at greater risk of developing ADHD than females (Ramtekkar et al., Reference Ramtekkar, Reiersen, Todorov and Todd2010; Willcutt, Reference Willcutt2012). Building a prediction model utilizing these and other candidate exploratory factors related to ADHD could help identify those who could be prioritized for monitoring and early intervention. These commonly recorded factors in healthcare datasets can provide a highly practical means to gain early information on later ADHD risk.
It is important to note that when used as a means to promote early identification, these factors need not be causal in ADHD (e.g., Sciberras et al., Reference Sciberras, Mulraney, Silva and Coghill2017; Thapar et al., Reference Thapar, Cooper, Eyre and Langley2013). Indeed, it is valuable to distinguish causal and predictive modeling. In fields such as ecology, healthcare, and machine learning, there is a growing discussion about the differences between “causal” and “predictive” modeling, even if these terms are not explicitly used (Arif & Macneil, Reference Arif and MacNeil2022; Prosperi et al., Reference Prosperi, Guo, Sperrin, Koopman, Min, He, Rich, Wang, Buchan and Bian2020; Young, Reference Young2019). Causal modeling endeavors to explain “why,” that is, the mechanism behind the relationship between the independent and outcome variables, that is, to explain why the latter will change with the alteration of the former (Arif & Macneil, Reference Arif and MacNeil2022; Prosperi et al., Reference Prosperi, Guo, Sperrin, Koopman, Min, He, Rich, Wang, Buchan and Bian2020; Young, Reference Young2019). This technique requires a thorough control and analysis of all factors related to the variables of interest (Arif & Macneil, Reference Arif and MacNeil2022; Prosperi et al., Reference Prosperi, Guo, Sperrin, Koopman, Min, He, Rich, Wang, Buchan and Bian2020). In contrast, even “confounders” can be considered valuable predictors in predictive modeling contexts (Young, Reference Young2019). Rather than having an explanatory focus, this approach aims to describe correlations and forecast outcomes based on known inputs. A good generalization of predictive results to new observations generally indicates robust predictions. Predictive modeling is significant in identifying potential risk factors in health research contexts (e.g., Ng et al., Reference Ng, Sun, Hu and Wang2015), while causal modeling is critical for decision-making in clinical settings, such as interventions and validating medication effects (e.g., Almeda et al., Reference Almeda, García-Alonso, Salinas-Pérez, Gutiérrez-Colosía and Salvador-Carulla2019; Bica, Reference Bica2022).
Nearly 90% of studies on ADHD prediction modeling included in a recent review focus on diagnosis (Salazar de Pablo et al., Reference Salazar de Pablo, Iniesta, Bellato, Caye, Dobrosavljevic, Parlatini, Garcia-Argibay, Li, Cabras and Haider Ali2024). These studies often emphasize detecting whether an individual has an ADHD diagnosis using longitudinal data or early developmental information (Salazar de Pablo et al., Reference Salazar de Pablo, Iniesta, Bellato, Caye, Dobrosavljevic, Parlatini, Garcia-Argibay, Li, Cabras and Haider Ali2024). However, increasing evidence suggests that ADHD symptoms are dimensional (e.g., Marcus & Barry, Reference Marcus and Barry2011; Panagiotidi et al., Reference Panagiotidi, Zavlis, Jones and Stafford2024). Additionally, diagnostic cut-offs can vary by context, affecting the identification of preclinical risks (e.g., Harrison & Edwards, Reference Harrison and Edwards2023; Miyasaka et al., Reference Miyasaka, Kajimura and Nomura2018). Research predicting continuous ADHD symptom scores helps address these issues but is limited.
In one recent exception, Dooley et al. (Reference Dooley, Healy, Cotter, Clarke and Cannon2024) analyzed secondary data from a cohort study to investigate whether 40 pre- or perinatal factors generally known at birth, including pregnancy complications and maternal demographic information, could predict continuous ADHD symptom scores in children aged 9–10. Elastic net regression identified 17 predictors, which collectively explained 8% of ADHD symptom variance. The study found that predictive accuracy varied by income and sex, but suggested that continuous ADHD symptom prediction is possible to an extent from birth. Nevertheless, the study was limited to the US, and the regression model applied was restricted to linear relationships.
Traditional regression assumes linearity, which may not be suitable for examining prediction that involves complex interactions. This is important in the context of ADHD symptom prediction because many studies have suggested that the development of ADHD is multifactorial, involving genetic and environmental factors and their interactions. It is challenging to accurately define the complex interplay between them (e.g., Faraone& Larsson, Reference Faraone and Larsson2019; Thapar et al., Reference Thapar, Cooper, Eyre and Langley2013). Therefore, complexity, nonlinearity, and interactive effects are more likely to exist in the development of ADHD, and traditional linear regression (LR) has limitations in capturing these. In enhancing a predictive model for studying ADHD, it is better to incorporate a wide range of predictors and apply a model that could automatically detect their intricate interactions without a manual definition of their terms.
Given the potential complexity of relationships between predictive factors and ADHD, machine learning techniques could offer advantages, enhance predictive power relative to regression models, and bypass their restrictive assumptions. Tree-based methods, such as classification and regression trees (CART) and random forest (RF), do not assume additivity and can detect nonlinear relationships, the most salient interactions, and even highly diverse structures without the manual specification required in traditional LR (e.g., Banerjee et al., Reference Banerjee, Reynolds, Andersson and Nallamothu2019; Uddin & Lu, Reference Uddin and Lu2024). Importantly, they may improve on the predictive power of regression models. Certain machine learning methods also provide high interpretability, such that a straightforward understanding of the model findings is not sacrificed (Dwyer et al., Reference Dwyer, Falkai and Koutsouleris2018). A recent study by Garcia-Argibay et al. (Reference Garcia-Argibay, Zhang-James, Cortese, Lichtenstein, Larsson and Faraone2023) utilized registry data based on Sweden’s population, supporting the application of machine learning techniques to large-scale data that provides early-life information. This approach can yield good predictions regarding the diagnosis of ADHD and identify particular early-life risk factors.
Aims
The aims of the current study were to use the UK-based Born in Bradford (BiB) cohort study to examine the overall “predictability” of ADHD from information typically available at birth, and to examine which predictors were the most important.
Methods
Participants
Participants are from the BiB study. BiB was established in 2007 as a longitudinal cohort study examining the multiple factors that impact pregnant individuals’ physical and mental well-being and their children. It is based in Bradford, a city in northern England with an ethnically and socioeconomically diverse population. Approximately half of the mothers in the region are of non-UK origin, primarily South Asian. The cohort study has been found to be approximately representative of the maternal population in Bradford.
The BiB project linked the pregnant individual’s records, obtained during their recruitment while receiving routine procedures at the Bradford Royal Infirmary, with their children’s educational and developmental outcomes through subsequent research. Hence, researchers can use the data to study the relationship between early factors and children’s developmental outcomes (Raynor Reference Raynor2008; Wright et al., Reference Wright, Small, Raynor, Tuffnell, Bhopal, Cameron, Fairley, Lawlor, Parslow and Petherick2013). The current study uses ADHD symptom data from children recruited in the “Starting School,” which were originally from the BiB project cohorts and thus have linked perinatal and ADHD symptom data (Pettinger et al., Reference Pettinger, Kelly, Sheldon, Mon-Williams, Wright and Hill2020; Shire et al., Reference Shire, Andrews, Barber, Bruce, Corkett, Hill, Kelly, McEachan, Mon-Williams and Tracey2020).
From 2007 to 2010, 12,453 women and 13,776 children were involved in the complete BiB cohort study. For the current analyses, 2063 cases were derived with complete outcome variable data. We utilized only the first child of pregnant individuals with multiple pregnancies.
Mothers
During March 2007 and November 2010, 12,453 pregnant individuals were recruited from the Bradford Royal Infirmary between 26 to 28 weeks of gestation while receiving routine care. Baseline measures were obtained through interviews and linked to their and their children’s primary and secondary care records. Information on biological, social, economic, educational and general health was collected. In addition to data obtained through interviews, further research was conducted to extract records from maternal paper notes, providing details on antenatal care, delivery notes, and the biological characteristics of newborns, such as gestational age, maternal blood pressure, delivery complications, and infant birth weight (Wright et al., Reference Wright, Small, Raynor, Tuffnell, Bhopal, Cameron, Fairley, Lawlor, Parslow and Petherick2013).
Children
A subset of BiB children aged 4–5 took part in the “Starting School” study, which included 94 out of 142 primary schools in Bradford during two consecutive academic years from 2012 to 2014. Overall, 3,444 BiB cohort children participated in “Starting School.” “Starting School” aims to predict children’s physical, mental, and educational development by examining their physical motor, cognitive language, and socio-emotional development via various in-school assessments. Assessments include the Strengths and Difficulties Questionnaire (SDQ). It was completed once by teachers during each child’s Reception year (the first year of primary school), when children were between 4 and 5 years old. Each child was assessed at a single time point within this age range (Pettinger et al., Reference Pettinger, Kelly, Sheldon, Mon-Williams, Wright and Hill2020; Shire et al., Reference Shire, Andrews, Barber, Bruce, Corkett, Hill, Kelly, McEachan, Mon-Williams and Tracey2020). A sample of 2063 children from the Hyperactivity/Inattention (H/I) subscales of the SDQ served as the initial analytic sample and outcome variable in our study. This was created by linking the pregnant individual’s data with their first-born child’s biological characteristics at birth, based on completed H/I SDQ subscale data.
Of the initial analytic sample, 51% of child participants were female, and 48.7% were male (sample size = 2042; missing rate = 1.0%). The mean age of the pregnant individual was 27.3 years (SD = 5.67) (sample size = 1560; missing rate = 24.4%). Among the sample of pregnant individuals’ ethnicity (sample size = 1558; missing rate = 24.5%), 51.9% were Pakistani, 36.3% were White British, and 11.8% were from other ethnic groups. Missingness occurred due to incomplete data across a list of predictors. The overall missing rate was around 30%, and the range of missingness varied from 1 to 84%, excluding the outcome variable, with zero missingness. High missingness was found in maternal smoking data, with 84%, and alcohol exposure, with 78%; and children’s cord blood (except leptin), with 70.4%. Full descriptive statistics, including the proportions of missing data for all variables included in the analyses, are provided in Tables 1 and 2.
Table 1. Descriptive statistics of continuous variables

Note. The missing rate for each variable is based on the initial analytical sample of 2063. Details of how the sample of 2063 was derived are described in the main text.
Table 2. Descriptive statistics of categorical variables

Note. The distribution of each level of the categorical variable is provided. The missing rate for each variable is based on the initial analytical sample of 2063. Details of how the sample of 2063 was derived are described in the main text.
Measures
Predictor variables
A set of predictors was prioritized, guided by prior literature both theoretically and empirically linking these factors to an increased risk of ADHD and other neurodevelopmental difficulties (e.g., Chang et al., Reference Chang, Lichtenstein, D’Onofrio, Almqvist, Kuja-Halkola, Sjölander and Larsson2014; He et al., Reference He, Chen, Zhu, Hua and Ke2020; Speyer et al., Reference Speyer, Neaves, Hall, Hemani, Lombardo, Murray, Auyeung and Luciano2022). Being used in previous predictive modeling and machine learning studies (e.g., Dooley et al., Reference Dooley, Healy, Cotter, Clarke and Cannon2024; Garcia-Argibay et al., Reference Garcia-Argibay, Zhang-James, Cortese, Lichtenstein, Larsson and Faraone2023) was an additional selection criterion. Availability from hospital routine, birth record, and sub-studies in the BiB study was a constraint on predictor inclusion.
Our predictor selection was also guided by the aim of developing a predictive model using early-life risk factors that are commonly observable in routine perinatal data. For example, although male sex and maternal smoking differ in clinical modifiability, they are significant as well as prevalentpredictors relevant to ADHD symptom development (e.g., Lawder et al., Reference Lawder, Whyte, Wood, Fischbacher and Tappin2019; Pietersma et al., Reference Pietersma, Mulders, Sabanovic, Willemsen, Jansen, Steegers, Steegers-Theunissen and Rousian2022; Willcutt, Reference Willcutt2012). Additionally, biological and psychosocial variables, including socioeconomic status and maternal and infant health indicators, can reflect the multidimensional influences on ADHD and have proven valuable predictors in the machine learning study by Garcia-Argibay et al. (Reference Garcia-Argibay, Zhang-James, Cortese, Lichtenstein, Larsson and Faraone2023).
Based on the above considerations, predictor variables were the pregnant individual’s age, ethnicity, country of birth, marital status, cohabitation status, educational level, socioeconomic position, Index of Multiple Deprivation (IMD), exposure to alcohol and smoking, mental health well-being (General Health Questionnaire; GHQ), metabolic markers and syndrome (pregnant individual’s BMI, HDL, triglycerides, systolic and diastolic blood pressure and fasting glucose levels, existing diabetes and hypertension), maternal infection and conditions related to adverse pregnant outcomes. Assistance required during birth (obstetric intervention at birth) was also considered a maternal predictor. Predictors related to the infant were their sex, cord blood biomarkers, birth weight, gestational age at birth and abdominal and head circumference. Notably, some categorical variables, typically IMD, were recoded into three levels due to the observation of zero variance in some levels of the original five-level structure. Definitions and coding methods of specific predictors are available in the Supplementary Materials Tables S1 to S3 (pp. 1–7).
Outcome variable
ADHD was measured using the Hyperactivity/Inattention (H/I) subscales of the Strengths and Difficulties Questionnaire (SDQ). (Descriptive information for H/I SDQ score is available in Table 1). Scores ranged from 0 to 10, with an average of 2.67 (SD = 2.81). The H/I subscale in SDQ is widely used internationally in a range of contexts, including epidemiological and clinical studies, for assessing children’s (3–16 years old) ADHD symptoms (e.g., Brandt et al., Reference Brandt, Patalay and Kerner auch Koerner2021; Carballo et al., Reference Carballo, Rodríguez-Blanco, García-Nieto and Baca-García2018). In BiB, the questionnaires were completed by teachers, who were required to have known the child for at least half a year (Shire et al., Reference Shire, Andrews, Barber, Bruce, Corkett, Hill, Kelly, McEachan, Mon-Williams and Tracey2020). The H/I subscale contains five items: “restless, overactive, cannot stay still for long,” “constantly fidgeting or squirming,” “easily distracted, concentration wanders,” “thinks things out before acting,” “sees tasks through to the end, good attention span.” Each item was rated on a three-point (0: Not True, 1: Somewhat True, 2: Certainly True; some are scored in reverse order) Likert Scale. The H/I scale has shown good predictive validity, test–retest reliability and internal consistency in previous research (e.g., Algorta et al., Reference Algorta, Dodd, Stringaris and Youngstrom2016; Almeda et al., Reference Almeda, García-Alonso, Salinas-Pérez, Gutiérrez-Colosía and Salvador-Carulla2019; Brandt et al., Reference Brandt, Patalay and Kerner auch Koerner2021; Carballo et al., Reference Carballo, Rodríguez-Blanco, García-Nieto and Baca-García2018; Hall et al., Reference Hall, Guo, Valentine, Groom, Daley, Sayal and Hollis2019). However, there is debate over the cut-off score for ADHD. For example, in based on a study in Spain, the suggested score is 8 (Carballo et al., Reference Carballo, Rodríguez-Blanco, García-Nieto and Baca-García2018), while in the UK, the suggested score has varied. A lower score of ≥ 4 or ≥ 5 is suggested for youth or younger adults, and a score of ≥ 7 or ≥ 8 for children (Bryant et al., Reference Bryant, Guy, Team and Holmes2020; Riglin et al., Reference Riglin, Agha, Eyre, Bevan Jones, Wootton, Thapar, Collishaw, Stergiakouli, Langley and Thapar2021; Ullebø et al., Reference Ullebø, Posserud, Heiervang, Gillberg and Obel2011). Recommendations are also provided by the test developers: http://www.sdqinfo.org/. Given ongoing uncertainty regarding optimal cut points and the fact that ADHD symptoms are dimensional, we did not introduce a cut-off score and instead analyzed the scores on a continuous scale.
Table 3. Multiple LR results

Note: Reference categories were as follows – ethnicity: other, for example, Asian; marital status: unmarried; cohabitation status: not living with a partner; education level: equal or higher than A’ levels; socioeconomic status: higher dependency or financially difficulty; metabolic syndrome: no; assistance required during birth: no; infant sex: male; maternal smoking: no; maternal alcohol: no; maternal infection: no; conditions associated with adverse pregnancy outcomes: no.
The H/I SDQ total score used in the main analyses was fully complete in the analytic sample, with a sample of 2063 and no imputation was required. Internal consistency of the H/I SDQ score was assessed using item-level data from the same analytic sample (using the “psych” R package; Revelle, Reference Revelle2023). Some H/I SDQ items had missing data; multiple imputation was thus conducted to ensure complete data availability for assessing internal consistency. Cronbach’s alpha was .89, indicating excellent internal reliability.
Analysis
Statistical analyses included correlation, unadjusted LR models, multiple LR models, CART, and RF models.
Correlation and unadjusted regressions were used for descriptive purposes to show the “raw” associations between each predictor and the outcome. Correlation analysis also allowed us to identify potentially problematic levels of multicollinearity. For predictive analyses, LR was included because of its interpretability. Further, because it is widely used for prediction, it provides a useful baseline against which machine learning methods can be compared. Given the advantages discussed earlier, machine learning methods were also employed. Both CART and RF were used because of their complementary strengths and weaknesses. CART provides higher interpretability because it involves fitting only a single tree; however, RF is an ensemble method and has the associated advantages of fitting and aggregating multiple trees.
Multiple imputation using chained equations (MICE) was used to deal with missing data, with a single imputed dataset analyzed due to the complexities (and computational intensity) of combining multiple imputation with RF. Missingness diagnosis and the application of MICE were performed in accordance with Newman’s (2014) guidelines. Little’s MCAR (Missing Completely At Random) test was conducted using the “misty” R package (Yanagida, Reference Yanagida2024), which indicated that the data was not MCAR (χ 2 = 13,793.39, df = 10,431, p < .001). The use of multiple imputation is based on an assumption of “missing at random” meaning that the missingness can be predicted based on modeled data. Given that we had relatively comprehensive baseline data and in the absence of any strong reason to assume that the data were subject to a missing not at random (MNAR) mechanism, we judged this assumption to be reasonable. The data distribution before and after imputations is provided in Supplementary Materials Tables S4 to S5 (pp. 8–9).
All continuous predictor variables, including the pregnant individual’s age, GHQ total score (mental health), the infant’s gestational age at birth, cord blood biomarkers, and other information, such as birth weight recorded after birth, were standardized by z-standardization using the scale() function in R prior to initial analysis (i.e., they were rescaled to have Mean = 0, SD = 1).
Initial analysis: correlation and unadjusted regression model
Before analyzing the various predictors’ predictive capabilities, initial analyses including correlation and unadjusted LR models were conducted to evaluate the basic relationships between the variables and to inform the selection of predictors for later analysis by identifying highly collinear predictors that may present issues later.
The “hetcor” function in the “polycor” R package was used to calculate a mixed (Pearson, polyserial, and polychoric) correlation matrix for 136 pairs of variables, excluding nominal variables. An unadjusted LR model was run for each of the 25 predictor variables using the lm() function in R, which handles missing values by conducting a complete-case analysis.
Predictive analyses
The study employed three types of predictive analysis using a single imputed dataset: multiple LR, a CART, and RF.
There are a few advantages in applying multiple LR in our study. First, it is well recognized as an interpretable method in explanatory and predictive research. Second, the LR model can still be considered pragmatic even when the residuals are not normally distributed in large samples. Third, it can be used to handle various types of variables (Schmidt & Finan, Reference Schmidt and Finan2018; Yang et al., Reference Yang, Tu and Chen2019). Nevertheless, linearity assumed by LR means that the effect of a predictor on the outcome remains constant and additive without being modified by other factors unless the interaction relationships are specified explicitly (McClelland & Judd, Reference McClelland and Judd1993; West et al., Reference West, Aiken, Wu, Taylor, Rosenzweig and Porter1991). Assuming linearity and equal variance in a model may limit its predictive power in complex real-world contexts (Ernst & Albers, Reference Ernst and Albers2017). Hence, it is beneficial to have multiple LR serve as a baseline for demonstrating the direction and strength of the associations, as well as for comparison with ML models.
The ML approaches, namely CART and RF, were applied to complement multiple LR to capture more complex patterns. They are flexible models that can accommodate potential nonlinear relationships and real-world and mixed-type data (which include various numeric, ordinal, and nominal variables) while examining the importance of specific predictors. For a more detailed and technical interpretation of ML and the tree-based methods, see (Banerjee et al., Reference Banerjee, Reynolds, Andersson and Nallamothu2019; Uddin & Lu, Reference Uddin and Lu2024).
CART takes on the form of a tree with “branches” representing different paths. Each branch represents a decision based on features that split the data into distinct subsets. These decisions can be based on either categorical (e.g., male or female) or continuous variables (e.g., older or younger than 20). The tree continues to create splits based on maximizing similarity for cases within the splits until reaching the endpoints, known as “leaves,” where final predictions are made based on the path followed by the data (Breiman et al., Reference Breiman, Friedman, Olshen and Stone1984). CART identifies the optimal breakpoints in continuous variable by examining all possible values of a predictor and selecting the one that best separates the outcome variable into groups with more similar values. In essence, it minimizes differences within groups by reducing the average error, which is measured as the mean squared error (MSE). A primary advantage of CART is that it can easily visualize the structure of the predictive relationships; however, it is prone to overfitting (Breiman et al., Reference Breiman, Friedman, Olshen and Stone1984).
RF uses an ensemble of trees to improve upon CART and overcome its limitations.
These trees are generated from a bootstrapped dataset, and their splits are based on a random subset of features. Hence, each tree in the RF model generates a different prediction, which is then aggregated through averaging (for regression; continuous variable) or majority voting (for classification; binary variable) to produce a final prediction (Breiman, Reference Breiman2001; Cutler et al., Reference Cutler, Cutler, Stevens, Zhang and Ma2012). This approach results in a more stable and accurate estimation than a single tree in CART. Compared to LR, it can allow for relaxed assumptions of linearity and equal variance and more accurate prediction (Ali et al., Reference Ali, Khan, Ahmad and Maqsood2012; Marchese Robinson et al., Reference Marchese Robinson, Palczewska, Palczewski and Kidley2017; Prajwala, Reference Prajwala2015; Schonlau & Zou, Reference Schonlau and Zou2020). It is, however, difficult to visualize the structure of the predictive relationships from the RF models, for which is an ensemble of many single classical and regression trees.
To facilitate more direct comparison of LR, CART, and RF all were implemented using a common pipeline: the mikropml pipeline (Topçuoğlu et al., Reference Topçuoğlu, Lapp, Sovacool, Snitkin, Wiens and Schloss2021) using the package of the same name in R. The dataset was pre-processed by scaling the continuous variables and creating dummy variables for the categorical variables (Topçuoğlu et al., Reference Topçuoğlu, Lapp, Sovacool, Snitkin, Wiens and Schloss2021). The pipeline split the dataset into training and testing sets with a typical proportion of 70:30. Additionally, 10-fold cross-validation with 100 partitions was conducted on the three models. The best tuning parameters for each model (a and l for LR; maxdepth for CART and mtry for RF) were automatically selected based on the performance statistics calculated by the pipeline (see Table 4 in the below section).
Table 4. Accuracy metrics of the MR regression, CART, and RF models

Note. CV = Cross-validation; Train = Training dataset; Test = Test dataset; RMSE = Root Mean of Squared Error; MAE = Mean Absolute Error.
Results
Initial analyses
Imputation allowed the use of a sample size of 2063 for these analyses. Correlation and univariate regression analyses indicated weak correlations between most predictors and H/I SDQ scores (r < .24). An unadjusted LR using complete-case data explained minimal variance (R 2 = .50), and no individual predictors exhibited statistically significant effects. However, strong intercorrelations were noted between certain predictors, such as head circumference and birth weight, as well as between maternal education and socioeconomic status (r > .60). Additional multicollinearity checks using variance inflation factors (VIF) from a multiple LR revealed high collinearity (VIF>5) between some predictors, indicating potential collinearity between predictors that may bias parameter estimates. As a result, two predictors: country of birth and the IMD, were excluded due to redundancy. After these exclusions, the final model included 23 predictors of the original 25 considered, all of which had acceptable VIF values (VIF < 5). The results of the correlation and univariate regression analyses are provided in the Supplementary Materials (see Tables S6 and S7, pp. 12–14). Diagnostic information for the multiple LR using complete-case data is provided in Figures S1 to S2 (pp. 15–16), while that using imputed data is provided in Figures S3 to S5 (pp. 17–27) Figures 1 to 3.
Predictive models
The multiple LR model (F (25, 2037) = 9.26, p < .001) explained 10% of the variance in the H/I SDQ score (R 2 = .10, adj R 2 = .09). Significant associations were found for White British (B = .28, 95% CI [.12, .43], p = .001; reference = “other ethnicity,” e.g., Asian) and Pakistani ethnicity (B = .17, 95% CI [.02, .31], p = .022), being married (B = .15, 95% CI [ < .001, .29], p = .045; reference = “unmarried,” infant’s cord bold triglycerides level (B = .10, 95% CI [.05, .15], p < .001), female infant (B = −.53, 95% CI [−.62, −.44], p < .001; reference = “male”), infant’s head circumference (B = −.08, 95% CI [−.15, −.02], p < .012), maternal smoking (B = .23, 95% CI [.14, .33], p < .001; reference = “non-smoker”), with offspring’s H/I SDQ score (Table 3).
Table 4 provides the model adequacy metrics for the multiple LR, CART, and RF models. These metrics suggest that the performance of the three models was similar, but the LR model achieved the best prediction, with a slightly lower RMSE and the highest R-squared. Using 23 predictors, 6.97% of the variation in the offspring’s H/I SDQ score could be explained using a 10-fold cross-validation with multiple LR in the test dataset. This was higher than CART (5.26%) and RF (5.81%). While the multiple LR model had the lowest RMSE values of 2.67 from cross-validation, it was only slightly different from the 2.70 obtained from the CART and RF models. These findings suggest that information available by birth can help predict ADHD symptoms later; however, the variation explained was modest.
The feature importance plot from the optimal model (the multiple LR) showed that infant male sex and maternal smoking were the top two most important predictors, followed by ethnicity = “Others”, infant’s cord blood triglycerides, head circumferences and other marital status (Figure 1). The pattern of results, as well as the feature importance statistics (for the multiple LR, provided in Table S8; p. 28), was similar for CART and RF (Figures 2, 3), which suggested that male sex and maternal smoking as the most important predictors. Notably, the LR and RF models produce importance scores for all predictors, whereas the CART model assigns nonzero importance only to those used in the final tree splits. As such, the CART feature importance plot highlights only the most discriminative predictors, namely, male sex and maternal smoking, and typically produces a more concise set of variables compared to regression- or ensemble-based approaches.

Figure 1. BarplotoftheFeatureImportancefor the Multiple linear regression (LR) model.

Figure 2. Bar plot of the feature importance for the classification and regression trees (CART) model.

Figure 3. Bar plot of the feature importance for the random forest (RF) model.
Discussion
Our study found that around 7% of the variation in ADHD symptoms, measured by children’s H/I SDQ subscale score at the age of five could be predicted based on perinatal and sociodemographic predictors typically easy to gather around the time of birth. Male sex, maternal smoking, and infant’s cord blood leptin emerged as the most influential predictors. The results suggest that prediction models based on data available around the time of birth could be used to help identify those at risk of later ADHD symptoms.
The modest variance explained is consistent with the complex nature of ADHD etiology, involving multiple factors, both genetic and environmental while also involving complex gene–environment interplay (Balogh et al., Reference Balogh, Pulay and andRéthelyi2022; Leffa et al., Reference Leffa, Caye, Belangero, Gadelha, Pan, Salum and Rohde2023). Numerous studies suggest that ADHD is a highly heritable but complex condition influenced by multiple factors (Faraone& Larsson, Reference Faraone and Larsson2019; Gizer et al., Reference Gizer, Ficks and Waldman2009; Thapar et al., Reference Thapar, Cooper, Eyre and Langley2013). Although genetics play a crucial role in its development, with approximately 74% heritability (Faraone& Larsson, Reference Faraone and Larsson2019), common individual DNA risk variants only contribute a tiny effect (Luo et al., Reference Luo, Weibman, Halperin and Li2019). Findings from twin studies have highlighted that even combining the impact of these DNA risk variants only explains around 22% of heritability. Additionally, by estimating and combining the effects of thousands of genetic variants, polygenic risk scores (PRS) revealed only 5.5% of ADHD symptoms can be predicted (Faraone & Larsson, Reference Faraone and Larsson2019), similar to the ∼ 7% of variance explained here by factors feasible to collect around birth.
A similar amount of explained variance, at around 7 to 8% variation, was found in a similar study by Dooley et al. (Reference Dooley, Healy, Cotter, Clarke and Cannon2024). Unlike our focus on ML with 23 predictors related to perinatal and sociodemographic characteristics, they used elastic net regression and included 40 detailed pre- and perinatal variables, such as maternal substance use, obstetric complications, child demographics, maternal drug use and vitamin intake. Furthermore, Dooley et al., measured child ADHD with the parent-reported Child Behaviour Checklist (CBCL) score at ages 9–10, while our study analyzed the teacher-rated H/I subscale from the SDQ when children were 4 to 5 years old. Despite differences in our predictors’ choices and definitions, the similar variance explained arguably highlights the inherent difficulty in predicting ADHD symptoms from perinatal factors alone.
In terms of the predictors that emerged as highest in feature importance, we found similar results to Dooley et al., Their study identified infant male sex and pregnant individual’s smoking during pregnancy as two of the three most significant predictors of ADHD among the 40 variables; these predictors also emerged with the highest feature importance in the current analysis. Garcia-Argibay et al. (Reference Garcia-Argibay, Zhang-James, Cortese, Lichtenstein, Larsson and Faraone2023), who applied ML models’ to registry-based data in Sweden, also found male sex to be one of the five top predictors (the others were: criminal convictions of parents, history of ADHD in the family, communication and learning difficulties, and academic performance of the child). However, results varied in their sex-stratified models. The next most important features in the current analysis (based on the RF model selected as the optimal model) were the infant’s cord blood leptin, the pregnant individual’s age, the infant’s cord blood triglycerides, head circumference, and the infant’scord blood adiponectin. These predictors are rarely investigated in prior ADHD prediction studies and represent a novel contribution of the present study. Taken together, the results suggest that these eight variables might be prioritized in future studies aiming to develop predictive models.
The superiority of the LR model over CART and RF in the present study is also consistent with some previous findings. Garcia-Argibay et al., analyzed the predictive performance of a range of ML models and found that the RF model, which yielded an area under the curve (AUC) of .68 and showed overfitting signs poorer predictive accuracy than the logistic regression model (AUC = .74). This suggests that complex algorithms, particularly nonlinear models, may not outperform traditional regression models when analyzing perinatal predictors, which may indicate a lack of nonlinearity and/or complex interactions.
The use of predictive modeling in ADHD research has been a burgeoning area in recent years; however, the use cases have been mostly in relation to the classification of ADHD presence, rather than early prediction. A systematic review by Salazar de Pablo et al. (Reference Salazar de Pablo, Iniesta, Bellato, Caye, Dobrosavljevic, Parlatini, Garcia-Argibay, Li, Cabras and Haider Ali2024) highlighted that nearly 90% of predictive modeling studies in ADHD had high predictive accuracy, as defined by an AUC ranging from 0.50 to 0.97. Nevertheless, these studies focused on the binary classification of ADHD presence. Further development of the literature of early ADHD symptom predictors (especially treating symptoms as continuous reflecting contemporary understandings of ADHD symptoms) holds potential for leveraging the advances in predictive modeling for clinically meaningful applications. Future investigations could also examine different subdimensions of ADHD symptoms, given that previous research suggests that inattention and hyperactivity/impulsivity may show different developmental trajectories and outcomes, and that profiles of symptoms may differ by gender (e.g., Stibbe et al., Reference Stibbe, Huang, Paucke, Ulke and Strauss2020; Vergunst et al., Reference Vergunst, Tremblay, Galera, Nagin, Vitaro, Boivin and andCôté2019). They could also address the prediction of different “developmental subtypes.” For example, Murray et al. (Reference Murray, Hall, Speyer, Carter, Mirman, Caye and Rohde2022) suggest that distinct early-life risks relate to different ADHD trajectories (e.g., earlier vs. later onset and remitting vs. persistent). It will also be important of assessing how predictive accuracy changes during development given that those with later onsets of ADHD symptoms may not have been captured in the present sample. Similarly, given that additional influences on ADHD come into play at different stages of development, it would be valuable to compare models at different developmental stages that include measures of emerging influences.
Limitations and future directions
A main limitation of the present study concerns the scope of predictors included. It lacks measures of predictors previously associated with children’s developmental outcomes including drug and medication usage (Dooley et al., Reference Dooley, Healy, Cotter, Clarke and Cannon2024), vitamin D deficiency (Tahir et al., Reference Tahir, Munir, Iqbal, Bacha, Amir, Umar, Riaz, Tahir, Ali Shah and Shafiq2023), as well as postnatal factors such as parenting styles (Hutchison et al., Reference Hutchison, Feder, Abar and Winsler2016), adverse childhood experience (Brown et al., Reference Brown, Brown, Briggs, Germán, Belamarich and Oyeku2017) and the contribution of genetic factors quantified by the PRS (Green et al., Reference Green, Baroud, DiSalvo, Faraone and Biederman2022; Ronald et al., Reference Ronald, de Bode and Polderman2021) that may improve prediction. A key predictor that would also be likely to improve prediction is parental ADHD symptoms. These were due to data availability in the cohort study and the complex multifactorial nature of ADHD. It is challenging to comprehensively measure all of these without imposing a significant burden on the participants. Importantly, this also explains why challenges exist in determining what to measure and include in predictive models, as the etiology of ADHD remains not fully understood to date. The measurement of predictor variables and ADHD symptoms could also be enhanced since measurement errors might exist.
These considerations imply two potential future directions: gathering and analyzing more comprehensive data available around birth and building dynamic prediction models that are updated as more information becomes available over the course of the child’s development. However, this must be balanced against feasibility and any data collection used to predict later ADHD in clinical practice needs to minimize clinician and patient time and burden.
A second limitation concerns the high proportion of missingness across several perinatal predictors, particularly in maternal smoking, alcohol use, and cord blood biomarkers. This reflects real-world challenges in birth cohort data collection. These were addressed through multiple imputations under the MAR assumption. MAR assumption adopted by MICE models. If, for example, parents declined their children’s participation in the “Starting School” project due to variables that may be related to ADHD symptoms, then this missing data could be considered non-random. Applying MICE is appropriate after diagnostic tests for missingness, following Newman (Reference Newman2014) for assessing item- and construct-level missing data. Recommended by general psychological research, the observed missingness patterns support the assumption of missing at random (MAR), justifying MICE use (e.g., Enders, Reference Enders2022). Potential biases due to missingness could be mitigated using datasets with high population coverage; however, this is likely to involve a trade-off with the depth of information available for each participant.
Conclusions
LR models provided the best prediction and explained a modest amount (6.97%) of the variance in the H/I SDQ score, with 23 maternal and neonatal factors providing proof-of-principle for predicting ADHD from non-genetic information available at birth. Maternal smoking and the infant’s male sex were the top three predictors of the RF models. Future research could build on the present study to develop improved prediction models by the inclusion of additional variables omitted from the present study.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S0954579425100783.
Data availability statement
Due to data access restrictions associated with the Born in Bradford study, the data and materials used in the study are not publicly available. All analyses were conducted using publicly available R packages, as referenced in the manuscript. No custom code was developed for this project. However, statistical code and analysis scripts can be made available upon reasonable request.
Pre-registration statement
The analyses presented in this manuscript were not preregistered.
Funding statement
Bonnie Auyeung was supported by the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No.813546, the Baily Thomas Charitable Fund TRUST/VC/AC/SG/469207686, the Data Driven Innovation, and the UK Economic and Social Research Council (ES/W001519/1) during the course of this work.
Competing interests
The author(s) declare none.
Ethical standards
Ethical approval for the current study was obtained from the School of Philosophy, Psychology and Language Sciences Ethics Committee at the University of Edinburgh.



