Impact Statement
This paper demonstrates that machine learning can predict concrete strength, even for small datasets, by utilizing new test results as they become available over the concrete lifetime. This has important implications for accelerating the decision-making process currently adopted in concrete practice, accelerating the adoption of novel low-carbon materials with limited test data. Furthermore, the machine learning methodology developed in this paper, including a newly proposed accuracy-uncertainty metric to compare various machine learning models, has application to the broader field of emerging materials design, allowing bespoke materials to be designed rapidly for particular applications.
1. Introduction
Concrete usage causes around 8% of global greenhouse gas emissions, primarily due to the production of ordinary Portland cement (OPC) (Turner and Collins, Reference Turner and Collins2013; Miller et al., Reference Miller, Habert, Myers and Harvey2021). Efforts are underway to reduce and replace OPC with alternative materials. Lower carbon OPC replacement materials include calcined clay, ground granulated blast furnace slag (GGBS), and other supplementary cementitious materials. Extensive historical testing of OPC-based mixes enabled researchers to establish general relationships between the constituent proportions and the properties of the resulting concrete. From this, prescriptive design principles are used in many aspects of concrete mix formulation, ensuring appropriate short-term and longer-term properties, including strength and durability (Teychenne et al., Reference Teychenne, Franklin and Emtroy1997).
Current guidelines for OPC mixes may not reflect the performance of newly developed concrete mixes that incorporate novel materials to reduce OPC content and embodied carbon. Experimental testing of lower carbon mixes, including such materials, can reduce uncertainty, but testing all possible mixes is prohibitively time-consuming and expensive. Therefore, accurate predictive methods are required to reduce experimental testing requirements. Machine learning is a powerful tool that can capture complex non-linear relationships between inputs (such as mix proportions) and outputs (desired physical properties) from historical data, even if the available data is sparse and noisy (Conduit et al., Reference Conduit, Jones, Stone and Conduit2017, Reference Conduit, Jones, Stone and Conduit2018, Reference Conduit, Illston, Baker, Duggappa, Harding, Stone and Conduit2019). The approach has been successfully used to predict the physical properties of many different materials (Bhadeshia et al., Reference Bhadeshia, MacKay and Svensson1995; Sourmail et al., Reference Sourmail, Bhadeshia and MacKay2002; Agrawal et al., Reference Agrawal, Deshpande, Cecen, Basavarsu, Choudhary and Kalidindi2014; Ward et al., Reference Ward, Agrawal, Choudhary and Wolverton2016, Reference Ward, Liu, Krishna, Hedge, Agrawal, Choudhary and Wolverton2017; Kim et al., Reference Kim, Ward, He, Krishna, Agrawal and Wolverton2018), including concrete (Taffese et al., Reference Taffese, Sistonen and Puttonen2015; Chaabene et al., Reference Chaabene, Flah and Nehdi2020; Prayogo et al., Reference Prayogo, Santoso, Wijaya, Gunawan and Widjaja2020; Liu et al., Reference Liu, Alam, Zhu, Zheng and Chi2021; Tran et al., Reference Tran, Mai, To and Nguyen2022). Furthermore, machine learning has been exploited to design concrete with high confidence in meeting requirements (Forsdyke et al., Reference Forsdyke, Zviazhynski, Lees and Conduit2023).
One of the most important properties of concrete is the compressive strength, which increases over time. In practical settings, strength is conventionally assessed at 28 days via compressive testing, but a target strength at 56 days can also be specified, particularly for materials with lower early-age strength development. Machine learning algorithms are increasingly being used to predict concrete compressive strength for mixes with a range of OPC replacement materials, including fly ash (Chou et al., Reference Chou, Tsai, Pham and Lu2014; Deng et al., Reference Deng, He, Zhou, Yu, Cheng and Wu2018; Young et al., Reference Young, Hall, Pilon, Gupta and Sant2019; Feng et al., Reference Feng, Liu, Wang, Chen, Chang, Wei and Jiang2020; Khursheed et al., Reference Khursheed, Jagan, Samui and Kumar2021; Salami et al., Reference Salami, Olayiwola, Oyehan and Raji2021; Wan et al., Reference Wan, Xu and Šavija2021; Nguyen et al., Reference Nguyen, Vu, Vo and Thai2021a; Liu, Reference Liu2022; Lee et al., Reference Lee, Nguyen, Karamanli, Lee and Vo2023; Li et al., Reference Li, Tang, Kang, Zhang and Li2023; Pakzad et al., Reference Pakzad, Roshan and Ghalehnovi2023; Hariri-Ardebili et al., Reference Hariri-Ardebili, Mahdavi and Pourkamali-Anaraki2024), blast furnace slag (Chou et al., Reference Chou, Tsai, Pham and Lu2014; Feng et al., Reference Feng, Liu, Wang, Chen, Chang, Wei and Jiang2020; Salami et al., Reference Salami, Olayiwola, Oyehan and Raji2021; Wan et al., Reference Wan, Xu and Šavija2021; Nguyen et al., Reference Nguyen, Vu, Vo and Thai2021a; Lee et al., Reference Lee, Nguyen, Karamanli, Lee and Vo2023; Li et al., Reference Li, Tang, Kang, Zhang and Li2023; Hariri-Ardebili et al., Reference Hariri-Ardebili, Mahdavi and Pourkamali-Anaraki2024) and other materials including silica fume, metakaolin and rubber (Chou et al., Reference Chou, Tsai, Pham and Lu2014; Hadzima-Nyarko et al., Reference Hadzima-Nyarko, Nyarko, Lu and Zhu2020; Mansouri et al., Reference Mansouri, Manfredi and Hu2022). Some attention has also been given to predicting the properties of high-strength mixes (Nguyen et al., Reference Nguyen, Vo, Lee and Asteris2021b), binary and ternary blended concretes (Murthy et al., Reference Murthy, Amruth, Marulasiddappa and Naganna2024) and self-compacting mixes (Chakravarthy et al., Reference Chakravarthy, Seenappa, Naganna and Pruthviraja2023) using machine learning algorithms.
Despite a wealth of academic literature in this area, the adoption of machine learning models in practical settings remains limited. Most existing predictive tools have been developed using large test banks (greater than 1000 data points) offering high prediction accuracy. Due to the rapid introduction of new materials and the evolving nature of concrete mixes, there will be limited data for each OPC replacement material. As such, prediction models must be robust for smaller datasets, with highly accurate strength predictions, but also realistic uncertainty estimates.
From a review of the 20 articles highlighted in the area of machine learning strength prediction, it emerges that existing models commonly rely on the mix aggregates, binder (ordinary Portland cement and other cementitious materials), and water contents to estimate strength at a specific age (Figure 1). Due to the increasing complexity of concrete mix behaviour and the limitations of traditional mix design principles, it may be necessary to include more than three primary input variables to obtain accurate predictions for a variety of concrete mixes with varying materials, especially when test data is sparse. Therefore, this work proposes utilizing information obtained along the concrete life-cycle, such as fresh state measurements, to improve strength predictions without having to wait extended periods. This potentially reduces the risk of adopting mixes with sparse historical data, given strength estimations can be updated in real time.

Figure 1. Prevalence of model input variables used to predict compressive strength in the literature (Chou et al., Reference Chou, Tsai, Pham and Lu2014; Deng et al., Reference Deng, He, Zhou, Yu, Cheng and Wu2018; Young et al., Reference Young, Hall, Pilon, Gupta and Sant2019; Feng et al., Reference Feng, Liu, Wang, Chen, Chang, Wei and Jiang2020; Hadzima-Nyarko et al., Reference Hadzima-Nyarko, Nyarko, Lu and Zhu2020; Pham et al., Reference Pham, Ngo, Nguyen and Truong2020; Khursheed et al., Reference Khursheed, Jagan, Samui and Kumar2021; Nunez et al., Reference Nunez, Marani, Flah and Nehdi2021; Salami et al., Reference Salami, Olayiwola, Oyehan and Raji2021; Wan et al., Reference Wan, Xu and Šavija2021; Nguyen et al., Reference Nguyen, Vu, Vo and Thai2021a; Han et al., Reference Han, Wu and Liu2022; Khan et al., Reference Khan, Salami, Jamal, Amin, Usman, Al-Faiad, Abu-Arab and Iqbal2022; Liu, Reference Liu2022; Mansouri et al., Reference Mansouri, Manfredi and Hu2022; Chi et al., Reference Chi, Wang, Liu, Lu, Kan, Xia and Huang2023; Lee et al., Reference Lee, Nguyen, Karamanli, Lee and Vo2023; Li et al., Reference Li, Tang, Kang, Zhang and Li2023; Pakzad et al., Reference Pakzad, Roshan and Ghalehnovi2023; Hariri-Ardebili et al., Reference Hariri-Ardebili, Mahdavi and Pourkamali-Anaraki2024).
Practitioners need good estimates of model uncertainty to be confident in the prediction outputs and to assess the risk profile of different concrete products. Most existing prediction models assess performance using a single accuracy metric, i.e., how close on average all of the predictions are to the actual strength. A measure of how trustworthy any particular prediction is, i.e. its uncertainty, is currently limited. Reviewing existing machine learning models revealed that most work considers a very limited range of accuracy metrics (Alkayem et al., Reference Alkayem, Shen, Mayya, Asteris, Fu, Di Luzio, Strauss and Cao2024). Almost all of the papers relied solely on accuracy metrics such as coefficient of determination
$ {R}^2 $
, mean absolute error, or root mean square error, to compare model performance (Chou et al., Reference Chou, Tsai, Pham and Lu2014; Chopra et al., Reference Chopra, Sharma, Kumar and Chopra2018; Deng et al., Reference Deng, He, Zhou, Yu, Cheng and Wu2018; Young et al., Reference Young, Hall, Pilon, Gupta and Sant2019; Chaabene et al., Reference Chaabene, Flah and Nehdi2020; Feng et al., Reference Feng, Liu, Wang, Chen, Chang, Wei and Jiang2020; Hadzima-Nyarko et al., Reference Hadzima-Nyarko, Nyarko, Lu and Zhu2020; Pham et al., Reference Pham, Ngo, Nguyen and Truong2020; Nunez et al., Reference Nunez, Marani, Flah and Nehdi2021; Salami et al., Reference Salami, Olayiwola, Oyehan and Raji2021; Wan et al., Reference Wan, Xu and Šavija2021; Nguyen et al., Reference Nguyen, Vu, Vo and Thai2021a; Han et al., Reference Han, Wu and Liu2022; Khan et al., Reference Khan, Salami, Jamal, Amin, Usman, Al-Faiad, Abu-Arab and Iqbal2022; Liu, Reference Liu2022; Mansouri et al., Reference Mansouri, Manfredi and Hu2022; Chi et al., Reference Chi, Wang, Liu, Lu, Kan, Xia and Huang2023; Lee et al., Reference Lee, Nguyen, Karamanli, Lee and Vo2023). A very small number of papers directly considered the uncertainty in model predictions (Pakzad et al., Reference Pakzad, Roshan and Ghalehnovi2023; Hariri-Ardebili et al., Reference Hariri-Ardebili, Mahdavi and Pourkamali-Anaraki2024). Where uncertainty was considered, it was to minimize prediction uncertainty, rather than assess the quality of uncertainty estimates and metrics such as the scatter index were adopted. From this review, it is clear that the practical considerations of applying machine learning for concrete strength prediction are often overlooked, resulting in predictive tools that are insufficiently robust for real-world use.
In this work, a sparse dataset (fewer than 50 data points) is adopted containing both OPC and GGBS materials to train a machine learning model to predict concrete compressive strength. The dataset used in this study is comparable to what a practitioner would have access to for an emerging lower-carbon material. This work aims to determine how machine learning models can be informed in real time over the life cycle of the concrete product as newly obtained information, such as fresh state measurements or earlier age strength data, becomes available. To do this, the following objectives are addressed:
-
• Which properties are most helpful to predict longer-term (56-day) strength?
-
• When predicting the strength of concrete mixes with OPC with or without GGBS binders, are better predictions achieved when using datasets separated by binder type or a single unified dataset?
-
• What is the role of uncertainty in the performance assessment of machine learning models and for concrete strength prediction? Can a broader range of predictive performance measures, including accuracy and uncertainty, be effectively captured in a single metric?
This paper is organized into several sections to achieve its objectives. Section 2 highlights the key properties of concrete throughout its life cycle and reviews the correlations between various concrete properties within the adopted dataset. Initially, the correlations for fresh state properties are examined separately for Ordinary Portland Cement (OPC) only mixes and mixes containing Ground Granulated Blast Furnace Slag (GGBS). This is followed by an assessment of the correlations within a unified dataset that includes both types of binder materials. Guidance is presented for processing datasets for property prediction in practical settings. Section 3 discusses the random forest machine learning methodology employed in this study. It begins with an overview of the two-layer model and the decision tree process. Next, the accuracy and uncertainty metrics are evaluated, and a unified accuracy-uncertainty metric is proposed. Finally, leave-one-out cross-validation is performed to optimize model performance. In Section 4, two previously unseen concrete mixes are used for experimental validation of the proposed approach, demonstrating the advantages of the just-in-time prediction method. Section 5 addresses the limitations of the presented approach and discusses potential future applications of this work within the concrete industry and beyond.
2. Key concrete properties
This section discusses the key properties of concrete, which change throughout its lifecycle from the fresh to hardened state. Correlations between the properties are then reviewed. After the design and mixing of the constituent materials, concrete is initially in a fluid state for the first few hours (fresh state) and then solidifies into a hardened state (Figure 2). Because of the different states passed through, several performance measures are used throughout the life cycle of concrete. Research has shown that shorter-term fresh-state performance impacts longer-term performance due to compaction effects and microstructure development (Lydon, Reference Lydon1972; Neville, Reference Neville1988). Fresh state testing, therefore, may aid understanding of the development of hardened state properties, leading to the early identification of issues in the later stages of the life cycle of a concrete product (Tattersall, Reference Tattersall1991).

Figure 2. Life cycle of concrete, from mixing to the hardened state, with the step-wise prediction approach and decision tree highlighted at each concrete age.
Figure 2 outlines a flow chart of how longer-term compressive strength predictions could be updated when new information becomes available over the life cycle of the concrete product. This offers a mechanism to reduce risk via several decision stages and, therefore, increase the adoption of mixes whose behavior is relatively unknown. The most basic input for the strength predictions is the mix constituents. Inputting solely this information to the machine learning model is advantageous as predictions can be performed before mixing. However, for novel mixes, these inputs alone will likely cause the prediction output to have large uncertainty as the training data will be primarily formed of traditional concrete mixes whose behaviour may not be representative of emerging materials. If large prediction uncertainty is observed, the next step is to update the strength predictions with fresh state results (such as slump testing) measured at the placement stage. Updating the strength predictions at this stage is advantageous as a decision can be made whether to proceed with the pour based on the improved estimates of strength and uncertainty. If the pour proceeds based on the output of this prediction step, then the strength predictions can be updated with the hardened state strength results taken at early ages. Progressively, the strength predictions will be closer to the actual compressive strength, with reduced uncertainty in the estimates. This step-wise prediction approach is adopted in this paper, with the experimental validation demonstrating the improved prediction performance as the model is updated with new information in real time.
First, in Section 2.1, we describe the preparation of concrete mixes used in the dataset and experimental validation performed in this study. Second, in Section 2.2, we describe fresh state performance testing of the mixes. Third, in Section 2.3, we set out hardened testing following the setting process of the mixes. Finally, in Section 2.4, we describe the correlations between the concrete properties observed in the sparse dataset, including mix proportions, fresh state properties, and hardened state properties, to form a sound basis for developing the machine learning model.
2.1. Preparation of concrete mix dataset
A total of 29 concrete mixes, 20 of which had 56-day strength data, were produced to form the training dataset (see the Supplementary Material) adopted in this study. These concrete mixes were produced in the University of Cambridge Civil Engineering Building laboratories. The experimental series adopted CEMI 52.5 ordinary Portland cement and sand fine aggregates with a maximum aggregate size of
$ 2\hskip0.1em \mathrm{mm} $
. A coarse crushed aggregate with a maximum aggregate size of
$ 10\hskip0.1em \mathrm{mm} $
was also adopted. A 100-litre planetary concrete mixer with variable rotor and drum speed was utilized throughout the test series. To produce a suitable range of fresh and hardened state properties, the water–binder ratio (0.50–0.91), total binder content (260–418
$ \mathrm{kg}/{\mathrm{m}}^3 $
), sand–aggregate ratio (0.48–0.62), aggregate–binder ratio (3.9–7.0), and the GGBS cement replacement content of the mix (0–70%) were manipulated. These values are typical of those used in the wider large-scale concrete activity. The full table of data is included in the Supplementary Material.
The mix proportions, including the total binder content (ordinary Portland cement plus other cementitious materials), water–binder ratio, sand–aggregate ratio, aggregate–binder ratio, and supplementary cementitious material (SCM) percentage, were designed to achieve adequate variation in the fresh and hardened properties and represent the diversity in mix design seen in practice, whilst maintaining the scale of a small dataset. Several mixes exhibit comparable mix proportions but were batched at different total volumes, replicating the real-life production processes. This also replicates the nature of data within a practical setting, where the potential of novel mixes is scoped before scaling up production. Where SCMs were adopted, ground granulated blast furnace slag (GGBS) was utilized, given it is one of the primary materials used in practice to replace OPC and reduce embodied carbon as it is a waste material from the steel industry. GGBS is used as an example, but the process is developed so that it is valid for other SCMs.
2.2. Fresh state properties
Immediately, and up to around two hours after water is added to the mix, concrete is in its fresh state so that it can flow with a relatively small applied stress (Tattersall, Reference Tattersall1991). This allows concrete to be poured and shaped within molds to form the desired geometry as the mix hardens. Tests are completed in the fresh state to ensure that the concrete is suitable for placement and that a stable mix has been produced. Fresh state testing often takes the form of the Abrams slump test, conducted in this study and in practice, according to BS EN 12350-2 (BSI, 2019a). In this test, concrete is added to a conical mold in three layers, with each layer tamped sequentially. The slump cone is lifted steadily to allow the material to flow due to its weight, and the concrete is classified according to the difference between the height of the deformed material and the height of the slump cone. This classification is termed the total slump height. Other properties, including the rheological measures of yield stress and viscosity, can further describe the fresh state characteristics of concrete as a non-Newtonian fluid. A concrete rheometer was used in this study to measure yield stress and viscosity directly, although alternative easier-to-access mechanisms to derive the rheological properties have been developed (White and Lees, Reference White and Lees2023; White and Lees, Reference White and Lees2025). The fresh state properties of each mix within the adopted dataset are provided as the Supplementary Material.
2.3. Hardened state properties
Following the addition of water, a chemical hydration process begins, and over time, structural build-up results in a hardened state as the concrete develops strength. After the design concrete strength is specified, the actual concrete mix placed during construction must achieve at least this value. The strength development of the concrete is tracked by conducting hardened state testing at various ages on 100 mm
$ \times $
100 mm
$ \times $
100 mm cube specimens cast at the same time as the structure. In this study, compressive strength testing is conducted according to BS EN 12390-3 at 28 and 56 days via the application of axial force onto a concrete cube through parallel platens (BSI, 2019b). For each concrete mix, three samples are taken, and the maximum load reached by each sample is recorded. The average of these maximum load values is considered as the compressive strength of the specimen. The 28-day strength is taken as an intermediate value to be used as an input for machine learning; the 56-day strength is taken as representative of the longer-term strength performance. The actual compressive strength values attained via experimental testing, and the predicted strength for each mix are provided as the Supplementary Material.
2.4. Correlations between properties
The previous sections demonstrated how the properties of concrete change throughout its lifecycle. Since these properties pertain to the same concrete mix, they can be related to one another, which is discussed in this section.
The correlations between properties within the dataset, calculated using the Pearson correlation coefficient, are shown in Figure 3. Color intensity indicates the correlation coefficient, with the lightest shade indicating the strongest positive correlation and the darkest shade indicating the strongest negative correlation. Three matrices are presented for the OPC mixes (Figure 3a), the GGBS mixes (Figure 3b), and the unified dataset comprising OPC and GGBS mixes (Figure 3c). The first seven rows of the matrix (eight rows for GGBS mixes and the unified dataset) display the proportions of the constituent materials present in the concrete. The following four rows display the fresh state testing completed, such as the slump test. The final three rows of the correlation matrix display the hardened state testing completed, including density and strength measurements.

Figure 3. (a) Correlation map for the OPC mixes within the dataset, (b) correlation map for the GGBS mixes within the dataset, and (c) correlation map for the unified dataset. The right-hand scale bar calibrates the correlation values.
In all three datasets, the fresh state properties of the mix are only weakly correlated with mix proportions. There is, as expected, a moderate positive correlation between water content and total slump height during fresh state testing, which is increased in the presence of GGBS (0.51 for OPC, 0.72 for GGBS and 0.52 for unified). There are similar physically justified negative correlations between water content and static yield stress (−0.55 for OPC, −0.65 for GGBS, and −0.52 for unified). Viscosity is more strongly correlated with the mix proportions, particularly with coarse aggregate content (0.80 for OPC, 0.56 for GGBS, and 0.75 for unified) and with aggregate–binder ratio (0.82 for OPC, 0.56 for GGBS, and 0.76 for unified). Furthermore, a strong negative correlation of −0.98 for OPC, −0.94 for GGBS, and −0.96 for the unified dataset was observed between total slump height and static yield stress. This effect has been observed elsewhere (Ferraris and Larrard, Reference Ferraris and Larrard1998; Laskar, Reference Laskar2009; White and Lees, Reference White and Lees2023).
In the GGBS dataset, a much stronger correlation between cement content and compressive strength at both 28 days (0.43 for OPC, 0.95 for GGBS) and 56 days (−0.30 for OPC, 0.93 for GGBS) is observed. Combined with the fact that strong negative correlations between GGBS content and compressive strength at both 28 days (−0.95) and 56 days (−0.94) mean that including GGBS within the mix has a significant negative effect on strength. Therefore, GGBS mixes should be included in the training data if this material is to be adopted in practice to enable accurate predictions of compressive strength with the presence of a novel material.
A negative correlation between the water–binder ratio and compressive strength is expected in both OPC and GGBS mixes (Neville, Reference Neville1988; Newman and Choo, Reference Newman and Choo2003; Fan et al., Reference Fan, Dhir and Hewlett2021). Increasing the water–binder ratio means more water is available for the chemical reaction between binder and other materials, but too much water results in weak concrete with pores and large matrix spacing. In the OPC dataset, a much stronger and negative correlation between the water-binder ratio and compressive strength at both 28 days (−0.65 for OPC, 0.11 for GGBS) and 56 days (−0.83 for OPC, 0.05 for GGBS) is observed. The lack of this correlation in the GGBS dataset could spuriously arise due to the small size of the dataset. Therefore, including OPC mixes and forming a unified dataset would allow machine learning to capture the expected fundamental relationships based on non-material specific elements while still capturing any differences that appear due to the inclusion of alternative binder materials besides ordinary Portland cement.
As expected, in all three datasets, the 28 day strength is highly correlated to the 56 day strength, meaning this earlier strength measure could be used to accurately predict the 56 day strength, following the workflow in Figure 2. Overall, Figure 3 displays strong correlations between many of the variables within the dataset, which can be utilized by machine learning to make predictions. These correlations are overall most accurately reflected when both GGBS and OPC mixes are included in the dataset. Therefore, moving forward, we adopt the unified dataset to train machine learning models, and a unified dataset is suggested for practical adoption of machine learning for emerging materials with sparse empirical data. This section has provided an understanding of the key properties of concrete and their interrelationships. The following section will outline the machine learning methodology used to predict concrete strength. It will cover the training process, the performance metrics adopted for the model, and comparisons of the model’s performance.
3. Machine learning methodology
Following the identification of the key relationships between concrete properties through its fresh and hardened states, the machine learning methodology used to predict concrete strength and uncertainty is discussed in this section. First, in Section 3.1, we describe the machine learning algorithm used in this work based on the property relationships discussed previously. Second, in Section 3.2, we describe the development of a single metric that incorporates both the accuracy of model predictions and the quality of uncertainty estimates. Finally, in Section 3.3, we demonstrate that optimization of the single metric leads to a model that gives overall better predictions of strength and its uncertainty. Based on the single metric, we demonstrate that using intermediate measures such as slump as an input improves predictions of longer-term strength (56 day strength) and its uncertainty. This informs the strategy of real-time strength prediction, as discussed in later sections.
3.1. Model training
This section first reviews the available tools for concrete strength prediction, a process which was guided by existing reviews of soft computing processes and the results of other prediction models for these purposes, e.g.Alkayem et al. (Reference Alkayem, Shen, Mayya, Asteris, Fu, Di Luzio, Strauss and Cao2024)). Supervised learning processes are most common and examples of widely used machine learning models are
$ k $
-means clustering (Cohn and Holm, Reference Cohn and Holm2021), neural networks (Hastie et al., Reference Hastie, Tibshirani and Friedman2001), and Gaussian processes (Tancret, Reference Tancret2013). This work adopts a two-layer random forest model (Zviazhynski and Conduit, Reference Zviazhynski and Conduit2023). This model is computationally cheap, robust against outliers, and can uncover non-linear relationships from sparse data. This method has been used previously to successfully design concrete mixes (Forsdyke et al., Reference Forsdyke, Zviazhynski, Lees and Conduit2023) and, when compared to k-nearest neighbor and other models, has produced the best accuracy and more stability with less susceptibility to preprocessing of data (Ghunimat et al., Reference Ghunimat, Alzoubi and Alzboon2023; Verma et al., Reference Verma, Upreti, Khan, Alam, Ghosh, Singh, Shaw, Paprzycki and Ghosh2023). The Scikit-Learn Python package (Pedregosa et al., Reference Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel, Blondel, Müller, Nothman, Louppe, Prettenhofer, Weiss, Dubourg, Vanderplas, Passos, Cournapeau, Brucher, Perrot and Duchesnay2011) is adopted for the random forest model.
The two-layer random forest model can be seen in Figure 4a. The first layer random forest model is trained on the concrete mix proportions (Flow A2) to predict all of the intermediate and output variables, such as 28-day strength, alongside their uncertainties (Flow A3). The second layer random forest model is then used, which takes the concrete mix proportions (Flow A1) and other recently predicted variables and uncertainties (Flow A4) to predict the final output, such as 56-day strength (Flow A5). Random forest model predictions are affected by their hyperparameters. In this work, to achieve the best overall predictions of the output variable and its uncertainty, we vary the
$ \mathit{\min}\_ samples\_ leaf $
hyperparameter, which is the minimum number of samples in a leaf of each constituent decision tree within a random forest, which, when increased, allows it to average out the noise.

Figure 4. (a) The two-layer random forest model. (b) A graphical description of a tree with
$ \mathit{\min}\_ samples\_ leaf=4 $
. Green boxes are tree leaves, black points are training data, the pink stepped line is the prediction of a single tree with all data present, and the pink shaded area is the range of the predictions from the ensemble of trees.
An example of fitting a decision tree with
$ \mathit{\min}\_ samples\_ leaf=4 $
is illustrated in Figure 4b. The tree leaves are represented by green boxes, each containing at least four training data points (black points). The resulting prediction is calculated as the average value of the output variable within each leaf and is represented by the pink stepped line. Our final model is an ensemble of such decision trees and is called random forest. Each decision tree within random forest is trained on a bootstrap sample from the training set, which is obtained by repeatedly drawing an entry from the training set randomly with replacement (Hastie et al., Reference Hastie, Tibshirani and Friedman2001). The differences among the bootstrap samples lead to differing decision trees that give a range of predictions represented by the pink shaded area in Figure 4b. The predictions are averaged to give the overall prediction along the center of the pink shaded area (without overfitting because they average across different samples); their standard deviation is the uncertainty in the overall prediction, which is represented by the width of the pink-shaded area. The performance of the adopted model is reviewed in the next section.
3.2. Model performance metrics
To evaluate the performance of the developed random forest model, three metrics are used. The first metric–
$ {R}^2 $
–captures how close the predictions are to the true values. The second metric–
$ \Gamma $
–captures how accurately the model estimates uncertainty. The third metric is proposed in this work,
$ \alpha $
, which combines the first two metrics. Each of these metrics is discussed in turn.
3.2.1. Coefficient of determination (
$ {R}^2 $
)
The coefficient of determination,
$ {R}^2 $
, assesses the quality of model predictions of strength. The formula for
$ {R}^2 $
is:

where
$ {Y}_i $
is the true value of strength for the
$ {i}^{\mathrm{th}} $
mix of
$ N $
,
$ {\hat{Y}}_i $
is machine learning prediction of
$ {Y}_i $
, and
$ \overline{Y} $
is the mean value of
$ Y. $
$ {R}^2 $
ranges from 1 for perfect predictions to
$ -\infty $
for arbitrarily inaccurate predictions.
The coefficient of determination (
$ {R}^2 $
) is chosen over other metrics, in particular the widely used Pearson correlation coefficient (
$ {r}^2 $
), because the coefficient of determination not only accounts for the slope of the predictions versus experimental values, but also accounts for any possible shift. The Pearson correlation coefficient, on the other hand, only reports whether there is a linear relationship between predictions and experimental values, so it would report a perfect fit even if either the data were shifted or had different slopes (Neter et al., Reference Neter, Wasserman and Kutner1989; Whitehead et al., Reference Whitehead, Irwin, Hunt, Segall and Conduit2019).
3.2.2. Distribution of uncertainty quality (
$ \Gamma $
)
The quality of uncertainty estimates from the model is now considered. We first calculate the error in machine learning prediction scaled by uncertainty–
$ {\varepsilon}_i $
–for each mix:

where
$ {Y}_i $
is the true value of strength for the
$ {i}^{\mathrm{th}} $
mix,
$ {\hat{Y}}_i $
is the machine learning prediction, and
$ {\sigma}_{{\hat{Y}}_i}=\sqrt{\sigma_{i, ml}^2+{\sigma}_{i,\mathit{\exp}}^2} $
is the uncertainty, calculated as the quadrature sum of machine learning uncertainty
$ {\sigma}_{i, ml} $
and experimental uncertainty
$ {\sigma}_{i,\mathit{\exp}} $
.
The values of
$ {\varepsilon}_i $
are accumulated across all the predictions and then binned into a histogram. This histogram is then compared to the histogram of a standard normal distribution, as shown schematically in Figure 5.

Figure 5. Binned distribution of
$ \varepsilon $
(magenta rectangles) and standard normal distribution (white transparent rectangles).
When uncertainty estimates are accurate, the weight in each bin of the
$ \varepsilon $
histogram and the normal distribution histogram is equal. Therefore, the similarity between the two histograms is represented by the distribution of uncertainty quality
$ \Gamma $
:

where
$ {n}_m $
is the number of entries in the
$ {m}^{\mathrm{th}} $
bin of
$ M $
. Bins are chosen so that the normal distribution has equal weight of
$ 1/M $
in each. The number of bins
$ M $
is typically chosen to be close to
$ \sqrt{N} $
(Lohaka, Reference Lohaka2007). The quantity
$ \Gamma $
is known as the distribution of uncertainty quality (Taylor and Conduit, Reference Taylor and Conduit2022). The quantity
$ \Gamma $
ranges from 0 for perfect uncertainty estimates to 1 for poor uncertainty estimates.
3.2.3. Single metric for model performance
A single metric
$ \alpha $
that combines
$ {R}^2 $
and
$ \Gamma $
is proposed to identify the overall best model. The metric favors high
$ {R}^2 $
value and low
$ \Gamma $
value, and is given by:

The factor
$ \frac{1}{2} $
is chosen so that variation of each term contributes approximately equally to the variation of
$ \alpha $
(Taylor and Conduit, Reference Taylor and Conduit2022). The final metric achieves the peak value of 1. If all predictions are shifted by a constant, both
$ {R}^2 $
and
$ \Gamma $
get worse, so the metric will award uncertainty estimates becoming larger, which is the desired behavior in this scenario.
3.3. Comparison of model performance
In the previous section, a combined performance metric integrating accuracy and uncertainty measurement was introduced. This section utilizes this metric to evaluate the performance of the proposed random forest model. The machine learning model has hyperparameters that must be selected to give optimal model performance. In order to select hyperparameters leave-one-out cross-validation is performed (Hastie et al., Reference Hastie, Tibshirani and Friedman2001): we start with a putative set of hyperparameters, train a model on all but one row of the dataset, and then blind test against the held-out row. As shown in Figure 6, the procedure is then repeated
$ N $
times, with a different held-out row each time, until we obtain predictions for the entire dataset of
$ N $
rows, and the accuracy of those predictions is found using the metrics
$ {R}^2 $
,
$ \Gamma $
, and
$ \alpha $
. The blind testing in cross-validation is essential, as it ensures that we are not simultaneously training and testing the model against the same piece of data to avoid data leakage, and to not overfit the training data, and thereby replicating the real-life usage of the machine learning model to design a concrete formulation that has never before been produced. The suggested hyperparameters are then adjusted and the metrics
$ {R}^2 $
,
$ \Gamma $
,
$ \alpha $
recomputed iteratively until the hyperparameters that give the best metric are found, here, we focus on the
$ \alpha $
metric. Below we illustrate the procedure by focusing on just one hyperparameter, which also serves to highlight the benefits of optimizing the
$ \alpha $
metric.

Figure 6. (a) Schematic of leave-one-out cross-validation. Gray squares are entries in the existing data and magenta squares are the test entries for each fold. (b) Model performance for different values of
$ \mathit{\min}\_ samples\_ leaf $
.
We vary the
$ \mathit{\min}\_ samples\_ leaf $
hyperparameter of the model, which is the minimum number of samples in a leaf of each tree within a random forest, and identify its values that lead to the optimum
$ {R}^2 $
,
$ \Gamma $
, and
$ \alpha $
for the 56-day strength prediction. The metrics for each value of
$ \mathit{\min}\_ samples\_ leaf $
considered are shown in Figure 6.
The table in Figure 6 shows that
$ \mathit{\min}\_ samples\_ leaf=1 $
it offers the highest quality of predictions
$ {R}^2 $
but a relatively poor quality of uncertainty
$ \Gamma $
. Increasing
$ \mathit{\min}\_ samples\_ leaf $
to 2 better averages the noise, so gives an improvement in
$ \Gamma $
, but a relatively small decrease in
$ {R}^2 $
, resulting in the best value of
$ \alpha $
. Therefore, the model selected by optimizing
$ {R}^2 $
is generally different from the model chosen by optimizing
$ \alpha $
. This conclusion indicates that the inclusion of uncertainty quality affects the appropriate selection of the machine learning model. This has significant implications for previous work that has omitted consideration of the quality of the uncertainty estimates. Therefore, going forward, we consider the performance metric
$ \alpha $
, as
$ \alpha $
incorporates both
$ {R}^2 $
and quality of uncertainty as defined in Equation 4. This metric is calculated during the leave-one-out cross-validation procedure, and model selection determines the hyperparameters that maximize its value.
The input variables that yield the best overall results for strength prediction are now investigated, using the previously proposed just-in-time approach. Table 1 demonstrates model performance depending on whether slump and/or 28 day strength inputs are used. Mix proportions include ordinary Portland cement, GGBS, fine aggregate, coarse aggregate, and water content used alongside the water-binder, sand-aggregate, and aggregate-binder ratios. The last row of the table shows the results for 56-day predictions when 28-day strength is included as an additional input. For each set of inputs and the output, the best value of
$ \mathit{\min}\_ samples\_ leaf $
is selected by optimizing
$ \alpha $
on leave-one-out cross-validation. This value of
$ \mathit{\min}\_ samples\_ leaf $
was different from the value of
$ \mathit{\min}\_ samples\_ leaf $
that optimizes
$ {R}^2 $
for 28-day predictions using slump and 56-day predictions without using slump.
Table 1. Model performance for different inputs and outputs

Including slump improves
$ {R}^2 $
from 0.820 to 0.848 and also improves
$ \Gamma $
, which reduces from 0.207 to 0.172, demonstrating the utility of fresh-state concrete properties for 28-day strength predictions. The inclusion of slump as an input for 56 day strength predictions improves
$ \Gamma $
(0.200 to 0.133) at a cost of a marginally smaller
$ {R}^2 $
(0.834 to 0.821). The noise observed in the slump measurements reflects the underlying variability of concrete constituents mixed together and is captured by the model, resulting in better uncertainty estimates for strength. When 28-day strength is used as an input to predict 56-day strength, we see an improvement in
$ {R}^2 $
from 0.821 to 0.897, driven by the strong correlation between 28-day strength and 56-day strength. We show the performance of this final model in Figure 7, where we plot predicted versus experimentally measured 56-day strength. We observe that predictions agree with experimental results within model uncertainty, confirming the strong correlations between properties and the utility of fresh-state measurements to predict strength using the random forest model. The effectiveness of this approach to predict the performance of unseen concrete mixes is evaluated in the following section.

Figure 7. The predicted versus experimental 56 day strength with model uncertainty (blue points). The magenta line shows the ideal trend.
4. Experimental validation
Given the optimization of the machine learning model in the previous section, this section considers the performance of the model to predict the strength of previously unseen concrete mixes. In particular, the benefit of incorporating intermediate data (slump and earlier age strength) to improve longer-term strength predictions and confidence intervals of the two concrete mixes shown in Table 2 is considered. The two mixes are expected to have significantly different compressive strengths and, hence, represent a good opportunity to test the predictions of the developed model across a strength range. Furthermore, these mixes are interesting from a concrete technology perspective as they both contain high levels of OPC replacement materials (50% GGBS content) and, therefore, have a lower carbon content compared to OPC-only mixes. Mix B contains a low aggregate-binder ratio of 3.0, so it would be of particular interest to practitioners as compressive strength is expected to increase with a decreased aggregate-binder ratio and hence offers an opportunity to somewhat offset the reduced strength offered by GGBS inclusion while maintaining sustainability credentials (Poon and Lam, Reference Poon and Lam2008). Furthermore, this mix sits outside the binder content and aggregate-binder ratio within the training data, so it offers an opportunity to test the extension capabilities of the model in the presence of sparse training data, replicating practical conditions. The mix proportions and test results attained during the experimental series are shown in Table 3, and further details are provided in the Supplementary Material.
Table 2. Comparison of Mix A, Mix B, and training data parameters

a The property target values.
Table 3. Mix composition and fresh and hardened testing results for the experimental series

Machine learning predictions of 28-day strength and the experimental value of compressive strength obtained for Mix A are presented in Figure 8a. Without using slump as an input, machine learning predicts a 90% confidence interval that does not overlap with the 90% confidence interval of the experimental result. When adding slump as an input, machine learning takes advantage of the additional information to not only give a predicted strength that is closer to the experimental value but also a more realistic uncertainty estimate, which results in a 90% confidence interval that includes the experimental value. This is advantageous given the predicted strength is above the experimental results in both models. The machine learning predictions of 28 day strength and its experimental value for Mix B are presented in Figure 8b. Both models give 90% confidence intervals that include the experimental value. However, error bars are large for predictions from both models. This is expected since Mix B has a formulation far from those within the training dataset. The model with slump gives a slightly more conservative strength prediction with a higher uncertainty, which is better for practical applications.

Figure 8. Predictions of 28-day strength for (a) Mix A using the model without slump (blue) and model with slump (orange) and (b) Mix B using the model without slump (blue) and model with slump (orange). Small error bars represent standard deviation of predictions, large error bars correspond to 90% confidence intervals of predictions. Dashed vertical line is the experimental value, and gray shaded area is 90% confidence interval of the experimental value.
The machine learning predictions of 56-day strength and the experimental value of compressive strength obtained for Mix A are presented in Figure 9a. First of all, prediction from the model that optimizes
$ \alpha $
agrees with the experimental value within one standard deviation, whereas prediction from the model that optimizes
$ {R}^2 $
does not. This endorses the approach to optimize
$ \alpha $
rather than
$ {R}^2 $
to select the model that gives both accurate predictions and good uncertainty estimates. Furthermore, prediction improves as fresh state data is added to the model input. When 28-day strength is included as input, the prediction becomes even closer to experimental value, again supporting the live updating of strength predictions as information becomes available. The machine learning predictions of 56-day strength and the experimental value of compressive strength obtained for Mix B are presented in Figure 9b. Again, with 28-day strength as an input, the predicted strength significantly improved, and moreover, the uncertainty is reduced owing to the strong correlation between 28 and 56-day strength, demonstrating the utility of the proposed machine learning approach to accurately predict strength and associated uncertainty. In the following section, the practical limitations and suggested uses for the developed model and approach are discussed.

Figure 9. Predictions of 56 day strength for (a) Mix A using the model without slump that optimizes
$ \alpha $
(blue), model without slump that optimizes
$ {R}^2 $
(purple), model with slump that optimizes
$ \alpha $
(orange), and model with slump and 28 day strength that optimizes
$ \alpha $
(green); (b) Mix B using the model without slump that optimizes
$ \alpha $
(blue), model without slump that optimizes
$ {R}^2 $
(purple), model with slump that optimizes
$ \alpha $
(orange), and model with slump and 28 day strength that optimizes
$ \alpha $
(green). Small error bars represent standard deviation of predictions, large error bars correspond to 90% confidence intervals of predictions. Dashed vertical line is the experimental value, and gray shaded area is 90% confidence interval of the experimental value.
5. Discussion
Following the demonstration of the utility of fresh state properties and a combined accuracy-uncertainty metric in a newly developed machine learning approach for concrete strength prediction, this section discusses the limitations and future improvements of the work. In Section 5.1, the limitations of the work are reviewed. In Section 5.2, the ramifications for the concrete industry are set out, and for machine learning practitioners in Section 5.3.
5.1. Limitations and future improvements
In this section, the key limitations of the modeling are reviewed, and suggestions for future improvements are presented. Data is key for machine learning to train a detailed, accurate model and to produce a model with the breadth to apply to many concrete mixes. In this study, we adopted a dataset with only 29 mixes, which is comparable to the datasets that exist in practice for materials where extensive mix trials have not been conducted. Although the adopted dataset and training approach led to accurate strength predictions, the developed model may have limited extension capabilities to other lower-carbon materials. The full extension capabilities to alternative materials, e.g., calcined clays and fly ashes, have not been tested in the current work. Therefore, a key topic for future work is to gather more data to train the machine learning model on alternative emerging materials, and to conduct experimental validation on the proposed machine learning model for such materials.
The paper demonstrates the benefits of optimizing the model that gives excellent uncertainty estimates, and how uncertainty estimates can be effectively used to focus on robust predictions. However, the simultaneous optimization of quality and uncertainties in the metric
$ \alpha $
, with relative weighting of
$ 1/2 $
in Equation 4, should be explored further in a theoretical work. This future work should ascertain whether the optimum weighting factor should dynamically change. For example, when model quality is poor, greater weight should be assigned to the quality of uncertainty to deliver an improved understanding of robustness.
The paper focused solely on predicted strength at 28 and 56 days. However, many other properties of interest in the commercial use of concrete, including carbon footprint, density, shrinkage, durability, workability, permeability, and cost, are also of interest. Follow-up work should extend the proposed machine learning approach for these properties to develop models that can be used practically in the real world.
5.2. Uses for concrete industry digitization
This work demonstrates that using fresh state variables as inputs for machine learning can improve the accuracy of strength predictions and the quality of uncertainty estimates in these predictions. The uncertainty estimates provided by the tool can be compared with the risk profile of the concrete mix, for example, the cost and development time associated with the rejection of the mix if deemed unsuitable. For practical applications, this means that the inclusion of recently obtained slump test results as a model input can improve the accuracy of strength predictions and estimates of uncertainty. Therefore, by updating strength predictions with the slump attained at delivery, a real-time decision on whether to accept the concrete mix for use at the construction site, or to revisit the question after 28 days. This is particularly valuable in the context of growing uncertainty with smaller datasets containing concrete mixes with a wider range of constituent materials.
Just-in-time predictions support the growing interest in digitizing the concrete sector and the broader construction industry. Recently, methods have been developed to track concrete materials across the supply chain, linking constituents to test data, including slump and strength testing (White et al., Reference White, Forsdyke, Harrison, Desnerck and Lees2024). This work’s machine learning strength prediction methodology can be embedded in such modular digital frameworks. For the end user, automated strength predictions for each concrete element could be observed and updated as time progresses, and new test data added. Furthermore, in a similar fashion to strength prediction, the prediction of carbonation permeability or long-term durability may be aided by intermediate measures such as slump height or fresh density and is, therefore, the subject of further investigation.
5.3. Uses for machine learning applications
This work has demonstrated that it is crucial to have a metric that can assess both the quality of predictions and uncertainty estimates concurrently to determine the optimum machine learning model. The approach set out in this study allows for selecting the most suitable model for predicting concrete strength, and its inputs can be adjusted as new test data becomes available to improve strength predictions. This process can help decide whether the concrete should be approved at a construction site. This method represents a significant improvement over previous efforts, which solely focused on prediction accuracy and relied on large amounts of test data that may not be available for emerging low-carbon materials. The approach to assess both predictions and uncertainty estimates has potential applications in areas beyond concrete mix design, including financial markets (Sharpe, Reference Sharpe1966), cancer diagnosis (Hunter et al., Reference Hunter, Hindocha and RW2022), additive manufacturing (Rasiya et al., Reference Rasiya, Shukla and Saran2021), information engineering (Płaczek and Bernaś, Reference Płaczek and Bernaś2014), and language processing (Zerva et al., Reference Zerva, Batista-Navarro, Day and Ananiadou2017).
6. Conclusions
Limited empirical data exist for emerging lower-carbon materials, which presents challenges in adopting machine learning approaches to predict key properties, such as concrete strength. This work used a random forest machine learning model to predict concrete strength when trained on a sparse dataset, reflecting the datasets observed for emerging materials. A novel metric to assess prediction performance that combines accuracy and uncertainty estimates is proposed to reflect the pertinent concerns of sparse datasets. Model selection using this metric differs from the traditional method, which relies solely on accuracy metrics.
In addition to a model assessment method that combined accuracy and uncertainty, fundamental understanding of the concrete lifecycle has led to developing a just-in-time machine learning prediction technique. This approach uses intermediate test results, such as slump and early age strength, as inputs for a machine learning model, meaning long-term strength prediction accuracy and uncertainty are improved through time. Cross validation and experimentation of unseen concrete mixes outside the sparse training data demonstrates that the just-in-time approach results in progressively more accurate confidence intervals of strength predictions. Predictions of 56-day strength for the two mixes were experimentally validated to within a 90% confidence interval when using slump as an input and further improved by using 28-day strength. Overall, these results are promising for the practical implementation of machine learning in the concrete industry, using real-time measurements throughout the life cycle.
Supplementary material
The supplementary material for this article can be found at http://doi.org/10.1017/dce.2025.10018.
Acknowledgements
The authors acknowledge the support of Heidelberg Materials UK for the provision of cementitious materials and Wisdom Asotah for useful discussion.
Author contribution
Conceptualization: B.Z., C.W., J.M.L., G.J.C.; Data curation: C.W.; Formal analysis: B.Z., C.W.; Funding acquisition: J.M.L., G.J.C.; Investigation: C.W.; Methodology: B.Z., C.W.; Project administration: J.M.L., G.J.C.; Software: B.Z.; Supervision: J.M.L., G.J.C.; Validation: B.Z., C.W.; Visualization: B.Z., C.W.; Writing—original draft: B.Z., C.W.; Writing— review and editing: J.M.L., G.J.C.
Competing interests
B.Z., C.W., and J.M.L. confirm there are no competing interests to declare. G.C. is a Director of the machine learning company Intellegens.
Data availability statement
Supplementary material containing the data utilized in this work has been provided.
Funding statement
This work was supported by the Engineering and Physical Sciences Research Council and Costain PLC [grant number EP/S02302X/1], the Engineering and Physical Sciences Research Council [grant number EP/N017668/1], and the Harding Distinguished Postgraduate Scholars Programme Leverage Scheme.
Comments
No Comments have been published for this article.