Impact statement
Much of existing research on data-driven infrastructure performance forecasting has predominantly focused on algorithm development, but practical implementation requires considerations beyond merely identifying the best-performing machine learning algorithms. This work systematically examines the effects of data preprocessing strategies, the volume of historical infrastructure data, and forecast horizon on the accuracy and reliability of infrastructure performance forecasts. The key impact of this work lies in offering critical insights into the strengths and limitations of the different implementation strategies related to these practical considerations. By addressing these factors, this work provides engineers with practical guidance and quantitative engineering knowledge to support more effective and reliable application of data-driven techniques in infrastructure performance forecasting.
1. Introduction
Asset management encompasses a broad spectrum of activities. Mehairjan and Mehairjan (Reference Mehairjan and Mehairjan2017) defined asset management as “making financial investment decisions so that returns are maximized while satisfying risk tolerance and other investor requirements.” In civil and infrastructure engineering research, infrastructure asset management encompasses the strategies and processes of operating, maintaining, upgrading, and expanding physical assets effectively throughout their life cycle (MnDOT, 2016; Hosseini, Reference Hosseini2020). A crucial aspect of infrastructure asset management involves conducting thorough assessments to understand current infrastructure conditions. The outcomes of these assessments then critically influence the estimations of future infrastructure performance, providing essential evidence to support decision-making and the development of infrastructure asset management plans.
Forecasting infrastructure performance is a complex task, and various techniques are available in engineering practices. For example, numerical simulation is widely used across different types of infrastructures (Minhoto et al., Reference Minhoto, Pais, Pereira and Picado-Santos2005; Zhang et al., Reference Zhang, Krabbenhoft, Sheng and Li2015; Cao et al., Reference Cao, Koh and Smith2019). However, this approach often faces criticisms for requiring high-level expertise and subjective assumptions in building numerical models. Furthermore, it is not always straightforward to determine the values of model and material parameters. In view of the ever-increasing amount of data available for engineers to interpret, data-driven techniques have emerged as a prominent focus in civil and infrastructure engineering research. A common and intuitive strategy involves leveraging the trends and engineering information extracted from monitoring and historical performance data to aid infrastructure performance forecasting. Statistical tools such as Bayesian model updating (Hsein Juang et al., Reference Hsein Juang, Luo, Atamturktur and Huang2013; Tabatabaee and Ziyadi, Reference Tabatabaee and Ziyadi2013; Li and Jia, Reference Li and Jia2020; Wang et al., Reference Wang, Bertola, Goh and Smith2021; Huang et al., Reference Huang, Ye, Zhou, Huang and Zhou2022) and regression techniques (Jibson, Reference Jibson2007; Luo, Reference Luo2011; Leng et al., Reference Leng, Li, Zhu, Tang, Zhang and Ghadimi2018; Ciulla and D’Amico, Reference Ciulla and D’Amico2019; Lim and Chi, Reference Lim and Chi2019; Glashier et al., Reference Glashier, Kromanis and Buchanan2024) are commonly employed to forecast the performance of a range of engineering systems based on historical data. However, as infrastructure systems are getting increasingly complex, statistical tools may exhibit limitations in interpreting the associated data.
In recent years, machine learning has become an increasingly popular data-driven technique for forecasting infrastructure performance. An expanded list of machine learning algorithms, including artificial neural networks (ANNs), support vector machine (SVM), random forest (RF), and extreme gradient boosting (XGBoost), among others, has been developed and applied to infrastructure asset management and performance forecasting (e.g., Melhem and Cheng, Reference Melhem and Cheng2003; Koch and Brilakis, Reference Koch and Brilakis2011; German et al., Reference German, Brilakis and DesRoches2012; Marcelino et al., Reference Marcelino, de Lurdes Antunes, Fortunato and Gomes2019; Plakandaras et al., Reference Plakandaras, Papadimitriou and Gogas2019; Pei and Qiu, Reference Pei and Qiu2023; Wang et al., Reference Wang, Zhang and Huang2024). These machine learning algorithms have been shown to be effective in handling big and complex datasets, demonstrating promising accuracy and reliability in forecasting infrastructure performance.
In literature, research has predominantly focused on algorithm development, with comparative studies (e.g., Yin et al., Reference Yin, Jin, Shen and Hicher2018; Mangalathu and Jeon, Reference Mangalathu and Jeon2019; Bashar and Torres-Machi, Reference Bashar and Torres-Machi2021) attempting to advocate for the best-performing machine learning algorithms for forecasting tasks. However, while the choice of the predictive model is important, the accuracy and reliability of forecasts also crucially depend on data preprocessing (Guo et al., Reference Guo, Fu and Sollazzo2021a). This aspect is particularly important because historical infrastructure data may contain various types of information. For example, environmental data, such as temperature, precipitation data and historical infrastructure condition data, typically follow a time series format, while material properties, such as the strength of concrete and stiffness of asphalt, are often independent of time. Therefore, incorporating such information that are not of compatible format may require additional efforts during the data preprocessing stage (Bukharin et al., Reference Bukharin, Yang and Tsai2021). However, existing studies tend to focus either on time-series analysis using historical performance data (e.g., Ahmed et al., Reference Ahmed, Abu-Lebdeh and Lyles2006) or predictive modelling based on explanatory variables such as environmental data (e.g., Barros et al., Reference Barros, Yasarer and Najjar2022). There is a lack of systematic studies that summarize and compare data preprocessing schemes to harmonize both historical performance data and explanatory variables. As a result, there is limited knowledge and guidelines to help clarify and quantify the value of each data component in the forecasting process.
In this paper, a literature review is first conducted to identify common strategies for preprocessing infrastructure data for data-driven performance forecasting. Special attention is given to understanding how time series data and non-time series data are integrated into the framework of infrastructure performance forecasting. Based on the outcomes of the literature review, several schemes to preprocess infrastructure data for machine-learning-based performance forecasting are identified. In this paper, four categories are identified to classify data-driven performance forecasting models. Subsequently, a dataset containing real performance data is utilized to compare the four categories. Specifically, pavement performance is chosen as the context. The Long-Term Pavement Performance (LTPP) dataset (Elkins et al., Reference Elkins, Schmalzer, Thompson and Simpson2003) is employed for the detailed benchmarking comparison of the four categories of data-driven forecasting models.
In summary, the present study aims to comprehensively compare and evaluate the effects of varying preprocessing strategies on the reliability and accuracy of machine-learning-based forecasting models. The objective is to provide practical and quantitative information and guidelines for engineers to determine the appropriate volume of historical infrastructure data needed to achieve a reasonable level of forecasting accuracy and to understand the limits and extents data-driven techniques can reasonably forecast. Ultimately, the quantitative information is expected to support more effective and reliable application of data-driven techniques for short-to medium-term infrastructure operation management and decision-making.
2. Background
2.1. The long-term pavement performance (LTPP) dataset
The data used in the present study was collected from the Long-Term Pavement Performance (LTPP) database (Elkins et al., Reference Elkins, Schmalzer, Thompson and Simpson2003; LTPP, 2018). This open-source database provides extensive pavement test sections in the United States along with a wide array of related information including traffic and weather-related data, material properties, and maintenance and rehabilitation activities. In this database, in addition to the International Roughness Index (IRI), defects, such as the length of longitudinal cracks and the length of transverse cracks, are also reported. Table 1 summarizes the inputs and outputs considered in the present study. After processing the dataset, 200 road sections are included in the present study (Wang and Sun, Reference Wang and Sun2025). Each road section contains at least 5 years of historical performance data, resulting in 1899 pavement survey data points for analysis. The key considerations in processing the LTPP dataset are explained as follows:
-
• The inputs considered in the present study were identified through a literature review (Sadek et al., Reference Sadek, Freeman and Demetsky1996; Anyala et al., Reference Anyala, Odoki and Baker2012; Marcelino et al., Reference Marcelino, de Lurdes Antunes, Fortunato and Gomes2019; Zeiada et al., Reference Zeiada, Dabous, Hamad, Al-Ruzouq and Khalil2020; Damirchilo et al., Reference Damirchilo, Hosseini, Mellat Parast and Fini2021; Guo et al., Reference Guo, Fu and Sollazzo2021a, Reference Guo, Zhao, Gregory and Kirchain2021b; Perrotta et al., Reference Perrotta, Nelson, Poulsom, Tuck, Hakim, D’angelo, Mbah and Edwards2023) and consultations with pavement engineers.
-
• The inputs consist of both time series data and non-time series data. For example, information related to climate, traffic and performance are time series data. Therefore, each of these variables is associated with values recorded annually. In contrast, design variables related to material properties are typically non-time series data, resulting in a single value assigned to these variables throughout the time frame considered in the present study.
-
• Panel data is utilized. Panel data refers to data based on cross-sections measured over time and reflects both cross-sectional and time-dependent characteristics. The use of panel data is a pre-requisite to ensure that the forecasting model can learn and utilize the time-dependent characteristics (Justo-Silva et al., Reference Justo-Silva, Ferreira and Flintsch2021).
-
• Major maintenance activities, such as pavement milling and grinding, are taken into account. When such major maintenance activities occur, the associated panel data of the road section is split into two parts: before and after the maintenance activity. Each part is assumed to be a distinct set of panel data. It is assumed that only these major maintenance activities can significantly impact pavement performance. Other minor repairs are excluded from the analysis.
Table 1. Inputs and outputs considered in the LTPP dataset

# These explanatory variables are non-time series data. From a practical standpoint, a single value is assigned to these variables throughout the time frame considered in the present study.
* Historical data is used as inputs.
2.2. Time factor in infrastructure performance forecasting
Data-driven techniques often rely on years of historical infrastructure data to forecast future performance. Conceptually, an ideal data-driven performance forecasting model can be expressed as follows:

where
$ {P}_t $
is the performance at time t, often referred to as the present.
$ {P}_0 $
is the initial performance at the time of project completion. Depending on the frequency of inspection or monitoring, time t can be in days, months, or years.
$ {P}_{t+m} $
refers to the performance in the future, with the value of m determined by the forecast horizon. Similarly,
$ {P}_{t-n} $
represents historical performance data at time t-n, with the value of n determined by the volume of historical data. X denotes explanatory variables, such as temperature, load, precipitation, and material strength, among others.
In practice, X can consist of two types of explanatory variables: time series variables and non-time series variables. For time series explanatory variables,
$ {X}_t,{X}_{t-1},{X}_{t-2},{X}_{t-3},\dots, {X}_{t-n} $
are included in performance forecasting. Conversely, for non-time series explanatory variables, static values are utilized. As evident, both performance data and some explanatory variables are sequential data that are dependent on time (Durango-Cohen, Reference Durango-Cohen2007; Haas et al., Reference Haas, Hudson and Falls2015). Therefore, it is essential to consider the sequential characteristics in both groups of information. Trivial however it is, many existing performance forecasting models do not adhere to this concept.
2.3. Existing classification of performance forecasting models
Based on studies reported in the literature (e.g., Yang et al., Reference Yang, Lu and Gunaratne2003; Justo-Silva et al., Reference Justo-Silva, Ferreira and Flintsch2021; Hu et al., Reference Hu, Bai, Chen, Meng, Li and Xu2022), infrastructure performance forecasting models can be reasonably classified according to the following criteria: (i) type of formulation, (ii) conceptual format, and (iii) time factor.
Regarding the type of formulation, forecasting models can be deterministic, probabilistic, or hybrid. Deterministic models forecast the exact value of performance without providing associated uncertainty estimates. In addition, deterministic models do not consider uncertainties associated with the inputs of the model. Consequently, Hu et al. (Reference Hu, Bai, Chen, Meng, Li and Xu2022) argue that deterministic models may have high accuracy for specific infrastructure, but they tend to oversimply the change in performance over time. Most importantly, deterministic models lack uncertainty information, which is crucial for decision-making. In contrast, probabilistic models address these limitations of deterministic models by considering uncertainties in both inputs and forecasted quantities, thereby better representing reality and improving forecasting accuracy and reliability.
Concerning the conceptual format, forecasting models can be mechanistic, empirical, mechanistic-empirical, or machine-learning models. Mechanistic models are formulated based on mechanistic principles. Therefore, these models closely adhere to physics. Empirical models, in contrast, rely mainly on observations and mathematical functions to establish relationships between performance and chosen explanatory variables. The observed relationships may not necessarily follow physics principles. Therefore, mechanistic-empirical models are proposed to leverage the strengths of mechanistic principles and empirical analysis. Machine learning models, compared with the other three classes, are advanced models capable of learning complex relationships that may not be easily captured through basic empirical analysis. However, it is often argued that machine learning models may sometimes contradict the law of physics, therefore lacking physics-based knowledge and self-explanatory nature (Pei et al., Reference Pei, Qiu and Shen2023; Chen et al., Reference Chen, Zhang and Yin2024).
The time factor in infrastructure performance forecasting has not been as rigorously examined as the two preceding scenarios. Yang et al. (Reference Yang, Lu and Gunaratne2003) highlighted that there are two types of forecasting models: static and dynamic. Conceptually, a static model can be described as follows:

As evident, static models omit the time factor in their formulation, which may result in several limitations. First, historical performance data, for example,
$ {P}_{t-1},{P}_{t-2},{P}_{t-3},\dots, {P}_{t-n} $
, are not considered. It is commonly understood that the performance at a specific time depends on its state at preceding times. As a result, this omission may limit their forecasting capability. Similarly, historical explanatory variables, for example,
$ {X}_{t-1},{X}_{t-2},{X}_{t-3},\dots, {X}_{t-n} $
, are also omitted, further limiting their forecasting capability. Third, the time lag between X and P is not considered (Marcelino et al., Reference Marcelino, de Lurdes Antunes, Fortunato and Gomes2019). This omission has significant implications in practice. For example, to forecast performance 2 years ahead, for example,
$ {P}_{t+2} $
, according to the Eq. (2), one must know
$ {X}_{t+2} $
. More specifically, one must first somehow estimate explanatory variables before forecasting performance, which may introduce errors into the system. In some cases, it may not be practically realistic to extrapolate explanatory variables.
In contrast, a dynamic model can be conceptually described as follows:

Dynamic models effectively consider both historical performance data and explanatory variables. Understanding of the dynamics in infrastructure performance over time improves the accuracy and reliability of forecasts (Haas et al., Reference Haas, Hudson and Falls2015). However, there is still a subtle difference between Eqs. (1) and (3): the time lag between X and P is still not considered. Users are still required to provide
$ {X}_{t+m},\dots, {X}_{t+2},{X}_{t+1} $
in forecasting
$ {P}_{t+m} $
, which may limit the accuracy and reliability of these models in practice.
3. Categorization of forecasting models
Through a literature review focused on pavement infrastructure, four categories are identified to classify performance forecasting models. The categorization centers on handling the time factor through data preprocessing schemes. Table 2 summarizes the four categories along with brief descriptions. Figures 1 and 2 provide visual representations of the formulation and demonstrate how the four categories of models can be applied to forecast future performance for 1-year and 2-year forecast horizons, respectively.
Table 2. Proposed four categories of infrastructure performance forecasting models


Figure 1. An illustration of the four categories of models (forecast horizon = 1 year). (a) Formulation of Category A model. (b) Formulation of Category B model. (c) Formulation of Category model. (d) Formulation of Category D model.

Figure 2. An illustration of the four categories of models (forecast horizon = 2 years). (a) Formulation of Category A model. (b) Formulation of Category B model. (c) Formulation of Category model. (d) Formulation of Category D model.
3.1. Category A
Referring to Eq. (3), the dynamic models commonly reported in the literature form Category A in the proposed classification. As shown in Table 2, this category of models is frequently reported in the literature for forecasting tasks although the time lag between performance data and explanatory variables is not considered. Practical models, such as AASHTO1993, HDM-III, and HDM4 (Paterson and Attoh-Okine, Reference Paterson and Attoh-Okine1992; Attoh-Okine, Reference Attoh-Okine1994; Kerali et al., Reference Kerali, Odoki and Stannard2000) are believed to fall into this category. Figure 1(a) provides an illustration of the formulation. As depicted in the figure, to predict performance at time t + 1, explanatory variables from t – 2 to t + 1 and performance data from t – 2 to t are utilized. Specifically, all this information is utilized as independent and individual inputs of Category A models for the forecasting task. Following the illustration shown in Figure 1(a), this model has seven inputs and one output.
The absence of a time lag between explanatory variables and infrastructure performance could lead to an issue. For example, when forecasting performance at time t + 1, explanatory variables at time t + 1 need to be known a-priori. Similarly, as shown in Figure 2(a), when forecasting performance at time t + 2, explanatory variables at time t + 1 and t + 2 need to be known beforehand. In practice, it is not always straightforward to obtain these explanatory variables at time t + 2 or even at time t + 1. In many cases, explanatory variables at time t + 1 and t + 2 need to be forecasted first using historical explanatory variables. Therefore, one criticism of Category A models is that errors may be amplified when used to forecast performance. Specifically, errors in the extrapolated explanatory variables of the model could introduce additional errors in performance forecasts. However, it can also be argued that the formulation of Category A models is the most rigorous. In principle, explanatory variables can be interpreted as “loads” or “actions” exerting on infrastructures. Therefore, infrastructure performance at time t + 1 or t + 2 is the direct outcome of the “loads” or “actions” of that year, that is, the explanatory variables at time t + 1 and t + 2. As a result, explanatory variables at time t + 1 and t + 2 should theoretically be included in forecasting.
3.2. Category B
As seen in Figure 1(b), Category B models operate solely within the domain of performance data. Performance at time t + 1 is forecasted based solely on performance data from t – 2 to t. A similar illustration for a forecast horizon of 2 years is shown in Figure 2(b). For example, MicroPaver is a practical pavement engineering tool belonging to this category. Following the illustration in Figure 1(b), this model has three inputs and one output. In other words, Category B models can be referred to as time-series models, where a time-series is a sequence of observations arranged by the index of time. In principle, performance data is time series data because performance at a particular time step is influenced by the state in the preceding time step. In practice, infrastructures in moderately good performance are likely to deteriorate faster than infrastructures in good performance. Once infrastructures reach the state of poor performance, the rate of deterioration will likely decrease (Butt et al., Reference Butt, Shahin, Feighan and Carpenter1987; Mers et al., Reference Mers, Yang, Hsieh and Tsai2022).
In practice, regression techniques are commonly adopted for Category B models. For example, a linear/non-linear regression model can be used to fit performance data (Kobayashi et al., Reference Kobayashi, Kaito and Lethanh2012) from time t – 2 to t to forecast performance at time t + 1. While such regression-based models have been shown to effectively forecast the performance at the project level, for example, one single infrastructure such as a building or a bridge, they have limited forecasting capabilities at the network level, for example, the performance of a cluster of infrastructures. Therefore, autoregression or Bayesian-based methods (Ahmed et al., Reference Ahmed, Abu-Lebdeh and Lyles2006; Luo, Reference Luo2011; Pantuso et al., Reference Pantuso, Flintsch, Katicha and Loprencipe2019) are proposed to forecast at both project and network levels. Markov models are also Category B models commonly used for performance forecasting (Butt et al., Reference Butt, Shahin, Feighan and Carpenter1987; Kobayashi et al., Reference Kobayashi, Kaito and Lethanh2012; Lethanh and Adey, Reference Lethanh and Adey2012). However, Markov models are applied to discrete data rather than continuous variables. Therefore, Markov models typically forecast the state of performance instead of the values of performance index. Furthermore, deep-learning approaches, such as Recurrent Neural Network or Long Short-term Memory, have also been applied to Category B models (Hosseini et al., Reference Hosseini, Alhasan and Smadi2020).
However, regardless of the techniques used in Category B models to process time-series data, a common limitation is that explanatory variables are absent from the models or have not been explicitly considered. It is widely recognized that factors such as environmental conditions influence infrastructure performance. Moreover, Category B models implicitly disregard potential radical changes in external factors, therefore rendering them unable to handle the impact of future changes in explanatory variables, such as the introduction of new generation vehicles (e.g., autonomous and electric vehicles are much heavier), climate change leading to hotter weather conditions, and more frequent natural hazards. Therefore, the omission of the valuable information embedded in these explanatory variables greatly limits their forecasting capabilities.
3.3. Category C
Motivated by the argument related to the time lag between explanatory variables and performance data, Category C models (e.g., Sadek et al., Reference Sadek, Freeman and Demetsky1996; Hong and Prozzi, Reference Hong and Prozzi2006; Inkoom et al., Reference Inkoom, Sobanjo, Barbu and Niu2019) explicitly consider the time lag. As shown in Figure 1(c), to predict performance at time t + 1, explanatory variables from t – 2 to t and performance data from t – 2 to t are utilized. Through this configuration, while both historical performance data and explanatory variables are incorporated into forecasting infrastructure performance, Category C models avoid the extra step in Category A models, that is, extrapolating explanatory variables to time t + 1. Category C models are therefore more practical because they utilize the data and information currently available without making any further assumptions, for example, the extrapolation in Category A models. A similar illustration for a forecast horizon of2 years is shown in Figure 2(c).
There is an additional configuration applied to Category C models. Compared to Category A models that employed all information as individual inputs, Category C models aggregate this information as shown in Figure 1(c). For example, Sadek et al. (Reference Sadek, Freeman and Demetsky1996) employed linear regression to forecast future distress maintenance rating (DMR) by considering historical DMR data, pavement age and traffic-related data. In the implementation, the traffic-related variable is represented by the average yearly equivalent single axle loads (YESAL), which is calculated by dividing the cumulative equivalent single axle loads (ESALs) from the time of its construction to the DMR date. Similarly, Barros et al. (Reference Barros, Yasarer and Najjar2022) employed the cumulative ESALs in their analysis. Through such an operation, the number of inputs required for Category C models is reduced. Referring to the illustration shown in Figure 1(c), in contrast to Category A models that have seven inputs, this model has two inputs and one output.
The motivation behind this configuration is that as the number of inputs and years of historical performance data and explanatory variables increases, the construction of performance forecasting models becomes more challenging. In many cases, simple regression techniques, for example, multi-linear regression, may show limited capabilities. Yet, many practitioners intend to use simple regression models for practical considerations. Therefore, aggregation techniques, for example, averaging or summation, offer a workaround. However, it is argued that the sequential characteristics in historical performance data and explanatory variables may have been overlooked to a certain extent, potentially undermining the forecasting capability of Category C models in some scenarios.
3.4. Category D
As machine learning approaches have advanced in the past few decades, the concerns regarding Category C models described earlier could be addressed to a certain extent. Therefore, as shown in Figure 1(d), Category D models, in contrast to Category C models, revise the formulation by explicitly considering historical performance data and explanatory variables as individual inputs while also considering the time lag between inputs and outputs. A similar illustration for a forecast horizon of 2 years is shown in Figure 2(d). For example, Marcelino et al. (Reference Marcelino, de Lurdes Antunes, Fortunato and Gomes2019) utilized 6 years of data to construct a forecasting model, with the explanatory variables corresponding to each of the6 years explicitly and individually considered as independent inputs of the model. Similarly, Choi and Do (Reference Choi and Do2020) employed a 10-year dataset to forecast pavement rutting depth, crack percentage and IRI, and information such as traffic and temperature in each year is explicitly considered as individual inputs. As a result, the number of inputs can be very large with a large dataset of historical infrastructure performance. For example, in one of the cases reported in Marcelino et al. (Reference Marcelino, de Lurdes Antunes, Fortunato and Gomes2019), there are in total 54 inputs based on6 years of performance data. In this regard, basic regression models may no longer be employed (Mers et al., Reference Mers, Yang, Hsieh and Tsai2022), and advanced machine learning algorithms are often necessary for Category D models.
Another criticism of Category D models is that explanatory variables at time t + 1 and/or t + 2, which represent direct “loads/actions,” are not considered. Therefore, the information incorporated into Category D models is technically incomplete. When comparing Category D and Category A models, a dilemma clearly arises. Category A models suggest that explanatory variables at time t + 1 and/or t + 2 need to be included even though the associated values are estimates and may not be accurate. On the other hand, Category D models essentially believe that only available information that is accurate or has been measured should be included in the forecasting tasks. In essence, this dilemma further justifies the necessity to compare these categories of models and understand their respective strengths and limitations.
4. Design of comparative analyses
This section provides detailed information on the implementations of the four model categories described in Section 3. It also presents two case studies to illustrate variations in their predictive performance, thereby reinforcing the rationale for comparing the different data preprocessing strategies adopted in each model category. This section serves as a foundation for the subsequent parametric analyses conducted using the large LTPP dataset.
4.1. Algorithm formulation
Tables 3 and 4 provide detailed information pertaining to the implementations of the four categories of models. Random forest, which has been reported in several studies as a promising machine learning algorithm, is selected as the main algorithm employed in the present study. Results of preliminary analyses also indicated that random forest outperforms SVM, ANNs, and XGBoost for the present predictive task. Polynomial regression is employed in both Category A and Category B models. However, in Category A models, polynomial regression is used to extrapolate time-dependent explanatory variables before building random forest-based forecasting models, whereas polynomial regression is used in Category B models to directly forecast performance. In addition, an S-curve function (e.g., Eq. [4]), which is commonly used to model infrastructure performance over time, is also implemented in Category B models.

where a and b are model parameters, P represents the performance metric, and t refers to time.
Table 3. Details of the comparative study

* The comparative study is implemented based on a parametric analysis involving:
-
• Number of years of historical data considered in building forecasting models: 1–5 years.
-
• Number of years of performance to forecast: 1–5 years.
Table 4. Subcategories considered in Category C models

Furthermore, three subcategories are considered within Category C models to investigate the impact of different data aggregation philosophy. In addition, for all four categories, a parametric analysis involving two key factors relevant to engineering practices is conducted: number of years of historical data considered in building forecasting models and forecast horizon. In the comparative parametric analyses, both the number of years of historical data and forecast horizon vary from one to 5 years, resulting in 25 cases for Category D models and the S-curve function in Category B models and 75 cases for Category A, B and C models (including 1st–3rd order polynomial regression and three subcategories shown in Table 4).
Last, for the random forest implementation, 80% of the total LTPP dataset is used for training, and the remaining 20% is reserved for testing purposes. The four categories of models are then compared based on their performance on the test dataset. In addition, since the split of the LTPP dataset into training and test datasets is random, 20 repeated runs following different seeds are conducted. The results of the comparative study are reported based on the average results across these repeated runs, and the variations across these runs are also reported to assess the reliability of model performance.
4.2. Illustrative examples
As shown in Figures 3 and 4, two illustrative road sections are presented to highlight the differences across the four model categories. Subplots (a)–(c) represent, respectively, cases that utilize 3 years of historical data to forecast one to 3 years ahead. Similarly, Subplots (d)–(f) show, respectively, cases that utilize 5 years of historical data to forecast one to 3 years ahead. Several key observations are summarized.

Figure 3. Comparison based on a selected road section in the state of Ohio, United States. (a) 3 years of historical data with 1 year of forecast horizon. (b) 3 years of historical data with 2 years of forecast horizon. (c) 3 years of historical data with 3 years of forecast horizon. (d) 5 years of historical data with 1 year of forecast horizon. (e) 5 years of historical data with 2 years of forecast horizon. (f) 5 years of historical data with 3 years of forecast horizon.

Figure 4. Comparison based on a selected road section in the state of California, United States. (a) 3 years of historical data with 1 year of forecast horizon. (b) 3 years of historical data with 2 years of forecast horizon. (c) 3 years of historical data with 3 years of forecast horizon. (d) 5 years of historical data with 1 year of forecast horizon. (e) 5 years of historical data with 2 years of forecast horizon. (f) 5 years of historical data with 3 years of forecast horizon.
First, all models produce forecasts with varying accuracy even though the historical performance data and machine learning algorithm considered are consistent. Second, the incorporation of more historical data generally improves the performance of all models, as evident in the comparison between Figures 3(b) and (e), and Figures 4(c) and (f). Furthermore, forecasting accuracy tends to decrease across all models as the forecast horizon increases, for example, Figures 3(e) and (f).
Moreover, the choice of polynomial order in Category A models results in varying forecasting accuracy. This is in line with the earlier hypothesis that errors in the extrapolated explanatory variables may introduce additional errors in forecasted infrastructure performance. In addition, significant variations in forecasting accuracy are observed across the three orders of polynomial regressions used in Category B models. While the 1st-order polynomial generally performs well in both illustrative examples, the 2nd and 3rd-order polynomials may yield unrealistic forecasting results. Since Category B models forecast performance based solely on historical performance data without considering explanatory variables, the absence of engineering information embedded in explanatory variables may limit their performance. As a result, forecasting accuracy heavily relies on the interpretation of historical performance data and the underlying assumptions associated with the forecasting models. The S-curve function, in contrast, provides a reasonable level of forecasting accuracy compared to polynomial regression. Last, forecasting accuracy associated with Category C and D models appears to be similar and reasonable for the two illustrative examples.
In summary, although the raw data and the machine learning algorithm are consistent across the four model categories and the associated sub-categories, their predictive performance varies. This observation highlights the significant impact of data preprocessing strategies on forecasting outcomes and underscores the need for a comprehensive comparative study. In subsequent sections, the performance of all models will be statistically quantified using the large LTPP dataset and compared to understand their respective strengths and limitations.
5. Results of parametric analysis
In this section, the four model categories are implemented using the large LTPP dataset. Sections 5.1 to 5.4 present individual evaluations of the reliability and accuracy of all models to highlight their respective strengths and limitations. Section 5.5 provides a comparative analysis across all model categories, followed by a summary of key findings.
5.1. Performance evaluation of Category A models
Figure 5 provides a classic visualization strategy to assess the accuracy of Category A models based on the test dataset. The results are based on the employment of a 1st-order polynomial to extrapolate time-dependent explanatory variables. Subplots (a)–(d) show, respectively, cases utilizing two to 5 years of historical data to forecast1 year ahead.

Figure 5. Examples of statistical evaluation of Category A models (forecast horizon = 1 year). (a) 2 years of historical data. (b) 3 years of historical data. (c) 4 years of historical data. (d) 4 years of historical data.
In general, the results reasonably cluster around the 1:1 line, demonstrating that forecasts agree reasonably well with the ground truth. The coefficient of determination (r 2) is utilized in the present study to measure predictive accuracy. The interpretation of r 2 can be subjective and is often dependent on the nature of the study. In the present study, which involves field performance data collected in uncontrolled environments and influenced by numerous external factors, a threshold r 2 value of 0.6 is reasonably chosen to represent an acceptable level of accuracy. As shown in the figure, even with 2 years of historical data, it is also possible to forecast the performance 1 year ahead with a good level of accuracy (r 2 = 0.77). Furthermore, as the volume of historical data increases, forecasting accuracy improves, as evidenced by the increasing values of r 2 from 0.72 to 0.82.
Figure 6 shows the full results of the parametric analysis conducted for Category A models. The average results across the 20 repeated runs are shown in the figure. It is worth noting that the minimum volume of historical data required varies depending on the polynomial order chosen. For example, at least 3 years of historical data is needed to implement a Category A model with a 2nd-order polynomial. Comparing the results across the three subplots, the choice of polynomial order has a very significant impact on forecasting accuracy. As shown in Figure 6(a), Category A models employing a 1st-order polynomial yield reasonable forecasting accuracy across all the parametric cases considered. However, Category A models with a 2nd or 3rd-order polynomial cannot be confidently used for the forecasting tasks. This is likely attributed to the errors in the extrapolated explanatory variables of the model, potentially amplified by a 2nd or 3rd-order polynomial. These observations, which are in line with those shown in Figures 3 and 4, underscore a critical limitation of Category A models. Although explanatory variables associated with the year to be forecasted are included in the forecasting tasks, errors in extrapolations could undermine the reliability of the forecasts.

Figure 6. Results of parametric analysis of Category A models. (a) 1st-order polynomial. (b) 2nd-order polynomial. (c) 3rd-order polynomial.
Figure 7 shows the uncertainty associated with selected parametric cases. Error bars represent 1-standard deviation of the r 2 value calculated based on 20 repeated runs. As shown in Figure 7(a), uncertainty in forecasts decreases as the volume of historical data increases, further illustrating the impact of data volume on forecasting tasks. In contrast, Figure 7(b) shows that although a Category A model with a 2nd-order polynomial yields reasonable accuracy (r 2 > 0.6) for a forecast horizon of 1 year, uncertainty is comparatively large. The large uncertainty indicates that errors in the extrapolated explanatory variables may undermine the forecasting accuracy of Category A models.

Figure 7. Uncertainties in the results of Category A models (forecast horizon = 1 year). (a) 1st-order polynomial. (b) 2nd-order polynomial.
5.2. Performance evaluation of Category B models
Figure 8 summarizes the results of the parametric analysis of Category B models involving three choices of polynomial order and the S-curve function. The results of the S-curve function, represented as dashed lines, are repeated for three subplots for comparison purposes.

Figure 8. Results of parametric analysis of Category B models. (a) 1st-order polynomial. (b) 2nd-order polynomial. (c) 3rd-order polynomial.
As shown in Figure 8, the choice of polynomial order has a significant impact on forecasting accuracy, with the 1st-order polynomial statistically performing much better than 2nd and 3rd-order polynomials. In particular, a 3rd-order polynomial cannot be employed as a reliable forecasting model regardless. As shown in Figure 8(c), even with 5 years of historical data, a 3rd-order polynomial struggles to accurately forecast performance even 1 year ahead. In addition, referring to the results shown in Figure 8(a), the 1st-order polynomial is statistically better than the S-curve function in most cases. The S-curve function, in turn, outperforms both 2nd and 3rd-order polynomials. However, the differences between the S-curve function and the 1st-order polynomial diminish as the volume of historical data increases, implying that the S-curve function requires more historical data than a 1st-order polynomial to yield reliable forecasts. The results demonstrate the limitations of Category B models mentioned in Section 3.3. Since only historical performance data is considered without examining the associated explanatory variables, Category B models are highly dependent on the quality of historical performance data and the choice of predictive algorithms. Moreover, the impact of the volume of historical data on forecasting accuracy is clearly shown by the results of the 1st-order polynomial: as the volume of historical data increases, forecasting accuracy improves. While this trend can also be observed in the results of the S-curve function, it is not as pronounced. Furthermore, forecasting accuracy of all models decreases as the forecast horizon increases.
Figure 9 shows the uncertainties in selected results of Category B models based on 20 repeated runs. It is seen that uncertainty in both models decreases as the volume of historical data increases, further demonstrating the impact of the volume of historical data on forecasting performance. However, the reduction in uncertainty for the 1st-order polynomial is more significant than for the S-curve function. In addition, the uncertainty associated with the 1st-order polynomial is smaller than that of the S-curve function.

Figure 9. Uncertainties in the results of Category B models. (a) 1st-order polynomial. (b) 2nd-order polynomial.
5.3. Performance evaluation of Category C models
The three subcategories shown in Table 4 are compared. Figures 10(a)–(e) show the comparison based on forecast horizon ranging from one to 5 years. While the performance of C2 models and C3 models is largely similar, their performance is consistently lower than that of C1 models in most cases, albeit by a small margin. The subtle modifications made in C1 models compared to C2 and C3 embed a certain level of physical principles into the model. For instance, it is reasonable to sum up traffic volume because pavement deterioration is the result of the cumulative effects of traffic. Similarly, precipitation data also needs to be summed up over the time frame considered. In contrast, historical IRI values cannot be simply summed up because the value would lose it physical meaning. Therefore, averaging appears to be a more appropriate aggregation strategy to maintain the physical meanings of the data. Similarly, as shown in Figure 11, the uncertainty associated with C1 models is observed to be slightly smaller than the uncertainties associated with C2 and C3 models. In a nutshell, this comparison clearly highlights the importance of data preprocessing and engineering knowledge in implementing data-driven performance forecasting tasks.

Figure 10. Results of parametric analysis of Category C models. (a) 1 year of forecast horizon. (b) 2 years of forecast horizon. (c) 3 years of forecast horizon. (d) 4 years of forecast horizon. (e) 5 years of forecast horizon.

Figure 11. Uncertainties in the results of Category C models. (a) 2 years of forecast horizon. (b) 2 years of historical data.
In general, the forecasting accuracy of Category C models is reasonable, with the values of r 2 exceeding 0.6 in most cases. It can be deduced that the inclusion of more historical data leads to improvement in forecasting accuracy, but the improvement is slight. It is also observed that Category C models can reasonably forecast performance with less than a proportionate volume of historical data. For example, Category C models can reasonably forecast the performance 5 years ahead (i.e., r 2 > 0.6) even with only 2 years of historical data. Similarly, referring to the results shown in Figure 11, the uncertainty associated with forecasting accuracy is also moderate even in cases with less than a proportionate volume of historical data.
5.4. Performance evaluation of Category D models
Figure 12 shows the results of parametric analysis of Category D models. Similar to the results shown in Figure 10, Category D models demonstrate a good level of forecasting accuracy for all the parametric cases considered (i.e., r 2 > 0.6). It is also observed that Category D models exhibit the ability to reasonably forecast performance with less than a proportionate volume of historical data. Similarly, referring to the results shown in Figures 12(b) and (c), the uncertainty associated with forecasting accuracy remains moderate even in cases with less than a proportionate volume of historical data. Furthermore, the impact of the volume of historical data on forecasting accuracy is not as pronounced in Figures 12 as it is in Figure 8(a). In general, it can be inferred that the inclusion of more historical data primarily aids in reducing uncertainty rather than significantly improving absolute forecasting accuracy in Category D models.

Figure 12. Results of parametric analysis of Category D models. (a) All models. (b) uncertainties of 2 years of forecast horizon. (c) uncertainties based on 3 years of historical data.
5.5. A benchmark comparison
Figure 13 offers a comprehensive comparison of the results obtained from all four model categories of models. All models demonstrate reasonable performance in forecasting IRI values. As the forecast horizon extends, the performance of all models gradually decreases. For the cases with a forecast horizon exceeding 4 years (i.e., Figure 13(d) and (e)), the performance of several models falls below the commonly accepted threshold of an r 2 value of 0.6. These observations highlight not only the importance of selecting an appropriate data preprocessing strategy but also the need for caution when applying to extended forecast horizons.

Figure 13. Comparing all four categories of models for IRI forecasts. (a) 1 year of forecast horizon. (b) 2 years of forecast horizon. (c) 3 years of forecast horizon. (d) 4 years of forecast horizon. (e) 5 years of forecast horizon.
Category C and Category D models emerge as promising choices in cases when historical data is limited, that is, is <2 years, especially for longer forecast horizons. When only 1 year of historical data is available, these categories are the sole viable options. However, Category D models consistently outperform Category C models across all parametric cases. Referring to the formulations illustrated in Figures 1 and 2, the key distinction between Category C and Category D models is that input information in Category C models is aggregated, while in Category D models, individual historical data points are explicitly considered. Therefore, Category D models offer a higher level of comprehensiveness in the information considered, contributing to their superior forecasting accuracy compared to Category C models.
Furthermore, Category A models may not always be a better choice than Category C and Category D models when dealing with a small volume of historical performance data and a large forecast horizon. This observation is intuitive: with limited historical data, the extrapolation in explanatory variables may be less reliable, particularly for longer forecast horizons. As a result, the potentially larger errors in extrapolated explanatory variables may significantly undermine forecasting accuracy. However, as the volume of historical data increases, Category A models become increasingly favourable. With a larger historical dataset, it is likely that explanatory variables can be more accurately extrapolated, resulting in enhanced overall forecasting accuracy. However, the improved forecasting accuracy in Category A models compared to Category C and Category D models is marginal.
Based on the information presented in this section, some key findings and recommendation are given as follows:
-
(i) The four model categories demonstrate differences in forecast reliability and accuracy, confirming the effects of data preprocessing strategies on infrastructure performance forecasts. Category A and B models perform poorly when historical data is limited but improve with a larger volume of data. In contrast, Category C and D models offer higher accuracy under data-scarce conditions, making them more suitable for forecasting tasks with limited historical input.
-
(ii) Based on the results, it is recommended that, to achieve accurate forecasts, the volume of historical data should span a time duration comparable to the intended forecast horizon.
-
(iii) For IRI, forecasts up to5 years are generally achievable with acceptable levels of accuracy and reliability.
6. Discussion
In this section, the broader applicability of the findings from earlier sections is evaluated and discussed in relation to other pavement performance indicators and machine learning algorithms. The objective is to assess the generalizability of the key findings.
6.1. Applications on other performance metrics
While IRI is a common performance metric used in pavement engineering, information related to surface defects is also equally critical for effective decision-making. In addition, the relatively stable nature of IRI values (e.g., Figures 3 and 4), often allows for techniques assuming a linear relationship between IRI and time to suffice. Therefore, it is necessary to compare the four categories of models across other performance metrics. This evaluation will facilitate understanding the general applicability of the recommendations given in the preceding section across other scenarios. In this section, all four categories of models are implemented to forecast the length of longitudinal and transverse cracks based on the same LTPP dataset. Figure 14 presents the comparison of all models to forecast the length of longitudinal cracks. In this case, results of Category B models implemented using all three choices of polynomial orders are shown. It is worth noting that some bars are absent from the figure because the corresponding r 2 values fall below zero.

Figure 14. Comparing all four categories of models for longitudinal crack forecasts. (a) 1 year of forecast horizon. (b) 2 years of forecast horizon. (c) 3 years of forecast horizon. (d) 4 years of forecast horizon. (e) 5 years of forecast horizon.
It is much more challenging to forecast the length of longitudinal cracks compared to forecasting IRI. This is evidenced by the much lower r 2 values shown in Figure 14 compared to those shown in Figure 13. First, Category A models with a 1st-order polynomial shows very poor performance for all considered parametric cases, indicating a significant impact of errors in extrapolated explanatory variables on forecasting accuracy even with a small forecast horizon. This contrasts with the findings shown in Figure 13. However, the variation of the length of longitudinal cracks with time is a much more complex process than that of IRI; therefore, any error in the extrapolated explanatory variables appear to be amplified. The comparison between Figures 13 and 14 demonstrates the critical limitation of Category A models when applied to complex phenomena.
Similarly, the choice of polynomial orders in Category B models significantly influences the forecasting accuracy of longitudinal cracks. However, the effects are more complicated. While all three polynomial orders yield reasonable forecasts for a forecast horizon of 1 year, forecasting accuracy quickly deteriorates as forecast horizon increases. In addition, the S-curve function fails to yield reasonable forecasts in most cases. The nature that Category B models do not consider explanatory variables makes them highly depend on the choice of forecasting algorithm. In cases involving more complex problems, Category B models may show limited capability.
Category C models also show poor performance in most cases. Referring to Section 3.4, Category C models aggregate historical data. Therefore, it is possible that some level of information may be lost through the operation, resulting in deterioration in forecasting accuracy as the forecasting task becomes more complex like longitudinal crack length predictions. Category D models emerge as the best performer shown in Figure 14 although the absolute performance is not as strong as that shown in Figure 13. Referring to the formulation of Category D models, explanatory variables and performance data are individually and independently considered as inputs of the model without any additional assumptions or any aggregation strategy. Therefore, Category D models provide a more robust strategy for forecasting based on historical data.
Figure 15 shows the comparison results based on the length of transverse cracks forecasts. As can be seen, forecasting the length of transverse cracks appears to be comparatively easier than for longitudinal cracks. In general, the observations are largely similar to those shown in Figures 13 and 14. For example, Category D models are promising choices in cases when historical data is limited, that is, is <2 years, especially for longer forecast horizons. When only 1 year of historical data is available, Category D models are the sole viable options. Category D models outperform Category C models in most cases because the comprehensiveness in the input information adopted by Category C models is slightly lower due to the utilization of aggregation strategies. Category B models are promising, but the performance varies across the different choices of predictive algorithms. In a nutshell, based on the results shown in Figures 14 and 15, the recommendations provided in Section 5 are largely valid for forecasting the length of longitudinal and transverse cracks.

Figure 15. Comparing all four categories of models for transverse cracks forecasts. (a) 1 year of forecast horizon. (b) 2 years of forecast horizon. (c) 3 years of forecast horizon. (d) 4 years of forecast horizon. (e) 5 years of forecast horizon.
6.2. Applications on other machine learning algorithm
Figure 16 presents the comparison results based on IRI forecasts and the use of artificial neural network (ANN) as the machine learning algorithm for Category A, C and D models. The objective is to assess if the observations and recommendations derived from earlier sections remain consistent when employing a different machine learning algorithm.

Figure 16. Comparing all four categories of models for IRI forecasts using ANN. (a) 1 year of forecast horizon. (b) 2 years of forecast horizon. (c) 3 years of forecast horizon. (d) 4 years of forecast horizon. (e) 5 years of forecast horizon.
At a first glance, the trends shown in the figure closely resemble those observed in Figure 13 to 15. Overall, all models demonstrate reasonable performance across all cases. Category D models outperform Category C models in most cases, and both categories of models are promising choices when historical data is limited, particularly for longer forecast horizons. In addition, Category A models may not always be a better choice than Category C and Category D models when dealing with a small volume of historical performance data and a large forecast horizon. However, as the volume of historical data increases, Category A models become increasingly favourable. Most importantly, the recommendations given in Section 5 remain largely applicable.
6.3. Limitations
The LTPP dataset adopted in the present study includes five-year historical performance data with a forecast horizon limited to 5 years. As a result, the key findings and recommendations are practically constrained by the range of the available data and are most applicable to short-to medium-term infrastructure management and decision-making. Consequently, the present study is not intended to directly inform long-term planning and policymaking. In addition, maintenance activities are not quantitatively considered in the parametric analysis. As outlined in Section 2.1, when major maintenance activities occur, the associated panel data is split into two parts: before and after the maintenance activity. Each part is assumed to be a distinct set of panel data. In this regard, future studies are recommended to explicitly quantify and incorporate the effects of maintenance activities into performance forecasting models to enhance the practical utility of data-driven approaches.
7. Conclusions
In this paper, a literature review was conducted to understand how the time factor is handled in performance forecasting models. Based on the outcomes of a literature review, this paper identified a scheme to classify infrastructure performance forecasting models. The four categories distinguish forecasting models based on how performance data is pre-processed and used to train forecasting models, with a particular focus on the handling of the time factor. Subsequently, the Long-Term Pavement Performance (LTPP) dataset was employed to carry out benchmark evaluation and comparison of the four proposed categories of forecasting models. Parametric analyses were conducted to understand the interactive effects of forecast horizon and the volume of historical data on forecasting accuracy. This paper provided engineers with quantitative information and evidence-based guidelines to determine the appropriate volume of historical infrastructure data needed to achieve a reasonable level of forecasting accuracy and to understand the limits and extents data-driven techniques can reasonably forecast. Ultimately, the quantitative information supports more effective and reliable application of data-driven techniques in short-to medium-term infrastructure performance forecasting. Multiple infrastructure performance metrics were also considered to evaluate the general applicability of the findings. Some key conclusions and recommendations are as follows:
-
(i) The four model categories differ in forecast reliability and accuracy, underscoring the importance of understanding the strengths and limitations of various data preprocessing strategies.
-
(ii) In data-scarce scenarios, strategies that incorporate both explanatory variables and historical performance data (e.g., Category C and D models) provides better accuracy and reliability.
-
(iii) Overall, the strategy that incorporates data across multiple time steps as individual input is more robust in capturing the time-series nature of the data and yields improved forecast accuracy. As such, Category D represents the most robust data preprocessing strategy for performance forecasting.
-
(iv) Results of three performance indicators and two machine learning algorithms suggest that, to achieve accurate forecasts, the volume of historical data should at least span a time duration comparable to the intended forecast horizon.
-
(v) For IRI and transverse crack length, a forecast horizon up to5 years is generally achievable.
-
(vi) For longitudinal crack length, Category D is the only viable data preprocessing strategy, further highlighting the importance of role of data preprocessing. However, forecasts beyond a three-year horizon are not recommended due to reduced reliability.
Data availability statement
The data used in the present study can be obtained from the Long-Term Pavement Performance (LTPP) database at https://infopave.fhwa.dot.gov/. The datasets used to generate the findings of this study are available from Zenodo at https://doi.org/10.5281/zenodo.15275788.
Acknowledgements
This project is funded by the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 101034337.
Author contribution
Z.Z.W: Conceptualization; Data curation; Formal analysis; Investigation; Methodology; Visualization; Writing-original draft. Z.S.: Formal analysis; Investigation; Visualization; Writing-original draft. A.T.: Investigation; Supervision; Writing, review and editing. B.H.: Methodology; Writing, review and editing; Supervision.
Competing interests
The authors declare none.
Comments
No Comments have been published for this article.