Impact Statement
This study, one of the first of its kind in a low-income country like Benin, underscores the importance of uncertainty quantification in drought forecasting within these specific localities and can help decision-makers, agricultural researchers, and local populations to take early precautions for drought occurrence based on some possible scenarios.
1. Introduction
Over recent decades, human activities such as industrialization, deforestation driven by agricultural expansion, and rapid urbanization have intensified the pace of climate change (Mir et al., Reference Mir, Mir, Ganie, Bhat, Shah, Mushtaq, Irshad, Jatav, Raiput and Minkina2025). These anthropogenic pressures have resulted in shifting rainfall patterns, rising temperatures, and prolonged droughts—phenomena that significantly threaten agricultural productivity. Among the various climate-related hazards, drought stands out as one of the most pervasive and damaging in sub-Saharan Africa (SSA) (Aryal et al., Reference Aryal, Rahut and Marenya2023). Between 1900 and 2013, Africa recorded 291 drought events, affecting more than 362 million people and causing 847,000 deaths, with an estimated $3 billion in economic losses (Masih et al., Reference Masih, Maskey, Mussá and Trambauer2014). Droughts can be categorized into four main types: meteorological, hydrological, agricultural, and socioeconomic. They are typically characterized using indices such as the Standardized Precipitation Index (SPI), Crop Moisture Index, Palmer Drought Severity Index, and the Standardized Precipitation Evapotranspiration Index (Yihdego et al., Reference Yihdego, Vaheddoost and Al-Weshah2019).
In SSA, where agriculture is predominantly rainfed, droughts have devastating effects on crop yields, threatening food availability and nutritional security for the region’s 1.15 billion people (Pison et al., Reference Pison, Couppié and Caporali2022; United Nations Department of Economic and Social Affairs Population Division, 2022). This has led to alarming malnutrition rates of 20.4% (FAO et al., 2021). Staple crops, such as maize (Zea mays L.), often fall short of their yield potential, further exacerbating food insecurity (Scheiterle and Birner, Reference Scheiterle and Birner2018). This productivity gap poses a substantial challenge for over 60% of the SSA population, who are directly or indirectly involved in agriculture and heavily reliant on the income generated from their agricultural activities (Oluwatayo and Ojo, Reference Oluwatayo and Ojo2016). In Benin, agriculture plays a central economic and social role, employing ~70% of the active labor force and contributing 33% to the national Gross Domestic Product. The northern region—especially the “Pôle de Développement Agricole 2 Borgou-Alibori”—is critical to national food security, accounting for 36% of Benin’s agricultural output (MAEP, 2017). However, this region is particularly vulnerable to climate change, with drought expected to reduce agricultural production by 5–20% by 2025 (Paeth et al., Reference Paeth, Capo-Chichi and Endlicher2008). This underscores the urgent need for improved drought monitoring and adaptation strategies in rainfed agricultural systems. Machine learning (ML) has emerged as a powerful tool for modeling complex environmental processes, including climate variability. It has been successfully applied in diverse contexts, such as flood forecasting (Kazadi et al., Reference Kazadi, Doss-Gollin, Sebastian and Silva2024), precipitation pattern analysis (Mooers et al., Reference Mooers, Beucler, Pritchard and Mandt2024), and rainfall prediction (Aïzansi et al., Reference Aïzansi, Ogunjobi and Ogou2024).
In Benin, recent studies have applied ML and statistical approaches to assess drought risks. For instance, Vodounon et al. (Reference Vodounon, Soude and Mamadou2022), Halissou et al. (Reference Halissou, Eric, Eliézer, Ezéchiel, Bio and Abel2022), Oyerinde and Olowookere (Reference Oyerinde and Olowookere2018), Emmanuel et al. (Reference Emmanuel, Hounguè, Biaou and Badou2019), and Hounkpè et al. (Reference Hounkpè, Diekkrüger, Badou and Afouda2016) analyzed hydrometeorological trends in the Ouémé watershed and the Beninese Niger River basin using indices such as SPI and the Standardized Streamflow Drought Index. These studies revealed substantial shifts in rainfall cycles and documented moderate to severe drought episodes between 1970 and 2016. Halissou et al. (Reference Halissou, Eric, Eliézer, Ezéchiel, Bio and Abel2022) further identified Kompongou (Mékrou), Couberi (Sota), Yakin (Alibori), and Gbasse (Sota) as high-risk sites for hydrological droughts. Projections suggest these patterns will persist and intensify through 2050. Vodounon et al. (Reference Vodounon, Soude and Mamadou2022) demonstrated that ML models could effectively predict SPI at a 12-month scale, achieving a coefficient of determination (R 2) of 0.99 and a root mean square error (RMSE) of 0.07. Despite these encouraging results, most existing studies, including those focused on the agriculturally critical northern region, lack quantification of the uncertainties associated with their forecasts. This omission reduces the practical value of ML models for decision-makers, who need to understand how confident the models are in their predictions to formulate adapted policies (Jalaian et al., Reference Jalaian, Lee and Russell2019; Poggio et al., Reference Poggio, De Sousa, Batjes, Heuvelink, Kempen, Ribeiro and Rossiter2021; Gruber et al., Reference Gruber, Schenk, Schierholz, Kreuter and Kauermann2023; Lilburne et al., Reference Lilburne, Helfenstein, Heuvelink and Eger2024; Simperegui et al., Reference Simperegui, Kouame, Kwesie, Bindraban, Adzawla, Asamoah and El Gharous2025).
This study aims to develop a robust framework for predicting meteorological drought—specifically the 6-month SPI-6—using ML and deep learning (DL) techniques, while also quantifying the associated prediction uncertainties in the Alibori department of northern Benin. Using gridded meteorological datasets from six towns in the region, we intend to: (i) build multiple ML and DL models to predict SPI-6 values, (ii) quantify the uncertainty associated with each model’s predictions, and (iii) compare the performance of ML and DL approaches in terms of accuracy and reliability. The outcomes of this research will support climate risk management by providing stakeholders—particularly farmers and policy-makers with actionable insights to enhance drought preparedness and resilience.
2. Materials and methods
In this section, we present the methodology adopted in this study, as illustrated in Figure 1. The process begins with a multivariate time series dataset and involves several preprocessing steps, including feature selection, creation of lagged variables, and normalization. The dataset is then split into training and testing subsets. Multiple ML and DL models are then trained, and their predictions are further processed using the Ensemble Batch Prediction Interval (EnbPI) method for uncertainty quantification. Finally, the performance of the models is evaluated using various statistical metrics.

Figure 1. Flowchart of the proposed methodology: From multivariate time series preprocessing to model training, uncertainty quantification with EnbPI, and final performance evaluation using various statistical metrics.
2.1. Study area
Geographically situated between 11°19′ north latitude and 2°55′ east longitude, Alibori is a department in the northeast of the Republic of Benin. It is bordered to the north by the Republic of Niger, to the northwest by the Republic of Burkina Faso, to the east by the Federal Republic of Nigeria, to the west by Atacora, and the south by the department of Borgou. Covering an area of
$ 26242{km}^2 $
, Alibori is subdivided into six municipalities: Malanville, Karimama, Segbana, Gogounou, Banikoara, and Kandi (Figure 2). For each municipality, one location was selected based on cropland cover as indicated by the MODIS (Moderate Resolution Imaging Spectroradiometer) land cover map of Benin.

Figure 2. Alibori department cartography and geographic distribution of the locations. The cropland cover is derived from the MODIS land cover.
2.2. Data collection
To model the SPI and conduct our analysis, we obtained monthly weather data spanning a range of climate variables. These included temperature, total precipitation, wind speed and direction, surface pressure, relative humidity, cloud cover, soil moisture, top-of-atmosphere shortwave downward irradiance, and all-sky surface shortwave downward irradiance (Table 1). Although the national meteorological agency provides some of these datasets, their spatial coverage is limited and does not extend to many of the locations included in our study. Moreover, the available records do not span the 30-year baseline period recommended by the World Meteorological Organization (WMO) for climate assessments, especially those concerning drought characterization (WMO, 2017). To address these limitations and ensure spatially consistent and temporally comprehensive data, we relied on gridded satellite-based datasets. Two main sources were considered: NASA Prediction of Worldwide Energy Resources (POWER) dataset, with a spatial resolution of 0.5° latitude by 0.625° longitude (
$ \sim 50 $
km), and the ERA5 reanalysis dataset, offering finer resolution at 11 km. Given that our analysis is conducted at the municipality level, we visualized both 50- and 11-km grids using QGIS (Quantum Geographic Information System) to assess their spatial alignment with administrative boundaries (Figure 2). While ERA5 provided higher spatial resolution, it resulted in multiple grid cells per municipality, which would have required aggregation and potentially introduced uncertainty. In contrast, the coarser NASA POWER grid is more closely aligned with municipality extents, thereby minimizing the need for spatial averaging. Consequently, we selected NASA POWER as the data source for this study. For each municipality, a single representative location was chosen based on the spatial distribution of cropland, ensuring that the extracted coordinates accurately reflect the areas of dominant agricultural activity (Figure 2). The dataset spans 41 years (1981–2021) and was extracted for the six municipalities under investigation (Malanville, Karimama, Segbana, Gogounou, Banikoara, and Kandi).
Table 1. Description of weather variables used in the study

2.2.1. Standardized Precipitation Index
The SPI is a functional and quantitative definition of drought. It can be calculated at different timescales (3, 6, 12, 24, and 48) using the monthly rainfall data (Lawin et al., Reference Lawin, Houngue, Biaou and Badou2019; McKee et al., Reference McKee, Doesken and Kleist1993). The SPI is defined as:
where
$ {SPI}_i $
is the SPI of the month i,
$ {P}_i $
is the rainfall of the month i,
$ \overline{P} $
and
$ \sigma $
are the mean annual rainfall and the standard deviation on the timescale considered. In the present study, the SPI6 considers a 6-month timescale. SPI6 is used to assess drought occurrence on a semester basis. Table 2 provides a guideline for interpreting the SPI value.
Table 2. SPI interpretation (Lloyd-Hughes and Saunders, Reference Lloyd-Hughes and Saunders2002 )

2.2.2. Data preprocessing
Before modeling the SPI6, the data underwent a quality control process. Among the variables, CLOUD_AMT, TOA_SW_DWN, and ALLSKY_SFC_SW_DWN had 36 missing values (
$ \sim 8\% $
). Additionally, due to the 6-month timescale required to calculate SPI6, the first five observations of this variable were missing. To prevent these gaps from impacting the model performance, we excluded the CLOUD_AMT, TOA_SW_DWN, and ALLSKY_SFC_SW_DWN variables and removed the first 5 observations from the dataset, resulting in 487 observations across the datasets.
Moreover, using Pearson correlation for feature selection, variables with an absolute value of <10% were removed, leading to datasets of seven variables (Malanville, Karimama, and Banikoara) and eight variables (Segbana, Kandi, and Gogounou). Furthermore, time series data exhibit relationships over historical periods (Stübinger and Adler, Reference Stübinger and Adler2020). To capture these patterns, we generated lagged versions of each variable using a window size of 5 (considering the use of SPI6 of the last 5 months to predict the sixth), creating 5 new lagged variables (t−5, t−4, t−3, t−2, and t−1) for each original covariate, leading to a total of 42 (for 7 variables) and 48 (for 8 variables) variables when including current time (t) observations.
Subsequently, using the Pearson correlation matrix, we performed feature selection by retaining the lag (t−5, t−4, t−3, t−2, t−1, or t) with the highest absolute Pearson correlation value with SPI6 (Table 3 and Figure 3). Finally, to normalize the scale of all covariates and eliminate the influence of variables with larger magnitudes, we applied a min-max scaler, bringing all covariates into the same range
$ \left[-1,1\right] $
.
Table 3. Lag retained per covariate and locality

Note. Blank cells indicate that none of the lags for that covariate had a Pearson correlation coefficient >0.1, leading to their removal.

Figure 3. Barplots of the correlations between SPI6 and lagged weather variables.
2.3. Modeling
We employed a range of models to forecast time series data, encompassing both traditional ML methods and DL architectures.
2.3.1. Presentation of models
-
• Linear regression
Linear regression by Huang (Reference Huang2020) is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship of the form:
where
$$ y={\beta}_0+{\beta}_1{x}_1+{\beta}_2{x}_2+\dots +{\beta}_n{x}_n+\varepsilon $$
$ y $
represents the predicted value of the target variable,
$ {\beta}_0 $
is the intercept term (the value of
$ y $
when all predictors are zero),
$ {\beta}_i $
are the coefficients associated with each independent variable
$ {x}_i $
(for
$ i=1,\dots, n $
), which quantify the effect of each predictor on the target, and
$ \varepsilon $
is the error term that captures the unexplained variability or noise in the data. The model estimates coefficients that minimize the difference between predicted and actual values. It is widely used as a baseline due to its simplicity, interpretability, and low computational cost. However, in the context of time-series forecasting, its performance can be limited. It does not account for temporal dependencies or autocorrelation between observations, and it cannot model seasonality or long-term trends unless these are explicitly encoded as features. Consequently, its relevance is constrained when the data exhibit complex time-based dynamics.
-
• Support vector regression (SVR)
SVR by Basak et al. (Reference Basak, Pal, Ch and Patranabis2007) adapts the principles of support vector machines to regression problems. It aims to find a function of the form:
that approximates the data within a specified tolerance level
$$ f(x)=\left\langle w,x\right\rangle +b $$
$ \varepsilon $
, while keeping the model as flat as possible. In this expression,
$ \left\langle w,x\right\rangle $
denotes the dot product between the weight vector
$ w $
and the input vector
$ x $
, which defines the orientation of the regression hyperplane. The scalar
$ b $
is the bias term that shifts the hyperplane vertically. The parameter
$ \varepsilon $
defines an
$ \varepsilon $
-insensitive margin around the predicted values, within which no penalty is assigned for prediction errors. The objective of SVR is to find the flattest possible function (i.e., minimizing
$ \parallel w{\parallel}^2 $
) that fits as many data points as possible within this margin. SVR is known for its robustness to outliers and its ability to model nonlinear relationships through kernel functions. Despite these strengths, SVR has several drawbacks in time-series applications. It is computationally expensive for large datasets and sensitive to the choice of kernel and hyperparameters. Moreover, SVR does not inherently capture sequential dependencies in data, making it less effective for long-range forecasting unless temporal structures are carefully embedded into the feature space.
-
• Random forest
Random forest by Breiman (Reference Breiman2001) is an ensemble learning method that constructs multiple decision trees using bootstrapped samples of the data and aggregates their outputs for prediction. For regression tasks, the final output is typically the average of all tree predictions:
where each
$$ \hat{y}=\frac{1}{T}\sum \limits_{t=1}^T\ {f}_t(x) $$
$ {f}_t $
is the prediction from an individual decision tree. This approach reduces variance and overfitting while capturing nonlinear relationships between inputs and outputs. While random forest is powerful and robust, it lacks temporal awareness. It does not consider the order of data points, and it cannot extrapolate future trends beyond the training range. To be effective for time series, it relies heavily on the inclusion of lag features or engineered time-based inputs.
-
• eXtreme gradient boosting (XGBoost)
XGBoost by Chen and Guestrin (Reference Chen and Guestrin2016) is an optimized implementation of gradient boosting that combines decision trees in a sequential manner, where each new tree corrects the residuals of the previous ones. It is known for its speed, regularization capabilities, and strong performance on structured data. It handles missing values well and includes several techniques to prevent overfitting. However, XGBoost is not specifically designed for time-series data. Without careful feature engineering (e.g., time lags and trend indicators), it cannot model temporal dependencies directly. Additionally, it may overfit if the model complexity is not well-controlled, especially when dealing with small or noisy datasets.
-
• Light gradient boosting machine (LightGBM)
LightGBM by Ke et al. (Reference Ke, Meng, Finley, Wang, Chen, Ma, Ye and Liu2017) is a gradient-boosting framework that improves upon earlier methods like XGBoost through optimizations, such as Gradient-based One-Side Sampling and Exclusive Feature Bundling. These enhancements make it highly efficient in terms of training speed and memory usage, mainly for large datasets with many features. Like other tree-based models, LightGBM can capture nonlinear interactions and works well with heterogeneous data. However, it does not intrinsically understand time-based relationships and, like XGBoost, depends on handcrafted temporal features. Its aggressive sampling strategy may also bias the model toward frequent or recent events, which can be problematic for forecasting rare or long-term patterns.
-
• One-dimensional convolutional neural network (Conv1D)
Conv1D by Kiranyaz et al. (Reference Kiranyaz, Avci, Abdeljaber, Ince, Gabbouj and Inman2021) is a type of Convolutional Neural Network (CNN) model designed to process sequential data using one-dimensional convolutional filters. These filters slide over the time dimension to detect local patterns and short-term dependencies. The model can extract meaningful features from fixed-length windows and is typically composed of multiple layers, including convolutions, nonlinear activation functions, and pooling operations, followed by fully connected layers for prediction. While Conv1D is computationally efficient and effective at identifying localized temporal patterns, it is naturally limited in its ability to capture long-term dependencies. It also requires consistent input lengths and may struggle with nonstationary data unless combined with deeper or recurrent structures.
-
• Long short-term memory networks (LSTMs)
LSTM by Hochreiter and Schmidhuber (Reference Hochreiter and Schmidhuber1997) is a specialized type of recurrent neural network (RNN) designed to handle the limitations of traditional RNNs in learning long-term dependencies. It introduces memory cells and gating mechanisms that control the flow of information across time steps, allowing the model to retain and forget information as needed. LSTMs are well-suited for sequential data with complex temporal relationships, such as time series with trends and seasonality. However, they come with high computational costs, especially for long sequences or deep architectures. They also require large datasets to generalize well and are prone to overfitting if regularization is not carefully applied.
-
• Gated recurrent units (GRUs)
GRU by Chung et al. (Reference Chung, Gulcehre, Cho and Bengio2014) is a variant of LSTM that simplifies the network architecture by merging the input and forget gates into a single update gate. This leads to faster training and reduced memory usage while still allowing the model to capture temporal dependencies. GRUs have been shown to perform comparably to LSTMs on many time-series tasks, especially when computational efficiency is a concern. However, their simplified structure may limit their ability to model very long-term dependencies. Like LSTMs, they also require careful tuning and regularization to avoid overfitting, specifically when training data are scarce or highly variable.
-
• Conv1D-LSTM
Conv1D-LSTM by Shi et al. (Reference Shi, Chen, Wang, Yeung, W-K and Woo2015) is a hybrid model that combines one-dimensional convolutional layers with LSTM units. The convolutional layers serve to extract local features and reduce the dimensionality of the input sequence before passing it to the LSTM layers, which model the temporal evolution of these features. This architecture is particularly effective for spatiotemporal forecasting, where both short-term patterns and long-term dependencies are essential. It leverages the strengths of both CNNs and RNNs, but also inherits their complexities. The model requires more computational resources and careful tuning, and its interpretability is lower compared to simpler models. Overfitting can also be an issue if the dataset is not adequately large or diverse.
2.3.2. Experimental setup
For our experiments, across all localities, the dataset was split at the date January 31, 2014, resulting in a training dataset of length 387 and a test dataset of 96. Before introducing the ML models, all features and the target variable were normalized using min-max scaling, applied only on the training set to prevent data leakage. The fitted scaler was then applied to the test set for evaluation of the models. The transformation follows the exact formula implemented in scikit-learn’s MinMaxScaler (Pedregosa et al., Reference Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel, Blondel, Prettenhofer, Weiss, Dubourg, Vanderplas, Passos, Cournapeau, Brucher, Perrot and Duchesnay2011:
In our case, we used
$ \min \_\mathrm{range}=-1 $
and
$ \max \_\mathrm{range}=1 $
. The ML models built for these datasets are linear regression, ridge regression, random forest, XGBoost, and LightGBM, using the default parameters from the scikit-learn, XGBoost, and LightGBM libraries.
Before implementing DL models, the dataset was reshaped into the format
$ \left({n}_{\mathrm{elements}},\mathrm{sequence}\_\mathrm{length},{n}_{\mathrm{features}}\right) $
, where
$ {n}_{\mathrm{elements}} $
is the number of samples,
$ \mathrm{sequence}\_\mathrm{length} $
is the fixed window size, and
$ {n}_{\mathrm{features}} $
is the number of predictors. In this case, the window size was set to 5, and seven features were used, resulting in an input shape of
$ \left(n,\mathrm{5,7}\right) $
.
To ensure robust training, 20% of the training set was reserved for validation. Normalization was performed using a MinMaxScaler fitted on the training set, with the same scaling parameters applied to the validation and test sets. All DL models were developed using the Keras API within TensorFlow. A consistent training setup was applied across all models and localities: a batch size of 64, the Adam optimizer with a learning rate of 0.001, mean squared error as the validation loss function, and RMSE as the evaluation metric. Monitoring the validation loss during training showed that convergence was typically achieved just before the 100th epoch across all localities and models, so the number of epochs was set to 100.
The implemented architectures include a Conv1D model consisting of a 1D convolutional layer with 64 filters and a kernel size of 1, followed by dense layers (see Table 4); an LSTM model composed of two stacked LSTM layers with 64 and 32 units (Table 5); a GRU variant using the same configuration as the LSTM but with GRU layers (Table 6); and a hybrid Conv1D-LSTM model that starts with a Conv1D layer (32 filters and kernel size 3), followed by two LSTM layers with 64 and 32 units, and dense layers (Table 7).
Table 4. Architecture of the Conv1d model used in the study

Table 5. Architecture of the LSTM model used in the study

Table 6. Architecture of the GRU model used in the study

Table 7. Architecture of the Conv1D-LSTM hybrid model used in the study

All experiments were conducted on a machine running Ubuntu 24.04.2 LTS (64-bit), with Python 3.12.3 and TensorFlow 2.16.2. The system features a 13th Gen Intel® Core™ i7−13620H CPU (Central Processing Unit) with 16 GB RAM (Random Access Memory) and an NVIDIA GeForce RTX™ 4060 Laptop GPU (Graphics Processing Unit) with 8 GB of memory.
2.4. Uncertainty quantification and model ranking
After carefully selecting ML and DL algorithms for drought forecasting, we sought to quantify the uncertainty associated with their predictions rigorously. Reliable uncertainty quantification is crucial for informed decision-making in drought management, allowing stakeholders to assess forecast confidence and anticipate possible variations. To achieve this goal, we integrated the selected predictive algorithms into the conformal prediction (CP) framework initially introduced by Vovk et al. (Reference Vovk, Gammerman and Shafer2005).
However, traditional CP methods for regression tasks assume data points are exchangeable, an assumption inherently violated by time-series data due to their temporal dependencies (Barber et al., Reference Barber, Candès, Ramdas and Tibshirani2023; Xu and Xie, Reference Xu and Xie2023). Therefore, applying standard conformal methods would yield unreliable uncertainty estimates for our drought-forecasting problem.
To overcome this fundamental challenge, we adopted the EnBPI CP method proposed by Xu and Xie (Reference Xu and Xie2023), specifically tailored for time-series prediction tasks with nonexchangeable data points. This method leverages ensemble modeling and bootstrap resampling to construct robust prediction intervals without relying on the strict assumption of exchangeability.
2.4.1. Conformal prediction
CP (Vovk et al., Reference Vovk, Gammerman and Shafer2005) is a flexible statistical framework for constructing prediction intervals or sets that contain the true outcome with a guaranteed probability. Given a training dataset
$ {\left\{\left({x}_i,{y}_i\right)\right\}}_{i=1}^n $
and a regression model
$ \hat{f} $
, CP assigns a nonconformity score to each training example, typically computed as the absolute residual:
where
$ {\hat{f}}_{-i} $
denotes the model trained with the ith data point excluded (leave-one-out). For a new input
$ {x}_{n+1} $
, the prediction interval is given by:
where
$ {q}_{1-\alpha } $
is the
$ \left(1-\alpha \right) $
quantile of the empirical distribution of
$ \left\{{\unicode{x025B}}_i\right\} $
. Under the assumption of exchangeability, this interval satisfies the marginal coverage guarantee:
However, in time-series applications, the assumption of exchangeability is typically violated, requiring more robust methods.
2.4.2. Ensemble batch prediction interval
The EnBPI (Xu and Xie, Reference Xu and Xie2023) method modifies traditional conformal inference by incorporating ensemble methods and bootstrap techniques, effectively addressing the sequential dependencies in time-series data. The algorithm operates as follows:
-
• Bootstrap resampling: The training dataset is bootstrapped multiple times (30 bootstrap samples in our study). Each bootstrap sample simulates different potential outcomes of the underlying data-generating process.
-
• Ensemble model construction: The selected ML and DL algorithms are independently trained on each bootstrap sample, creating an ensemble of diverse predictive models. This ensemble captures a broad spectrum of possible data behaviors and prediction outcomes.
-
• Batch-wise conformalization: Prediction intervals are computed batch-wise, considering historical residuals from previous batches. Residuals (absolute errors between observed and predicted values) from previous batches are used to construct empirical quantiles, resulting in robust and adaptive prediction intervals for future observations.
For this study, we implemented EnBPI using the Python library Puncc, with a significance level
$ \alpha =0.1 $
, corresponding to a 90% coverage level, meaning theoretically 90% of prediction intervals should contain true observed values. While EnBPI offers a practical and adaptive solution for generating prediction intervals in time-series forecasting, it is not without limitations. A key drawback is its computational cost: since the method requires training multiple models on bootstrapped datasets, it may face scalability issues when applied to very large datasets or high-dimensional input spaces.
2.4.3. Evaluation metrics
To evaluate model performance, we consider both classical regression metrics and uncertainty quantification-specific criteria. This dual evaluation allows us to assess not only the accuracy but also the trustworthiness and informativeness of predictions.
2.4.4. Regression metrics
-
• R 2 by Wright (Reference Wright1921): The R 2 metric measures the proportion of variance in the dependent variable that is predictable from the independent variables. It is calculated as:
(5)where
$$ {R}^2=1-\frac{\sum_{t=1}^T{\left({y}_t-{\hat{y}}_t\right)}^2}{\sum_{t=1}^T{\left({y}_t-\overline{y}\right)}^2}, $$
$ {y}_t $
are the actual values,
$ {\hat{y}}_t $
are the predicted values, and
$ \overline{y} $
is the mean of the actual values.
$ {R}^2 $
ranges from 0 to 1, with higher values indicating a better fit.
-
• RMSE by Adhikari and Agrawal (Reference Adhikari and Agrawal2013): RMSE is a widely used metric that measures the square root of the average squared differences between predicted and actual values. It is given by:
(6)RMSE is sensitive to large errors and provides a measure of the magnitude of the prediction errors.
$$ \mathrm{RMSE}=\sqrt{\frac{1}{T}\sum \limits_{t=1}^T{\left({y}_t-{\hat{y}}_t\right)}^2}. $$
-
• Mean absolute error (MAE) by Adhikari and Agrawal (Reference Adhikari and Agrawal2013): MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It is calculated as:
(7)MAE gives an easily interpretable measure of the average prediction error in the same units as the target variable.
$$ \mathrm{MAE}=\frac{1}{T}\sum \limits_{t=1}^T\mid {y}_t-{\hat{y}}_t\mid . $$
Uncertainty quantification metrics.
-
• Coverage. The fraction of true targets that fall within the predicted intervals. A well-calibrated model should satisfy:
(8)
$$ \mathrm{Coverage}=\frac{1}{T}\sum \limits_{t=1}^T\nVdash \left\{{y}_t\in {C}_{1-\alpha}\left({x}_t\right)\right\}. $$
-
• Prediction interval width (Width). This measures the average width of the prediction intervals. Narrower intervals are more informative:
(9)Coverage and interval width quantify (Stankevičiūtė et al., Reference Stankevičiūtė, Alaa and van der Schaar2021; Auer et al., Reference Auer, Gauch, Klotz and Hochreiter2023; Xu and Xie, Reference Xu and Xie2023; Lee et al., Reference Lee, Xu and Xie2024) the trade-off between reliability (i.e., how often intervals contain the true value) and informativeness (i.e., how narrow the intervals are). A trustworthy model should maintain nominal coverage while producing tight intervals.
$$ \mathrm{Width}=\frac{1}{T}\sum \limits_{t=1}^T\left({\mathrm{upper}}_t-{\mathrm{lower}}_t\right). $$
Computational sustainability metric.
-
• Carbon footprint (CO 2 e): We report the environmental cost of computation separately from predictive metrics. Emissions are estimated with CodeCarbon (Courty et al., Reference Courty, Schmidt, Luccioni, Goyal-Kamal, Marion Coutarel, Feld, Lecourt, LiamConnell, Saboni, Inimaz, supatomic, Léval, Blanche, Cruveiller, Ouminasara, Zhao, Joshi, Bogroff, de Lavoreille, Laskaris, Abati, Blank, Wang, Catovic, Alencon, Michał, Bauer and Araújo2024), developed by a consortium including MILA (Quebec AI Institute), Haverford College, Comet.ml, and BCG GAMMA. We use this library to monitor hardware power use (CPU/GPU/RAM) and record, for each model, the total emissions in kg CO2e.
2.4.5. Model ranking via Borda Count method
To systematically compare and rank predictive model performance, we used the Borda Count ranking system, like many studies (Cherif and Idri, Reference Cherif and Idri2025; Hamdouchi and Idri, Reference Hamdouchi and Idri2025; Nakach et al., Reference Nakach, Idri and Tchokponhoue2025; Zerouaoui et al., Reference Zerouaoui, Idr and El Alaoui2025). The Borda Count is a consensus-based ranking method effective in combining multiple performance metrics. The method works as follows:
-
• Each performance metric (R 2, RMSE, MAE, Coverage, Width, and CO emissions) acts as a voter, ranking candidate models.
-
• For
$ n $
candidate models, each model receives points based on its position in each metric:
$ n $
points for first place,
$ n-1 $
points for second, down to 1 point for last place. -
• Points from all metrics are summed for each model; the model with the highest total score is considered the most effective and reliable across multiple evaluation dimensions.
This structured, transparent approach ensures comprehensive, balanced, and interpretable evaluations of model performance, explicitly accounting for accuracy, reliability, precision, and environmental considerations.
2.4.6. Taylor Diagram
This study uses the Taylor Diagram (Taylor, Reference Taylor2001) to provide a comprehensive statistical evaluation of the performance of ML and DL models. The Taylor Diagram framework enables simultaneous visualization of three critical performance metrics on a single polar coordinate plot. Unlike single-metric evaluation, the Taylor Diagram reveals trade-offs between correlation, centered root-mean-squared-difference, and standard deviation. It gives a simultaneous assessment of multiple models across complementary statistical dimensions on a single two-dimensional plot. The diagram focuses on centered pattern differences after removing overall bias, isolating structural model differences from mean field errors. However, the Borda Count ranking method uses more metrics compared to the Taylor Diagram.
3. Results and discussion
Table 8 presents a comprehensive evaluation of various predictive models for drought forecasting in Malanville, integrating multiple performance dimensions including predictive accuracy (
$ {R}^2 $
, RMSE, and MAE), uncertainty quantification (Coverage and Width), and environmental impact (CO2 emissions). The Borda Count ranking method synthesizes these metrics, providing a clear and balanced assessment framework.
Table 8. Performance of different models for Malanville

Note. The best model according to the Borda Count ranking is highlighted in bold.
The results clearly demonstrate the superior performance of the Conv1D model, which achieved the highest overall rank. With the best predictive accuracy metrics (highest
$ {R}^2=97.76\% $
, lowest RMSE = 0.140, and MAE = 0.097), the Conv1D model effectively captures complex temporal and nonlinear patterns characteristic of drought time series. Furthermore, the high coverage rate (92.8%) indicates that this model provides reliable uncertainty quantification, ensuring that decision-makers can confidently interpret its drought predictions. Although the Conv1D model exhibited moderate interval widths and slightly elevated carbon emissions relative to simpler statistical models, the significant gains in accuracy and reliability clearly outweigh these trade-offs.
Furthermore, the quality of the uncertainty estimates produced by Conv1D is visually confirmed in Figure 4, which presents the prediction intervals generated by the model using the EnBPI method. The vast majority of the test set observations fall within the predicted intervals, with only a few instances lying outside. This indicates that the prediction intervals are not only well-calibrated but also adaptive to temporal fluctuations in the data. The intervals effectively capture the seasonal dynamics of the SPI/6 drought index, adjusting their width according to local uncertainty. This reinforces the robustness of the Conv1D model in providing accurate point predictions and reliable uncertainty quantification.

Figure 4. Malanville’s prediction intervals and observations of the most effective algorithm (a) and the least effective algorithm (b) using the EnBPI conformal prediction method for the SPI6. The shaded blue area represents the 90% prediction interval, the black line indicates the model’s point predictions, green dots show observations within the interval, and red dots denote observations outside the interval.
Closely following, the Conv1D-LSTM hybrid model ranks second, showcasing a compelling blend of convolutional and sequential learning strengths. It achieves strong predictive performance (
$ {R}^2=97.698\% $
, RMSE = 0.142, and MAE = 0.098) and delivers slightly narrower prediction intervals compared to the Conv1D model, indicating higher precision. However, its marginally lower coverage (89.7%) suggests a small risk of missing some valid observations. The hybrid approach thus represents a robust alternative that effectively balances accuracy, uncertainty quantification, and moderate environmental impact.
Traditional statistical models, such as linear regression, ridge regression, and random forest, ranked among the lowest performers. Despite their lower computational demands and reduced CO2 emissions, these methods failed to achieve comparable predictive accuracy and produced wider prediction intervals. Consequently, their practical utility for precise drought forecasting in critical applications is limited.
Moreover, DL architectures, such as LSTM and GRU, displayed promising predictive accuracy but incurred substantially higher carbon footprints. This increased environmental cost poses sustainability challenges, particularly relevant in resource-limited or environmentally sensitive contexts. Therefore, although these models have the potential to deliver strong predictive capabilities, their deployment must be carefully evaluated against available computational resources and environmental considerations.
Overall, the results in Figure 5 confirm the results obtained using the Borda Count ranking method, specifically in terms of the most effective and least effective models, and Conv1D shows better performance across all three metrics. DL architecture models show little variability among them and align better with observations compared to the ML models. The analysis underscores the effectiveness of convolutional-based DL models—particularly Conv1D and Conv1D-LSTM—for accurate, reliable, and environmentally balanced drought forecasts in Malanville.

Figure 5. Malanville’s Taylor Diagram.
In Karimama, the evaluation result summarized in Table 9 reveals two top-performing models with the top Borda Count rank: Conv1D and Conv1D-LSTM.
Table 9. Performance of different models for Karimama

Note. The best model according to the Borda Count ranking is highlighted in bold.
The Conv1D model exhibited the strongest overall predictive accuracy, achieving the highest R 2 (
$ {R}^2=97.072\% $
), the lowest RMSE (0.153), and a notably low MAE (0.109). These results underscore the model’s capability to model complex drought dynamics and provide reliable forecasts effectively. Additionally, its high coverage rate (92.8%) indicates robust uncertainty quantification, ensuring the constructed prediction intervals reliably capture the actual observations (see Figure 6). However, the interval width (0.475) and the model’s relatively higher carbon footprint (1.17E−03) suggest inevitable trade-offs regarding precision and computational sustainability.

Figure 6. Karimama’s prediction intervals and observations of the most effective algorithm (a) and the least effective algorithm (b) using the EnBPI conformal prediction method for the SPI6. The shaded blue area represents the 90% prediction interval, the black line indicates the model’s point predictions, green dots show observations within the interval, and red dots denote observations outside the interval.
Equally ranked first, the Conv1D-LSTM hybrid model provided similarly impressive predictive performance, slightly trailing in terms of
$ {R}^2 $
(96.917%) and RMSE (0.157), but surpassing Conv1D with the lowest MAE (0.107). Importantly, this model generated narrower prediction intervals (width = 0.463), improving precision, although with a slight reduction in coverage (89.7%). A key advantage of Conv1D-LSTM is its lower environmental impact, with substantially reduced CO2 emissions (6.63E−04), making it a particularly attractive option for sustainable forecasting applications.
Traditional models, such as linear regression and ridge regression, underperformed considerably across all metrics. Their lower accuracy (~91%
$ {R}^2 $
), wider prediction intervals, and suboptimal coverage highlight their limitations in addressing the complexity of drought-forecasting tasks. While these models have suitable computational efficiency, their relatively poor performance significantly constrains their practical use.
Intermediate approaches, such as XGBoost, LightGBM, and SVR, demonstrated moderate predictive accuracy and acceptable uncertainty quantification, but were consistently outperformed by the DL-based methods. Their intermediate rankings reflect a trade-off between computational simplicity and predictive robustness.
In conclusion, Figure 7 validates the results of the Borda Count ranking method. DL architecture models show little variability among them, and align better with observations compared to the ML models. The Karimama case study reinforces the effectiveness of convolution-based DL models, particularly Conv1D and Conv1D-LSTM, as optimal methods for accurate, reliable, and sustainable drought forecasting. Coupled with the EnBPI uncertainty framework, these models offer significant promise for practical decision-making and drought management strategies in regions facing similar environmental challenges.

Figure 7. Karimama’s Taylor Diagram.
In Kandi, the evaluation results summarized in Table 10 provide critical insights into the performance and suitability of various predictive models for drought forecasting, based on predictive accuracy (R 2, RMSE, and MAE), uncertainty quantification (Coverage and Width), and environmental impact (CO2 emissions). Using the Borda Count ranking method, the Conv1D-LSTM hybrid model emerged as the highest-ranked algorithm, primarily due to superior regression performance metrics (highest R 2 = 97.417%, lowest RMSE = 0.173, and lowest MAE = 0.119). This highlights the strong capability of hybrid DL architectures in capturing complex drought patterns.
Table 10. Performance of different models for Kandi

Note. The best model according to the Borda Count ranking is highlighted in bold.
However, despite its high predictive accuracy, the Conv1D-LSTM model did not meet the targeted 90% coverage, achieving only 77.3%. This deficiency indicates a potential overconfidence in its predictions, as further suggested by its notably narrow prediction intervals (width = 0.391). Such narrow intervals, although indicative of high precision, may not accurately reflect the true predictive uncertainty and hence limit the model’s reliability for practical decision-making. This limitation is visually confirmed in Figure 8, which displays the prediction intervals generated by the Conv1D-LSTM model using the EnBPI method. Figure 8 reveals a significant number of test set observations lying outside the prediction intervals, confirming the model’s inability to encapsulate true values. Although the intervals appear adaptively narrow, they fail to provide adequate coverage, further underscoring the model’s overconfidence in this specific locality.

Figure 8. Kandi’s prediction intervals and observations of the most effective algorithm (a) and the least effective algorithm (b) using the EnBPI conformal prediction method for the SPI6. The shaded blue area represents the 90% prediction interval, the black line indicates the model’s point predictions, green dots show observations within the interval, and red dots denote observations outside the interval.
Conversely, traditional regression methods such as linear regression and ridge regression demonstrated relatively balanced performance, achieving coverage rates very close to the ideal 90% (~89.6%). Despite having moderately lower predictive accuracy metrics compared to DL models, these statistical methods exhibit desirable reliability in their uncertainty quantification. Their wider, more adaptive prediction intervals provide a realistic representation of prediction uncertainty, enhancing the confidence stakeholders can place in their forecasts.
Moreover, these traditional models offer significant computational advantages, reflected by substantially lower carbon footprints (~3.55E−06 and 9.89E−06 for linear regression and ridge regression, respectively), emphasizing their suitability for sustainable and efficient deployment in operational settings.
In summary, the Conv1D-LSTM model ranked first based on overall performance metrics; traditional regression approaches (linear and ridge regressions) might represent more suitable alternatives for practical deployment. Additionally, Figure 9 validates the results of the Borda Count ranking method. The Conv1D-LSTM model aligns better with the observations than the other models, as it performs better across all three metrics with only marginal variability regarding the other DL models. Their reliable uncertainty quantification, near-ideal coverage rates, computational efficiency, and environmental sustainability collectively provide stronger assurance for informed and confident decision-making in drought management applications.

Figure 9. Kandi’s Taylor Diagram.
In Gogounou, the evaluation results presented in Table 11 provide an in-depth comparison of different models’ performances based on predictive accuracy (
$ {R}^2 $
, RMSE, and MAE), uncertainty quantification (Coverage and Width), and environmental impact (CO2 emissions). According to the Borda Count ranking, the Conv1D-LSTM model secured the top position overall.
Table 11. Performance of different models for Gogounou

Note. The best model according to the Borda Count ranking is highlighted in bold.
The Conv1D-LSTM model achieved the highest predictive accuracy among all models (
$ {R}^2=97.842\% $
), with the lowest RMSE (0.162) and MAE (0.109), demonstrating its superior capability in capturing complex patterns in drought time series. These accuracy metrics are notable. However, its uncertainty quantification was less satisfactory. The model recorded a coverage of only 79.4%, substantially below the 90% threshold typically expected from reliable prediction intervals. While it maintained a relatively narrow interval width (0.399), this may indicate overconfident predictions and a lack of adaptability in the uncertainty estimates—raising concerns about the robustness of its predictive intervals in operational settings (see Figure 10).

Figure 10. Gogounou’s prediction intervals and observations of the most effective algorithm (a) and the least effective algorithm (b) using the EnBPI conformal prediction method for the SPI6. The shaded blue area represents the 90% prediction interval, the black line indicates the model’s point predictions, green dots show observations within the interval, and red dots denote observations outside the interval.
By contrast, traditional models such as linear regression and ridge regression ranked second and third, respectively, and showed a more balanced trade-off between predictive performance and uncertainty reliability. Despite slightly lower
$ {R}^2 $
values (97.605 and 97.598%) and higher MAEs (0.133 and 0.134), both models offered good coverage (87.5 and 89.6%) and acceptable width (0.540), more faithfully capturing the uncertainty around their forecasts. Additionally, their carbon emissions were minimal (on the order of 10–6), making them significantly more computationally efficient and environmentally sustainable than DL-based alternatives.
Other DL models, such as LSTM, GRU, and Conv1D, exhibited strong predictive accuracy, with
$ {R}^2 $
scores all above 97.3%, but they also suffered from insufficient coverage (ranging from 76.3 to 81.4%), reinforcing the concern that these architectures may generate overly narrow, overconfident prediction intervals. In particular, GRU and LSTM demonstrated relatively high emissions and low coverage, making them less suitable for operational deployment despite their precision.
To conclude, Figure 11 presents outcomes that confirm the results of the Borda Count ranking method, where the Conv1D-LSTM model aligns better with observations compared to the rest of the models. The Conv1D-LSTM model led in overall predictive performance and Borda ranking in Gogounou; its suboptimal coverage and overconfident uncertainty estimates call into question its trustworthiness in risk-sensitive applications. Alternatively, traditional models like linear regression and ridge regression offer a better balance of accuracy, reliable uncertainty quantification, and minimal environmental impact, making them more pragmatic choices for reliable and sustainable drought forecasting in Gogounou.

Figure 11. Gogounou’s Taylor Diagram.
In Banikoara, the results presented in Table 12 reveal the outstanding performance of DL models, particularly the hybrid Conv1D-LSTM, which ranked first according to the Borda Count method. This model achieved the highest
$ {R}^2=98.285\% $
, the lowest RMSE (0.130), and the lowest MAE (0.084), marking it as the most accurate model for drought forecasting in this locality. In addition to its strong accuracy, it maintained a coverage level of 93.8% well above the 90% threshold, indicating a reliable uncertainty estimation. Its prediction interval width (0.435) was also among the narrowest, suggesting both high precision and trustworthy predictive intervals (see Figure 12).
Table 12. Performance of different models for Banikoara


Figure 12. Banikoara’s prediction intervals and observations of the most effective algorithm (a) and the least effective algorithm (b) using the EnBPI conformal prediction method for the SPI6. The shaded blue area represents the 90% prediction interval, the black line indicates the model’s point predictions, green dots show observations within the interval, and red dots denote observations outside the interval.
This result is particularly noteworthy because it contrasts with observations in other regions where DL models tended to underperform in coverage. Here, the Conv1D-LSTM model demonstrated both high predictive power and reliable uncertainty quantification, making it a strong candidate for real-world deployment, despite its higher computational cost (CO2 emissions of 1.47E−03). Its ability to maintain a balanced trade-off between accuracy and reliability sets it apart as a robust and practical solution for forecasting in Banikoara.
Other DL models, such as Conv1D, LSTM, and GRU, also performed competitively. Conv1D, for instance, achieved a very high
$ {R}^2=98.164\% $
and excellent uncertainty coverage (94.8%), reinforcing the strength of convolutional architectures in handling temporal data. However, it had a slightly wider interval (0.432) and marginally higher emissions than GRU and LSTM, which delivered similar levels of accuracy and competitive interval widths (both below 0.37), although with slightly lower coverage.
Traditional models like SVR, ridge, and linear regressions provided satisfactory results in terms of coverage (all around or above 90%) but trailed behind in predictive accuracy. Notably, SVR combined solid performance (
$ {R}^2=98.235\% $
) with good coverage (91.7%) and moderate width (0.451), making it a reasonable choice when considering computational efficiency (emissions of 1.11E−05).
In summary, Figure 13 corroborates the findings of the Borda Count ranking method. The Conv1D-LSTM model aligns more closely with the observations than the other models, and the DL models show little variability. Banikoara is the region in which the Conv1D-LSTM model achieves near-ideal performance across all evaluation criteria—predictive accuracy, interval reliability, precision, and consistency—thus validating its suitability for deployment in data-driven drought forecasting. While simpler models such as SVR offer good trade-offs for environments with fewer resources, DL models demonstrated their full potential in this context.

Figure 13. Banikoara’s Taylor Diagram.
In Segbana, Table 13 reports the performance of multiple predictive models across regression metrics (
$ {R}^2 $
, RMSE, and MAE), uncertainty quantification (Coverage and Width), and environmental impact (CO2 emissions). The Borda Count ranking identifies both Conv1D and Conv1D-LSTM as top-performing models, sharing the first position.
Table 13. Performance of different models for Segbana

Note. The best model according to the Borda Count ranking is highlighted in bold.
The Conv1D-LSTM model achieved the highest
$ {R}^2=96.613\% $
, the lowest RMSE (0.209), and a competitive MAE (0.145), confirming its superior predictive capability in this setting. However, its uncertainty quantification was less reliable, with a coverage rate of only 72.2%, significantly below the standard 90% expectation. Despite producing relatively narrow intervals (width = 0.409), this suggests the model is potentially overconfident—failing to capture true outcomes consistently within its predictive bounds (see Figure 14). Moreover, its CO2 emissions (1.42E−03) indicate a high computational cost, which may be limiting in real-world, resource-constrained applications.

Figure 14. Segbana’s prediction intervals and observations of the most effective algorithm (a) and the least effective algorithm (b) using the EnBPI conformal prediction method for the SPI6. The shaded blue area represents the 90% prediction interval, the black line indicates the model’s point predictions, green dots show observations within the interval, and red dots denote observations outside the interval.
Similarly, the Conv1D model performed very well in terms of predictive accuracy (
$ {R}^2=96.466\% $
, RMSE = 0.214, and MAE = 0.148) and had the narrowest prediction intervals (width = 0.369). However, its coverage rate (73.2%) was also unsatisfactory, reaffirming the issue of overconfidence. Although this model has a slightly lower carbon footprint (1.11E−03) than Conv1D-LSTM, it still ranks among the more computationally expensive options.
On the other hand, traditional models such as ridge regression and linear regression demonstrated more balanced performances. While their
$ {R}^2 $
values were lower (95.7 and 95.5%, respectively), they produced higher coverage rates (82.3 and 80.2%, respectively), more consistent with trustworthy uncertainty quantification, although still below the desired 90% level. Their predictive intervals were wider, but better reflected true variation in the data. Additionally, these models remain computationally efficient, with significantly lower emissions (e.g., 4.41E-06 for linear regression).
Models such as XGBoost and LightGBM delivered reasonable regression metrics but also struggled with low coverage, while GRU, LSTM, and SVR exhibited both low uncertainty reliability and higher emissions—making them less practical for deployment.
In conclusion, while Conv1D and Conv1D-LSTM are ranked first in terms of overall performance, their low coverage values raise concerns about their robustness for uncertainty-aware forecasting. In contrast, simpler regression models like ridge or linear regression—although slightly less accurate—may offer more dependable and computationally efficient solutions for deployment in Segbana, especially when reliable uncertainty quantification is a priority. Figure 15 confirms the results of the Borda Count ranking method. Conv1D and Conv1D-LSTM models align better with observations compared to the other models.

Figure 15. Segbana’s Taylor Diagram.
4. Discussion and limitations
The prediction of SPI6 across several regions achieved impressive
$ {R}^2 $
values of 98.29, 97.84, 97.76, 97.42, 96.61, and 97.07% in Banikoara, Gogounou, Malanville, Kandi, Segbana, and Karimama, respectively. A similar study conducted in these regions by Vodounon et al. (Reference Vodounon, Soude and Mamadou2022), using the same data, reported a significantly lower
$ {R}^2 $
of 80.17% for SPI6, highlighting the superior performance of our models. This discrepancy may be due to the spatio-temporal aspects of the data not being fully accounted for in their study, potentially leading to less reliable results (Botache et al., Reference Botache, Dingel, Huhnstock, Ehresmann and Sick2023). When assessed across all locations, the DL Conv1D-LSTM model emerged as the best overall performer. Its combination of convolutional layers and recurrent networks enables the capture of both local patterns and temporal dependencies, which is critical in weather forecasting where complex dynamics are involved (Shi et al., Reference Shi, Chen, Wang, Yeung, W-K and Woo2015). However, from a sustainability perspective, traditional ML models may be preferable in some cases. They often achieve comparable performance to DL models while generating far lower
$ {CO}_2 $
emissions. For example, in Kandi, Conv1D-LSTM outperformed XGBoost by only 0.7% in
$ {R}^2 $
but produced 126 times more
$ {CO}_2 $
emissions. This raises the question of whether marginal accuracy gains justify significantly higher environmental costs, especially in the context of climate change mitigation. Beyond accuracy, this study emphasizes the importance of quantifying predictive uncertainty in agricultural applications. High accuracy does not imply high certainty; decision-makers should be informed of model uncertainty to fully understand the implications of using the forecasts. Several limitations should be noted. First, the analysis does not cover all agricultural regions of the country, leaving some areas without information. Second, the study focuses on retrospective predictions, whereas stakeholders require forward-looking drought forecasts to enable proactive measures. Third, the analysis is limited to short-term drought assessment using the 6-month SPI, without addressing longer-term drought conditions. Finally, the use of gridded datasets introduces inherent uncertainties, which may increase the overall error of the predictive models.
5. Conclusion
Droughts pose significant challenges to food security, particularly when they occur in major agricultural regions. This study focused on quantifying the uncertainty of drought predictions using ML and DL techniques across the six localities in the Alibori department of Benin. The results demonstrated high predictive accuracy for SPI6, with
$ {R}^2 $
values of 98.29, 97.84, 97.76, 97.42, 96.61, and 97.07% in Banikoara, Gogounou, Malanville, Kandi, Segbana, and Karimama, respectively. The Borda count rankings highlighted the Conv1D-LSTM model as the top performer in two localities and second in another two, suggesting its strong suitability for spatiotemporal drought prediction. This analysis underscores the importance of considering model uncertainty when selecting the best predictive approach, as it allows for a balance between accuracy and robustness. Given the coarse resolution of the data used in this study (~50 km), future research should consider utilizing higher-resolution datasets to better capture the spatial variability within each locality, which may improve model performance and reliability.
Open peer review
To view the open peer review materials for this article, please visit http://doi.org/10.1017/eds.2025.10025.
Author contribution
Conceptualization: B.L., A.D.T., R.V., K.B.D.S., and A.I. Data curation: A.I., B.L., R.V., K.B.D.S., and A.D.T. Formal analysis: B.L., A.D.T., and A.I. Methodology: B.L., A.D.T., R.V., K.B.D.S., and A.I. Software: B.L., A.D.T., and A.I. Writing—original draft: A.D.T., K.B.D.S., B.L., and A.I. Writing—review and editing: K.B.D.S., B.L., A.I., and A.D.T. All authors approved the final submitted draft.
Competing interests
The authors declare none.
Data availability statement
The code for this study is archived on Zenodo at https://zenodo.org/records/16815747.
Funding statement
The authors did not receive any support from any organization for the submitted work.








Comments
Bernardin LIGAN
University Mohammed VI Polytechnic (UM6P)
Ben Guerir, Morocco, 43150
Bernardin.LIGAN@um6p.ma
September 11, 2024.
Environmental Data Science (EDS)
Dear Editors,
I am pleased to submit our manuscript titled Prediction and Uncertainty Quantification of Drought in North Benin for consideration for publication in the EDS Journal. Our study addresses a critical issue in climate science and agricultural planning by developing a novel approach for drought forecasting that incorporates uncertainty quantification in North Benin, a region where agriculture is the primary livelihood for the majority of the population.
In our research, we focused on six localities within the Alibori department of North Benin: Banikoara, Gogounou, Kandi, Karimama, Malanville, and Segbana. Each of these areas faces unique challenges due to frequent droughts, which have severe socio-economic impacts. To enhance the reliability of drought predictions, we employed an Ensemble Batch Prediction Interval (EnbPI) method, a conformal prediction approach tailored to time series data, to effectively quantify uncertainties in our forecasts.
Our study leveraged a comprehensive range of machine learning (ML) and deep learning (DL) models—linear regression, ridge regression, random forest, XGBoost, LightGBM, SVR, Conv1D, LSTM, GRU, and Conv1D-LSTM—to predict drought scenarios and assess their performance through a Borda count methodology. This comparative analysis was based on metrics such as R², RMSE, MAE, carbon footprint, and the coverage and width of prediction intervals. Our results demonstrated that the Conv1D-LSTM model provided the best balance between accuracy and uncertainty coverage, achieving a high R² value across all six localities.
We believe this research makes a significant contribution to the field by demonstrating the feasibility and importance of uncertainty quantification in drought forecasting, particularly in low-income countries like Benin. Our findings can inform decision-makers, agricultural researchers, and local communities to adopt more resilient strategies in response to drought scenarios.
Furthermore, in line with our commitment to transparency and reproducibility, we have made our full code and dataset publicly available on GitHub, enabling other researchers to build upon our work and further enhance the prediction of droughts in similar settings.
We believe that our manuscript is well-suited for publication in EDS, given its relevance to the journal’s readership and its potential to contribute valuable insights into the intersection of climate science, machine learning, and agricultural sustainability.
We appreciate your consideration of our manuscript and look forward to your feedback.
Thank you for your time and consideration.
Sincerely,
Bernardin LIGAN