Hostname: page-component-857557d7f7-qr8hc Total loading time: 0 Render date: 2025-11-29T01:56:37.497Z Has data issue: false hasContentIssue false

Prediction and uncertainty quantification of drought in North Benin

Published online by Cambridge University Press:  24 November 2025

Bernardin Marie Augustin Sèdjro Ligan*
Affiliation:
Université Mohammed VI Polytechnique (UM6P) , Ben Guerir, Marrakesh-Safi, Morocco
Gbègninougbo Aurel Davy Tchokponhoue
Affiliation:
Université Mohammed VI Polytechnique (UM6P) , Ben Guerir, Marrakesh-Safi, Morocco
Alain Thierry Iliho Manzi
Affiliation:
Université Mohammed VI Polytechnique (UM6P) , Ben Guerir, Marrakesh-Safi, Morocco
Kora B. Désiré Simperegui
Affiliation:
International Fertilizer Development Center (IFDC) , Accra, Greater Accra, Ghana
Rodrigue Bignon Wilfried Vodounon
Affiliation:
Université d’Abomey-Calavi Institut de Mathématiques et des Sciences Physiques (IMSP) , Dangbo, Ouémé, Bénin
*
Corresponding author: Bernardin Marie Augustin Sèdjro Ligan; Email: bernardin.ligan@um6p.ma

Abstract

Drought forecasting is a critical tool for mitigating the severe impacts of water scarcity, particularly in regions like North Benin, where agriculture is a cornerstone of livelihoods. Despite the vital importance of its accurate prediction in resource management, the ability to quantify uncertainties in forecasts is a significant pain point to enable more informed and trustworthy decision-making. So, this study aims to develop an uncertainty-aware prediction model for drought forecasting in six key localities within the Alibori department—Banikoara, Gogounou, Kandi, Karimama, Malanville, and Segbana—each facing unique challenges due to drought. To achieve this, we conducted a comprehensive experiment involving six machine learning models (linear regression, ridge regression, random forest, Xgboost, LightGBM, and SVM) and four deep learning models (Conv1D, LSTM, GRU, and Conv1D-LSTM) using the Standardized Precipitation Index at a 6-month scale. To address the uncertainty quantification challenge, we employed the Ensemble Batch Prediction Interval, a conformal prediction method specifically designed for time series data. Our comparative analysis, framed within the Borda count methodology, utilized performance metrics such as R2, RMSE, MSE, and carbon footprint, as well as uncertainty quantification metrics, including empirical coverage and the width of prediction intervals. The top-performing models achieved $ {R}^2 $ scores of 98.29, 97.84, 97.76, 97.42, 96.61, and 97.07%, and prediction interval coverages of 0.94, 0.79, 0.93, 0.77, 0.73, and 0.93, respectively, for Banikoara, Gogounou, Malanville, Kandi, Segbana, and Karimama. The Conv1D-LSTM model stood out as the most effective, offering an optimal balance between predictive accuracy and uncertainty coverage.

Information

Type
Application Paper
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
Copyright
© The Author(s), 2025. Published by Cambridge University Press

Impact Statement

This study, one of the first of its kind in a low-income country like Benin, underscores the importance of uncertainty quantification in drought forecasting within these specific localities and can help decision-makers, agricultural researchers, and local populations to take early precautions for drought occurrence based on some possible scenarios.

1. Introduction

Over recent decades, human activities such as industrialization, deforestation driven by agricultural expansion, and rapid urbanization have intensified the pace of climate change (Mir et al., Reference Mir, Mir, Ganie, Bhat, Shah, Mushtaq, Irshad, Jatav, Raiput and Minkina2025). These anthropogenic pressures have resulted in shifting rainfall patterns, rising temperatures, and prolonged droughts—phenomena that significantly threaten agricultural productivity. Among the various climate-related hazards, drought stands out as one of the most pervasive and damaging in sub-Saharan Africa (SSA) (Aryal et al., Reference Aryal, Rahut and Marenya2023). Between 1900 and 2013, Africa recorded 291 drought events, affecting more than 362 million people and causing 847,000 deaths, with an estimated $3 billion in economic losses (Masih et al., Reference Masih, Maskey, Mussá and Trambauer2014). Droughts can be categorized into four main types: meteorological, hydrological, agricultural, and socioeconomic. They are typically characterized using indices such as the Standardized Precipitation Index (SPI), Crop Moisture Index, Palmer Drought Severity Index, and the Standardized Precipitation Evapotranspiration Index (Yihdego et al., Reference Yihdego, Vaheddoost and Al-Weshah2019).

In SSA, where agriculture is predominantly rainfed, droughts have devastating effects on crop yields, threatening food availability and nutritional security for the region’s 1.15 billion people (Pison et al., Reference Pison, Couppié and Caporali2022; United Nations Department of Economic and Social Affairs Population Division, 2022). This has led to alarming malnutrition rates of 20.4% (FAO et al., 2021). Staple crops, such as maize (Zea mays L.), often fall short of their yield potential, further exacerbating food insecurity (Scheiterle and Birner, Reference Scheiterle and Birner2018). This productivity gap poses a substantial challenge for over 60% of the SSA population, who are directly or indirectly involved in agriculture and heavily reliant on the income generated from their agricultural activities (Oluwatayo and Ojo, Reference Oluwatayo and Ojo2016). In Benin, agriculture plays a central economic and social role, employing ~70% of the active labor force and contributing 33% to the national Gross Domestic Product. The northern region—especially the “Pôle de Développement Agricole 2 Borgou-Alibori”—is critical to national food security, accounting for 36% of Benin’s agricultural output (MAEP, 2017). However, this region is particularly vulnerable to climate change, with drought expected to reduce agricultural production by 5–20% by 2025 (Paeth et al., Reference Paeth, Capo-Chichi and Endlicher2008). This underscores the urgent need for improved drought monitoring and adaptation strategies in rainfed agricultural systems. Machine learning (ML) has emerged as a powerful tool for modeling complex environmental processes, including climate variability. It has been successfully applied in diverse contexts, such as flood forecasting (Kazadi et al., Reference Kazadi, Doss-Gollin, Sebastian and Silva2024), precipitation pattern analysis (Mooers et al., Reference Mooers, Beucler, Pritchard and Mandt2024), and rainfall prediction (Aïzansi et al., Reference Aïzansi, Ogunjobi and Ogou2024).

In Benin, recent studies have applied ML and statistical approaches to assess drought risks. For instance, Vodounon et al. (Reference Vodounon, Soude and Mamadou2022), Halissou et al. (Reference Halissou, Eric, Eliézer, Ezéchiel, Bio and Abel2022), Oyerinde and Olowookere (Reference Oyerinde and Olowookere2018), Emmanuel et al. (Reference Emmanuel, Hounguè, Biaou and Badou2019), and Hounkpè et al. (Reference Hounkpè, Diekkrüger, Badou and Afouda2016) analyzed hydrometeorological trends in the Ouémé watershed and the Beninese Niger River basin using indices such as SPI and the Standardized Streamflow Drought Index. These studies revealed substantial shifts in rainfall cycles and documented moderate to severe drought episodes between 1970 and 2016. Halissou et al. (Reference Halissou, Eric, Eliézer, Ezéchiel, Bio and Abel2022) further identified Kompongou (Mékrou), Couberi (Sota), Yakin (Alibori), and Gbasse (Sota) as high-risk sites for hydrological droughts. Projections suggest these patterns will persist and intensify through 2050. Vodounon et al. (Reference Vodounon, Soude and Mamadou2022) demonstrated that ML models could effectively predict SPI at a 12-month scale, achieving a coefficient of determination (R 2) of 0.99 and a root mean square error (RMSE) of 0.07. Despite these encouraging results, most existing studies, including those focused on the agriculturally critical northern region, lack quantification of the uncertainties associated with their forecasts. This omission reduces the practical value of ML models for decision-makers, who need to understand how confident the models are in their predictions to formulate adapted policies (Jalaian et al., Reference Jalaian, Lee and Russell2019; Poggio et al., Reference Poggio, De Sousa, Batjes, Heuvelink, Kempen, Ribeiro and Rossiter2021; Gruber et al., Reference Gruber, Schenk, Schierholz, Kreuter and Kauermann2023; Lilburne et al., Reference Lilburne, Helfenstein, Heuvelink and Eger2024; Simperegui et al., Reference Simperegui, Kouame, Kwesie, Bindraban, Adzawla, Asamoah and El Gharous2025).

This study aims to develop a robust framework for predicting meteorological drought—specifically the 6-month SPI-6—using ML and deep learning (DL) techniques, while also quantifying the associated prediction uncertainties in the Alibori department of northern Benin. Using gridded meteorological datasets from six towns in the region, we intend to: (i) build multiple ML and DL models to predict SPI-6 values, (ii) quantify the uncertainty associated with each model’s predictions, and (iii) compare the performance of ML and DL approaches in terms of accuracy and reliability. The outcomes of this research will support climate risk management by providing stakeholders—particularly farmers and policy-makers with actionable insights to enhance drought preparedness and resilience.

2. Materials and methods

In this section, we present the methodology adopted in this study, as illustrated in Figure 1. The process begins with a multivariate time series dataset and involves several preprocessing steps, including feature selection, creation of lagged variables, and normalization. The dataset is then split into training and testing subsets. Multiple ML and DL models are then trained, and their predictions are further processed using the Ensemble Batch Prediction Interval (EnbPI) method for uncertainty quantification. Finally, the performance of the models is evaluated using various statistical metrics.

Figure 1. Flowchart of the proposed methodology: From multivariate time series preprocessing to model training, uncertainty quantification with EnbPI, and final performance evaluation using various statistical metrics.

2.1. Study area

Geographically situated between 11°19′ north latitude and 2°55′ east longitude, Alibori is a department in the northeast of the Republic of Benin. It is bordered to the north by the Republic of Niger, to the northwest by the Republic of Burkina Faso, to the east by the Federal Republic of Nigeria, to the west by Atacora, and the south by the department of Borgou. Covering an area of $ 26242{km}^2 $ , Alibori is subdivided into six municipalities: Malanville, Karimama, Segbana, Gogounou, Banikoara, and Kandi (Figure 2). For each municipality, one location was selected based on cropland cover as indicated by the MODIS (Moderate Resolution Imaging Spectroradiometer) land cover map of Benin.

Figure 2. Alibori department cartography and geographic distribution of the locations. The cropland cover is derived from the MODIS land cover.

2.2. Data collection

To model the SPI and conduct our analysis, we obtained monthly weather data spanning a range of climate variables. These included temperature, total precipitation, wind speed and direction, surface pressure, relative humidity, cloud cover, soil moisture, top-of-atmosphere shortwave downward irradiance, and all-sky surface shortwave downward irradiance (Table 1). Although the national meteorological agency provides some of these datasets, their spatial coverage is limited and does not extend to many of the locations included in our study. Moreover, the available records do not span the 30-year baseline period recommended by the World Meteorological Organization (WMO) for climate assessments, especially those concerning drought characterization (WMO, 2017). To address these limitations and ensure spatially consistent and temporally comprehensive data, we relied on gridded satellite-based datasets. Two main sources were considered: NASA Prediction of Worldwide Energy Resources (POWER) dataset, with a spatial resolution of 0.5° latitude by 0.625° longitude ( $ \sim 50 $ km), and the ERA5 reanalysis dataset, offering finer resolution at 11 km. Given that our analysis is conducted at the municipality level, we visualized both 50- and 11-km grids using QGIS (Quantum Geographic Information System) to assess their spatial alignment with administrative boundaries (Figure 2). While ERA5 provided higher spatial resolution, it resulted in multiple grid cells per municipality, which would have required aggregation and potentially introduced uncertainty. In contrast, the coarser NASA POWER grid is more closely aligned with municipality extents, thereby minimizing the need for spatial averaging. Consequently, we selected NASA POWER as the data source for this study. For each municipality, a single representative location was chosen based on the spatial distribution of cropland, ensuring that the extracted coordinates accurately reflect the areas of dominant agricultural activity (Figure 2). The dataset spans 41 years (1981–2021) and was extracted for the six municipalities under investigation (Malanville, Karimama, Segbana, Gogounou, Banikoara, and Kandi).

Table 1. Description of weather variables used in the study

2.2.1. Standardized Precipitation Index

The SPI is a functional and quantitative definition of drought. It can be calculated at different timescales (3, 6, 12, 24, and 48) using the monthly rainfall data (Lawin et al., Reference Lawin, Houngue, Biaou and Badou2019; McKee et al., Reference McKee, Doesken and Kleist1993). The SPI is defined as:

(1) $$ {SPI}_i=\frac{P_i-\overline{P}}{\sigma } $$

where $ {SPI}_i $ is the SPI of the month i, $ {P}_i $ is the rainfall of the month i, $ \overline{P} $ and $ \sigma $ are the mean annual rainfall and the standard deviation on the timescale considered. In the present study, the SPI6 considers a 6-month timescale. SPI6 is used to assess drought occurrence on a semester basis. Table 2 provides a guideline for interpreting the SPI value.

Table 2. SPI interpretation (Lloyd-Hughes and Saunders, Reference Lloyd-Hughes and Saunders2002 )

2.2.2. Data preprocessing

Before modeling the SPI6, the data underwent a quality control process. Among the variables, CLOUD_AMT, TOA_SW_DWN, and ALLSKY_SFC_SW_DWN had 36 missing values ( $ \sim 8\% $ ). Additionally, due to the 6-month timescale required to calculate SPI6, the first five observations of this variable were missing. To prevent these gaps from impacting the model performance, we excluded the CLOUD_AMT, TOA_SW_DWN, and ALLSKY_SFC_SW_DWN variables and removed the first 5 observations from the dataset, resulting in 487 observations across the datasets.

Moreover, using Pearson correlation for feature selection, variables with an absolute value of <10% were removed, leading to datasets of seven variables (Malanville, Karimama, and Banikoara) and eight variables (Segbana, Kandi, and Gogounou). Furthermore, time series data exhibit relationships over historical periods (Stübinger and Adler, Reference Stübinger and Adler2020). To capture these patterns, we generated lagged versions of each variable using a window size of 5 (considering the use of SPI6 of the last 5 months to predict the sixth), creating 5 new lagged variables (t−5, t−4, t−3, t−2, and t−1) for each original covariate, leading to a total of 42 (for 7 variables) and 48 (for 8 variables) variables when including current time (t) observations.

Subsequently, using the Pearson correlation matrix, we performed feature selection by retaining the lag (t−5, t−4, t−3, t−2, t−1, or t) with the highest absolute Pearson correlation value with SPI6 (Table 3 and Figure 3). Finally, to normalize the scale of all covariates and eliminate the influence of variables with larger magnitudes, we applied a min-max scaler, bringing all covariates into the same range $ \left[-1,1\right] $ .

Table 3. Lag retained per covariate and locality

Note. Blank cells indicate that none of the lags for that covariate had a Pearson correlation coefficient >0.1, leading to their removal.

Figure 3. Barplots of the correlations between SPI6 and lagged weather variables.

2.3. Modeling

We employed a range of models to forecast time series data, encompassing both traditional ML methods and DL architectures.

2.3.1. Presentation of models

  • Linear regression

    Linear regression by Huang (Reference Huang2020) is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship of the form:

    $$ y={\beta}_0+{\beta}_1{x}_1+{\beta}_2{x}_2+\dots +{\beta}_n{x}_n+\varepsilon $$
    where $ y $ represents the predicted value of the target variable, $ {\beta}_0 $ is the intercept term (the value of $ y $ when all predictors are zero), $ {\beta}_i $ are the coefficients associated with each independent variable $ {x}_i $ (for $ i=1,\dots, n $ ), which quantify the effect of each predictor on the target, and $ \varepsilon $ is the error term that captures the unexplained variability or noise in the data. The model estimates coefficients that minimize the difference between predicted and actual values. It is widely used as a baseline due to its simplicity, interpretability, and low computational cost. However, in the context of time-series forecasting, its performance can be limited. It does not account for temporal dependencies or autocorrelation between observations, and it cannot model seasonality or long-term trends unless these are explicitly encoded as features. Consequently, its relevance is constrained when the data exhibit complex time-based dynamics.
  • Support vector regression (SVR)

    SVR by Basak et al. (Reference Basak, Pal, Ch and Patranabis2007) adapts the principles of support vector machines to regression problems. It aims to find a function of the form:

    $$ f(x)=\left\langle w,x\right\rangle +b $$
    that approximates the data within a specified tolerance level $ \varepsilon $ , while keeping the model as flat as possible. In this expression, $ \left\langle w,x\right\rangle $ denotes the dot product between the weight vector $ w $ and the input vector $ x $ , which defines the orientation of the regression hyperplane. The scalar $ b $ is the bias term that shifts the hyperplane vertically. The parameter $ \varepsilon $ defines an $ \varepsilon $ -insensitive margin around the predicted values, within which no penalty is assigned for prediction errors. The objective of SVR is to find the flattest possible function (i.e., minimizing $ \parallel w{\parallel}^2 $ ) that fits as many data points as possible within this margin. SVR is known for its robustness to outliers and its ability to model nonlinear relationships through kernel functions. Despite these strengths, SVR has several drawbacks in time-series applications. It is computationally expensive for large datasets and sensitive to the choice of kernel and hyperparameters. Moreover, SVR does not inherently capture sequential dependencies in data, making it less effective for long-range forecasting unless temporal structures are carefully embedded into the feature space.
  • Random forest

    Random forest by Breiman (Reference Breiman2001) is an ensemble learning method that constructs multiple decision trees using bootstrapped samples of the data and aggregates their outputs for prediction. For regression tasks, the final output is typically the average of all tree predictions:

    $$ \hat{y}=\frac{1}{T}\sum \limits_{t=1}^T\ {f}_t(x) $$
    where each $ {f}_t $ is the prediction from an individual decision tree. This approach reduces variance and overfitting while capturing nonlinear relationships between inputs and outputs. While random forest is powerful and robust, it lacks temporal awareness. It does not consider the order of data points, and it cannot extrapolate future trends beyond the training range. To be effective for time series, it relies heavily on the inclusion of lag features or engineered time-based inputs.
  • eXtreme gradient boosting (XGBoost)

    XGBoost by Chen and Guestrin (Reference Chen and Guestrin2016) is an optimized implementation of gradient boosting that combines decision trees in a sequential manner, where each new tree corrects the residuals of the previous ones. It is known for its speed, regularization capabilities, and strong performance on structured data. It handles missing values well and includes several techniques to prevent overfitting. However, XGBoost is not specifically designed for time-series data. Without careful feature engineering (e.g., time lags and trend indicators), it cannot model temporal dependencies directly. Additionally, it may overfit if the model complexity is not well-controlled, especially when dealing with small or noisy datasets.

  • Light gradient boosting machine (LightGBM)

    LightGBM by Ke et al. (Reference Ke, Meng, Finley, Wang, Chen, Ma, Ye and Liu2017) is a gradient-boosting framework that improves upon earlier methods like XGBoost through optimizations, such as Gradient-based One-Side Sampling and Exclusive Feature Bundling. These enhancements make it highly efficient in terms of training speed and memory usage, mainly for large datasets with many features. Like other tree-based models, LightGBM can capture nonlinear interactions and works well with heterogeneous data. However, it does not intrinsically understand time-based relationships and, like XGBoost, depends on handcrafted temporal features. Its aggressive sampling strategy may also bias the model toward frequent or recent events, which can be problematic for forecasting rare or long-term patterns.

  • One-dimensional convolutional neural network (Conv1D)

    Conv1D by Kiranyaz et al. (Reference Kiranyaz, Avci, Abdeljaber, Ince, Gabbouj and Inman2021) is a type of Convolutional Neural Network (CNN) model designed to process sequential data using one-dimensional convolutional filters. These filters slide over the time dimension to detect local patterns and short-term dependencies. The model can extract meaningful features from fixed-length windows and is typically composed of multiple layers, including convolutions, nonlinear activation functions, and pooling operations, followed by fully connected layers for prediction. While Conv1D is computationally efficient and effective at identifying localized temporal patterns, it is naturally limited in its ability to capture long-term dependencies. It also requires consistent input lengths and may struggle with nonstationary data unless combined with deeper or recurrent structures.

  • Long short-term memory networks (LSTMs)

    LSTM by Hochreiter and Schmidhuber (Reference Hochreiter and Schmidhuber1997) is a specialized type of recurrent neural network (RNN) designed to handle the limitations of traditional RNNs in learning long-term dependencies. It introduces memory cells and gating mechanisms that control the flow of information across time steps, allowing the model to retain and forget information as needed. LSTMs are well-suited for sequential data with complex temporal relationships, such as time series with trends and seasonality. However, they come with high computational costs, especially for long sequences or deep architectures. They also require large datasets to generalize well and are prone to overfitting if regularization is not carefully applied.

  • Gated recurrent units (GRUs)

    GRU by Chung et al. (Reference Chung, Gulcehre, Cho and Bengio2014) is a variant of LSTM that simplifies the network architecture by merging the input and forget gates into a single update gate. This leads to faster training and reduced memory usage while still allowing the model to capture temporal dependencies. GRUs have been shown to perform comparably to LSTMs on many time-series tasks, especially when computational efficiency is a concern. However, their simplified structure may limit their ability to model very long-term dependencies. Like LSTMs, they also require careful tuning and regularization to avoid overfitting, specifically when training data are scarce or highly variable.

  • Conv1D-LSTM

    Conv1D-LSTM by Shi et al. (Reference Shi, Chen, Wang, Yeung, W-K and Woo2015) is a hybrid model that combines one-dimensional convolutional layers with LSTM units. The convolutional layers serve to extract local features and reduce the dimensionality of the input sequence before passing it to the LSTM layers, which model the temporal evolution of these features. This architecture is particularly effective for spatiotemporal forecasting, where both short-term patterns and long-term dependencies are essential. It leverages the strengths of both CNNs and RNNs, but also inherits their complexities. The model requires more computational resources and careful tuning, and its interpretability is lower compared to simpler models. Overfitting can also be an issue if the dataset is not adequately large or diverse.

2.3.2. Experimental setup

For our experiments, across all localities, the dataset was split at the date January 31, 2014, resulting in a training dataset of length 387 and a test dataset of 96. Before introducing the ML models, all features and the target variable were normalized using min-max scaling, applied only on the training set to prevent data leakage. The fitted scaler was then applied to the test set for evaluation of the models. The transformation follows the exact formula implemented in scikit-learn’s MinMaxScaler (Pedregosa et al., Reference Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel, Blondel, Prettenhofer, Weiss, Dubourg, Vanderplas, Passos, Cournapeau, Brucher, Perrot and Duchesnay2011:

$$ {X}_{\mathrm{std}}=\frac{X-{X}_{\mathrm{min}}}{X_{\mathrm{max}}-{X}_{\mathrm{min}}};\hskip1em {X}_{\mathrm{scaled}}={X}_{\mathrm{std}}\times \left(\max \_\mathrm{range}-\min \_\mathrm{range}\right)+\min \_\mathrm{range} $$

In our case, we used $ \min \_\mathrm{range}=-1 $ and $ \max \_\mathrm{range}=1 $ . The ML models built for these datasets are linear regression, ridge regression, random forest, XGBoost, and LightGBM, using the default parameters from the scikit-learn, XGBoost, and LightGBM libraries.

Before implementing DL models, the dataset was reshaped into the format $ \left({n}_{\mathrm{elements}},\mathrm{sequence}\_\mathrm{length},{n}_{\mathrm{features}}\right) $ , where $ {n}_{\mathrm{elements}} $ is the number of samples, $ \mathrm{sequence}\_\mathrm{length} $ is the fixed window size, and $ {n}_{\mathrm{features}} $ is the number of predictors. In this case, the window size was set to 5, and seven features were used, resulting in an input shape of $ \left(n,\mathrm{5,7}\right) $ .

To ensure robust training, 20% of the training set was reserved for validation. Normalization was performed using a MinMaxScaler fitted on the training set, with the same scaling parameters applied to the validation and test sets. All DL models were developed using the Keras API within TensorFlow. A consistent training setup was applied across all models and localities: a batch size of 64, the Adam optimizer with a learning rate of 0.001, mean squared error as the validation loss function, and RMSE as the evaluation metric. Monitoring the validation loss during training showed that convergence was typically achieved just before the 100th epoch across all localities and models, so the number of epochs was set to 100.

The implemented architectures include a Conv1D model consisting of a 1D convolutional layer with 64 filters and a kernel size of 1, followed by dense layers (see Table 4); an LSTM model composed of two stacked LSTM layers with 64 and 32 units (Table 5); a GRU variant using the same configuration as the LSTM but with GRU layers (Table 6); and a hybrid Conv1D-LSTM model that starts with a Conv1D layer (32 filters and kernel size 3), followed by two LSTM layers with 64 and 32 units, and dense layers (Table 7).

Table 4. Architecture of the Conv1d model used in the study

Table 5. Architecture of the LSTM model used in the study

Table 6. Architecture of the GRU model used in the study

Table 7. Architecture of the Conv1D-LSTM hybrid model used in the study

All experiments were conducted on a machine running Ubuntu 24.04.2 LTS (64-bit), with Python 3.12.3 and TensorFlow 2.16.2. The system features a 13th Gen Intel® Core™ i7−13620H CPU (Central Processing Unit) with 16 GB RAM (Random Access Memory) and an NVIDIA GeForce RTX™ 4060 Laptop GPU (Graphics Processing Unit) with 8 GB of memory.

2.4. Uncertainty quantification and model ranking

After carefully selecting ML and DL algorithms for drought forecasting, we sought to quantify the uncertainty associated with their predictions rigorously. Reliable uncertainty quantification is crucial for informed decision-making in drought management, allowing stakeholders to assess forecast confidence and anticipate possible variations. To achieve this goal, we integrated the selected predictive algorithms into the conformal prediction (CP) framework initially introduced by Vovk et al. (Reference Vovk, Gammerman and Shafer2005).

However, traditional CP methods for regression tasks assume data points are exchangeable, an assumption inherently violated by time-series data due to their temporal dependencies (Barber et al., Reference Barber, Candès, Ramdas and Tibshirani2023; Xu and Xie, Reference Xu and Xie2023). Therefore, applying standard conformal methods would yield unreliable uncertainty estimates for our drought-forecasting problem.

To overcome this fundamental challenge, we adopted the EnBPI CP method proposed by Xu and Xie (Reference Xu and Xie2023), specifically tailored for time-series prediction tasks with nonexchangeable data points. This method leverages ensemble modeling and bootstrap resampling to construct robust prediction intervals without relying on the strict assumption of exchangeability.

2.4.1. Conformal prediction

CP (Vovk et al., Reference Vovk, Gammerman and Shafer2005) is a flexible statistical framework for constructing prediction intervals or sets that contain the true outcome with a guaranteed probability. Given a training dataset $ {\left\{\left({x}_i,{y}_i\right)\right\}}_{i=1}^n $ and a regression model $ \hat{f} $ , CP assigns a nonconformity score to each training example, typically computed as the absolute residual:

(2) $$ {\unicode{x025B}}_i=\mid {y}_i-{\hat{f}}_{-i}\left({x}_i\right)\mid, $$

where $ {\hat{f}}_{-i} $ denotes the model trained with the ith data point excluded (leave-one-out). For a new input $ {x}_{n+1} $ , the prediction interval is given by:

(3) $$ {C}_{1-\alpha}\left({x}_{n+1}\right)=\left[\hat{f}\left({x}_{n+1}\right)\pm {q}_{1-\alpha}\right], $$

where $ {q}_{1-\alpha } $ is the $ \left(1-\alpha \right) $ quantile of the empirical distribution of $ \left\{{\unicode{x025B}}_i\right\} $ . Under the assumption of exchangeability, this interval satisfies the marginal coverage guarantee:

(4) $$ \mathrm{\mathbb{P}}\left({y}_{n+1}\in {C}_{1-\alpha}\left({x}_{n+1}\right)\right)\ge 1-\alpha . $$

However, in time-series applications, the assumption of exchangeability is typically violated, requiring more robust methods.

2.4.2. Ensemble batch prediction interval

The EnBPI (Xu and Xie, Reference Xu and Xie2023) method modifies traditional conformal inference by incorporating ensemble methods and bootstrap techniques, effectively addressing the sequential dependencies in time-series data. The algorithm operates as follows:

  • Bootstrap resampling: The training dataset is bootstrapped multiple times (30 bootstrap samples in our study). Each bootstrap sample simulates different potential outcomes of the underlying data-generating process.

  • Ensemble model construction: The selected ML and DL algorithms are independently trained on each bootstrap sample, creating an ensemble of diverse predictive models. This ensemble captures a broad spectrum of possible data behaviors and prediction outcomes.

  • Batch-wise conformalization: Prediction intervals are computed batch-wise, considering historical residuals from previous batches. Residuals (absolute errors between observed and predicted values) from previous batches are used to construct empirical quantiles, resulting in robust and adaptive prediction intervals for future observations.

For this study, we implemented EnBPI using the Python library Puncc, with a significance level $ \alpha =0.1 $ , corresponding to a 90% coverage level, meaning theoretically 90% of prediction intervals should contain true observed values. While EnBPI offers a practical and adaptive solution for generating prediction intervals in time-series forecasting, it is not without limitations. A key drawback is its computational cost: since the method requires training multiple models on bootstrapped datasets, it may face scalability issues when applied to very large datasets or high-dimensional input spaces.

2.4.3. Evaluation metrics

To evaluate model performance, we consider both classical regression metrics and uncertainty quantification-specific criteria. This dual evaluation allows us to assess not only the accuracy but also the trustworthiness and informativeness of predictions.

2.4.4. Regression metrics

  • R 2 by Wright (Reference Wright1921): The R 2 metric measures the proportion of variance in the dependent variable that is predictable from the independent variables. It is calculated as:

    (5) $$ {R}^2=1-\frac{\sum_{t=1}^T{\left({y}_t-{\hat{y}}_t\right)}^2}{\sum_{t=1}^T{\left({y}_t-\overline{y}\right)}^2}, $$
    where $ {y}_t $ are the actual values, $ {\hat{y}}_t $ are the predicted values, and $ \overline{y} $ is the mean of the actual values. $ {R}^2 $ ranges from 0 to 1, with higher values indicating a better fit.
  • RMSE by Adhikari and Agrawal (Reference Adhikari and Agrawal2013): RMSE is a widely used metric that measures the square root of the average squared differences between predicted and actual values. It is given by:

    (6) $$ \mathrm{RMSE}=\sqrt{\frac{1}{T}\sum \limits_{t=1}^T{\left({y}_t-{\hat{y}}_t\right)}^2}. $$
    RMSE is sensitive to large errors and provides a measure of the magnitude of the prediction errors.
  • Mean absolute error (MAE) by Adhikari and Agrawal (Reference Adhikari and Agrawal2013): MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It is calculated as:

    (7) $$ \mathrm{MAE}=\frac{1}{T}\sum \limits_{t=1}^T\mid {y}_t-{\hat{y}}_t\mid . $$
    MAE gives an easily interpretable measure of the average prediction error in the same units as the target variable.

    Uncertainty quantification metrics.

  • Coverage. The fraction of true targets that fall within the predicted intervals. A well-calibrated model should satisfy:

    (8) $$ \mathrm{Coverage}=\frac{1}{T}\sum \limits_{t=1}^T\nVdash \left\{{y}_t\in {C}_{1-\alpha}\left({x}_t\right)\right\}. $$
  • Prediction interval width (Width). This measures the average width of the prediction intervals. Narrower intervals are more informative:

    (9) $$ \mathrm{Width}=\frac{1}{T}\sum \limits_{t=1}^T\left({\mathrm{upper}}_t-{\mathrm{lower}}_t\right). $$
    Coverage and interval width quantify (Stankevičiūtė et al., Reference Stankevičiūtė, Alaa and van der Schaar2021; Auer et al., Reference Auer, Gauch, Klotz and Hochreiter2023; Xu and Xie, Reference Xu and Xie2023; Lee et al., Reference Lee, Xu and Xie2024) the trade-off between reliability (i.e., how often intervals contain the true value) and informativeness (i.e., how narrow the intervals are). A trustworthy model should maintain nominal coverage while producing tight intervals.

    Computational sustainability metric.

  • Carbon footprint (CO 2 e): We report the environmental cost of computation separately from predictive metrics. Emissions are estimated with CodeCarbon (Courty et al., Reference Courty, Schmidt, Luccioni, Goyal-Kamal, Marion Coutarel, Feld, Lecourt, LiamConnell, Saboni, Inimaz, supatomic, Léval, Blanche, Cruveiller, Ouminasara, Zhao, Joshi, Bogroff, de Lavoreille, Laskaris, Abati, Blank, Wang, Catovic, Alencon, Michał, Bauer and Araújo2024), developed by a consortium including MILA (Quebec AI Institute), Haverford College, Comet.ml, and BCG GAMMA. We use this library to monitor hardware power use (CPU/GPU/RAM) and record, for each model, the total emissions in kg CO2e.

2.4.5. Model ranking via Borda Count method

To systematically compare and rank predictive model performance, we used the Borda Count ranking system, like many studies (Cherif and Idri, Reference Cherif and Idri2025; Hamdouchi and Idri, Reference Hamdouchi and Idri2025; Nakach et al., Reference Nakach, Idri and Tchokponhoue2025; Zerouaoui et al., Reference Zerouaoui, Idr and El Alaoui2025). The Borda Count is a consensus-based ranking method effective in combining multiple performance metrics. The method works as follows:

  • Each performance metric (R 2, RMSE, MAE, Coverage, Width, and CO emissions) acts as a voter, ranking candidate models.

  • For $ n $ candidate models, each model receives points based on its position in each metric: $ n $ points for first place, $ n-1 $ points for second, down to 1 point for last place.

  • Points from all metrics are summed for each model; the model with the highest total score is considered the most effective and reliable across multiple evaluation dimensions.

This structured, transparent approach ensures comprehensive, balanced, and interpretable evaluations of model performance, explicitly accounting for accuracy, reliability, precision, and environmental considerations.

2.4.6. Taylor Diagram

This study uses the Taylor Diagram (Taylor, Reference Taylor2001) to provide a comprehensive statistical evaluation of the performance of ML and DL models. The Taylor Diagram framework enables simultaneous visualization of three critical performance metrics on a single polar coordinate plot. Unlike single-metric evaluation, the Taylor Diagram reveals trade-offs between correlation, centered root-mean-squared-difference, and standard deviation. It gives a simultaneous assessment of multiple models across complementary statistical dimensions on a single two-dimensional plot. The diagram focuses on centered pattern differences after removing overall bias, isolating structural model differences from mean field errors. However, the Borda Count ranking method uses more metrics compared to the Taylor Diagram.

3. Results and discussion

Table 8 presents a comprehensive evaluation of various predictive models for drought forecasting in Malanville, integrating multiple performance dimensions including predictive accuracy ( $ {R}^2 $ , RMSE, and MAE), uncertainty quantification (Coverage and Width), and environmental impact (CO2 emissions). The Borda Count ranking method synthesizes these metrics, providing a clear and balanced assessment framework.

Table 8. Performance of different models for Malanville

Note. The best model according to the Borda Count ranking is highlighted in bold.

The results clearly demonstrate the superior performance of the Conv1D model, which achieved the highest overall rank. With the best predictive accuracy metrics (highest $ {R}^2=97.76\% $ , lowest RMSE = 0.140, and MAE = 0.097), the Conv1D model effectively captures complex temporal and nonlinear patterns characteristic of drought time series. Furthermore, the high coverage rate (92.8%) indicates that this model provides reliable uncertainty quantification, ensuring that decision-makers can confidently interpret its drought predictions. Although the Conv1D model exhibited moderate interval widths and slightly elevated carbon emissions relative to simpler statistical models, the significant gains in accuracy and reliability clearly outweigh these trade-offs.

Furthermore, the quality of the uncertainty estimates produced by Conv1D is visually confirmed in Figure 4, which presents the prediction intervals generated by the model using the EnBPI method. The vast majority of the test set observations fall within the predicted intervals, with only a few instances lying outside. This indicates that the prediction intervals are not only well-calibrated but also adaptive to temporal fluctuations in the data. The intervals effectively capture the seasonal dynamics of the SPI/6 drought index, adjusting their width according to local uncertainty. This reinforces the robustness of the Conv1D model in providing accurate point predictions and reliable uncertainty quantification.

Figure 4. Malanville’s prediction intervals and observations of the most effective algorithm (a) and the least effective algorithm (b) using the EnBPI conformal prediction method for the SPI6. The shaded blue area represents the 90% prediction interval, the black line indicates the model’s point predictions, green dots show observations within the interval, and red dots denote observations outside the interval.

Closely following, the Conv1D-LSTM hybrid model ranks second, showcasing a compelling blend of convolutional and sequential learning strengths. It achieves strong predictive performance ( $ {R}^2=97.698\% $ , RMSE = 0.142, and MAE = 0.098) and delivers slightly narrower prediction intervals compared to the Conv1D model, indicating higher precision. However, its marginally lower coverage (89.7%) suggests a small risk of missing some valid observations. The hybrid approach thus represents a robust alternative that effectively balances accuracy, uncertainty quantification, and moderate environmental impact.

Traditional statistical models, such as linear regression, ridge regression, and random forest, ranked among the lowest performers. Despite their lower computational demands and reduced CO2 emissions, these methods failed to achieve comparable predictive accuracy and produced wider prediction intervals. Consequently, their practical utility for precise drought forecasting in critical applications is limited.

Moreover, DL architectures, such as LSTM and GRU, displayed promising predictive accuracy but incurred substantially higher carbon footprints. This increased environmental cost poses sustainability challenges, particularly relevant in resource-limited or environmentally sensitive contexts. Therefore, although these models have the potential to deliver strong predictive capabilities, their deployment must be carefully evaluated against available computational resources and environmental considerations.

Overall, the results in Figure 5 confirm the results obtained using the Borda Count ranking method, specifically in terms of the most effective and least effective models, and Conv1D shows better performance across all three metrics. DL architecture models show little variability among them and align better with observations compared to the ML models. The analysis underscores the effectiveness of convolutional-based DL models—particularly Conv1D and Conv1D-LSTM—for accurate, reliable, and environmentally balanced drought forecasts in Malanville.

Figure 5. Malanville’s Taylor Diagram.

In Karimama, the evaluation result summarized in Table 9 reveals two top-performing models with the top Borda Count rank: Conv1D and Conv1D-LSTM.

Table 9. Performance of different models for Karimama

Note. The best model according to the Borda Count ranking is highlighted in bold.

The Conv1D model exhibited the strongest overall predictive accuracy, achieving the highest R 2 ( $ {R}^2=97.072\% $ ), the lowest RMSE (0.153), and a notably low MAE (0.109). These results underscore the model’s capability to model complex drought dynamics and provide reliable forecasts effectively. Additionally, its high coverage rate (92.8%) indicates robust uncertainty quantification, ensuring the constructed prediction intervals reliably capture the actual observations (see Figure 6). However, the interval width (0.475) and the model’s relatively higher carbon footprint (1.17E−03) suggest inevitable trade-offs regarding precision and computational sustainability.

Figure 6. Karimama’s prediction intervals and observations of the most effective algorithm (a) and the least effective algorithm (b) using the EnBPI conformal prediction method for the SPI6. The shaded blue area represents the 90% prediction interval, the black line indicates the model’s point predictions, green dots show observations within the interval, and red dots denote observations outside the interval.

Equally ranked first, the Conv1D-LSTM hybrid model provided similarly impressive predictive performance, slightly trailing in terms of $ {R}^2 $ (96.917%) and RMSE (0.157), but surpassing Conv1D with the lowest MAE (0.107). Importantly, this model generated narrower prediction intervals (width = 0.463), improving precision, although with a slight reduction in coverage (89.7%). A key advantage of Conv1D-LSTM is its lower environmental impact, with substantially reduced CO2 emissions (6.63E−04), making it a particularly attractive option for sustainable forecasting applications.

Traditional models, such as linear regression and ridge regression, underperformed considerably across all metrics. Their lower accuracy (~91% $ {R}^2 $ ), wider prediction intervals, and suboptimal coverage highlight their limitations in addressing the complexity of drought-forecasting tasks. While these models have suitable computational efficiency, their relatively poor performance significantly constrains their practical use.

Intermediate approaches, such as XGBoost, LightGBM, and SVR, demonstrated moderate predictive accuracy and acceptable uncertainty quantification, but were consistently outperformed by the DL-based methods. Their intermediate rankings reflect a trade-off between computational simplicity and predictive robustness.

In conclusion, Figure 7 validates the results of the Borda Count ranking method. DL architecture models show little variability among them, and align better with observations compared to the ML models. The Karimama case study reinforces the effectiveness of convolution-based DL models, particularly Conv1D and Conv1D-LSTM, as optimal methods for accurate, reliable, and sustainable drought forecasting. Coupled with the EnBPI uncertainty framework, these models offer significant promise for practical decision-making and drought management strategies in regions facing similar environmental challenges.

Figure 7. Karimama’s Taylor Diagram.

In Kandi, the evaluation results summarized in Table 10 provide critical insights into the performance and suitability of various predictive models for drought forecasting, based on predictive accuracy (R 2, RMSE, and MAE), uncertainty quantification (Coverage and Width), and environmental impact (CO2 emissions). Using the Borda Count ranking method, the Conv1D-LSTM hybrid model emerged as the highest-ranked algorithm, primarily due to superior regression performance metrics (highest R 2 = 97.417%, lowest RMSE = 0.173, and lowest MAE = 0.119). This highlights the strong capability of hybrid DL architectures in capturing complex drought patterns.

Table 10. Performance of different models for Kandi

Note. The best model according to the Borda Count ranking is highlighted in bold.

However, despite its high predictive accuracy, the Conv1D-LSTM model did not meet the targeted 90% coverage, achieving only 77.3%. This deficiency indicates a potential overconfidence in its predictions, as further suggested by its notably narrow prediction intervals (width = 0.391). Such narrow intervals, although indicative of high precision, may not accurately reflect the true predictive uncertainty and hence limit the model’s reliability for practical decision-making. This limitation is visually confirmed in Figure 8, which displays the prediction intervals generated by the Conv1D-LSTM model using the EnBPI method. Figure 8 reveals a significant number of test set observations lying outside the prediction intervals, confirming the model’s inability to encapsulate true values. Although the intervals appear adaptively narrow, they fail to provide adequate coverage, further underscoring the model’s overconfidence in this specific locality.

Figure 8. Kandi’s prediction intervals and observations of the most effective algorithm (a) and the least effective algorithm (b) using the EnBPI conformal prediction method for the SPI6. The shaded blue area represents the 90% prediction interval, the black line indicates the model’s point predictions, green dots show observations within the interval, and red dots denote observations outside the interval.

Conversely, traditional regression methods such as linear regression and ridge regression demonstrated relatively balanced performance, achieving coverage rates very close to the ideal 90% (~89.6%). Despite having moderately lower predictive accuracy metrics compared to DL models, these statistical methods exhibit desirable reliability in their uncertainty quantification. Their wider, more adaptive prediction intervals provide a realistic representation of prediction uncertainty, enhancing the confidence stakeholders can place in their forecasts.

Moreover, these traditional models offer significant computational advantages, reflected by substantially lower carbon footprints (~3.55E−06 and 9.89E−06 for linear regression and ridge regression, respectively), emphasizing their suitability for sustainable and efficient deployment in operational settings.

In summary, the Conv1D-LSTM model ranked first based on overall performance metrics; traditional regression approaches (linear and ridge regressions) might represent more suitable alternatives for practical deployment. Additionally, Figure 9 validates the results of the Borda Count ranking method. The Conv1D-LSTM model aligns better with the observations than the other models, as it performs better across all three metrics with only marginal variability regarding the other DL models. Their reliable uncertainty quantification, near-ideal coverage rates, computational efficiency, and environmental sustainability collectively provide stronger assurance for informed and confident decision-making in drought management applications.

Figure 9. Kandi’s Taylor Diagram.

In Gogounou, the evaluation results presented in Table 11 provide an in-depth comparison of different models’ performances based on predictive accuracy ( $ {R}^2 $ , RMSE, and MAE), uncertainty quantification (Coverage and Width), and environmental impact (CO2 emissions). According to the Borda Count ranking, the Conv1D-LSTM model secured the top position overall.

Table 11. Performance of different models for Gogounou

Note. The best model according to the Borda Count ranking is highlighted in bold.

The Conv1D-LSTM model achieved the highest predictive accuracy among all models ( $ {R}^2=97.842\% $ ), with the lowest RMSE (0.162) and MAE (0.109), demonstrating its superior capability in capturing complex patterns in drought time series. These accuracy metrics are notable. However, its uncertainty quantification was less satisfactory. The model recorded a coverage of only 79.4%, substantially below the 90% threshold typically expected from reliable prediction intervals. While it maintained a relatively narrow interval width (0.399), this may indicate overconfident predictions and a lack of adaptability in the uncertainty estimates—raising concerns about the robustness of its predictive intervals in operational settings (see Figure 10).

Figure 10. Gogounou’s prediction intervals and observations of the most effective algorithm (a) and the least effective algorithm (b) using the EnBPI conformal prediction method for the SPI6. The shaded blue area represents the 90% prediction interval, the black line indicates the model’s point predictions, green dots show observations within the interval, and red dots denote observations outside the interval.

By contrast, traditional models such as linear regression and ridge regression ranked second and third, respectively, and showed a more balanced trade-off between predictive performance and uncertainty reliability. Despite slightly lower $ {R}^2 $ values (97.605 and 97.598%) and higher MAEs (0.133 and 0.134), both models offered good coverage (87.5 and 89.6%) and acceptable width (0.540), more faithfully capturing the uncertainty around their forecasts. Additionally, their carbon emissions were minimal (on the order of 10–6), making them significantly more computationally efficient and environmentally sustainable than DL-based alternatives.

Other DL models, such as LSTM, GRU, and Conv1D, exhibited strong predictive accuracy, with $ {R}^2 $ scores all above 97.3%, but they also suffered from insufficient coverage (ranging from 76.3 to 81.4%), reinforcing the concern that these architectures may generate overly narrow, overconfident prediction intervals. In particular, GRU and LSTM demonstrated relatively high emissions and low coverage, making them less suitable for operational deployment despite their precision.

To conclude, Figure 11 presents outcomes that confirm the results of the Borda Count ranking method, where the Conv1D-LSTM model aligns better with observations compared to the rest of the models. The Conv1D-LSTM model led in overall predictive performance and Borda ranking in Gogounou; its suboptimal coverage and overconfident uncertainty estimates call into question its trustworthiness in risk-sensitive applications. Alternatively, traditional models like linear regression and ridge regression offer a better balance of accuracy, reliable uncertainty quantification, and minimal environmental impact, making them more pragmatic choices for reliable and sustainable drought forecasting in Gogounou.

Figure 11. Gogounou’s Taylor Diagram.

In Banikoara, the results presented in Table 12 reveal the outstanding performance of DL models, particularly the hybrid Conv1D-LSTM, which ranked first according to the Borda Count method. This model achieved the highest $ {R}^2=98.285\% $ , the lowest RMSE (0.130), and the lowest MAE (0.084), marking it as the most accurate model for drought forecasting in this locality. In addition to its strong accuracy, it maintained a coverage level of 93.8% well above the 90% threshold, indicating a reliable uncertainty estimation. Its prediction interval width (0.435) was also among the narrowest, suggesting both high precision and trustworthy predictive intervals (see Figure 12).

Table 12. Performance of different models for Banikoara

Figure 12. Banikoara’s prediction intervals and observations of the most effective algorithm (a) and the least effective algorithm (b) using the EnBPI conformal prediction method for the SPI6. The shaded blue area represents the 90% prediction interval, the black line indicates the model’s point predictions, green dots show observations within the interval, and red dots denote observations outside the interval.

This result is particularly noteworthy because it contrasts with observations in other regions where DL models tended to underperform in coverage. Here, the Conv1D-LSTM model demonstrated both high predictive power and reliable uncertainty quantification, making it a strong candidate for real-world deployment, despite its higher computational cost (CO2 emissions of 1.47E−03). Its ability to maintain a balanced trade-off between accuracy and reliability sets it apart as a robust and practical solution for forecasting in Banikoara.

Other DL models, such as Conv1D, LSTM, and GRU, also performed competitively. Conv1D, for instance, achieved a very high $ {R}^2=98.164\% $ and excellent uncertainty coverage (94.8%), reinforcing the strength of convolutional architectures in handling temporal data. However, it had a slightly wider interval (0.432) and marginally higher emissions than GRU and LSTM, which delivered similar levels of accuracy and competitive interval widths (both below 0.37), although with slightly lower coverage.

Traditional models like SVR, ridge, and linear regressions provided satisfactory results in terms of coverage (all around or above 90%) but trailed behind in predictive accuracy. Notably, SVR combined solid performance ( $ {R}^2=98.235\% $ ) with good coverage (91.7%) and moderate width (0.451), making it a reasonable choice when considering computational efficiency (emissions of 1.11E−05).

In summary, Figure 13 corroborates the findings of the Borda Count ranking method. The Conv1D-LSTM model aligns more closely with the observations than the other models, and the DL models show little variability. Banikoara is the region in which the Conv1D-LSTM model achieves near-ideal performance across all evaluation criteria—predictive accuracy, interval reliability, precision, and consistency—thus validating its suitability for deployment in data-driven drought forecasting. While simpler models such as SVR offer good trade-offs for environments with fewer resources, DL models demonstrated their full potential in this context.

Figure 13. Banikoara’s Taylor Diagram.

In Segbana, Table 13 reports the performance of multiple predictive models across regression metrics ( $ {R}^2 $ , RMSE, and MAE), uncertainty quantification (Coverage and Width), and environmental impact (CO2 emissions). The Borda Count ranking identifies both Conv1D and Conv1D-LSTM as top-performing models, sharing the first position.

Table 13. Performance of different models for Segbana

Note. The best model according to the Borda Count ranking is highlighted in bold.

The Conv1D-LSTM model achieved the highest $ {R}^2=96.613\% $ , the lowest RMSE (0.209), and a competitive MAE (0.145), confirming its superior predictive capability in this setting. However, its uncertainty quantification was less reliable, with a coverage rate of only 72.2%, significantly below the standard 90% expectation. Despite producing relatively narrow intervals (width = 0.409), this suggests the model is potentially overconfident—failing to capture true outcomes consistently within its predictive bounds (see Figure 14). Moreover, its CO2 emissions (1.42E−03) indicate a high computational cost, which may be limiting in real-world, resource-constrained applications.

Figure 14. Segbana’s prediction intervals and observations of the most effective algorithm (a) and the least effective algorithm (b) using the EnBPI conformal prediction method for the SPI6. The shaded blue area represents the 90% prediction interval, the black line indicates the model’s point predictions, green dots show observations within the interval, and red dots denote observations outside the interval.

Similarly, the Conv1D model performed very well in terms of predictive accuracy ( $ {R}^2=96.466\% $ , RMSE = 0.214, and MAE = 0.148) and had the narrowest prediction intervals (width = 0.369). However, its coverage rate (73.2%) was also unsatisfactory, reaffirming the issue of overconfidence. Although this model has a slightly lower carbon footprint (1.11E−03) than Conv1D-LSTM, it still ranks among the more computationally expensive options.

On the other hand, traditional models such as ridge regression and linear regression demonstrated more balanced performances. While their $ {R}^2 $ values were lower (95.7 and 95.5%, respectively), they produced higher coverage rates (82.3 and 80.2%, respectively), more consistent with trustworthy uncertainty quantification, although still below the desired 90% level. Their predictive intervals were wider, but better reflected true variation in the data. Additionally, these models remain computationally efficient, with significantly lower emissions (e.g., 4.41E-06 for linear regression).

Models such as XGBoost and LightGBM delivered reasonable regression metrics but also struggled with low coverage, while GRU, LSTM, and SVR exhibited both low uncertainty reliability and higher emissions—making them less practical for deployment.

In conclusion, while Conv1D and Conv1D-LSTM are ranked first in terms of overall performance, their low coverage values raise concerns about their robustness for uncertainty-aware forecasting. In contrast, simpler regression models like ridge or linear regression—although slightly less accurate—may offer more dependable and computationally efficient solutions for deployment in Segbana, especially when reliable uncertainty quantification is a priority. Figure 15 confirms the results of the Borda Count ranking method. Conv1D and Conv1D-LSTM models align better with observations compared to the other models.

Figure 15. Segbana’s Taylor Diagram.

4. Discussion and limitations

The prediction of SPI6 across several regions achieved impressive $ {R}^2 $ values of 98.29, 97.84, 97.76, 97.42, 96.61, and 97.07% in Banikoara, Gogounou, Malanville, Kandi, Segbana, and Karimama, respectively. A similar study conducted in these regions by Vodounon et al. (Reference Vodounon, Soude and Mamadou2022), using the same data, reported a significantly lower $ {R}^2 $ of 80.17% for SPI6, highlighting the superior performance of our models. This discrepancy may be due to the spatio-temporal aspects of the data not being fully accounted for in their study, potentially leading to less reliable results (Botache et al., Reference Botache, Dingel, Huhnstock, Ehresmann and Sick2023). When assessed across all locations, the DL Conv1D-LSTM model emerged as the best overall performer. Its combination of convolutional layers and recurrent networks enables the capture of both local patterns and temporal dependencies, which is critical in weather forecasting where complex dynamics are involved (Shi et al., Reference Shi, Chen, Wang, Yeung, W-K and Woo2015). However, from a sustainability perspective, traditional ML models may be preferable in some cases. They often achieve comparable performance to DL models while generating far lower $ {CO}_2 $ emissions. For example, in Kandi, Conv1D-LSTM outperformed XGBoost by only 0.7% in $ {R}^2 $ but produced 126 times more $ {CO}_2 $ emissions. This raises the question of whether marginal accuracy gains justify significantly higher environmental costs, especially in the context of climate change mitigation. Beyond accuracy, this study emphasizes the importance of quantifying predictive uncertainty in agricultural applications. High accuracy does not imply high certainty; decision-makers should be informed of model uncertainty to fully understand the implications of using the forecasts. Several limitations should be noted. First, the analysis does not cover all agricultural regions of the country, leaving some areas without information. Second, the study focuses on retrospective predictions, whereas stakeholders require forward-looking drought forecasts to enable proactive measures. Third, the analysis is limited to short-term drought assessment using the 6-month SPI, without addressing longer-term drought conditions. Finally, the use of gridded datasets introduces inherent uncertainties, which may increase the overall error of the predictive models.

5. Conclusion

Droughts pose significant challenges to food security, particularly when they occur in major agricultural regions. This study focused on quantifying the uncertainty of drought predictions using ML and DL techniques across the six localities in the Alibori department of Benin. The results demonstrated high predictive accuracy for SPI6, with $ {R}^2 $ values of 98.29, 97.84, 97.76, 97.42, 96.61, and 97.07% in Banikoara, Gogounou, Malanville, Kandi, Segbana, and Karimama, respectively. The Borda count rankings highlighted the Conv1D-LSTM model as the top performer in two localities and second in another two, suggesting its strong suitability for spatiotemporal drought prediction. This analysis underscores the importance of considering model uncertainty when selecting the best predictive approach, as it allows for a balance between accuracy and robustness. Given the coarse resolution of the data used in this study (~50 km), future research should consider utilizing higher-resolution datasets to better capture the spatial variability within each locality, which may improve model performance and reliability.

Open peer review

To view the open peer review materials for this article, please visit http://doi.org/10.1017/eds.2025.10025.

Author contribution

Conceptualization: B.L., A.D.T., R.V., K.B.D.S., and A.I. Data curation: A.I., B.L., R.V., K.B.D.S., and A.D.T. Formal analysis: B.L., A.D.T., and A.I. Methodology: B.L., A.D.T., R.V., K.B.D.S., and A.I. Software: B.L., A.D.T., and A.I. Writing—original draft: A.D.T., K.B.D.S., B.L., and A.I. Writing—review and editing: K.B.D.S., B.L., A.I., and A.D.T. All authors approved the final submitted draft.

Competing interests

The authors declare none.

Data availability statement

The code for this study is archived on Zenodo at https://zenodo.org/records/16815747.

Funding statement

The authors did not receive any support from any organization for the submitted work.

Footnotes

This research article was awarded Open Data and Open Materials badges for transparent practices. See the Data Availability Statement for details.

References

Adhikari, R and Agrawal, RK (2013) An introductory study on time series modeling and forecasting. arXiv preprint, arXiv: 1302.6613.Google Scholar
Aïzansi, AN, Ogunjobi, KO and Ogou, FK (2024) Monthly rainfall prediction using artificial neural network (case study: Republic of Benin). Environmental Data Science 3, e11.10.1017/eds.2024.10CrossRefGoogle Scholar
Aryal, JP, Rahut, DB and Marenya, P (2023) Do ex-ante and ex-post strategies to manage drought differ at the farm household level? A case of sub-Saharan Africa. Environment, Development and Sustainability 25(10), 1014510169. https://doi.org/10.1007/s10668-023-03660-9.Google Scholar
Auer, A, Gauch, M, Klotz, D and Hochreiter, S (2023) Conformal prediction for time series with modern hopfield networks. Advances in Neural Information Processing Systems.Google Scholar
Barber, RF, Candès, EJ, Ramdas, A and Tibshirani, RJ (2023) Conformal prediction beyond exchangeability.10.1214/23-AOS2276CrossRefGoogle Scholar
Basak, D, Pal, S, Ch, D and Patranabis, R (2007) Support vector regression. Neural Information Processing—Letters and Reviews 11, 203224.Google Scholar
Botache, D, Dingel, K, Huhnstock, R, Ehresmann, A and Sick, B (2023) Unraveling the complexity of splitting sequential data: tackling challenges in video and time series analysis. arXiv e-prints. arXiv:2307.14294.Google Scholar
Breiman, L (2001) Random forests. Machine Learning 45, 532.10.1023/A:1010933404324CrossRefGoogle Scholar
Chen, T and Guestrin, C (2016) Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ‘16. ACM, pp. 785794.10.1145/2939672.2939785CrossRefGoogle Scholar
Cherif, Y and Idri, A (2025) Single and ensemble based filters in environmental data. Expert Systems 42(7), e70076. https://onlinelibrary.wiley.com/doi/pdf/10.1111/exsy.70076.CrossRefGoogle Scholar
Chung, J, Gulcehre, C, Cho, K and Bengio, Y (2014) Empirical evaluation of gated recurrent meural networks on sequence modeling. https://doi.org/10.48550/arXiv.1412.3555.CrossRefGoogle Scholar
Courty, B, Schmidt, V, Luccioni, S, Goyal-Kamal, , Marion Coutarel, , Feld, B, Lecourt, J, LiamConnell, , Saboni, A, Inimaz, , supatomic, , Léval, M, Blanche, L, Cruveiller, A, Ouminasara, , Zhao, F, Joshi, A, Bogroff, A, de Lavoreille, H, Laskaris, N, Abati, E, Blank, D, Wang, Z, Catovic, A, Alencon, M, Michał, S, deBauer, C, Araújo, LON, JPW, and MinervaBooks (2024) mlco2/codecarbon: v2.4.1.Google Scholar
Emmanuel, LA, Hounguè, NR, Biaou, CA and Badou, DF (2019) Statistical analysis of recent and future rainfall and temperature variability in the mono river watershed (Benin, Togo). Climate 7(1), 8.10.3390/cli7010008CrossRefGoogle Scholar
FAO, IFAD, UNICEF, WFP, and WHO (2021) The state of food security and nutrition in the world 2021.Google Scholar
Gruber, C, Schenk, PO, Schierholz, M, Kreuter, F and Kauermann, G (2023) Sources of uncertainty in supervised machine learning – A statisticians’ view. arXiv preprint, arXiv:2305.16703. https://arxiv.org/abs/2305.16703Google Scholar
Halissou, Y, Eric, AA, Eliézer, BI, Ezéchiel, O, Bio, TD and Abel, A (2022) History and projection of hydrological droughts in the Benin basin. Journal of Atmospheric Science Research 5(2), 3351.10.30564/jasr.v5i2.4602CrossRefGoogle Scholar
Hamdouchi, A and Idri, A (2025) Empowering IoT security: Deploying TinyML ensemble techniques for cyberattack detection. Scientific African 29, e02809.10.1016/j.sciaf.2025.e02809CrossRefGoogle Scholar
Hochreiter, S and Schmidhuber, J (1997) Long short-term memory. Neural Computation 9, 17351780.10.1162/neco.1997.9.8.1735CrossRefGoogle ScholarPubMed
Hounkpè, J, Diekkrüger, B, Badou, DF and Afouda, AA (2016) Change in heavy rainfall characteristics over the ouémé river basin. Climate 4(1), 15.10.3390/cli4010015CrossRefGoogle Scholar
Huang, M (2020) Theory and implementation of linear regression. In 2020 International Conference on Computer Vision, Image and Deep Learning (CVIDL), pp. 210217.10.1109/CVIDL51233.2020.00-99CrossRefGoogle Scholar
Jalaian, B, Lee, M and Russell, S (2019) Uncertain context: Uncertainty quantification in machine learning. AI Magazine 40(4), 4049.10.1609/aimag.v40i4.4812CrossRefGoogle Scholar
Kazadi, A, Doss-Gollin, J, Sebastian, A and Silva, A (2024) FloodGNN-GRU: A spatio-temporal graph neural network for flood prediction. Environmental Data Science 3, e21.10.1017/eds.2024.19CrossRefGoogle Scholar
Ke, G, Meng, Q, Finley, T, Wang, T, Chen, W, Ma, W, Ye, Q and Liu, T-Y (2017) LightGBM: a highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, December 2017, pp. 31493157.Google Scholar
Kiranyaz, S, Avci, O, Abdeljaber, O, Ince, T, Gabbouj, M and Inman, DJ (2021) 1D convolutional neural networks and applications: a survey. Mechanical Systems and Signal Processing 151, Article 107398. https://doi.org/10.1016/j.ymssp.2020.107398.CrossRefGoogle Scholar
Lawin, EA, Houngue, NR, Biaou, CA and Badou, DF (2019) Statistical analysis of recent and future rainfall and temperature variability in the Mono River watershed (Benin, Togo). Climate 7(1), 8. https://doi.org/10.3390/cli7010008.Google Scholar
Lee, J, Xu, C and Xie, Y (2024) Transformer conformal prediction for time series. arXiv preprint, arXiv:2406.05332. https://arxiv.org/abs/2406.05332.Google Scholar
Lilburne, L, Helfenstein, A, Heuvelink, GBM and Eger, A (2024) Interpreting and evaluating digital soil mapping prediction uncertainty: A case study using texture from soilgrids. Geoderma 450, 117052.10.1016/j.geoderma.2024.117052CrossRefGoogle Scholar
Lloyd-Hughes, B and Saunders, MA (2002) A drought climatology for europe. International Journal of Climatology 22, 15711592.10.1002/joc.846CrossRefGoogle Scholar
McKee, TB, Doesken, NJ and Kleist, J (1993) The relationship of drought frequency and duration to time scales. In Proceedings of the 8th Conference on Applied Climatology, volume 17(22), pp. 179–-183. https://climate.colostate.edu/pdfs/relationshipofdroughtfrequency.pdf.Google Scholar
Masih, I, Maskey, S, Mussá, FE and Trambauer, P (2014) A review of droughts on the african continent. Hydrology and Earth System Sciences 18, 36353649.10.5194/hess-18-3635-2014CrossRefGoogle Scholar
Mir, YH, Mir, S, Ganie, MA, Bhat, JA, Shah, AM, Mushtaq, M and Irshad, I (2025) Overview of land use and land cover change and its impacts on natural resources. In Jatav, HS, Raiput, VD and Minkina, T (eds), Ecologically Mediated Development, Volume 41 of Sustainable Development and Biodiversity, Singapore: Springer.Google Scholar
Mooers, G, Beucler, T, Pritchard, M and Mandt, S (2024) Understanding precipitation changes through unsupervised machine learning. Environmental Data Science 3, e3.10.1017/eds.2024.1CrossRefGoogle Scholar
Nakach, F-Z, Idri, A and Tchokponhoue, GAD (2025) Multimodal random subspace for breast cancer molecular subtypes prediction by integrating multi-dimensional data. Multimedia Tools and Applications 84(27), 3267132703.10.1007/s11042-024-20504-4CrossRefGoogle Scholar
Oluwatayo, I and Ojo, A (2016) Is africa’s dependence on agriculture the cause of poverty? The Journal of Developing Areas 50, 93102.10.1353/jda.2016.0016CrossRefGoogle Scholar
Oyerinde, GT and Olowookere, BT (2018) Observed shift and merge of hydrological regimes in the sota catchment. International Journal of Research-GRANTHAALAYAH 6(6), 205211.Google Scholar
Paeth, H, Capo-Chichi, A and Endlicher, W (2008) Climate change and food security in tropical West Africa. Erdkunde 62(2), 101115.10.3112/erdkunde.2008.02.01CrossRefGoogle Scholar
Pedregosa, F, Varoquaux, G, Gramfort, A, Michel, V, Thirion, B, Grisel, O, Blondel, M, Prettenhofer, P, Weiss, R, Dubourg, V, Vanderplas, J, Passos, A, Cournapeau, D, Brucher, M, Perrot, M and Duchesnay, E (2011) Scikit-learn: Machine learning in python. Journal of Machine Learning Research 12, 28252830.Google Scholar
Pison, G, Couppié, E and Caporali, A (2022) The population of the world, 2022. Population Societies 603(8), 18.Google Scholar
Poggio, L, De Sousa, LM, Batjes, NH, Heuvelink, GBM, Kempen, B, Ribeiro, E and Rossiter, D (2021) Soilgrids 2.0: Producing soil information for the globe with quantified spatial uncertainty. The Soil 7, 217240.10.5194/soil-7-217-2021CrossRefGoogle Scholar
République du Bénin – Ministère de l’Agriculture, de l’Élevage et de la Pêche (MAEP) (2017) Plan Stratégique de Développement du Secteur Agricole (PSDSA): Orientations stratégiques 2025. Plan National d’Investissements Agricoles et de Sécurité Alimentaire et Nutritionnelle (PNIASAN) 2017–2021. Version finale. Cotonou, Bénin: Ministère de l’Agriculture, de l’Élevage et de la Pêche (MAEP) et NEPAD.Google Scholar
Scheiterle, L and Birner, R (2018) Assessment of Ghana’s comparative advantage in maize production and the role of fertilizers. Sustainability 10(11), 4181.10.3390/su10114181CrossRefGoogle Scholar
Shi, XJ, Chen, ZR, Wang, H, Yeung, D-Y, W-K, Wong and Woo, W-C (2015) Convolutional LSTM network: a machine learning approach for precipitation nowcasting. Advances in Neural Information Processing Systems 28, 802810.Google Scholar
Simperegui, KBD, Kouame, AKK, Kwesie, B, Bindraban, PS, Adzawla, W, Asamoah, E and El Gharous, M (2025) Digital mapping of Ghana’s soil properties and nutrients: Performance of spline and weighted average approaches. Geoderma 459, 117365.10.1016/j.geoderma.2025.117365CrossRefGoogle Scholar
Stankevičiūtė, K, Alaa, AM and van der Schaar, M (2021) Conformal time-series forecasting. Advances in Neural Information Processing Systems 34, 62166228.Google Scholar
Stübinger, J and Adler, K (2020) How to identify varying lead–lag effects in time series. Algorithms 13(4), 95.10.3390/a13040095CrossRefGoogle Scholar
Taylor, KE (2001) Summarizing multiple aspects of model performance in a single diagram. Journal of Geophysical Research: Atmospheres 106(D7), 71837192.10.1029/2000JD900719CrossRefGoogle Scholar
United Nations Department of Economic and Social Affairs Population Division (2022) World population prospects 2022.Google Scholar
Vodounon, RBW, Soude, H and Mamadou, O (2022) Drought forecasting in Alibori Department in Benin using the standardized precipitation index and machine learning approaches. International Journal of Advanced Computer Science and Applications (IJACSA) 13(12). https://doi.org/10.14569/IJACSA.2022.01312113Google Scholar
Vovk, V, Gammerman, A and Shafer, G (2005) Algorithmic Learning in a Random World. New York: Springer.Google Scholar
WMO (2017) Guidelines on the calculation of climate normals (wmo-no. 1203). Technical Report, World Meteorological Organization, Geneva.Google Scholar
Wright, S (1921) Correlation and causation. Journal of Agricultural Research 20, 557585.Google Scholar
Xu, C and Xie, Y (2023) Conformal prediction for time series. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(10), 1157511587. https://doi.org/10.1109/TPAMI.2023.3272339.CrossRefGoogle ScholarPubMed
Yihdego, Y, Vaheddoost, B and Al-Weshah, RA (2019) Drought indices and indicators revisited. Arabian Journal of Geosciences 12, 69.10.1007/s12517-019-4237-zCrossRefGoogle Scholar
Zerouaoui, H, Idr, A and El Alaoui, O (2025) DHHoE: Deep hybrid homogenous ensemble for digital histological breast cancer classification. Expert Systems 42(1),e13397https://onlinelibrary.wiley.com/doi/pdf/10.1111/exsy.13397.CrossRefGoogle Scholar
Figure 0

Figure 1. Flowchart of the proposed methodology: From multivariate time series preprocessing to model training, uncertainty quantification with EnbPI, and final performance evaluation using various statistical metrics.

Figure 1

Figure 2. Alibori department cartography and geographic distribution of the locations. The cropland cover is derived from the MODIS land cover.

Figure 2

Table 1. Description of weather variables used in the study

Figure 3

Table 2. SPI interpretation (Lloyd-Hughes and Saunders, 2002)

Figure 4

Table 3. Lag retained per covariate and locality

Figure 5

Figure 3. Barplots of the correlations between SPI6 and lagged weather variables.

Figure 6

Table 4. Architecture of the Conv1d model used in the study

Figure 7

Table 5. Architecture of the LSTM model used in the study

Figure 8

Table 6. Architecture of the GRU model used in the study

Figure 9

Table 7. Architecture of the Conv1D-LSTM hybrid model used in the study

Figure 10

Table 8. Performance of different models for Malanville

Figure 11

Figure 4. Malanville’s prediction intervals and observations of the most effective algorithm (a) and the least effective algorithm (b) using the EnBPI conformal prediction method for the SPI6. The shaded blue area represents the 90% prediction interval, the black line indicates the model’s point predictions, green dots show observations within the interval, and red dots denote observations outside the interval.

Figure 12

Figure 5. Malanville’s Taylor Diagram.

Figure 13

Table 9. Performance of different models for Karimama

Figure 14

Figure 6. Karimama’s prediction intervals and observations of the most effective algorithm (a) and the least effective algorithm (b) using the EnBPI conformal prediction method for the SPI6. The shaded blue area represents the 90% prediction interval, the black line indicates the model’s point predictions, green dots show observations within the interval, and red dots denote observations outside the interval.

Figure 15

Figure 7. Karimama’s Taylor Diagram.

Figure 16

Table 10. Performance of different models for Kandi

Figure 17

Figure 8. Kandi’s prediction intervals and observations of the most effective algorithm (a) and the least effective algorithm (b) using the EnBPI conformal prediction method for the SPI6. The shaded blue area represents the 90% prediction interval, the black line indicates the model’s point predictions, green dots show observations within the interval, and red dots denote observations outside the interval.

Figure 18

Figure 9. Kandi’s Taylor Diagram.

Figure 19

Table 11. Performance of different models for Gogounou

Figure 20

Figure 10. Gogounou’s prediction intervals and observations of the most effective algorithm (a) and the least effective algorithm (b) using the EnBPI conformal prediction method for the SPI6. The shaded blue area represents the 90% prediction interval, the black line indicates the model’s point predictions, green dots show observations within the interval, and red dots denote observations outside the interval.

Figure 21

Figure 11. Gogounou’s Taylor Diagram.

Figure 22

Table 12. Performance of different models for Banikoara

Figure 23

Figure 12. Banikoara’s prediction intervals and observations of the most effective algorithm (a) and the least effective algorithm (b) using the EnBPI conformal prediction method for the SPI6. The shaded blue area represents the 90% prediction interval, the black line indicates the model’s point predictions, green dots show observations within the interval, and red dots denote observations outside the interval.

Figure 24

Figure 13. Banikoara’s Taylor Diagram.

Figure 25

Table 13. Performance of different models for Segbana

Figure 26

Figure 14. Segbana’s prediction intervals and observations of the most effective algorithm (a) and the least effective algorithm (b) using the EnBPI conformal prediction method for the SPI6. The shaded blue area represents the 90% prediction interval, the black line indicates the model’s point predictions, green dots show observations within the interval, and red dots denote observations outside the interval.

Figure 27

Figure 15. Segbana’s Taylor Diagram.

Author comment: Prediction and uncertainty quantification of drought in North Benin — R0/PR1

Comments

Bernardin LIGAN

University Mohammed VI Polytechnic (UM6P)

Ben Guerir, Morocco, 43150

Bernardin.LIGAN@um6p.ma

September 11, 2024.

Environmental Data Science (EDS)

Dear Editors,

I am pleased to submit our manuscript titled Prediction and Uncertainty Quantification of Drought in North Benin for consideration for publication in the EDS Journal. Our study addresses a critical issue in climate science and agricultural planning by developing a novel approach for drought forecasting that incorporates uncertainty quantification in North Benin, a region where agriculture is the primary livelihood for the majority of the population.

In our research, we focused on six localities within the Alibori department of North Benin: Banikoara, Gogounou, Kandi, Karimama, Malanville, and Segbana. Each of these areas faces unique challenges due to frequent droughts, which have severe socio-economic impacts. To enhance the reliability of drought predictions, we employed an Ensemble Batch Prediction Interval (EnbPI) method, a conformal prediction approach tailored to time series data, to effectively quantify uncertainties in our forecasts.

Our study leveraged a comprehensive range of machine learning (ML) and deep learning (DL) models—linear regression, ridge regression, random forest, XGBoost, LightGBM, SVR, Conv1D, LSTM, GRU, and Conv1D-LSTM—to predict drought scenarios and assess their performance through a Borda count methodology. This comparative analysis was based on metrics such as R², RMSE, MAE, carbon footprint, and the coverage and width of prediction intervals. Our results demonstrated that the Conv1D-LSTM model provided the best balance between accuracy and uncertainty coverage, achieving a high R² value across all six localities.

We believe this research makes a significant contribution to the field by demonstrating the feasibility and importance of uncertainty quantification in drought forecasting, particularly in low-income countries like Benin. Our findings can inform decision-makers, agricultural researchers, and local communities to adopt more resilient strategies in response to drought scenarios.

Furthermore, in line with our commitment to transparency and reproducibility, we have made our full code and dataset publicly available on GitHub, enabling other researchers to build upon our work and further enhance the prediction of droughts in similar settings.

We believe that our manuscript is well-suited for publication in EDS, given its relevance to the journal’s readership and its potential to contribute valuable insights into the intersection of climate science, machine learning, and agricultural sustainability.

We appreciate your consideration of our manuscript and look forward to your feedback.

Thank you for your time and consideration.

Sincerely,

Bernardin LIGAN

Review: Prediction and uncertainty quantification of drought in North Benin — R0/PR2

Conflict of interest statement

No

Comments

Report on Prediction and uncertainty quantification of drought in North Benin

The study attempts to identify the uncertainties from the meteorological drought index namely the SPI through applying the Machine learning and deep learning in the Northern Benin with datasets covering 1981-2021. Some results are shown especially the R-square (R2) but others can be seen. The topic is interesting for water resource monitoring over the region and environmental management. It fits well with journal with clear objectives. However, the choice of the SPI time scale lacks of justification. In addition, the introduction should be updated with at least three to five recent references (e.g., 2023 and 2024).

My major concerns are listed as follows:

1) The data description is missing is very critical for understanding of the performance of the models used. I was particularly waiting to see the training model, the validation and then the prediction with their associated uncertainties. Please, at least consider using station data for Kandi to validate the gridded data before proceed.

2) While trying to predict the drought and assert the uncertainties, it is difficult to figure out from the methodology the employed equations for each machine learning method. I encourage the authors to show the equations for the methods used. For instance, RNN has been applied, how many layers and sub-layers (hidden) were utilised, the weight applied, etc.

3) The uncertainties are not shown except the R-square which is likely to be contribution in the section of abstract

4) Please, use more recent references. For example, one recent reference, you may refer to is: Environmental Data Science, Volume 3, 2024, e11; DOI: https://doi.org/10.1017/eds.2024.10 (Arsène Nounangnon Aïzansi et al, 2024)

5) Be aware of autocorrelation among the variables used to predict the SPI.

Specific comments

1) In the abstract, please clarify the values of the uncertainties and the SPI scale, and the reason for chosen the scale. Did you try for 12-month, 15-month, 24-month or 3-month?

2) Abstract: to tackle or to enable? Please, revise

3) I suggest that author (s) describe (s) in a table all the data used, time spanning, horizontal resolution, the variables.

4) Under the experimental part, “A batch size of 64 was used for all model training, with the number of epochs adjusted per locality to prevent overfitting”. Verify the spelling of overfitting

5) From “followed by three …. and linear activations, respectively”, why “D” are in capital ?

6) “which violate the exchangeability assumption these methods rely on (Barber et al., 2023; Xu and Xie, 2023)”. Rely on …; the sentence is incomplete.

7) “employing 30 bootstraps”, does it mean the size sample chosen for bootstrapping?

8) Figure 8, the title is confusing. Is it R-square that is presented in the figure or MAE? Please, clarify.

9) There are minor adjustments of some sentences, please read carefully and correct.

Review: Prediction and uncertainty quantification of drought in North Benin — R0/PR3

Conflict of interest statement

Reviewer declares none.

Comments

Ligan et al. present an uncertainty-aware drought prediction model for six localities in Benin’s Alibori department. The study addresses an important issue and suggests a potentially useful and efficient tool. However, the paper suffers from several significant flaws that prevent a proper assessment of the results' reliability. If the following suggested revisions are addressed, the paper will be suitable for publication.

Section 1

- Correct the typo “de” in line 1 on page 3.

Section 2.1

- Correct the typo “Pi” in equation (1) on page 5.

Section 2.2

-The sections in the manuscript are ordered incorrectly (2.2, 2.1, 2.2, 2.3). from page 4. Could you please correct the section numbering to follow a logical order? Also correct the table numbering to follow a logical order to match its reference in the manuscript.

- Once you introduce an acronym, you don’t need to write out the full phrase again.

--- For example, Standardized Precipitation Index (SPI) should only be spelled out the first time, and after that, you can simply use SPI throughout the rest of the paper. Similarly to others.

--- Are you using atmospheric data specific to each site, or are you relying on gridded data for broader regions?

--- Could you provide details on how the data for each town were extracted?

--- Did you consider using the local grid cell corresponding to each town’s longitude and latitude coordinates?

- There are other data sources, such as ERA5 from the Climate Data Store, that provide finer spatial resolution (e.g., 0.25° × 0.25°) compared to NASA POWER (1/2° × 5/8°). What led you to choose NASA POWER over these alternatives?

--- Did you consider using ERA5 or similar datasets for better spatial detail?

--- Were there specific advantages of NASA POWER (e.g., data availability, temporal coverage, or preprocessing convenience) that influenced your choice?

- Add the source of the SPI interpretation Table 2 on page 5.

- The table of lagged variables (page 6) is interesting. It would be valuable to explore this data further.

--- Consider adding bar charts for each site, showing the correlations per month of lag for each variable. Although this would require additional space, it is an important aspect worth highlighting.

- Add the Min-Max scaler formula.

- Include a methodology flowchart for better clarity.

- Clarify the normalization process:

---Was normalization done before or after splitting the dataset into training, validation, and test sets?

---To avoid data leakage, it is crucial to split the dataset first and then apply normalization only to the training set. The same normalization parameters should then be used for the validation and test sets.

--- Since real-world applications do not have access to test data during training, the normalization approach needs to be explicitly justified. Was this procedure followed in this study?

- Can you give information about the computer specifications used for the analysis?

Section 2.3.1

- Discuss the limitations of each model used.

Section 2.3.2

- Clarify the description of the validation set:

--- The text states: “To train the models, an additional validation set was created using 80% of the training dataset.” This phrasing might confuse readers, as it could suggest that the validation set itself comprises 80% of the training data. Consider rephrasing for clarity.

- Specify the approach used to adjust the number of epochs (e.g., validation monitoring, early stopping) to prevent overfitting.

- Provide details on hyperparameters, including:Learning rate, Optimizer, Any regularization techniques applied…

Section 2.4

- Write the limitations of the Ensemble Batch Prediction Interval methods

- Explain how uncertainty quantification affects the usefulness of predictions.

- Explain the rationale for selecting the Borda Count method, with more references

- Add sources for the evaluation metrics used.

- The formula used for R² in your study appears to be different from the conventional definition. Did this formulation come from a specific reference or prior work?

- Consider adding a Taylor diagram to visually compare model performance in terms of correlation, variance, and overall error. This would provide a more intuitive and comprehensive assessment of model accuracy and robustness.

Section 3

- Please increase the font size of the plot’s titles and axes, which are currently difficult to read.

- Include a visualization of the loss function during both training and validation to monitor potential overfitting or underfitting.

- Did you rescale the predicted values back to their original units before computing evaluation metrics such as RMSE, MAE, and R²? Make it clear in the methodology flowchart for better clarity.

--- Since the data was scaled before training, computing metrics directly on the scaled data may lead to values that are not interpretable in the original context.

--- If rescaling was not done before metric calculation, could you clarify whether this affects the reported performance of the models?

Section 4

- More information regarding the impact of this work would be helpful. Provide stakeholders important insights.

- Discuss the limitations of relying solely on SPI6. Add the study’s limitations and your suggestions to the results.

The authors should make it clear how novel the research is. Without a compelling argument for its uniqueness, the paper may fail to captivate readers. It is critical for authors to explain what distinguishes their work and how it advances the field.

- What novel contributions does your research make in terms of methodology or application?

- How does your approach improve uncertainty quantification in drought forecasting for these specific regions?

Recommendation: Prediction and uncertainty quantification of drought in North Benin — R0/PR4

Comments

Please use the GitHub integration with Zenodo to create a permanent record of your data, code and models, and update your Data Availability Statement to include the Zenodo DOI: https://docs.github.com/en/repositories/archiving-a-github-repository/referencing-and-citing-content

Decision: Prediction and uncertainty quantification of drought in North Benin — R0/PR5

Comments

No accompanying comment.

Author comment: Prediction and uncertainty quantification of drought in North Benin — R1/PR6

Comments

Cover Letter – Revised Manuscript Submission

Bernardin LIGAN

University Mohammed VI Polytechnic (UM6P)

Ben Guerir, Morocco, 43150

Bernardin.LIGAN@um6p.ma

August 15, 2025

Environmental Data Science (EDS)

Dear Editors,

I am pleased to resubmit our revised manuscript titled Prediction and Uncertainty Quantification of

Drought in North Benin for consideration in Environmental Data Science. This work addresses the

critical challenge of drought forecasting in a vulnerable region where agriculture is the main source

of livelihood. Our approach combines state-of-the-art machine learning and deep learning models

with the Ensemble Batch Prediction Interval (EnbPI) method to produce reliable forecasts and

quantify predictive uncertainties.

In the first submission, we presented results for six localities in North Benin—Banikoara,

Gogounou, Kandi, Karimama, Malanville, and Segbana—highlighting the capacity of hybrid deep

learning models, particularly Conv1D-LSTM, to achieve a strong balance between predictive

accuracy and robust uncertainty coverage.

Following the reviewers’ valuable feedback, we have substantially improved the manuscript:

• Clarified methodological details, including normalization procedures, avoidance of data

leakage, and scaling back predictions before computing metrics.

• Added missing equations, expanded model descriptions with limitations, and integrated a

methodology flowchart.

• Included a Taylor diagram for visual model comparison and enhanced the explanation of the

Borda count method with relevant references.

• Corrected formula inconsistencies, reorganized section and table numbering, and updated

execution environment details.

• Ensured complete transparency and reproducibility by making the dataset available on

Zenodo and linking the source code on GitHub with a Zenodo DOI.

We believe that these revisions have significantly strengthened the manuscript, both in

methodological rigor and clarity of presentation. We are confident that the improved version offers

greater value to the journal’s readership and advances the understanding of uncertainty-aware

drought forecasting in data-scarce regions.

Thank you for considering our revised submission. We look forward to your favorable response.

Sincerely,

Bernardin LIGAN

Review: Prediction and uncertainty quantification of drought in North Benin — R1/PR7

Conflict of interest statement

Reviewer declares none.

Comments

It is partly improved and I have no further comments.

Review: Prediction and uncertainty quantification of drought in North Benin — R1/PR8

Conflict of interest statement

There are no competing interests

Comments

The study aims to develop an uncertainty-aware prediction model for drought forecasting in six key localities within the Alibori department—Banikoara, Gogounou, Kandi, Karimama, Malanville, and Segbana—each facing unique challenges due to drought. There are some minor concerns that should be addressed.

Specific comments

1) Under the abstracts, the full name should be given before using the abbreviation, for example, Deep Learning (DL) should be defined in the first use. Please verify this through the MS, line 30.

2) which model gave which performance, i.e., XGBoost, LightGBM or which one ?

Most of work considered RMSE as one of the best metrics for models comparison; however, authors preferred interval coverage and R-square. Please what are advantages of showing interval coverage values over the RMSE?

3) Line 45; uncertainty quantification “of” drought ….

4) Line 13 in page 14, cRMSD(Centered Root-Mean-Squared-Difference). Space is needed

5) References section, https://scikit-learn.org/stable/about.html#citing-scikit-learn; Authors can get the reference as indicated (If you use scikit-learn in a scientific publication, we would appreciate citations to the following paper: Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011).

Recommendation: Prediction and uncertainty quantification of drought in North Benin — R1/PR9

Comments

There are still some open matters mentioned by one of the Reviewers. The Associate Editor would also like to point out that certain Authors‘ responses to Reviewers’ comments could have been better or more prominently incorporated in the manuscript. For example, the Authors‘ responses to Reviewers’ inquiries on “uncertainty quantification” with regards to the proposed method should have been stressed more clearly, as it seems a centerpiece of the Authors' work.

Decision: Prediction and uncertainty quantification of drought in North Benin — R1/PR10

Comments

No accompanying comment.

Author comment: Prediction and uncertainty quantification of drought in North Benin — R2/PR11

Comments

Dear Editors,

We are pleased to resubmit our second revised manuscript entitled Prediction and Uncertainty Quantification of Drought in North Benin for further consideration in Environmental Data Science.

This study is one of the first of its kind in a low-income country context, focusing on six localities in North Benin that are particularly vulnerable to drought. Our approach integrates machine learning, deep learning, and conformal prediction methods to deliver both accurate forecasts and reliable uncertainty quantification. This dual perspective is central to our contribution, as it provides decision-makers, researchers, and local communities with robust tools to anticipate and respond to drought scenarios.

In this second revision, we carefully addressed the remaining open matters raised by the reviewers and the Associate Editor. In particular:

We have more clearly emphasized the role of uncertainty quantification, highlighting its centrality in both methodology and results.

We revised the abstract to ensure that abbreviations (e.g., Deep Learning, DL) are defined at first mention, and that key metrics (RMSE, R2, and interval coverage) are presented with greater clarity.

We explained why interval coverage is particularly relevant for assessing model performance in an uncertainty-aware setting, complementing conventional metrics such as RMSE.

We implemented all minor corrections suggested (grammar, phrasing, and formatting issues such as spacing in cRMSD).

We updated the references to properly cite the scikit-learn library (Pedregosa et al., JMLR 2011), as recommended.

We believe that this second revised version strengthens the manuscript both in methodological rigor and in clarity of presentation. We are confident that the improved version now fully addresses the reviewers’ concerns and is suitable for publication.

We thank you for your continued consideration and look forward to your feedback.

Sincerely,

Bernardin LIGAN (on behalf of all co-authors)

Review: Prediction and uncertainty quantification of drought in North Benin — R2/PR12

Conflict of interest statement

Reviewer declares none.

Comments

I have no more comments to authors.

Recommendation: Prediction and uncertainty quantification of drought in North Benin — R2/PR13

Comments

No accompanying comment.

Decision: Prediction and uncertainty quantification of drought in North Benin — R2/PR14

Comments

No accompanying comment.