Impact Statement
We show the promise of machine learning for longer-term solar forecasting with probabilistic predictions, an area that has not been sufficiently explored in the literature. Our encouraging results suggest such methods could play a larger role in future power system operations when greater shares of renewable energy resources will require operational planning at these timescales. For example, these methods could inform the operation of hybrid power plants with storage capabilities, where information about expected future renewable power generation would weigh into decisions on storage charging and discharging.
1. Introduction
Renewable energy sources, like solar, wind, tidal, or geothermal energy, have the potential of reducing the world’s dependency on fossil fuel. These resources are not only abundantly available in nature but they are also clean energy sources, reducing greenhouse gas emissions that lead to global warming. However, many of these resources are variable and uncertain, posing challenges for integration into a power system which is predicated upon dispatchable supply. There is, therefore, a growing need for accurate renewable energy forecasting to ease integration into electric grids. Solar photovoltaics (PV) systems are experiencing exponential growth in deployment and the output of PV systems is highly dependent on solar irradiance (Alzahrani et al., Reference Alzahrani, Shamsi, Dagli and Ferdowsi2017). A number of physical and statistical models have been used for making solar forecasts at different timescales from intra-hour to a few days ahead (Tuohy et al., Reference Tuohy, Zack, Haupt, Sharp, Ahlstrom, Dise, Grimit, Mohrlen, Lange, Casado, Black, Marquis and Collier2015; Wang et al., Reference Wang, Lei, Zhang, Zhou and Peng2019). Statistical methods have been shown to perform well at forecasting at very short time horizons, with numerical weather prediction (NWP) models outperforming them in the hours to days-ahead timeframe (Tuohy et al., Reference Tuohy, Zack, Haupt, Sharp, Ahlstrom, Dise, Grimit, Mohrlen, Lange, Casado, Black, Marquis and Collier2015).
Most physical models in this domain are based on NWP simulations that traditionally provide more accurate forecasts at hours to days-ahead lead times (Tuohy et al., Reference Tuohy, Zack, Haupt, Sharp, Ahlstrom, Dise, Grimit, Mohrlen, Lange, Casado, Black, Marquis and Collier2015). However, due to their computational expense, NWP model outputs are updated less frequently and with coarser resolution at longer prediction lead times, such as week(s) ahead. This motivates the need for data-driven machine learning models that can provide forecasts at longer periods in advance at a finer (1 hr) resolution (as opposed to e.g., the 12 hr resolution in the case of the European Centre for Medium-Range Weather Forecasts (ECMWF) model predictions). As a part of our study, we not only perform a direct comparison with the NWP baseline for our 1-week ahead forecasts, but we also evaluate our models’ performance when they incorporate NWP outputs as input features to see if it improves their forecasting ability.
Probabilistic forecasting provides a distribution over the prediction, this additional knowledge of uncertainty estimates can provide advantages over point forecasting. For example, knowing about future time periods of low and high uncertainty in advance can be very useful in planning plant maintenance (Zelikman et al., Reference Zelikman, Zhou, Irvin, Raterink, Sheng, Kelly, Rajagopal, Ng and Gagne2020). Until recently, probabilistic forecasting for solar energy had not received as much attention as for wind energy, as observed by Doubleday et al. (Reference Doubleday, Hernandez and Hodge2020). In their work, Doubleday et al. (Reference Doubleday, Hernandez and Hodge2020) introduce probabilistic benchmarks to evaluate probabilistic methods, which we will utilize in this work.
1.1. Contributions
We propose deep sequence learning for this longer lead-time (1 week ahead) solar irradiance forecasting task, that provides point as well as probabilistic predictions. Overall, these deep learning pipelines outperform several benchmarks from the literature including NWP models and a machine learning-based probabilistic prediction method. The results fall slightly behind the complete history persistence ensemble (Ch-PeEN) benchmark (Doubleday et al., Reference Doubleday, Hernandez and Hodge2020) in terms of continuous ranked probability score (CRPS), but are better in terms of forecast sharpness.
2. Related Work
A variety of deep learning approaches have been proposed for learning from sequence data, some of which have been applied in the solar energy domain. Recurrent neural networks (RNNs), unlike fully connected neural networks, have the ability to capture temporal dependencies in sequences by incorporating feedback from previous time steps. Long short-term memory (LSTM) models are especially useful for a time series data when the inputs can have longer dependencies. The works of Gensler et al. (Reference Gensler, Henze, Sick and Raabe2016), Alzahrani et al. (Reference Alzahrani, Shamsi, Dagli and Ferdowsi2017), Mishra and Palanisamy (Reference Mishra and Palanisamy2018), and Brahma and Wadhvani (Reference Brahma and Wadhvani2020) show the potential of LSTMs for solar energy forecasting, and they outperform fully connected networks and traditional machine learning models at short forecasting lead times. Convolution neural network (CNN)-based models that use dilated and causal convolutions along with residual connections (also referred to as temporal CNNs) were designed specifically for sequential modeling (Oord et al., Reference Oord, Dieleman, Zen, Simonyan, Vinyals, Graves, Kalchbrenner, Senior and Kavukcuoglu2016; Shaojie Bai and Koltun, Reference Shaojie Bai and Koltun2018). They are autoregressive prediction models based on the recent WaveNet architecture (Oord et al., Reference Oord, Dieleman, Zen, Simonyan, Vinyals, Graves, Kalchbrenner, Senior and Kavukcuoglu2016). Temporal CNNs have recently been applied to forecasting day-ahead PV power output, outperforming both LSTMs and multilayer feed-forward networks (Lin et al., Reference Lin, Koprinska and Rana2020). They are able to exploit a longer history in the time series, enabling more accurate forecasts. In this work, we study a significantly longer forecast horizon that challenges the limits of NWP forecasting and is expected to have emerging applications as power systems evolve. We compare LSTMs, temporal CNNs, temporal CNNs with an added attention layer (Saha et al., Reference Saha, Naik and Monteleoni2020), and the transformer model (Vaswani et al., Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, Guyon, Luxburg, Bengio, Wallach, Fergus, Vishwanathan and Garnett2017).
Recently, Zelikman et al. (Reference Zelikman, Zhou, Irvin, Raterink, Sheng, Kelly, Rajagopal, Ng and Gagne2020) showed how probabilistic models such as Gaussian processes, neural networks with dropout for uncertainty estimation, and NGBoost (Duan et al., Reference Duan, Anand, Ding, Thai, Basu, Ng and Schuler2020) compare when making short-term solar forecasts. They explored post hoc calibration techniques for improving the forecasts produced by these models. NGBoost or Natural Gradient Boosting algorithm (Duan et al., Reference Duan, Anand, Ding, Thai, Basu, Ng and Schuler2020) is a gradient boosting pipeline that is extended to give a probabilistic distribution as an output (the parameters of the distribution are the regressed outputs). We now consider NGBoost with a Gaussian output distribution, to be a machine learning benchmark in this domain, since it showed superior performance for intra-hour and hourly resolution forecasting (Zelikman et al., Reference Zelikman, Zhou, Irvin, Raterink, Sheng, Kelly, Rajagopal, Ng and Gagne2020). Deep learning-based probabilistic prediction models are, however, yet to be fully explored (Wang et al., Reference Wang, Lei, Zhang, Zhou and Peng2019). In this paper, we extend the deep learning point prediction models mentioned above to yield predictions at multiple quantiles (see Figure 1), as quantile regression is a nonparametric approach to obtain probabilistic forecasts (Wang et al., Reference Wang, Lei, Zhang, Zhou and Peng2019; Saha et al., Reference Saha, Naik and Monteleoni2020).

Figure 1. Fan plot showing the temporal CNN (TCN) model’s prediction intervals from 5 to 95% percentile on three March days at the Boulder station.
3. Data
We use open-source NOAA’s SURFRAD network (Surface radiation budget network for atmospheric research, Augustine et al., Reference Augustine, DeLuisi and Long2000) that provides the ground-truth solar irradiance and meteorological measurements from seven sites across the US in different climatic zones (https://gml.noaa.gov/grad/surfrad/). Models are trained on measurements from the years 2016–2017 and then evaluated in the year 2018. The test data (year 2018) is kept hidden and the rest of the data is split into training and validation sets (70/30 split). Data is converted to an hourly resolution and only the daytime values are considered for training and testing of all models including benchmarks (for relevance to the domain, as in Doubleday et al., Reference Doubleday, Hernandez and Hodge2020). Days with less than 24 hr of data points due to missing data were dropped. Following standard practice, we take a ratio of the ground-truth Global Horizontal Irradiance (GHI) (W/m2) with respect to the “clear sky” GHI value (these are irradiance estimates under cloud-free conditions, obtained from CAMS McClear Service; Granier et al., Reference Granier, Darras, van der Gon, Jana, Elguindi, Bo, Michael, Marc, Jalkanen, Kuenen, Liousse, Quack, Simpson and Sindelarova2019), to produce a clearness index, such as in Mishra and Palanisamy (Reference Mishra and Palanisamy2018), Doubleday et al. (Reference Doubleday, Hernandez and Hodge2020), and Zelikman et al. (Reference Zelikman, Zhou, Irvin, Raterink, Sheng, Kelly, Rajagopal, Ng and Gagne2020) that is used as the prediction label for training. While trained on the clearness index, the models are evaluated on the GHI.
 Important predictor variables available in the data, such as solar zenith angle, hour of the day, month of the year, wind, pressure, temperature, and relative humidity, are included, along with the clearness index at the hour (a total of 16 input variables overall). These inputs are scaled (standardized) before the modeling procedure. All the sequence models take in a 3D input, where every row is a sequence of input feature vectors corresponding to previous timesteps: we use a history of 
 $ 12\times 7 $
 past daylight hours for all our models. Each row in the time series at hour h is assigned a label which is the clearness index value at the hour h + 1-week.
$ 12\times 7 $
 past daylight hours for all our models. Each row in the time series at hour h is assigned a label which is the clearness index value at the hour h + 1-week.
4. Methods
We focus on showing the potential of the following deep multivariate sequence models: LSTM, Temporal CNN, Temporal CNN with an attention layer, and Transformer, for point and probabilistic solar irradiance forecasting. We compare them to the NGBoost method (Duan et al., Reference Duan, Anand, Ding, Thai, Basu, Ng and Schuler2020) that has been shown to outperform various probabilistic models for short-term solar forecasting (Zelikman et al., Reference Zelikman, Zhou, Irvin, Raterink, Sheng, Kelly, Rajagopal, Ng and Gagne2020), along with benchmarks from the literature (as described in the next section). Hyperparameters were tuned on the validation dataset.
4.1. Long short-term memory
We use a simple LSTM pipeline; a single hidden layer with a dimension of 25.
4.2. Temporal CNN
Temporal CNN (TCN) consists of 1D dilated convolution filters and residual layers that are responsible for learning long-term dependencies efficiently (Shaojie Bai and Koltun, Reference Shaojie Bai and Koltun2018; Lin et al., Reference Lin, Koprinska and Rana2020). Figure 2 shows how dilations help to increase (exponentially) the receptive field of a kernel. This makes the model capable of learning correlations between data points far apart in the past. The convolutions are also causal, meaning that while convolving, outputs at time t only convolve with time t and earlier from the previous layer. Our TCN architecture is comprised of three levels, size of the hidden layer is 25, and kernel size is 3 with dilation factors d = 1, 2, and 4.

Figure 2. Dilation in kernels (Oord et al., Reference Oord, Dieleman, Zen, Simonyan, Vinyals, Graves, Kalchbrenner, Senior and Kavukcuoglu2016; Borovykh et al., Reference Borovykh, Bohte and Oosterlee2017).
4.3. TCN with attention
The Attention mechanism (Cho and Bengio, Reference Bahdanau, Cho, Bengio, Bengio and LeCun2015) has been used for sequential modeling and time series prediction problems (Qin et al., Reference Qin, Song, Chen, Cheng, Jiang, Cottrell and Sierra2017). It has the ability to model dependencies in long sequences without regard to their distance (Vaswani et al., Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, Guyon, Luxburg, Bengio, Wallach, Fergus, Vishwanathan and Garnett2017). We add a self-attention layer (adapted from Zhang et al., Reference Zhang, Goodfellow, Metaxas and Odena2019) on the convolution maps generated from the Temporal CNN network and observe the prediction outcomes. This enables the model to “pay attention” to various important parts of the feature maps that can help in making more accurate predictions.
4.4. Transformer
Transformers are architectures that are comprised of only the attention layers, leaving out any recurrence or convolutions entirely (Vaswani et al., Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, Guyon, Luxburg, Bengio, Wallach, Fergus, Vishwanathan and Garnett2017). They have been adapted for the task of time series forecasting as they work very well with longer sequences (Song et al., Reference Song, Rajan, Thiagarajan and Spanias2018; Wu et al., Reference Wu, Green, Ben and O’Banion2020). For this work, we use the encoder structure of transformers and work with a single stack of two-headed self-attention modules and other standard layers based on Vaswani et al. (Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, Guyon, Luxburg, Bengio, Wallach, Fergus, Vishwanathan and Garnett2017).
4.5. Probabilistic prediction
For probabilistic forecasts, the above models are modified to output predictions at multiple quantiles (from 5 to 95%). While the point models are trained with mean squared error losses, their probabilistic counterparts are trained using quantile loss.
A fully connected layer at the end of each model is modified to produce either a single output (for point) or multiple outputs (for probabilistic). NGBoost is trained with default parameters and 2000 estimators as in Zelikman et al. (Reference Zelikman, Zhou, Irvin, Raterink, Sheng, Kelly, Rajagopal, Ng and Gagne2020).
5. Evaluation
We provide the results of our experiments over all 7 SURFRAD stations for the test year (2018) in Tables 1 and 2. The benchmarks from the solar energy literature (derived from Doubleday et al., Reference Doubleday, Hernandez and Hodge2020) are:
Table 1. Results of the point forecasting pipeline.

Note. Results are in terms of RMSE scores (lower the better). Comparisons are made with the SP, HC, and NWP benchmarks.
Abbreviations: HC, hourly climatology; LSTM, long short-term memory; NWP, numerical weather prediction; RMSE, root mean squared error; SP, smart persistence; TCN, temporal CNN.
Table 2. Results of the probabilistic forecasting pipeline.

Note. Results are in terms of CRPS scores. Comparisons are made with the probabilistic HC, CH-PeEN, and NWP benchmarks. The lower the CRPS, the better the model.
Abbreviations: CH-PeEN, complete history persistence ensemble; CRPS, continuous ranked probability score; HC, hourly climatology; LSTM, long short-term memory; NWP, numerical weather prediction; TCN, temporal CNN.
5.1. Hourly climatology
Hourly climatology (HC) is a model that assigns the irradiance at a certain hour in 2018, to be the average of all irradiance values at the same hour of every day in the training data. For the probabilistic forecast evaluation, we do not use the average but the cumulative distribution function (CDF) over these values.
5.2. Complete history persistence ensemble
CH-PeEN is used as a probabilistic prediction benchmark, where for a certain forecast hour, we take a CDF over the clearness indices at the same hour of every day from the training data and these are further converted to irradiance measures.
5.3. NWP ensembles
We are using the ECMWF 51-member ensemble as our NWP outputs. These members are updated only twice a day for 1-week ahead forecasts, and hence had to be repeated for the rest of the hours of the day (to be consistent with other forecasts). For point forecasts, we take the ensemble mean, while for probabilistic prediction we take an empirical CDF over them.
5.4. Smart persistence
 Smart persistence (SP) is a model that assumes the clearness index (ratio of GHI/clear-sky GHI) at time 
 $ t+ $
lead-time to be the same as at time
$ t+ $
lead-time to be the same as at time 
 $ t $
, and uses that to obtain the irradiance at
$ t $
, and uses that to obtain the irradiance at 
 $ t\hskip0.35em + $
 lead-time. This is a common benchmark from the short-term point forecasting literature, which we would not expect to perform well at longer forecast lead times, but include for the sake of completeness.
$ t\hskip0.35em + $
 lead-time. This is a common benchmark from the short-term point forecasting literature, which we would not expect to perform well at longer forecast lead times, but include for the sake of completeness.
5.5. Evaluation metrics
 The evaluation metrics for point forecasting are the RMSE (root mean squared error) scores of each model. For probabilistic forecasting, we use CRPS or Continuous Ranked Probability Score. CRPS is a widely used metric for evaluating probabilistic forecasts as it balances reliability, resolution, and sharpness which are other criteria to measure the quality of probabilistic outputs (Gneiting et al., Reference Gneiting, Balabdaoui and Raftery2007). Intuitively, CRPS measures the area between the predicted and the observed CDF, the observed (true) CDF being a step function at the observation (Zelikman et al., Reference Zelikman, Zhou, Irvin, Raterink, Sheng, Kelly, Rajagopal, Ng and Gagne2020). The lower the CRPS, the better the model. To evaluate our probabilistic forecasts, that is, when our model outputs predictions at different quantiles (
 $ \xi \in \left(0,1\right) $
), the CRPS score can be expressed as an integral over quantile scores(
$ \xi \in \left(0,1\right) $
), the CRPS score can be expressed as an integral over quantile scores(
 $ \mathrm{QS} $
) at all quantiles (from Doubleday et al., Reference Doubleday, Hernandez and Hodge2020):
$ \mathrm{QS} $
) at all quantiles (from Doubleday et al., Reference Doubleday, Hernandez and Hodge2020):
 $$ \mathrm{CRPS}\hskip0.35em ={\int}_0^1\frac{1}{T}\sum \limits_{t=1}^T{\mathrm{QS}}_{\xi}\left({P}^{-1}\left(\xi, t\right),y(t)\right)\hskip0.1em d\xi, $$
$$ \mathrm{CRPS}\hskip0.35em ={\int}_0^1\frac{1}{T}\sum \limits_{t=1}^T{\mathrm{QS}}_{\xi}\left({P}^{-1}\left(\xi, t\right),y(t)\right)\hskip0.1em d\xi, $$
where 
 $ y $
 is the observation,
$ y $
 is the observation, 
 $ 1 $
 is an indicator function,
$ 1 $
 is an indicator function, 
 $ P $
 the predicted CDF distribution, and
$ P $
 the predicted CDF distribution, and 
 $ T $
 the number of data points.
$ T $
 the number of data points. 
 $ \mathrm{QS} $
 at a particular
$ \mathrm{QS} $
 at a particular 
 $ \xi $
 is defined as:
$ \xi $
 is defined as:
 $$ {\mathrm{QS}}_{\xi}\hskip0.35em =\hskip0.35em 2\left(1\left\{y(t)\hskip0.35em \le \hskip0.35em {P}^{-1}\Big(\xi, t\Big)\right\}-\xi \right)\left({P}^{-1}\left(\xi, t\right)-y(t)\right). $$
$$ {\mathrm{QS}}_{\xi}\hskip0.35em =\hskip0.35em 2\left(1\left\{y(t)\hskip0.35em \le \hskip0.35em {P}^{-1}\Big(\xi, t\Big)\right\}-\xi \right)\left({P}^{-1}\left(\xi, t\right)-y(t)\right). $$
Reliability looks at the statistical consistency between the forecast distribution and observed distribution, while sharpness looks at the concentration (narrowness) of the forecast (Gneiting et al., Reference Gneiting, Balabdaoui and Raftery2007; Lauret et al., Reference Lauret, David and Pinson2019; Doubleday et al., Reference Doubleday, Hernandez and Hodge2020). These characteristics are best observed with a visual analysis and we follow the work of Doubleday et al. (Reference Doubleday, Hernandez and Hodge2020) to visualize both reliability and sharpness. The sharpness plot in Figure 3b is where we plot the average forecast width at 10, 20 ,30 %, … central intervals. As we can see, sharpness does not look at the observation, it just considers the narrowness of the prediction interval. We also provide a reliability diagram (Figure 3b) where we compare the proportion of the observations that lie within a given quantile output, versus the quantile or nominal proportion itself (in an ideal scenario both are expected to be equal).

Figure 3. Reliability and sharpness plots at Penn State station.
As we observe from Table 1, for the majority of the stations, all of our proposed deep learning models including LSTM, TCN, TCN + Attention, and Transformers outperform Smart Persistence, HC and NWP for point forecasts. LSTM, TCN, and TCN + Attention perform very well for point prediction. NGBoost performs comparably, or better, but falls behind in probabilistic evaluation. The probabilistic prediction results in Table 2 show that TCN (and TCN + Attention) obtain superior results against all benchmarks except CH-PeEN in terms of the CRPS scores. Overall, LSTMs perform equally well. Transformers however do not come close to the other proposed models for both point and probabilistic evaluation (except for Desert Rock station). The CH-PeEN benchmark consistently performs slightly better than the best-performing probabilistic models. To investigate this, we refer to the reliability and sharpness diagrams for the station Penn State in Figure 3. We clearly note all our proposed models have better sharpness (as their curves are lower) in their forecasts than CH-PeEN, even though it very reliable.
6. Discussion
Our encouraging results demonstrate that deep sequence learning algorithms hold promise for producing improved week-ahead forecasts as they outperform most of the literature benchmarks. Our methods also provide a distribution over the prediction, and this additional knowledge of uncertainty can be extremely important in efficient power system and generator planning. Our proposed models outperform a machine learning-based approach (Duan et al., Reference Duan, Anand, Ding, Thai, Basu, Ng and Schuler2020) in probabilistic forecasting.
While temporal CNNs are a faster alternative to training LSTMs, they show an almost equal performance in this application, especially for probabilistic forecasts. The Attention mechanism proved useful when used in conjunction with TCNs but notably, not as much when we dispensed the convolution layers and used a transformer which is entirely based on attention.
Furthermore, as our part of our study, we wanted to look into the potential performance of our existing deep learning models if they are provided with an additional input feature of the NWP model ensemble. Tables 3 and 4 provide the results obtained when the 51 member ensemble is incorporated into our models. With the poor temporal resolution of these NWP predictors, we did not expect to see a huge performance improvement in the forecasts. We do observe an overall slight enhancement in performance with LSTM but not a clear trend with the TCN, TCN + Attention, and Transformer models. Further investigation is left to future work.
Table 3. Results of the point forecasting pipeline with NWP ensemble included as features in our models.

Note. Results are in terms of RMSE scores.
Abbreviations: HC, hourly climatology; LSTM, long short-term memory; RMSE, root mean squared error; SP, smart persistence; TCN, temporal CNN.
Table 4. Results of the probabilistic forecasting pipeline with NWP ensemble included as features in our models.

Note. Results are in terms of CRPS scores.
Abbreviations: CH-PeEN, complete history persistence ensemble; CRPS, continuous ranked probability score; HC, hourly climatology; LSTM, long short-term memory; NWP, numerical weather prediction; TCN, temporal CNN.
7. Conclusion
We provide a quantitative study and demonstrate the valuable potential of deep learning methods for 1 week-ahead solar irradiance forecasting, especially when such longer-term predictions are ill-served by existing NWP models. Week-ahead and longer forecasts, coupled with uncertainty estimates, can be very significant for future power systems operations when efficient energy planning will become increasingly important with greater shares of renewable energy penetrating into power systems. We hope this paper will encourage future work leveraging machine learning for long-term point and probabilistic forecasting, not only for solar power but also for other renewables and applications mitigating climate change.
Author Contributions
Conceptualization: S.S., B.-M.H., C.M.; Data curation: S.S., B.-M.H.; Methodology: S.S., C.M.; Supervision: B.-M.H., C.M.; Writing—original draft: S.S., B.-M.H., C.M.; Writing—review and editing: S.S., B.-M.H., C.M. All authors approved the final submitted draft.
Competing Interests
C.M. is the Editor-in-Chief of Environmental Data Science. This paper was independently reviewed and accepted for publication.
Data Availability Statement
We use open-source NOAA’s SURFRAD network (surface radiation budget network for atmospheric research, Augustine et al., Reference Augustine, DeLuisi and Long2000) that provides the ground-truth solar irradiance and meteorological measurements from seven sites across the US in different climatic zones (https://gml.noaa.gov/grad/surfrad/).
Funding Statement
This work received no specific grant from any funding agency, commercial or not-for-profit sectors.
Provenance
This article is part of the Climate Informatics 2022 proceedings and was accepted in Environmental Data Science on the basis of the Climate Informatics peer review process.
 
 






