Impact Statement
We leverage generative adversarial networks, a class of deep generative models, to produce regional temperature maps at hourly resolutions, making them amenable to many downstream applications such as reliability impact studies for power systems, probabilistic estimates of impacts on food, health, and energy. Noting the lack of unifying metrics and procedures for evaluating machine learning models in this regime, this paper proposes systematic procedures and metrics for evaluation and empirically validates the reliability of the metrics through experiments in multiple regions.
1. Introduction
Over the past decade, the growing impacts of climate change have been disproportionate among regions and populations (Diffenbaugh and Burke, Reference Diffenbaugh and Burke2019). The frequency and severity of extreme events are impacted by climate change and pose a major threat to ecosystems, power systems, agriculture, and people (Henry and Pratson, Reference Henry and Pratson2017; Liang et al., Reference Liang, Wu, Chambers, Schmoldt, Gao, Liu, Liu, Sun and Kennedy2017; Somanathan et al., Reference Somanathan, Somanathan, Sudarshan and Tewari2021). For power systems, outages and curtailments can last for days because of events like heat waves or wildfires impacting generation and transmissions units. In Thiery et al. (Reference Thiery, Lange, Rogelj, Schleussner, Gudmundsson, Seneviratne, Andrijevic, Frieler, Emanuel and Geiger2021), the authors estimate that the generation born in 2020 will experience a two to seven fold increase in heat waves compared to people born in 1960. It has been shown that inequities can be spatially and temporally heterogeneous and are correlated with socioeconomic factors like race. For instance, Brockway et al. (Reference Brockway, Conde and Callaway2021), show that in California, regions with more black population have lower solar grid hosting capacities, thus potentially limiting access to solar PV, which can act as a key home-level resiliency tool.
As the power sector transitions to cleaner energy technologies, fueled by the large-scale deployment renewable energy resources, there is a massive opportunity to make the grid more resilient for everyone. Thus, it is important to identify who are most at risk to the effects of climate change. It is very challenging to accurately (spatially and temporally) forecast specific occurrences and attributes of significant weather events more than a couple days to weeks in advance, which makes preparing for these events challenging.
General circulation models (GCMs) provide spatially coarse examples of plausible future climate states but lack the spatial resolution needed to study local phenomena. Characterizing the tails of temperature distributions requires generating many examples, as rarer events may occur only once in many samples, making computationally efficient sampling paramount. Stochastic weather generators (SWG) are commonly used for generating statistically plausible examples of weather data (Semenov, Reference Semenov2008). SWGs commonly comprise statistical models (Markov Chains, Gaussian fits, exponential fits, etc.) that are fit to observational weather data; Wilks and Wilby (Reference Wilks and Wilby1999) talks in detail about some of these models and applications. LARS-WG (Semenov et al., Reference Semenov, Barrow and Lars-Wg2002), a popular stochastic weather generator based on the series weather generator (Racsko et al., Reference Racsko, Szeidl and Semenov1991), was created to simulate data at a single site, generating maximum and minimum daily temperatures, but not at hourly resolutions. Generating atmospheric conditions at hourly resolutions can be useful for grid reliability studies.
This motivates a method that can provide spatially and temporally resolved (hourly resolution) examples of atmospheric conditions that are physically realistic (i.e. reasonably fall within or near historically observed trends) to a specific region and time; we develop such an approach utilizing Generative adversarial networks (GANs) (Goodfellow et al., Reference Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville and Bengio2014). A GAN is a deep learning (DL) generative modeling framework in which two groups of artificial neural networks (ANN) are trained continuously in a repeated sequence. It comprises the discriminator/critic networks which are trained to distinguish between true and fake samples and the generator networks whose goal is to produce samples that come from the same distribution as the true samples, or resemble the true samples. As both networks train over time, the critic learns to be better at distinguishing between fake and true samples, while the generator produces fake samples that are harder to distinguish. Mathematically, this results in a minimax expression with both the critic and generator networks trying to optimize conflicting objectives.
Deep generative models have gained prominence due to their ability to capture complex distributions without explicit statistical parameterization, yielding methods that generally outperform their more explicit counterparts which rely on stronger prior assumptions on the data distribution; the authors in Buechler et al. (Reference Buechler, Balogun, Majumdar and Rajagopal2021) show an example of this. SWGs have been leveraged by various research communities for decades now, thus, developing new methods that complement or improve on these models will drive the research community forward. In this paper, we propose an approach to modelling conditional regional temperatures utilizing GANs and propose metrics for evaluation.
The rest of this paper is organized as follows. In Section 2, we describe the problem, state our contributions, propose a framework for approaching the problem, and discuss some related works. In Section 3, we describe the methodology in detail, including data treatment, model architecture, and training. In Section 4, we show results from experiments and we evaluate of the model’s outputs, and finally, we conclude in Section 5, with outlook on future work.
2. Problem statement
Understanding the distribution of impacts of temperatures on smaller regions or specific communities requires data at spatial resolutions finer than what GCMs currently provide, and the plausible states derived from GCMs do not provide the temporal resolution that some studies i.e. a power distribution system reliability or resilience study, will require. Additionally, because GCMs produce snapshots of the entire earth at once, it can be very inefficient and memory intensive to produce enough samples to study smaller regions of interest. In this work, we aim to:
-
1. Motivate and develop an an approach for utilizing GANs to reliably generate daily surface temperature maps at an hourly temporal resolution at low computational cost.
-
2. Show that our model can be leveraged for period, month, and region-based sampling by conditioning the model on those priors during training time; to the best of our knowledge, no other work has used GANs in the regime for month and region-based sampling.
-
3. Propose metrics and methods to evaluate our approach specifically for this task and measure performance and validate the quality of the generative model’s outputs empirically. The metrics can be leveraged in other domains for evaluating generative models of similar outputs.
GANs are most commonly used for image generation and have seen significant development in conditional and controllable generation (Radford et al., Reference Radford, Metz and Chintala2015; Brock et al., Reference Brock, Donahue and Simonyan2018; Karras et al., Reference Karras, Laine and Aila2018) and more nascent video generation as in Clark et al. (Reference Clark, Donahue and Simonyan2019) and Xia et al. (Reference Xia, Mitchell, Ek, Sheffield, Cosgrove, Wood, Luo, Alonge, Wei and Meng2012). In this work, we aim to produce 24-hour temperature maps, which can be viewed as a video generation task. We propose a conditional GAN that can generate (2 m above ground) atmospheric temperatures, conditioned on region, month, and time period. This can potentially provide inputs to other impact assessment models (Gleick, Reference Gleick1987; Moriondo et al., Reference Moriondo, Giannakopoulos and Bindi2011; Semenza et al., Reference Semenza, Herbst, Rechenburg, Suk, Höser, Schreiber and Kistemann2012; Siebert et al., Reference Siebert, Ewert, Rezaei, Kage and Graß2014) that estimate the downstream impacts of weather and the changing climate, by capturing region-based distribution of temperatures in a generative fashion. We display the overarching framework in Figure 1.
2.1. Related works
In recent years, there has been meaningful progress in applying deep learning techniques to problems related to weather and climate. The authors in Bihlo (Reference Bihlo2021) use an ensemble of GANs to predict the future (one year) weather using the European Centre for Medium-Range Weather Forecasts (ECMWF) ERA5 ( $ {0.25}^{\circ}\times {0.25}^{\circ } $ spatial and three-hour time resolution) reanalysis data from the prior four years, and evaluate their models using root-mean-squared error (RMSE), anomaly correlation coefficients (ACC), and the continuous ranked probability scores (Zamo and Naveau, Reference Zamo and Naveau2018), which are commonly used to evaluate forecasts. Meng et al. (Reference Meng, Rigall, Chen, Gao, Dong and Chen2021) develop a physics-informed GAN for sea subsurface temperature prediction at $ {0.25}^{\circ}\times {0.25}^{\circ } $ spatial and daily mean resolutions. In Keisler (Reference Keisler2022), the authors utilize Graph Neural Networks (GNN) for global weather forecasts. The model uses local information to step the current 3D atmospheric state forward by six hours, and show results that rivals state of the art forecasting models such as the ECMWF model, while reducing computational cost; some months later, Lam et al. (Reference Lam, Sanchez-Gonzalez, Willson, Wirnsberger, Fortunato, Pritzel, Ravuri, Ewalds, Alet and Eaton-Rosen2022) (also GNN) and Bi et al. (Reference Bi, Xie, Zhang, Chen, Gu and Tian2022) introduced global weather forecasting models that surpassed the state of the art physics-based numerical weather prediction model on most metrics. In Bhatia et al. (Reference Bhatia, Jain and Hooi2021), the authors adopt a gradual distribution shifting and resampling approach to model extreme precipitation, but they evaluate the outputs using Fréchet Inception Distance (FIT) proposed by Heusel et al. (Reference Heusel, Ramsauer, Unterthiner, Nessler, Klambauer and Hochreiter2017) which does not explain the temporal veracity of the generated rainfall (distribution). Additionally, we suggest that FIT is not appropriate for understanding the model’s performance on generating physical variables, because the intermediate layer from which scores are obtained does not produce values with units that have an easily discernible physical meaning. In Puchko et al. (Reference Puchko, Link, Hutchinson, Kravitz and Snyder2020), the authors use a GAN to model average daily temperatures from the Coupled Model Intercomparison Project Phase (CMIP) data, taking an autoregressive approach, and evaluate their model by visually comparing the histogram plot of true and generated data, which can be useful, but is not rigorous and can be biased by the human eye. Additionally, they do not produce samples at an hourly resolution, but rather a daily average temperature, which may not be sufficient for some downstream applications (i.e. reliability assessments). They also acknowledge that more work needs to be done to thoroughly evaluate their GAN. In Besombes et al. (Reference Besombes, Pannekoucke, Lapeyre, Sanderson and Thual2021), the authors use a GAN to model the daily mean climate variables at a 2.8° resolution and propose some approaches to evaluation, which include principal component Analysis (PCA), Wasserstein distance, and visual inspection, recognizing that traditional methods for evaluating GANs for natural image synthesis community may not suffice. More recently, Izumi et al. (Reference Izumi, Amagasaki, Ishida and Kiyama2022) leverage existing GANs for super-resolution of sea surface temperature, and evaluate the models by comparing the outputs with high resolution optimum interpolation sea surface temperature (OISST), using the learned perceptual image patch similarity (LPIPS) and RMSE as metrics.
Although recent work has shown that GANs have the potential for modeling climate variables, utilizing GANs in this regime is still in its infancy. Consequently, there is a dearth of intuitive and consistent evaluation metrics/benchmarks specific to this application. Evaluation metrics that can be adopted by machine learning (ML), power, and climate communities, are critical for comparing generative models and their outputs, especially for researchers working at the confluence of these fields.
3. Methodology
TemperatureGAN’s task is analogous to video generation; we aim to produce 2D spatial maps of temperatures that iterate through multiple time steps. Consequently, such spatial maps are amenable to estimating the probability, $ \hat{P}\left(T|R,M,k\right) $ , where $ \left\{T,R,M,k\right\} $ represent temperature, region, month, and period. This can be relevant for other downstream applications, such as empirically estimating power system resilience over a whole utility service territory under plausible ambient conditions (Saraiva et al., Reference Saraiva, Miranda and Pinto1996; Billinton and Karki, Reference Billinton and Karki1999; Billinton and Wangdee, Reference Billinton and Wangdee2006; Li et al., Reference Li2013). Though they have shown reliability and accuracy for weather forecasting, physics-based models are computationally expensive, thus, cannot efficiently create ensembles with enough members for simulating multiple realizations rapidly.
3.1. Data
Direct station observations (e.g., the automated surface observation system in the US) are reliable, accurate, and can have decades of historical records, but are spatially sparse. Satellites can provide spatially gridded observations of some atmospheric variables, but data may not exist for long time periods. Reanalysis datasets blend these sources together along with forecasting/simulation models to produce spatially and temporally consistent maps of climate variables. The dataset used for this work is the mosaic land surface model forcing temperatures data from the North American Land Data Assimilation System (NLDAS) (Xia et al., Reference Xia, Mitchell, Ek, Sheffield, Cosgrove, Wood, Luo, Alonge, Wei and Meng2012). The dataset is well-suited for this task as it provides fine spatial and temporal resolution relevant to power systems studies. NLDAS (a collaborative effort between NOAA, NASA, and others) integrates satellite and surface-level observational data with reanalysis datasets and includes multiple surface state and flux variables in North America from 1979 to near-present day. This data has a $ {0.125}^{\circ}\times {0.125}^{\circ } $ ( $ 13.8\times 11 $ km) spatial resolution (taken from an equidistant cylindrical projection) at an hourly timescale and has a size of about 700GB for the contiguous US. Training using data at a $ {0.125}^{\circ}\times {0.125}^{\circ } $ resolution, aids the possibility for a model to capture fine spatiotemporal dependencies that a physics-driven model cannot without a costly downscaling effort. We limit the scope of this work to the United States (US) West Coast region due to computational resources, and regions without data are zero-padded. We aggregate the raw data both spatially and temporally, as single grid point measurements are not sufficient for generating distributions with a high degree of confidence.
3.1.1. Spatial data aggregation
To aggregate the data spatially, $ {1}^{\circ}\times {1}^{\circ } $ ( $ 111\times 88 $ km) grids are grouped as a single labeled region; region $ R $ is a grid of 64 grid points ( $ 8\times 8 $ ) from the original dataset. This aggregation increases the number of samples for each region compared to the original NLDAS grid. This implicitly assumes that samples from each $ {1}^{\circ}\times {1}^{\circ } $ region comes from a conditional distribution $ p\left(T|R\right) $ . regions are indexed based on their relative (integer) positions from the SW corner of the dataset which corresponds to a position of (1,1). The GAN is conditioned on these relative integer positions during training. Note that this data aggregation may limit the downstream application for this model. For example, independent system operators (ISO) will cover large regions, usually larger than $ {1}^{\circ}\times {1}^{\circ } $ , thus studying power risks to weather at that scale using this spatial extent will not be straightforward. Nonetheless, it is possible to model spatial extents larger $ {1}^{\circ}\times {1}^{\circ } $ as this will mainly modify the data-engineering step.
3.1.2. Temporal data aggregation
Temporally, we posit that temperature distributions are non-stationary over sufficiently long periods Donat and Alexander (Reference Donat and Alexander2012). The data is aggregated such that each example is a 24-hour (daily) temperature map; one can imagine each example as a video with 24 frames. We introduce the idea of periods—a period is a stipulated number of years for which one can assume that the overall climate does not change significantly. To make this concrete, we aggregate the data into 24-hour daily time-series, then group all the 24-hour time series by their respective months, and finally, the same month within the elected period is also thrown into the same bucket. In this work, we elect 4-year periods. This makes the implicit assumption that for a 4-year period $ {k}_i $ , at a given region $ R $ , and month, $ M $ , the diurnal cycles come from the same distribution, or are independently and identically distributed (IID). For example, if the entire historically observed record spans 1979 to present day, then the first period, $ {k}_0 $ , will encompass observations from 1979–1982, inclusive, the second period $ {k}_2 $ , spans 1983–1986 inclusive, the third period $ {k}_3 $ spans 1987–2000, etc. Selecting quadrennial (4-year) periods has not been rigorously justified, as one might elect 1-, 2-, or 5-year periods instead. The choice of number of years within a period constitutes a design trade-off. If there are too few years (e.g. one year period) then the modeling assumption departs too far from known climatology and will be difficult to evaluate empirically with confidence. If there are too many years, one might unintentionally average out temporal effects or years/periods of significant temperature distribution shifts or dilation—we find 4 years to be a reasonable choice and defer more thorough investigation to future work. Figure 2 describes the temporal data aggregation scheme.
Adopting $ {k}_i $ , where $ i\in \left\{\mathrm{0,1,2},\dots, n\right\} $ , offers the ability to smooth over inter-year macro scale climate events (e.g. El Niño) in the ground-truth observations and capture medium/longer term trends. Another important motivation for this modeling decision is that it lends itself well to evaluation, because by aggregating over regions and years we have more data per sample, making parametric and non-parametric goodness-of-fit tests for evaluating the generated samples from the model more tractable.
3.2. Model architecture and training
Video generation (Saito et al., Reference Saito, Matsumoto and Saito2017; Clark et al., Reference Clark, Donahue and Simonyan2019; Chu et al., Reference Chu, Xie, Mayer, Leal-Taixé and Thuerey2020; Gur et al., Reference Gur, Benaim and Wolf2020; Wang et al., Reference Wang, Bilinski, Bremond and Dantcheva2020; Gupta et al., Reference Gupta, Keshari and Das2022) is one of the most challenging GAN applications. In Clark et al. (Reference Clark, Donahue and Simonyan2019), the authors propose a dual-video-discriminator (DVD) to handle the memory bottleneck for video datasets (terming the model DVD-GAN), using two discriminators—a spatial and temporal discriminator. Spatial discriminator $ {D}_{\mathrm{s}} $ inspects an individual video frame (a static image) for texture quality and temporal discriminator $ {D}_{\mathrm{t}} $ for penalizes the generated frame-by-frame transitions. For TemperatureGAN, we also leverage two discriminator/critic networks, but take a different approach from DVD-GAN. We build a convolutional neural network-based temporal discriminator $ {D}_{\mathrm{t}} $ , and rather than training on videos (in our case a video is a spatio-temporal time series of temperature values) our model is trained on temporal gradients, distinguishing it from DVD-GAN. Training on the temporal gradients separately guides the model to focus on the learning the daily (hourly) diurnal cycles and to produce hourly (or temporal) temperature transitions that are consistent with the ground-truth’s diurnal cycles. Thus, for a given 24-hour sample, we have 23 gradients $ \frac{\partial T}{\partial t} $ . We use the Wasserstein loss with a gradient penalty (Gulrajani et al., Reference Gulrajani, Ahmed, Arjovsky, Dumoulin and Courville2017) as a soft constraint to satisfy the 1-Lipschitz continuity (Gouk et al., Reference Gouk, Frank, Pfahringer and Cree2021). We also experimented with directly constraining the layer weights via spectral normalization introduced in Miyato et al. (Reference Miyato, Kataoka, Koyama and Yoshida2018) to satisfy the 1-Lipschitz continuity conditions and found it to perform poorly, thus was not further pursued. The loss functions are:
$ \overset{\sim }{\boldsymbol{T}}\sim {\unicode{x2119}}_{\mathrm{g}} $ represents examples sampled from the GAN (the generator) and $ \boldsymbol{T}\sim {\unicode{x2119}}_{\mathrm{r}} $ represents examples sampled from the real (observed) data. The discriminators/critics and generator seek to minimize their respective losses. Observe that the first two terms in Equations (1) and (2) instruct the discriminator D to maximize the gap between the expected values of the true and fake samples for the temperature gradients and the temperature values—this is done by minimizing the loss functions in (1) and (2). The objective of the discriminators $ {D}_{\mathrm{t}} $ and $ {D}_{\mathrm{s}} $ is to minimize Equations (1) and (2), because by minimizing these losses, it is encouraged to assign higher scores to true (observed) data and lower scores to generated (“fake”) examples. However, the objective of the generator $ G $ is to minimize Equation (3), which means it is encouraged to produce examples that will yield high scores from the discriminator, suggesting that it attempts to produce samples that resemble those from the observed (ground-truth) data, thereby implicitly estimating the true probability distribution $ {\unicode{x2119}}_{\mathrm{r}} $ . The final terms in Equations (1) and (2) highlight an important concept regarding the training stability of Wasserstein GANs (WGANs), which is the notion of Lipschitz continuity. For training stability, the loss functions of the discriminator should be 1-Lipschitz continuous for the Wasserstein distance approximation to be valid; this constrains how quickly the models’ parameters can change during training. In other words, enforcing Lipschitz continuity ensures that the Discriminator’s loss does not grow too quickly such that the Generator cannot learn. $ {\nabla}_{\hat{\mathbf{T}}}D\left(\hat{\mathbf{T}}\right) $ is the gradient of the discriminator’s outputs with respect to its inputs, which are temperature maps. We follow a similar convention as in Gulrajani et al. (Reference Gulrajani, Ahmed, Arjovsky, Dumoulin and Courville2017), where the authors define $ {\mathrm{\mathbb{P}}}_{\hat{\mathbf{T}}} $ as sampling uniformly along straight lines between pairs of points sampled from the data distribution $ {\mathrm{\mathbb{P}}}_r $ and the generator distribution $ {\unicode{x2119}}_{\mathrm{g}} $ . $ {\lambda}_{\mathrm{GP}} $ is a hyperparameter for weighting how much importance the model places on the final terms representing Lipschitz continuity in Equations (1) and (2); we elect $ {\lambda}_{\mathrm{GP}}=1 $ for training.
Figures 3–5 show the architectures of the generator and discriminator neural networks. They are convolution-based networks.
The generator is an arguably modest 562,206 parameter model. We normalize and standardize the data and train the GAN for 2000 epochs using a batch size of 4096 and ADAM optimizer (Kingma and Ba, Reference Kingma and Ba2014) ( $ {\beta}_1=0.5,{\beta}_2=0.99 $ ) for gradient descent, with an exponential learning rate (LR) decay every 100 epochs. We train on the first 8 periods (1979–2010), about 2.7 million examples, where each example is 3-dimensional (lon $ \times $ lat $ \times $ time). The model takes 4 days to train for 2000 epochs on a single 80GB NVIDIA A100 GPU. Because we do not explicitly constrain the network outputs (constraining the output would imply we know the upper bound on temperature values), it is important to use activation functions that are bounded, especially at later layers of the generator $ G $ . We observed that using only ReLU/LeakyReLU layers in the generator could yield implausibly extreme samples which was rectified by the choosing more bounded activation functions (i.e., $ \tanh $ ).
4. Experiments and evaluation
Figures 6–8 show results from TemperatureGAN compared to the ground-truth. The generated outputs are sampled from the model by passing in the input noise, region, month, and period labels as displayed in Figure 1. The displayed outputs are (random) representative days for the given month, region, and period. The displayed ground-truth data is similarly randomly sampled. The time zones for generated examples are in Universal Time Coordinate (UTC), and local times are included. More generated examples can be accessed via the hyperlink here, and additional distribution plots are in Figures B7 and B8 in Appendix B.
4.1. Model evaluation
Because generative models attempt to learn a probability distribution $ \hat{p}\left(x,y\right) $ (or $ \hat{p}(x) $ if labels are not observed) that estimates the true probability distribution $ p\left(x,y\right) $ (where $ x $ represents the data sample or its features and $ y $ represents the labels/class) of data, it is important to develop methods/metrics for evaluation that quantify how well a model performs this task. This can be achieved by estimating the likelihood that an example $ \hat{x}\sim \hat{p}(x) $ comes from the true distribution $ p(x) $ .
Using generative ML for climate variables differs from traditional ML tasks such as image, speech, or video synthesis, because models generate physical values for which a unit discrepancy has physical consequence. It is important to establish a simple yet efficient method for quantifying how well or poorly a model performs on any given climate variable, offering standard baselines that future models can be compared to. Metrics should ideally (1) be efficiently computed, (2) consistently track quality, and (3) be relevant and easily adopted by the ML, climate, and energy/power systems communities. We discuss and propose ideas for evaluation below.
If the true distribution $ p(x) $ of $ x $ is known or $ p(x) $ approximately admits a certain functional form (e.g. Gaussian), one can compute the distance between the estimates of the sufficient statistics of the generated $ \hat{p}(x) $ and true distribution $ p(x) $ . A metric that uses this approach is the Fréchet Inception Distance (FID) (Heusel et al., Reference Heusel, Ramsauer, Unterthiner, Nessler, Klambauer and Hochreiter2017), which is commonly used on GANs. FID implicitly assumes that the intermediate feature vectors extracted from images using an Inception V3 model trained on the ImageNet data set, come from a multivariate normal distribution. That is, $ {X}_r\sim \mathcal{N}\left({\mu}_{\mathbf{r}},{\mathtt{\varSigma}}_{\mathbf{r}}\right) $ and $ {X}_g\sim \mathcal{N}({\mu}_{\mathbf{g}},{\mathtt{\varSigma}}_{\mathbf{g}}) $ with $ \left({\mu}_{\mathbf{r}},{\mathtt{\varSigma}}_{\mathbf{r}}\right) $ and $ ({\mu}_{\mathbf{g}},{\mathtt{\varSigma}}_{\mathbf{g}}) $ as the mean-covariance pairs for the ground-truth and generated images, respectively. In some other cases, the distribution $ p(x) $ may not be continuous, which can make evaluation more challenging. In discontinuous cases, one may take the approach of piece-wise evaluation, if there exists a continuous form of the distribution $ {p}_q(x) $ for a certain interval $ {\mathbf{q}}_{\mathbf{0}}\le \mathbf{q}\le {\mathbf{q}}_{\mathbf{1}} $ . And, if the distribution within these intervals admits a known functional form, then a weighted average of the distances between the sufficient statistics for all intervals can be adopted; this approach is limited to real-valued distributions.
In some other cases, a functional form of $ p $ is unknown, thus a non-parametric goodness of fit (e.g. Kolmogorov–Smirnov statistic) test can be adopted, or entropy-based (Kullback–Leibler (KL) and Jensen–Shannon (JS)) (Kullback and Leibler, Reference Kullback and Leibler1951; Lin, Reference Lin1991) methods can be leveraged. The JS divergence is generally accepted as the symmetric distance measure borne from the KL divergence, thus can be formally considered a metric. For evaluating a generative model when the functional form of $ p $ is unknown, one can empirically estimate the JS-divergence by taking the following steps:
-
1. Choose the number of bins $ n $ , which decides how many quantile intervals the datapoints will be placed into,
-
2. Sort the data and place every datapoint/sample into a bin corresponding to its quantile range within the dataset,
-
3. Empirically calculate $ {D}_{\mathrm{JS}}\left(\boldsymbol{P}\Big\Vert \boldsymbol{Q}\right) $ for each bin and then compute the average JS-divergence.
We discuss and formalize the evaluation of TemperatureGAN for the rest of this section.
4.1.1. Q-Q envelopes
Q–Q plots are typically leveraged to examine the plausibility that two separate datasets come from the same distribution. They can also help discern the distribution quantiles for which for uncertainty is higher.
For each of the Q–Q plot envelopes in Figure 9 within each period (labelled by the legends), TemperatureGAN is sampled 100 times (to generate 100 realizations), while the ground-truth contains one realization (we can only observe a single realization). The plot elucidates a few things. A key observation across all seasons is that the bulk of the distribution produces a tighter envelope around the constant (black) line, while the envelope at the tails are wider. This is not too surprising, because the observations at the tails are sparse, thus the spread (envelope) around the constant line is observed to be wider. Empirical evidence also suggests that the temperature maps generated are bounded, which is important for representing plausible physical temperature states. The samples generated from TemperatureGAN may be leveraged to expand the potential realizations that are plausible within a given region, which can aid robust planning for energy transition planning agencies. More Q-Q envelopes are displayed in Appendix B.4.
4.1.2. Baseline
Because we group $ {1}^{\circ}\times {1}^{\circ } $ regions as one region $ R $ , the spatial temperature patterns are not critical for analyzing the overall effects on that region, but the temperature distributions are critical. However, for capturing more granular, local effects within a region, the spatial patterns become critical. For spatial representation evaluation, we propose a baseline model. Because daily temperatures typically exhibit cyclical (diurnal) patterns, we assume that for a given hour of the day, the temperatures within a ( $ {1}^{\circ}\times {1}^{\circ } $ ) region overall follow a normal distribution; that is $ T\left(M,R,k,t\right)\sim \mathcal{N}\left(\mu, \sigma \right) $ , where $ M,R,k,t $ represent, month, region, period, and hour-of-day, respectively. For each hour, we compute the empirical spatial means and variances of the ground-truth data. Thus, we have 24-dimensional mean and standard deviation vectors $ {\hat{\mu}}_{\mathrm{S}},{\hat{\sigma}}_{\mathrm{S}}\in {\mathcal{R}}^{24} $ , and can then generate multiple examples using these statistics. This yields a fairly simple model that generates temperature maps quickly. This baseline is compared to TemperatureGAN in the following section.
4.1.3. Spatial pixel-wise average correlation distance (SPAC’D)
We introduce SPAC’D (pronounced “spaced”). To evaluate the spatial representation, we leverage the idea of covariance. Specifically, we calculate the L1-norm distance of the Pearson product–moment correlation coefficient matrices (Benesty et al., Reference Benesty, Chen, Huang and Cohen2009) between the ground-truth and generated data samples. It is worth noting that the choice of the L1-norm is not arbitrary. We elect the L1-norm because it ensures the metric has a fixed range, making it suitable for frames with varying length and width; an intuitive description of SPAC’D is included in Appendix B.1. SPAC’D estimates how well the pixel-wise correlations (or relationships), in the ground-truth are replicated by the generator. It has a range $ \left[0,2\right] $ , with the quality of spatial representation increasing with decreasing value. Additionally, it is resolution-invariant, meaning the spatial resolution (or size) of the samples being evaluated does not alter its range. We define SPAC’D below with $ \overset{\sim }{T}\sim {\unicode{x2119}}_{\mathrm{g}} $ and $ T\sim {\unicode{x2119}}_{\mathrm{r}} $ , where the subscripts $ r $ and $ g $ represent the generated and ground-truth data, respectively.
where $ N=W\times D $ is the total number of pixels per frame with W and D representing the number of pixels in the x and y directions. $ {\left\Vert \cdot \right\Vert}_1 $ , a matrix L1-norm is the maximum sum of absolute values of the column vectors of a matrix. For any matrix $ A $ , the L1-norm is given by:
$ \rho $ is the correlation coefficient matrix with each entry given by Equation (6) below.
where $ x $ and $ y $ represent the two features for which $ {\rho}_{x,y} $ is calculated.
We call TemperatureGAN ‘ $ G $ ’ and the baseline ‘ $ B $ ’. We sample temperature maps $ {T}_G\sim G\left(z,M,R,k\right) $ and $ {T}_{\mathrm{B}}\sim B\left({\hat{\mu}}_{\mathrm{S}},{\hat{\sigma}}_{\mathrm{S}}\right) $ (where each sample $ {T}_{\mathrm{G},\mathrm{B}}\in {\mathcal{R}}^{24\times 8\times 8} $ ), from the GAN and baseline, respectively and report this metric for different regions. Because we are introducing this metric for the first time in this regime, we cannot compare it to other existing models, but it offers a baseline for future comparison. The plots in Figure 10 show the SPAC’D values over training steps. The initial examples produced by the GAN, display inferior SPAC’D values compared to the baseline. Because, during the earlier stages of training, the parameters of the neural network are in proximity to their randomly initialized values. However, later into training (see plots on the right and notice training steps), there is a significant decline (improvement) in the SPAC’D values. This shows that not only are the generated temperature ranges (distribution) accurate, the spatial temperature fields generated have structure.
4.1.4. Fréchet daily-mean temperature distance (FDTD)
In addition to evaluating spatial correlations, it is important to evaluate the temperature values being generated. By visual inspection of Figure 6 and 7, observe that TemperatureGAN generates realistic temperatures for the given regions and months. However, we propose a metric to measure its performance. Daily mean temperatures are typically assumed to follow Gaussian distributions (Meehl et al., Reference Meehl, Karl, Easterling, Changnon, Pielke, Changnon, Evans, Groisman, Knutson and Kunkel2000). Thus, we compute the distance between the sufficient statistics of the ground-truth and generated data as parameterized by a Gaussian. For a certain region $ R $ , in a given month $ M $ , and period $ k $ , the FDTD is calculated by taking the daily average temperatures across every observation. Then, the central bulk of the data is estimated by a normal distribution. The bulk is chosen as the 10th to 90th percentile daily mean temperatures from the ground-truth data, excluding the tails, as they usually admit a different functional form, usually characterized by generalized extreme value (GEV) or generalized pareto (GPD) distributions. Thus, for a set of daily mean temperatures $ \overline{\mathbf{T}}=\left\{{\overline{T}}_1,{\overline{T}}_2,{\overline{T}}_3,\dots, \right\} $ , we select a subset $ {\overline{\mathbf{T}}}_{\mathbf{bulk}}\subset \overline{\mathbf{T}} $ , fit a normal distribution to $ {\overline{\mathbf{T}}}_{\mathbf{bulk}} $ , and compute the distance of the sufficient statistics of the generated examples from the ground-truth data as expressed in Equation (7).
The subscripts $ r $ and $ g $ represent ground-truth and generated examples, respectively. $ {\mu}_{\mathrm{r}} $ and $ {\mu}_{\mathrm{g}} $ are ground-truth and generated data sample means respectively, and $ {\sigma}_{\mathrm{r}} $ and $ {\sigma}_{\mathrm{g}} $ are the ground-truth and generated data sample standard deviations, respectively. The results are reported in Tables 1–4, and B2–B4.
4.1.5. Temporal gradient distribution distance (TGDD)
For some applications, generating realistic hourly temperature maps can be useful; for example, power system reliability or capacity sufficiency studies may benefit from this (Panteli and Mancarella, Reference Panteli and Mancarella2015; Perera et al., Reference Perera, Nik, Chen, Scartezzini and Hong2020). Because we generate hourly spatial maps for any given day, we aim to estimate the integrity of the generated diurnal cycle. Visual inspection may be used to validate general cyclical patterns. However, we propose an approach to numerically estimating the model’s performance. We obtain the distribution of temperature gradients by empirically estimating the cumulative density functions (CDFs) of the temperature hourly gradients $ \frac{\partial \mathbf{T}}{\partial \mathbf{t}} $ and $ \frac{\partial \tilde{\mathbf{T}}}{\partial \mathbf{t}} $ , where $ \boldsymbol{T}\sim {\unicode{x2119}}_{\mathrm{r}} $ and $ \overset{\sim }{\boldsymbol{T}}\sim {\unicode{x2119}}_{\mathrm{g}} $ and $ {\unicode{x2119}}_{\mathrm{r}} $ and $ {\unicode{x2119}}_{\mathrm{g}} $ represent the ground-truth and generated distributions (see Figure 11), respectively. To do this, we split the data samples into $ n=10 $ bins, estimate $ {D}_{\mathrm{JS}}\left(p\Big\Vert q\right) $ for each bin,
after which we compute the average across all bins to obtain the metric we call TGDD,
with a range $ \left(0,\ln 2\right) $ , and improving with decreasing value. We report the monthly TGDD in Table 5. Diurnal cycle patterns are displayed in Figure B1 in Appendix B for visual inspection.
4.1.6. Comparison with existing stochastic weather generator
We conclude evaluations by comparing TemperatureGAN to WeaGETS, an existing single-site stochastic weather generator (SWG) (Chen et al., Reference Chen, Brissette and Leconte2010), similar to WGEN. We are limited in the breadth of comparsion as WeaGETS only produces minimum and maximum daily temperatures for a given site (Figures 12 and 13). Because the ground-truth is three-dimensional, we take the spatial average temperature of the $ 1\times 1 $ region and obtain the maximum and minimum temperatures for each day. WeaGETS performs better than TemperatureGAN in this comparison. We posit that this is largely because the models were given fundamentally different tasks. TemperatureGAN was developed to generate high-dimensional (3D) samples while WeaGETS generates minimum and maximum temperatures for only one region only and is trained on one-dimensional data. WeaGETS cannot learn temporal and spatial structure of disparate regions, which is important for more comprehensive energy systems assessments.
5. Conclusion
In this paper, we introduced TemperatureGAN, a data-driven generative model that efficiently produces spatial temperature maps of a given region, month, and period at an hourly temporal resolution. We proposed metrics to evaluate this model and showed that it can learn and reproduce historically observed regional spatiotemporal temperature dynamics.
We discussed four metrics for evaluating the model performance, each serving an important purpose to evaluating the spatiotemporal integrity of the generated samples. Introduced for the first time is SPAC’D, a bounded metric that measures the veracity of generated spatial fields. FDTD is used to evaluate how well the model captures the distribution of daily average temperatures. TGDD is used to scrutinize temporal integrity. Collectively, these metrics show that the GAN reasonably captures conditional temperature distributions. As discussed in the text, these metrics can be adopted for evaluating other models/approaches in this regime.
While this work leveraged only historical temperature measurements, we recognize that other input data streams may be useful. This approach is independent of input from general circulation models (GCM) or other exogenous anthropogenic inputs. However, retrospective runs of GCMs could provide the GAN with a useful signal to better capture trends over time. We also recognize that regenerating accurate spatial gradients for multiple regions by virtue of its relative position is rather arbitrary and may be improved by including more physically meaningful priors for various regions, for example, topological maps. Topology maps at a finer spatial resolution than the base temperature data could be ingested by the GAN to improve both the spatial representation integrity and potentially, resolutions. Future work will leverage this.
Although TemperatureGAN can be used to generate regional temperatures, we caution against leveraging it for extreme value analysis for a few reasons. First, the model has not been has not been rigorously evaluated within the regime of producing rare, extreme temperatures. Secondly, the 4-year period $ k $ used in training TemperatureGAN may not produce a sufficient number of temperature observations to accurately capture the distribution of extreme temperature values. Because there are fewer samples, it may lead to high variance in the parameters for the extreme value models. Further work can be done to improve and validate the TemperatureGAN’s performance at the tails of temperature distributions, however the current model demonstrates promising characteristics that can be built on for future work. Finally, TemperatureGAN in its current state should not be applied to future climate as it has not been designed to handle nonstationarity; we defer this to future developments of TemperatureGAN.
There are many downstream applications for models that can produce realistic insights into regional temperature events, and its effects on communities, power distribution circuits, etc, especially in a quick and sc alable fashion and TemperatureGAN provides a method to do this.
Acknowledgements
The authors would like to acknowledge Rob Buechler for his helpful feedback and suggestions throughout the project, and his assistance with initial data retrieval/processing. The authors would also like to thank Jehangir Amjad for helpful discussions and suggestions at the beginning of the project. The authors also thank Emily Gordon and Amina Ly for helpful discussions at the latter stages of revising this paper.
Author contributions
E.B: Conceptualization, Methodology, Software, Investigation, Validation, Visualization, Formal analysis, Writing — original draft. R.R: Supervising, funding. A.M: Supervising, funding, writing. All authors approved the final submitted draft.
Competing interest
The authors declare no competing interests exist.
Data availability statement
The original NLDAS-2 forcing data used in this paper is publically available and can be assessed here. The modified data for training the model is publicly accessible on data mendeley. Model training repository and trained models have been made publicly available on github.
Funding statement
Emmanuel Balogun was supported by the Stanford Data Science Scholarship and Chevron Energy Fellowship.
Ethics statement
The research meets all ethical guidelines, including adherence to the legal requirements of the study country.
Appendix
Appendix A. Model architecture
Tables A1–A3 describe the neural network architectures implemented in this paper.
In all models, the labels are mapped from $ {\mathrm{\mathbb{R}}}^{15} $ to $ {\mathrm{\mathbb{R}}}^{100} $ learned embedding during training.
A.1. GAN conditioning
The month is represented as a discrete one-hot vector, which suggests that we cannot continuously vary its representation to interpolate between months. It is also an intuitive decision as the number of months in a year is not changing, so it is convenient to think of each month as one class from a multi-class size of 12. It can be desirable to generate temperatures for a given month because downstream impacts may also depend on the month/time of the year. One example is on grid reliability or resilience studies. Energy usage patterns are typically seasonal and can change depending on the month, thus, the impacts of temperatures can be much more severe during months when demand is also high. Additionally, the type of demand can be critical. For example, HVAC vs EV charging, or a mix of both. Being able to understand times of the year that pose higher risk can be particularly desirable with the proliferation of long-duration or seasonal storage and DERs. Energy transition planners or utilities can estimate potential risks for each month and strategically plan resources accordingly. We include month as a conditional variable because months are universal and the number of months each year is fixed; intuitively, it makes the model more generalizable because seasons are region-dependent. Additionally, we carried out experiments using seasons (both continuous and discrete) as a conditional variable and found that the model did not perform as well as using one-hot encoded months variable. We also note that once we are able to model each month accurately, it is much easier to aggregate the different months (see Figure B6) into desired seasons (barring some slight inaccuracies due to smoothness of seasonal transitions). The converse is much harder, if not impossible. That is, once we aggregate certain months during training, we lose information of month-to-month variations that may exist and it will be very challenging to recover.
The region is represented as a dual-axis variable (X, Y), representing its relative position to the the specified origin, which is selected to be the southwest (SW) corner of the dataset. The GAN is conditioned on these relative positions during training. A depiction of this is shown in Figure A1.
Appendix B. Model evaluation
B.1. SPAC’D
In the main text, we briefly touched on a key idea of this metric and the rationale for choosing the L1-norm as the preferred distance measure for this metric. We will now discuss the details for implementation. As shown in Figure A2, each video frame is unraveled into a vector whose length is equal to the total number of pixels within the frame (here we have 64 pixels per $ {1}^{\circ}\times {1}^{\circ } $ region. Each pixel can be viewed as a feature within the sample, and the goal is to calculate the Pearson product–moment correlation coefficient (PPCC) matrices for each month and then calculate the distance between these matrices for the ground-truth and generated data. The choice for the L1-norm was not arbitrary but a more natural choice for this distance measure. If the Frobenius norm distance was used, then SPAC’D is no longer implicitly bounded by a fixed interval. Because the PPCC takes on values between −1 and 1, the maximum difference for any set of two PPCC matrices is 2. The frobenius norm of a matrix X is
implying that for a given set of samples, the maximum sum of distance of all the pixels must be 2 N, because the maximum distance per pixel is 2. This means that the average per pixel distance depends on N, because the PPCC matrix has a shape of $ N\times N $ where $ N $ is the total number of pixels per sample. Therefore by using the frobenius norm, the maximum average distance between the PPCC matrices is $ \frac{\sqrt{2^2\times {N}^2}}{N^2}=\frac{2N}{N^2}=\frac{2}{N} $ , making it size-variant. However, the L1-norm does not suffer from this, making it fairly straightforward to implement as the metric remains bounded [0, 2] for any $ N $ .
B.2. Temporal Integrity
For many practical purposes, visual inspection may be sufficient for evaluating the validity of timeseries generation, but for model comparisons and evaluation compared to a baseline, a standard process for comparison is integral (Figures B2–B5, B9, B10 and Tables B1, B3, B4).
The table above shows the temporal gradient distribution values from the real and generated dataset for the period from 1979 to 1982. We observe that training with one discriminator suffices for the GAN to learn the overall temperature distributions, but the samples produced have temporal gradients that are not as true to the real data distribution, though we see that the overall diurnal cycle patterns are captured. Given this observation, we explored two variants for training the GAN. In the first variant, which we call ExWGAN-TGP, we add to the cost function a temporal gradient penalty, which yields the Generator (G) cost function:
where $ \Delta t=1 $ hour for this work. $ {\lambda}_{\mathrm{TP}} $ represents the hyperparameter that can be adjusted for penalizing the temporal gradients directly. The second term in the G cost above represents the Frobenius Norm of the 3D temporal gradient matrix. Plots below 16 show diurnal cycles without and with temporal penalty.
The bold values represent values from the generator with temporal penalty.
B.3. FDTD tables
B.4. More Q–Q envelopes
B.5. Sampled distributions