Impact Statement
Our application paper offers a substantial research contribution at the interface of statistical physics, environmental sciences, geography, and data analytics. It addresses the complex dynamics of air pollution in Europe, analyzing, in particular, the tails of the observed PDFs which describe high-pollution events. Our work enables a better understanding of the time-varying dynamics of air pollution, which is essential for policy formulation and the construction of suitable stochastic models, as well as for analyzing the medical consequences of exposure to polluted air.
1. Introduction
Outdoor air pollution is estimated to cause 4.2 million premature deaths per year worldwide, according to estimates of the WHO, and is thus a major killer. It also causes many diseases on a variety of time scales World Health Organization (2020); Shah et al. (Reference Shah2015). There are highly detrimental effects on the environment, as air pollution affects vegetation and natural ecosystems and plays a role in climate change Ortiz et al. (Reference Ortiz, Guerreiro and Soares2020). In all densely populated regions of the word, it represents a major environmental health risk Health Effects Institute (2019); World Health Organization et al. (2018); GBD 2016 Risk Factors Collaborators et al. (2017), impacting human health, ecosystems, climate, infrastructure, and the economy Ortiz et al. (Reference Ortiz, Guerreiro and Soares2020). There are many different types of air pollutants and they can interact in a complex way. Two air pollutants, namely particulate matter ( $ PM $ ) and nitrogen oxides ( $ {NO}_x $ ), pose a considerable threat to the health of citizens. In 2018, about 55000 and 417000 premature deaths in 41 European countries were attributed to $ {NO}_2 $ and $ {PM}_{2.5} $ , respectively Ortiz et al. (Reference Ortiz, Guerreiro and Soares2020). In this contribution to Climate Informatics 2024, we focus on $ {NO}_x $ and $ PM $ , but our statistical data analytics methods can be applied to other substances as well.
The impact of air pollution on health does not only depend on the pollutant type but also on the type of surrounding area. The EU European Parliament (2011) uses environmental area types to classify air quality monitoring sites into traffic, industrial, background, urban, suburban, and rural, based on predominant emission sources and building density. From an air quality policy perspective, this allows for evaluating the effectiveness of measures targeting specific emissions sectors, assessing the impact of those pollutants which dominate the area surrounding a given monitoring station.
A lot of air pollution research concentrates on mean values, but having a thorough understanding of the time-varying statistics, i.e., of the entire probability density function (PDF) of air pollution concentrations is crucial for policymakers involved in defining thresholds or reducing overall exposure to air pollution by suitable mitigation measures. It is also crucial for the construction of suitable statistical physics-based models using stochastic methods. PDFs such as gamma, log-normal, and Weibull distributions. Hsin-Chung Lu (Reference Lu2003) has been widely used for fitting air pollutant concentration data. However, these distributions decay approximately like exponential functions at large values, while previous investigations have found heavy tails in air pollution statistics Williams et al. (Reference Williams, Schäfer and Beck2020); He et al., (Reference He, Schäfer and Beck2022), which are not well-described by the above distributions.
Previous papers Giri et al. (Reference Giri, Schäfer and Verma2023); Baldasano (Reference Baldasano2020); Environmental Research Group et al. (2020) have also explored the COVID-19 lockdown effects on air quality (for example, in megacities such as London and Delhi), focusing on comparing the PDFs or given moments of the PDFs before and during the lockdown. Superstatistical methods, originating from turbulence modeling Beck (Reference Beck2007) and applied to many fields Metzler (Reference Metzler2020); Beck et al. (Reference Beck, Benedek, Livadiotis, Rapisarda, Tirnakli and Tsallis2020); Ourabah (Reference Ourabah2024), offer a powerful effective approach to describe the dynamics of air pollution. These types of models are based on the assumption of well-separated time scales Williams et al. (Reference Williams, Schäfer and Beck2020). Air pollutants such as $ {NO}_x $ have been described successfully using a superstatistical approach, taking into account nonequilibrium situations with fluctuating variance parameters Williams et al. (Reference Williams, Schäfer and Beck2020). However, the approach in that paper was verified for limited data sets only, chosen from the UK (London), and also only for a limited set of pollutants, namely $ NO $ and $ {NO}_2 $ only. For European data, recent progress was made by He et al., (Reference He, Schäfer and Beck2022), where it was shown that many measured air pollution time series data can be generically understood and modeled by superstatistical methods. In this conference proceedings, we follow up from that approach. We consider a very large set of measured time series data in Europe (3544 locations), extract the best-fitting parameters at each spatial location, and then plot a map of Europe where the best-fitting parameters are visualized at each point. The results are quite useful to know as they provide some information on what type of power law exponent is to be expected at a given location, thus describing the risk of high-pollution situations occurring at that location.
This paper is organized as follows. We first describe the data set considered and briefly comment on our method of finding the best-fitting parameters. Some relevant background on superstatitical methods and $ q $ -exponential functions will be given. We then present our main result on air pollution statistics in Europe in the form of a visual display of the map of Europe with the given power-law statistics parameters that we obtained in our detailed analysis. Finally, we discuss the observed patterns and spatial correlations.
2. Methods
2.1. The data set considered
We use air quality monitoring data from a large number of locations in Europe through the interface “Saqgetr”, which is an R package available on the Comprehensive R Archive Network (CRAN) Grange (Reference Grange2019). Most of the data are openly Available at the European Commission’s Airbase and air quality e-reporting (AQER) repositories European Parliament, Council of the European Union (2008); European Commission (2011). We also import surface meteorological data from NOAA ISD Smith et al. (Reference Smith, Lott and Vose2011) via the “worldmet” Carslaw (Reference Carslaw2017) R package. Initially, we started with 9698 different locations and used measured data from January 2017 to December 2021, recorded at 1-hour intervals. However, with certain selection criteria applied on the quality of the measured data, this reduces to a set containing 3544 sites. Our selection criteria are described in detail in He (Reference He2024). A main criterium was that at least one year of measured data should be available. Our interest is in the behavior of the tails of the PDFs, as illustrated in Figure 1. These tails correspond to high pollution states and are most damaging to health. The measuring stations are divided into three categories: traffic, industrial, and background based on predominant emission sources; the surrounding areas are classified as urban, suburban, or rural based on the building density. Measuring station types are combined with area types to provide an overall station classification, and we analyze our data conditioned on this station classification.
2.2. Fitting with q-exponentials
In He et al., (Reference He, Schäfer and Beck2022), evidence was presented that generic air pollution PDFs often decay as a power law. To extract the power law exponent, it is best to apply a fitting approach based on $ q $ -exponential functions. These functions asymptotically approach a power law (with an exponent of $ -1/\left(q-1\right) $ if $ q>1 $ ) and also take into account the next-to-leading order terms in the asymptotics. The form of $ q $ -exponentials is motivated by generalized versions of statistical mechanics based on nonadditive entropies Tsallis (Reference Tsallis2009), as well as on superstatistical approaches Beck and Cohen (Reference Beck and Cohen2003); Beck et al. (Reference Beck, Cohen and Swinney2005). The reason for the occurrence of $ q $ -exponentials is that exponentials with varying rate parameters, when integrated over the probability density of the rate parameter, in leading order lead to $ q $ -exponentials. The functional form of the $ q $ -exponential is given by
where $ q $ is the entropic index Tsallis (Reference Tsallis2009); Hanel et al. (Reference Hanel, Thurner and Gell-Mann2011); Jizba and Korbel (Reference Jizba and Korbel2019), $ \lambda $ is a positive width parameter and $ x $ , in our case, denotes the air pollutant concentration. Eq. (1) contains the exponential distribution as a special case, namely that is obtained in the limit $ q\to 1 $ , as the $ q $ -exponential function, defined as $ {e}_q(x)={\left[1+\left(1-q\right)x\right]}^{\frac{1}{1-q}} $ , converges to the exponential function in the limit $ q\to 1 $ . For $ q<1 $ , $ {f}_{q,\lambda }(x) $ is also well defined but in this case lives on a finite support and becomes exactly zero above a critical value x, since, by definition, $ {e}_q(x)=0 $ for $ 1-\lambda \left(1-q\right)x<0 $ . In contrast, if $ q>1 $ , $ 1-\lambda \left(1-q\right)x>0 $ , then Eq. (1) exhibits power-law asymptotic behavior. Both cases are relevant for air pollution fits.
The occurrence of $ q $ -exponentials with $ q>1 $ in PDFs of complex systems has been previously explained as originating from a superstatistical dynamics Beck and Cohen (Reference Beck and Cohen2003); Beck, Cohen, and Swinney (Reference Beck, Cohen and Swinney2005); Ourabah (Reference Ourabah2024). In these types of models, one assumes a temporally fluctuating parameter $ \hat{\lambda} $ for local exponential distributions $ \sim {e}^{-\hat{\lambda}x} $ . This means that in a sense the parameters of the system are random variables as well, but they change on a much larger time scale. These variations of $ \hat{\lambda} $ take place on a long-time scale, which is much longer than the local relaxation time to equilibrium. Physically, one may think of this as describing changing weather conditions (e.g. changes in the wind pattern) at the measuring site. The marginal distribution, obtained by integration over all possible values of $ \hat{\lambda} $ and describing the long-term behavior of the air pollution concentration dynamics, is then a $ q $ -exponential, with
Here, $ \left\langle \cdots \right\rangle $ denotes the expectation with respect to the PDF of $ \hat{\lambda} $ , see Beck and Cohen (Reference Beck and Cohen2003) for more details. Strictly speaking, a $ q $ -exponential is only obtained exactly if $ \hat{\lambda} $ is $ \Gamma $ -distributed, but the general idea of superstatistics is that a parameter $ q $ can be defined by Eq. (2) for more general distributions different than the $ \Gamma $ distribution as well.
In our automatically applied fitting procedure for the 3544 measuring stations, we obtain the optimal fitting parameters via MLE, i.e. maximizing the likelihood under the assumption that the data originates from the assumed $ q $ -exponential distribution. One may ask whether other distributions, such as the previously studied lognormal distribution, Weibull distribution, or gamma distribution might yield equally good fits as the $ q $ -exponential distribution. To illustrate the goodness of fit, we compare all these functions with the measured histogram for $ NO $ data in Figure 2 for an example site (Barnsley Gawber in the UK). Clearly, the $ q $ -exponential fit yields the best fit of the tails of the distribution. It also yields the largest log-likelihood value compared to the other distributions. We systematically compared the goodness-of-fit using maximum likelihood estimation to determine the best fitting parameters and hence computed the likelihood for each fitting function. Overall, we find that $ q $ -exponentials generate the highest log-likelihoods for all four pollutants, when averaging over all sites. Hence, we apply $ q $ -exponentials as our main method for analyzing the tails of air pollution statistics. See also github He (Reference He2024) for full technical details.
2.3. Extracting the long superstatistical time scale $ T $
Utilizing the $ q $ -exponential distribution for fitting the PDFs opens the option to identify a long time scale $ T $ , as discussed in the superstatistics approach Beck and Cohen (Reference Beck and Cohen2003); Beck et al. (Reference Beck, Cohen and Swinney2005). In this view, we assume that on time scales shorter than $ T $ , we observe random samples from a simple distribution, which we assume to be exponential distributions here. Meanwhile, the overall complex time series, characterized frequently by heavy-tailed probability distributions, is seen as a superposition of several of these simpler, local distributions.
When considering shorter time slices, if the local time series approximately follows an exponential distribution, the kurtosis for a local snapshot should be $ {K}_{\mathrm{exp}}=9 $ . To determine $ T $ , we compute the local average kurtosis Beck et al. (Reference Beck, Cohen and Swinney2005) as follows:
where $ {t}_{max} $ is the length of the time series and $ {\left\langle \right\rangle}_{t_0,\Delta t} $ denotes the expectation for the time slice of length $ \Delta t $ starting at $ {t}_0 $ . The long-time scale $ T $ is then defined by the condition $ \overline{\kappa}(T)={K}_{exp} $ . In this case, the average kurtosis of windows of length $ T $ has a kurtosis $ {K}_{exp}=9 $ . Once $ T $ is determined, the time series can be divided into multiple snapshots, each spanning a length of $ T $ . This provides a collection of slices with approximately local exponential distributions, each with a distinct decay parameter $ \hat{\lambda} $ that fluctuates on the time scale $ T $ .
3. Results
Figure 3 shows the best-fitting parameters $ q $ and $ \lambda $ for the tails of the observed PDFs at all 3544 sites evaluated. As mentioned above, it is natural to expect $ q $ -exponential distributions as these simply arise from the agglomeration of many exponential distributions that have temporal fluctuations of the effective decay rate.
What is noticeable is the fact that one observes an immensely large range of values of the parameter $ \lambda $ for the best-fitting $ q $ -exponential as given in Eq. (1) for the various measuring stations. Note the logarithmic scale of the plots. The parameter $ \lambda $ can take on values as small as $ {10}^{-2} $ up to values as large as $ {10}^2 $ , which spans four orders of magnitude. Typical $ q $ -values are in the range $ 0.8-1.4 $ , but there are subtle differences between the various substances, with $ NO $ reaching large $ q $ -values such as 1.6, and $ {NO}_2 $ reaching small $ q $ -values such as $ 0.6 $ in the scattering plots. Figure 3, panels a and b, indicate that roughly $ q $ grows linearly with $ \log \lambda $ for $ {NO}_x $ . Traffic areas are clustered on the left, while urban/suburban background points are in the middle part, and rural background points are scattered widely at the right. The PDF decay rate increases as $ \lambda $ increases from highly polluted urban traffic sites to less polluted rural background areas. For $ {PM}_{2.5} $ and $ {PM}_{10} $ , there is a slightly different statistical relationship between $ q $ values and $ \lambda $ , as can be seen in Figure 3 panels c and d. The urban points approach small $ \lambda $ values such as $ 0.01 $ for $ {PM}_{10} $ , and suburban/rural traffic, suburban/rural industrial, and rural background points cluster at the right-hand side with large $ \lambda $ . The observed range of $ \lambda $ values is smaller as compared to the case of $ {NO}_x $ , and the shape of possible values $ \left(q,\lambda \right) $ as displayed in the figure is more spherical. The stronger color mixing for $ PM $ may be due to the fact that by air movement there is transport of the pollutants, hence the distributions cannot be uniquely identified with the original environmental types where the $ PM $ -particles were produced.
So far we conditioned the observed distributions on the environmental type, such as high-traffic region, industrial region, or rural area. This leads to a pretty mixed pattern of states as displayed in Figure 4. However, it is useful to keep the spatial information where the actual measurements were done. Thus, the idea in the following is to plot an air pollution map of Europe where the color encodes the value of the best-fitting fitting parameters $ \left(q,\lambda \right) $ relevant in that given region of Europe. This map is shown in Figure 5. The novel approach here is that we do not only consider average values of pollution concentrations as observed in a given region but encode information on the entire probability distribution. The parameter $ q $ describes the power-law tails of the probability densities, which decay with an exponent given by $ -1/\left(q-1\right) $ . Thus, $ q $ is related to the frequency of occurrence of extreme events describing high-pollution states.
The $ q $ values for $ {NO}_2 $ are smaller in Germany and the UK as compared to other countries, in fact below 1 (green colors), which indicates there are no heavy tails. Since the primary emission source of $ {NO}_x $ is fuel combustion, this pattern suggests that these regions cope better with extreme situations and have a decreased susceptibility to extreme $ {NO}_x $ events. The $ q $ values for $ PM $ pollutants exhibit a slightly different spatial pattern, with lower values (green) in the Western regions and higher values (red) in the East. The $ \lambda $ values in Central Europe and the UK are smaller compared to those in other European regions, indicating a higher average pollution, in particular for $ NO $ . As for the spatial distribution of $ \lambda $ values for $ PM $ pollutants, there is a slight eastward shift in the clustering of red dots, suggesting a higher pollution in Eastern Europe.
Note that the average concentration of pollutants is proportional to $ {\lambda}^{-1} $ . This means the smaller $ \lambda $ the bigger is the average air pollution concentration in that region. For this reason, we have color-coded smaller values of $ \lambda $ in red and bigger values in blue. Some of the Scandinavian countries appear to host a couple of blue dots, whereas in Turkey there are big concentrations of red dots, in particular for $ {PM}_{10} $ . For the $ q $ -values, describing extreme events, the tail in the distribution is more pronounced and the bigger $ q $ . Some Eastern European countries exhibit particularly heavy tails, as indicated by red colors in the $ q $ panels. On the other hand, if $ q $ is smaller than 1 then the distributions live on a finite support, and there are no heavy tails. This case is indicated by green colors in the $ q $ -plane. Some regions of Germany and Western Europe exhibit predominantly green colors.
Figure 5 displays the superstatistical time scale $ T $ as extracted for the four pollutants, indicating distinct temporal dynamics. $ NO $ overall has a majority of shorter time scales $ T $ , typically less than a week. $ {NO}_2 $ , on the other hand, comprises higher values of time scales $ T $ , mostly ranging between a week to a year. This difference could be explained by the fact that $ NO $ is highly reactive and tends to quickly oxidize to form $ {NO}_2 $ in the presence of oxygen. This rapid transformation means that $ NO $ concentrations can change swiftly over short periods. The extracted time scales for $ {PM}_{2.5} $ are of similar order of magnitude as for $ {NO}_2 $ , whereas the extracted time scales for the $ {PM}_{10} $ dynamics appear to be smaller than for $ {PM}_{2.5} $ on average.
4. Discussion
In this paper, we applied data analytics tools to better understand the probability densities of measured air pollution time series. In particular, we were interested in the tail behavior. The parameters $ q $ and $ \lambda $ , as obtained from the optimum fittings of data from 3544 measuring stations in Europe, exhibit interesting patterns in the $ \left(q,\lambda \right) $ plane. They also exhibit interesting patterns when associated with a map of Europe. There are significant differences between the $ NO $ , $ {NO}_2 $ , and the $ PM $ statistics.
The shape of scattered $ \left(q,\lambda \right) $ values in the plane depends on the local characteristics, e.g. whether the measurements are done in a rural area or densely populated area. There is a typical range of parameter values for a given class of this type. We observe an immensely large range of values of the parameter $ \lambda $ for the various measuring stations.
We also observe a significant spatial heterogeneity in the best-fitting shape parameters of the PDFs. Eastern Europe shows a tendency for more severe $ PM $ pollution events and higher average concentration exposure, described by higher $ q $ values and lower $ \lambda $ values, respectively. Northern Europe shows a tendency for lower air pollutant concentration, described by higher $ \lambda $ values, but is more susceptible to extreme events, described by higher $ q $ values. Additionally, there are strong variations of PDFs at local level as well. These disparities highlight the importance of tailored strategies for air pollution mitigation strategies in different areas. Recall that $ q $ contains information about the tail, i.e. extreme events and $ \lambda $ about the scale and thereby the mean pollution level, i.e. we distinguish thereby between regions with low and high average pollution and simultaneously regions with many or few extreme events.
We also extracted the long superstatistical time scales $ T $ for the measured time series for each of the 3544 measuring sites and each of the four pollutants considered, thus incorporating dynamical information that amends the static information contained in the PDFs. This information was displayed in the form of a map of Europe, too, and shows a variety of time scales relevant at different spatial locations. The observed wide range of time scales, in particular for $ {NO}_2 $ and $ {PM}_{2.5} $ , can be attributed to a variety of factors, including but not limited to fluctuating local meteorological conditions and anthropogenic activities.
Future research, for each given location taking into account the local circumstances, should help to gain a clearer systematic understanding of the relationship between meteorological and anthropogenic factors and the relevant time scales and $ \left(q,\lambda \right) $ -values of air pollutants. A main result of our investigation is strong local heterogeneity, i.e. the temporal dynamics of air pollution can be quite different if one proceeds from one local measurement station to the next. Different local wind conditions are expected to play an important role in this context.
Further statistical analysis of more detailed data at local level could support policymakers to produce more precise rules and thresholds for individual types of environmental conditions and meteorological conditions, taking into account fluctuations, extreme events, and variations of concentrations on different time scales.
For our statistical analysis, the different types of air pollutants have different data availability in the various geographical regions. For this reason, in the air pollution maps that we have produced in this paper, some regions are less frequently covered by points than others. We hope that in the future more measurements at further locations will become available, given the importance of the air pollution problem. Our analysis could also be extended to other substances, such as sulfur oxide, carbon oxide, and ozone, besides $ {NO}_x $ and $ PM $ . We stress again that the detailed description of the entire pollution PDF, including the exact behavior of the tails, seems critical here to better estimate the risks of very high pollution situations.
Open peer review
To view the open peer review materials for this article, please visit http://doi.org/10.1017/eds.2024.43.
Author contribution
Conceptualization - All authors; Methodology - All authors; Data curation - H.H., B.S. Data visualization - All authors; Writing original draft - All authors. All authors approved the final submitted draft.
Competing interest
The authors declare no competing interests.
Data availability statement
Data sets analyzed in the present study can be obtained via the saqgetr package from https://github.com/skgrange/saqgetr. The code to generate the figures in the paper, as well as the implementation of the method for the data sets used in the paper, are available at https://github.com/hurst0415/Spatial-heterogeneity-of-air-pollution-statistics.
Funding statement
We gratefully acknowledge funding from a 2024 QMUL ISPF grant awarded to C.B. and funding from the Helmholtz Association and the Networking Fund through Helmholtz AI and under grant no. VH-NG-1727 to B.S.
Provenance statement
This article is part of the Climate Informatics 2024 proceedings and was accepted in Environmental Data Science on the basis of the Climate Informatics peer review process.