Impact Statement
Today, we have the capability to continuously monitor a broad array of processes within the Earth system at high spatio-temporal resolutions. However, it is only through combined analysis that these data reveal their full potential in advancing Earth system research. Earth System Data Cubes (ESDCs) possess transformative potential in this regard, yet they are accompanied by significant challenges throughout their life cycle. This paper offers a detailed exploration of these challenges, highlighting the importance of rigorous ESDC analysis while warning against potentially misleading outcomes from naive applications.
1. Introduction
Humanity possesses the capability to observe and model the majority of Earth’s subsystems, generating vast amounts of data with unprecedented resolution, quality, and coverage (Simmons et al., Reference Simmons, Fellous, Ramaswamy, Trenberth, Asrar, Balmaseda, Burrows, Ciais, Drinkwater and Friedlingstein2016; Peng et al., Reference Peng, Albergel, Balenzano, Brocca, Cartus, Cosh, Crow, Dabrowska-Zielinska, Dadson and Davidson2021; Bauer et al., Reference Bauer, Dueben, Hoefler, Quintino, Schulthess and Wedi2021a). The co-interpretation of these diverse datasets represents an unprecedented opportunity for understanding the intricacies of the Earth system (Runge et al., Reference Runge, Bathiany, Bollt, Camps-Valls, Coumou, Deyle, Glymour, Kretschmer, Mahecha, Muñoz-Marí, van Nes, Peters, Quax, Reichstein, Scheffer, Schölkopf, Spirtes, Sugihara, Sun, Zhang and Zscheischler2019; Mahecha et al., Reference Mahecha, Gans, Brandt, Christiansen, Cornell, Fomferra, Kraemer, Peters, Bodesheim and Camps-Valls2020; Tuia et al., Reference Tuia, Schindler, Demir, Camps-Valls, Zhu, Kochupillai, Džeroski, van Rijn, Hoos and Del Frate2023). However, this wealth of heterogeneous data comes with substantial challenges. The sheer volume of data, characterised by variations in spatial and temporal resolution as well as data curation levels, coupled with the high complexity of processes encoded in these multi-dimensional datasets, renders conventional data processing and interpretation methods unsuitable (Boulton, Reference Boulton2018; Sudmanns et al., Reference Sudmanns, Tiede, Lang, Bergstedt, Trost, Augustin, Baraldi and Blaschke2020).
Recognising the need for a simple yet robust data infrastructure to facilitate Earth system data interoperability led to the emergence of various data cube concepts (Nativi et al., Reference Nativi, Mazzetti and Craglia2017; Baumann et al., Reference Baumann, Misev, Merticariu and Huu2019; Giuliani et al., Reference Giuliani, Masó, Mazzetti, Nativi and Zabala2019; Kopp et al., Reference Kopp, Becker, Doshi, Wright, Zhang and Xu2019; Mahecha et al., Reference Mahecha, Gans, Brandt, Christiansen, Cornell, Fomferra, Kraemer, Peters, Bodesheim and Camps-Valls2020, and others). We refer to Earth System Data Cubes (ESDCs) as frameworks where diverse datasets are integrated into a unified, highly interoperable system, organised on a common spatio-temporal grid (a more formal definition is given in Section 2.1). The essence of ESDCs is to convert the vast array of Earth system data into readily accessible data streams, apt for a variety of Earth system research domains. Such frameworks are gaining widespread acceptance in Earth system research as a solution for managing complex Earth Observation (EO) data.
Given the simplicity of such structures, various initiatives have greatly enhanced the use of EO data derived from satellite remote sensing and other large-scale array data, such as climate model outputs. Initiatives building on an ESDC concept originally developed their data in hand-crafted ways (e.g. Mahecha et al., Reference Mahecha, Gans, Brandt, Christiansen, Cornell, Fomferra, Kraemer, Peters, Bodesheim and Camps-Valls2020; Estupiñan Suarez et al., Reference Estupiñan Suarez, Gans, Brenning, Gutierrez-Velez, Londono, Pabon-Moreno, Poveda, Reichstein, Reu and Sierra2021; Walther et al., Reference Walther, Besnard, Nelson, El-Madany, Migliavacca, Weber, Carvalhais, Ermida, Brümmer, Schrader, Prokushkin, Panov and Jung2022) or created systems supporting on-demand generations of ESDCs (e.g. Appel and Pebesma, Reference Appel and Pebesma2019; Killough, Reference Killough2018; Schramm et al., Reference Schramm, Pebesma, Milenković, Foresta, Dries, Jacob, Wagner, Mohr, Neteler and Kadunc2021). Earth system data providers have invested tremendous efforts in compiling extensive data catalogues, which can be used for the development of further ESDCs. Notable examples of such catalogues are provided by Google Earth Engine (GEE, Gorelick et al., Reference Gorelick, Hancher, Dixon, Ilyushchenko, Thau and Moore2017)Footnote 1, Microsoft Planetary ComputerFootnote 2, or the Open Geospatial Data Catalogue of Amazon Web Services (AWS)Footnote 3. Additionally, there is a constant effort to increase the adoption of ESDCs (including generation and analysis) within cloud environments (Zellner et al., Reference Zellner, Claus, Dolezalova, Balogun, Eberle, Hodam, Eckardt, Meißl, Jacob and Anghelea2024). Therefore, ESDCs can be efficiently generated and used in virtual laboratories, such as the DeepESDL (Brandt et al., Reference Brandt, Balfanz, Fomferra, Harish, Mahecha, Kraemer, Montero, Meißl, Achtsnit, Umlauft, Neumann, Horton, Ewart, Gans and Anghelea2023; Sturm, Reference Sturm2023)Footnote 4, or the agricultural virtual labFootnote 5.
This access to straightforward aligned Earth system data has facilitated numerous Earth system research questions. For instance, researchers have employed both linear and non-linear dimensionality reduction methods to generate global indicators for the terrestrial biosphere (Kraemer et al., Reference Kraemer, Camps-Valls, Reichstein and Mahecha2020), uncover the main modes of Earth system variables (Bueso et al., Reference Bueso, Piles and Camps-Valls2020), quantified spatial dynamics of vegetation responses to ENSO in South America (Estupinan-Suarez et al., Reference Estupinan-Suarez, Mahecha, Brenning, Kraemer, Poveda, Reichstein and Sierra2023), or gained major insights on Land Use and Cover Change (LUCC, Santos et al., Reference Santos, Ferreira, Picoli and Camara2019). Specifically, EO data cubes, or ESDCs comprising satellite remote sensing imagery, have been instrumental in applications such as learning the vegetation response to climate drivers using Recurrent Neural Network (RNN) architectures (Martinuzzi et al., Reference Martinuzzi, Mahecha, Camps-Valls, Montero, Williams and Mora2023), quantifying drought legacy effects on gross primary production (Yu et al., Reference Yu, Orth, Reichstein, Bahn, Klosterhalfen, Knohl, Koebsch, Migliavacca, Mund, Nelson, Stocker, Walther and Bastos2022), and detecting spatio-temporal extreme events (Mahecha et al., Reference Mahecha, Gans, Sippel, Donges, Kaminski, Metzger, Migliavacca, Papale, Rammig and Zscheischler2017).
However, if the goal is for ESDCs to evolve and become sustainable data infrastructures, it is essential to develop robust ESDC life cycles. Considering the unique characteristics of ESDCs, we cannot merely apply existing research data life-cycle concepts; instead, we must identify and address the peculiarities specific to ESDCs. It is necessary to create opportunities for continuous improvement and to address current challenges by leveraging contemporary technological advancements, specifications, and research paradigms. For instance, data formats and sharing protocols must evolve to align with the current status of cloud-based technologies and standards, in accordance with the adoption of FAIR Open Science principles (Wilkinson et al., Reference Wilkinson, Dumontier, Aalbersberg, Appleton, Axton, Baak, Blomberg, Boiten, da Silva Santos, Bourne, Bouwman, Brookes, Clark, Crosas, Dillo, Dumon, Edmunds, Evelo, Finkers, Gonzalez-Beltran, Gray, Groth, Goble, Grethe, Heringa, ‘t Hoen, Hooft, Kuhn, Kok, Kok, Lusher, Martone, Mons, Packer, Persson, Rocca-Serra, Roos, van Schaik, Sansone, Schultes, Sengstag, Slater, Strawn, Swertz, Thompson, van der Lei, van Mulligen, Velterop, Waagmeester, Wittenburg, Wolstencroft, Zhao and Mons2016). Moreover, transforming heterogeneous data into an analysis-ready format aligned with a multidimensional spatio-temporal grid is often complex and subject to application-specific variations (Giuliani et al., Reference Giuliani, Masó, Mazzetti, Nativi and Zabala2019; Zuefle et al., Reference Zuefle, Wessels and Pfoser2021).
The resulting data format, though straightforward and relatively easy to analyse, encompasses inherent complexities (Béjar et al., Reference Béjar, Lacasta, Lopez-Pellicer and Nogueras-Iso2023). These intricacies necessitate careful consideration during data analysis, requiring a profound understanding of the nature of Earth system processes. Naive analyses based on ESDCs can potentially lead to misleading interpretations as pointed out, e.g. by Meyer et al. (Reference Meyer, Reudenbach, Hengl, Katurji and Nauss2018); Rußwurm et al. (Reference Rußwurm, Klemmer, Rolf, Zbinden and Tuia2023) or Sweet et al. (Reference Sweet, Müller, Anand and Zscheischler2023). Common pitfalls include model performance inflation caused by spatio-temporal auto-correlation, biased sampling, and inaccurate spatial aggregations. It’s only by adequately addressing these challenges that the full potential of ESDCs can be realised, aligning with the perspectives of various authors (Reichstein et al., Reference Reichstein, Camps-Valls, Stevens, Jung, Denzler and Carvalhais2019; Irrgang et al., Reference Irrgang, Boers, Sonnewald, Barnes, Kadow, Staneva and Saynisch-Wagner2021; Hsieh, Reference Hsieh2022; Sun et al., Reference Sun, Sandoval, Crystal-Ornelas, Mousavi, Wang, Lin, Cristea, Tong, Carande and Ma2022; Persello et al., Reference Persello, Wegner, Hänsch, Tuia, Ghamisi, Koeva and Camps-Valls2022; Tuia et al., Reference Tuia, Schindler, Demir, Camps-Valls, Zhu, Kochupillai, Džeroski, van Rijn, Hoos and Del Frate2023). Topics widely discussed today are generative processes in Artificial Intelligence (AI) that could enable researchers to reconstruct unseen data (Rüttgers et al., Reference Rüttgers, Lee, Jeon and You2019; Oyama et al., Reference Oyama, Ishizaki, Koide and Yoshida2023). Another promising direction is the potential for making causal inferences solely from data (Runge et al., Reference Runge, Bathiany, Bollt, Camps-Valls, Coumou, Deyle, Glymour, Kretschmer, Mahecha, Muñoz-Marí, van Nes, Peters, Quax, Reichstein, Scheffer, Schölkopf, Spirtes, Sugihara, Sun, Zhang and Zscheischler2019; Krich et al., Reference Krich, Migliavacca, Miralles, Kraemer, El-Madany, Reichstein, Runge and Mahecha2021; Christiansen et al., Reference Christiansen, Baumann, Kuemmerle, Mahecha and Peters2022; Camps-Valls et al., Reference Camps-Valls, Gerhardus, Ninad, Varando, Martius, Balaguer-Ballester, Vinuesa, Diaz, Zanna and Runge2023). Also, integrating physical constraints and domain knowledge in the inference process can lead to more plausible semi-empirical predictions (Ilie et al., Reference Ilie, Dittrich, Carvalhais, Jung, Heinemeyer, Migliavacca, Morison, Sippel, Subke and Wilkinson2017; Karniadakis et al., Reference Karniadakis, Kevrekidis, Lu, Perdikaris, Wang and Yang2021; Camps-Valls et al., Reference Camps-Valls, Svendsen, Cortes-Andres, Mareno-Martinez, Perez-Suay, Adsuara, Martin, Piles, Munoz-Mari and Martino2021; Cortés-Andrés et al., Reference Cortés-Andrés, Camps-Valls, Sippel, Székely, Sejdinovic, Diaz, Pérez-Suay, Li, Mahecha and Reichstein2022). Concurrently, advances in data processing and visualisation technologies not only enhance data exploration and analysis but also aid in disseminating research findings (Söchting et al., Reference Söchting, Mahecha, Montero and Scheuermann2023).
This paper seeks to identify the challenges inherent in the complete ESDC life cycle while, at the same time, highlighting the potential to advance Earth system research through these data structures. The manuscript is organised as follows: Section 2 introduces the concept of ESDC and its relationship to information-preserving systems for Earth system data. In Section 3, we elaborate on the ESDC life cycle, displaying the obstacles encountered during data processing and proposing pathways toward creating analysis-ready ESDCs. Section 4 explores the transformative possibilities stemming from contemporary AI advancements in Earth system research while Section 5 cautions against the risks of uninformed ESDC analysis. Section 6 addresses the technical facets of manipulating ESDCs throughout their life cycle, offering insights into technologies that can streamline Earth system data processing. Lastly, in Section 7, we examine the challenges associated with data visualisation in the context of ESDCs. Through this paper, we aim to outline the complexities and opportunities associated with employing ESDCs, hopefully paving the way for advancements in Earth system research.
2. The Art of Data Cubes
Data cubes are renowned for their capacity to serve as multidimensional arrays of data, enabling the representation of values across various dimensions of interest within a specific domain. Specialised data cubes designed for analytical queries in database systems, such as Online Analytical Processing (OLAP, Chaudhuri and Dayal, Reference Chaudhuri and Dayal1997) cubes, have been integrated with Geographical Information System (GIS) databases to give rise to Spatial OLAP (SOLAP, Rivest et al., Reference Rivest, Bédard, Proulx, Nadeau, Hubert and Pastor2005) cubes. SOLAP infrastructures, although traditionally associated with vector data, are also available for raster data (Kasprzyk and Donnay, Reference Kasprzyk and Donnay2017). Database systems have proven effective in storing and managing Earth system data in the form of data cubes, exemplified by array database solutions like Rasdaman (Baumann et al., Reference Baumann, Dehmel, Furtado, Ritsch and Widmann1998). Additionally, data cube infrastructures can be employed to store indexed files (Killough, Reference Killough2018), thus safeguarding the information that might otherwise be lost during data transformation processes, such as reprojection. Here, we rely on a specific interpretation of data cubes, specifically tailored to tackle the vast volumes and interoperability challenges of Earth system data. We first explain the concept of ESDCs, but also provide an overview of related information-preserving structures, namely image collections and information-preserving data cubes, showcasing how they interface with ESDCs.
2.1. What are Earth System Data Cubes (ESDCs)?
The concept of ESDCs was introduced along with the Earth System Data Lab (ESDL, Mahecha et al., Reference Mahecha, Gans, Brandt, Christiansen, Cornell, Fomferra, Kraemer, Peters, Bodesheim and Camps-Valls2020), an integrated data and analytical hub that aimed to unify multiple heterogeneous Earth system data streams into a standard data model with a unique Coordinate Reference System (CRS). ESDCs represent multidimensional data structures designed to facilitate streamlined access, analysis, and manipulation of Earth system data. ESDCs comprise labels as dimensions defining the cube’s axes, an array of grids with their associated coordinate values distributed along these dimensions, and univariate data associated with each grid cell. Furthermore, in this paper, we add a new component: a suite of attributes that characterise the data, the dimensions, and the complete ESDC entity.
The dimensions are a set of labels describing the axes of the ESDC. Generally, these dimensions comprise space (e.g. “x” and “y”), time, and variables. Nevertheless, further dimensions can be added (e.g. “pressure levels”, “model ensembles” or “time series components”). It is crucial to emphasise that while ESDCs conventionally incorporate spatial and temporal dimensions (e.g. latitude, longitude, and time), they are not confined to this paradigm (cf. Table 1 of Mahecha et al., Reference Mahecha, Gans, Brandt, Christiansen, Cornell, Fomferra, Kraemer, Peters, Bodesheim and Camps-Valls2020). ESDCs can exhibit different dimensions, and the number of dimensions is called the order of the ESDC. Thus, an increment in the ESDC’s complexity according to its dimensions is given by their order (e.g. a spatio-temporal grid of a univariate ESDC has an order of 3, while the order of a multivariate ESDC is 4).
The grouping of grids consists of discrete subsets derived from the domain of each dimension’s axis. The values of these subsets are referred to as coordinates, and, in the case of a regular grid, they determine the data’s resolution along that specific dimension. For instance, a grid determining a resolution of 0.5 degrees for the “latitude” dimension in a global ESDC may have a set of coordinates $ \mathrm{grid}\left(\mathrm{latitude}\right)=\left\{-89.75,-89.25,\dots,, \hskip2pt ,\mathrm{89.25,},\hskip2pt ,89.75\right\} $ . While coordinates are often associated with numerical values (e.g. latitudes and longitudes), they can encompass a wide range of values. For instance, timestamps in a “time” dimension with a set of coordinates $ \mathrm{grid}\left(\mathrm{time}\right)=\left\{``2022-01-01",``2022-01-02",\dots, ``2022-12-31"\right\} $ , or components derived from a time series decomposition approach in a “component” dimension with a set of coordinates $ \mathrm{grid}\left(\mathrm{component}\right)=\left\{``\mathrm{raw}",``\mathrm{trend}",``\mathrm{seasonal}",``\mathrm{residual}"\right\} $ . The grids within an ESDC exhibit the following characteristics: 1) In the case of spatial dimensions, they reference the same CRS, 2) the coordinates within a grid share identical units, and 3) they must consist of at least two coordinates; otherwise, the dimension (and consequently the grid) is omitted. It’s important to note that, given these properties, irregular grids are also possible, with the temporal dimension grid being a typical example in EO data due to the irregular revisit times of some satellite missions (e.g. Sentinel-2).
The array of data represents scalar values corresponding to each grid cell. Typically, the data spans from observed measurements to modelled values. Nevertheless, one can also encounter higher-order features (that is data derived from operations performed on the original values), such as outcomes from time series decomposition or AI-generated products. Furthermore, flag values, which delineate data status, can be incorporated. Cells without data are denoted as “NA” (that is not available).
The collection of attributes comprises a series of key-value objects that provide additional details about the data. These objects serve as metadata and can offer descriptions ranging from individual variables (including their associated dimensions) to the entire ESDC. The information contained within these attributes typically encompasses a wide range of elements, including, but not limited to, names, acronyms, units, flag definitions, versions, and source details.
2.2. Relation of ESDCs to Image Collections and Data Cubes
Earth system data often exhibits heterogeneity and irregularity, particularly within EO data. Variability can manifest in different spatial resolutions, time units, projections, formats, and more, sometimes even within the same data product. Consequently, two robust approaches to retaining data integrity without succumbing to information loss due to transformations (e.g. reprojection, reduction, and resampling) are to utilise image collections (refer to Appel and Pebesma, Reference Appel and Pebesma2019 for a comprehensive distinction between image collections and conventional data cubes) or to adopt a process where original files are stored and indexed within a information-preserving data cube infrastructure based on their file metadata (Figure 1). In the latter approach, the original files can be stored locally or in the cloud while preserving the essential information intact.
A successful example of image collections is the GEE Catalogue. This extensive, multi-petabyte catalogue stores data in tiled images, where each image may encompass multiple bands, thereby preserving essential information. Furthermore, these images can be organised into an image collection if they share relevancy. GEE also offers the computational resources necessary for accessing and analysing their catalogued data. Within GEE, data cube-like operations can be seamlessly executed through dynamic on-the-fly reprojection, resampling, and reduction for the tiles where a subset of pixels was explicitly requested. It is worth noting, however, that users are required to conform to the specific Application Programming Interfaces (API) provided by GEE for processing and analysing the data effectively.
Standardising image collections and their access brings simplicity and promotes data usage across platforms. Currently, a widely recognised standard is the Spatio-Temporal Assets Catalog (STAC) specification. This specification empowers users to query data assets based on metadata and spatio-temporal criteria. Coupled with domain-specific API clients available for multiple programming languages (cf. Section 6.2) and GIS software (e.g. QGIS STAC PluginFootnote 6), users can easily retrieve data. The flexibility of the STAC specification has prompted numerous data providers to adopt it for creating their own data cataloguesFootnote 7, with notable examples including Microsoft Planetary Computer CatalogueFootnote 8 and the United States Geological Survey (USGS) Landsat Archive Catalogue (stored in the Amazon Simple Storage Service, S3)Footnote 9.
The information-preserving data cube approach is exemplified by the Open Data Cube (ODC) initiative, a prominent model in this field (Killough, Reference Killough2018; Killough et al., Reference Killough, Siqueira and Dyke2020)Footnote 10. This approach has played a pivotal role in informing governmental actions and policies, as evidenced by their integration into national and regional data cube frameworks (Dhu et al., Reference Dhu, Giuliani, Juárez, Kavvada, Killough, Merodio, Minchin and Ramage2019; Sudmanns et al., Reference Sudmanns, Augustin, Killough, Giuliani, Tiede, Leith, Yuan and Lewis2022). Noteworthy instances of these initiatives include Digital Earth Africa (DE Africa, formerly known as Africa Regional Data Cube, Killough, Reference Killough2019), Digital Earth Australia (DE Australia, previously Australian Geoscience Data Cube, Lewis et al., Reference Lewis, Oliver, Lymburner, Evans, Wyborn, Mueller, Raevksi, Hooke, Woodcock, Sixsmith, Wu, Tan, Li, Killough, Minchin, Roberts, Ayers, Bala, Dwyer, Dekker, Dhu, Hicks, Ip, Purss, Richards, Sagar, Trenham, Wang and Wang2017; Dhu et al., Reference Dhu, Dunn, Lewis, Lymburner, Mueller, Telfer, Lewis, McIntyre, Minchin and Phillips2017), the Colombian Data Cube (CDCol, Ariza-Porras et al., Reference Ariza-Porras, Bravo, Villamizar, Moreno, Castro, Galindo, Cabera, Valbuena and Lozano2017; Bravo et al., Reference Bravo, Castro, Moreno, Ariza-Porras, Galindo, Cabrera, Valbuena and Lozano-Rivera2017; Villamizar et al., Reference Villamizar, Castro, Ariza-Porras, Mancipe, Cabrera, Pachon, Ramirez, Fonseca, Lozano-Rivera, Cabrera and Becerra2018), and the Swiss Data Cube (SDC, Giuliani et al., Reference Giuliani, Chatenoux, Bono, Rodila, Richard, Allenbach, Dao and Peduzzi2017).
While both of these approaches excel in preserving data integrity and offering flexibility for various analyses, achieving Earth system data interoperability necessitates their integration into a unified structure through ESDCs. These ESDCs can be constructed from either approach. For instance, in the case of image collections, it is feasible to request pixels from GEE (Clinton, Reference Clinton2023), and data transformations can be executed within the GEE cloud-based environment before downloading the data. It’s important to note that limitations related to the size of the requested data can be a potential concern in this process. Alternatively, STAC simplifies the process, particularly when combined with cloud-ready formats. This lazily enables the creation of ESDCs. In the information-preserving data cube approach, platforms like ODC offer a comprehensive system for transforming original data into ESDCs and even provide mechanisms for storing the resulting ESDCs within the data cube infrastructureFootnote 11. Noteworthy is openEO (Schramm et al., Reference Schramm, Pebesma, Milenković, Foresta, Dries, Jacob, Wagner, Mohr, Neteler and Kadunc2021), an API striving to connect multiple backends in a standardised way, including image collection providers (e.g. GEE) and information-preserving data cubes (e.g. ODC)Footnote 12, to generate ESDCs.
3. The ESDC Life cycle
Creating an ESDC from multiple sources, including source files, data cubes, or image collections, is a multifaceted process. The ESDC life cycle, as illustrated in Figure 2, encompasses several crucial stages, each playing a vital role in the generation, analysis, and effective utilisation of these data structures. The ESDC life cycle comprises the following key phases: data collection, curation, cubing, harmonisation, transformation, analysis, and reuse. These phases are linked, reflecting the meticulous efforts involved in ESDCs’ development. In parallel to these stages, metadata generation occurs concurrently with data transformations, data exploration, visualisation, and dissemination. This section provides an overview of the ESDC life cycle, emphasising relevant considerations that contribute to the streamlined development and utilisation of ESDCs.
3.1. Collection
Given that data providers frequently utilise diverse formats and protocols for data sharing, particularly in proportion to the multidimensional complexity of the data, the establishment of streamlined access mechanisms becomes imperative. Traditionally, File Transfer Protocol (FTP) servers have been used for data sharing. However, to enhance data discoverability and usability, data providers are increasingly adopting data stores that offer persistent and standardised data storage. Repositories play a vital role in this process by standardising metadata, enabling easy search and retrieval of assets through metadata queries. Recently, more and more data providers offer APIs to facilitate efficient querying of metadata and access to the data itself by adopting specifications such as STAC and enabling range requests for cloud-optimised data (e.g. Zenodo recently started to support HTTP range requestsFootnote 13).
The flexibility of these specifications enhances data interoperability by enabling the development of extensions that simplify data integration. For instance, the Electro-Optical STAC-extensionFootnote 14 has been created to facilitate the integration of multispectral remote sensing data by expanding the capabilities of STAC to accommodate specific requirements and metadata associated with this kind of data. Looking ahead, the advantages of data interoperability may potentially extend beyond the realm of raw source data, encompassing entire ESDCs. The datacube STAC-extensionFootnote 15 (currently in a “candidate” Extension Maturity level) has been developed with the primary objective of advancing the integration and interoperability of structured data representations like ESDCs within the STAC ecosystem. This effort aims to broaden the scope of opportunities for reusing ESDCs in new data processing pipelines.
Additionally, the efficiency of data access and collection is contingent upon data formats. GeoTIFF is arguably the most used and renowned data format for georeferenced raster data. This format adds a standard specification for the TIFF format that describes the spatial properties of the raster. It is widely used for EO products such as Landsat imagery. The need to operate in cloud environments has driven the development of cloud-optimised geospatial data formats. Consequently, the GeoTIFF format has evolved to the Cloud-Optimised GeoTIFF (COG)Footnote 16 format, enhanced to function efficiently in cloud environments through HTTP range requests. COGs offer several advantages over traditional GeoTIFFs, including reduced latency in data retrieval, faster visualization of large datasets, and a tiled structure that enables parallel processing. The significance of this format is underscored by its recent approval as an Open Geospatial Consortium (OGC) standardFootnote 17 , Footnote 18.
When the dimensionality of the data increases, formats such as NetCDF or HDF5 are typically used to encapsulate data and coordinate values. Tiling and chunking allow efficient access to big data arrays for both data formats. However, these formats are not inherently optimised for cloud environments. The Zarr specificationFootnote 19 addresses this limitation and can be used directly in cloud environments, offering several advantages over NetCDF and HDF5. Zarr enables more efficient chunk access for parallel processing, provides better support for distributed computing, and offers improved read and write speeds, particularly in cloud storage systems. Moreover, Zarr’s flexible chunking scheme allows for optimised data access patterns, and its simpler metadata structure facilitates easier data discovery and management. Additionally, specifications such as geo-zarrFootnote 20 and the xcube dataset conventionFootnote 21 have been introduced to further enhance data interoperability and compatibility within the context of Earth system data.
3.2. Curation
Effective data curation stands as a critical anchor in the preparation of data for subsequent spatio-temporal processes and analysis via ESDCs (Marujo et al., Reference Marujo, Ferreira, Queiroz, Costa, Arcanjo and Souza2022). The transformation of raw data into Analysis-Ready Data (ARD) has emerged as an essential prerequisite across multiple initiatives. ARD ensures that data are readily amenable to queries, analysis, and application development. Notable instances of these initiatives include DE AfricaFootnote 22, DE AustraliaFootnote 23, and the Brazil Data Cube (Ferreira et al., Reference Ferreira, Queiroz, Vinhas, Marujo, Simoes, Picoli, Camara, Cartaxo, Gomes, Santos, Sanchez, Arcanjo, Fronza, Noronha, Costa, Zaglia, Zioti, Korting, Soares, Chaves and Fonseca2020; Marujo et al., Reference Marujo, Ferreira, Queiroz, Costa, Arcanjo and Souza2022), among others.
The Committee on Earth Observation Satellites (CEOS) has precisely defined ARD as “satellite data that have been processed to a minimum set of requirements and organized into a form that allows immediate analysis with a minimum of additional user effort and interoperability both through time and with other datasets”Footnote 24. In this definition, ARD also exhibit interoperability both across time and with other datasets (refer to Siqueira et al., Reference Siqueira, Tadono, Rosenqvist, Lacey, Lewis, Thankappan, Szantoi, Goryl, Labahn, Ross, Hosford and Mecklenburg2019 for an overview of the CEOS ARD for Land initiative, CARD4L). CEOS has established a comprehensive set of Product Family Specifications (PFS) tailored to various data groups, including Surface Reflectance, Surface Temperature, Polarimetric Radar, and more. These specifications undergo rigorous peer review processes across multiple satellite platforms, such as Landsat and Sentinel collections, to obtain CEOS ARD approval. It’s worth noting that there are ongoing efforts to develop additional PFS, including Interferometric Radar and LiDAR Terrain and Canopy Height. It is also worth noting that the OGC has recently addressed the CEOS ARD concept by forming a new Standards Working Group (SWG) to define a generic multi-part standard specifying a set of minimum requirements for geospatial products to be considered ARDFootnote 25.
It is important to recognise that achieving an ARD level can extend beyond minimum standard specifications. Obtaining ARD often involves crucial preprocessing and data curation tasks that are tailored to the unique requirements of the application domain. For instance, in the context of EO data, these tasks may encompass but are not limited to, cloud and cloud shadow masking (refer to Skakun et al., Reference Skakun, Wevers, Brockmann, Doxani, Aleksandrov, Batič, Frantz, Gascon, Gómez-Chova, Hagolle, López-Puigdollers, Louis, Lubej, Mateo-García, Osman, Peressutti, Pflug, Puc, Richter, Roger, Scaramuzza, Vermote, Vesel, Zupanc and Žust2022 for a comprehensive intercomparison exercise of multiple cloud and cloud shadow masking methods), snow masking (e.g. Richiardi et al., Reference Richiardi, Blonda, Rana, Santoro, Tarantino, Vicario and Adamo2021), and the correction of Bidirectional Reflectance Distribution Function (BRDF) effects to derive Nadir BRDF Adjusted Reflectance (NBAR) values (e.g. Roy et al., Reference Roy, Zhang, Ju, Gomez-Dans, Lewis, Schaaf, Sun, Li, Huang and Kovalskyy2016).
3.3. Cubing
The concept of ARD may exhibit some subjectivity depending on the specific application. This subjectivity pertains to the data that populates an ESDC. In contrast, ESDCs inherently represent straightforward yet robust analysis-ready integrated entities, capable of simplifying a broad spectrum of analytical tasks (Baumann et al., Reference Baumann, Misev, Merticariu and Huu2019). An ESDC filled with ARD is often called an Analysis-Ready Data Cube (ARDC), a concept widely employed in DeepESDL. To generate an ARDC, the critical step involves aligning data onto a unified grid. Domain experts predefine this grid, and all data sources must conform. Furthermore, the efficacy of the ARDC processing is significantly influenced by the implementation of an optimal chunking strategy for this grid. This strategy must be tailored to facilitate efficient data processing across diverse analytical scenarios. For instance, analyses focused on temporal dynamics benefit from chunking strategies that preserve the temporal dimension, whereas spatial analyses or cartographic visualisations are optimised by maintaining spatial dimensions within chunks. In scenarios requiring multi-temporal spatial analysis, a hybrid approach combining both temporal and spatial preservation in chunking can be advantageous.
When the grid moves in the spatio-temporal domain, the varying spatio-temporal resolutions and coverage among multiple data sources require selecting adequate methods to fit the data into the predefined grid. Datasets with varying spatial resolutions and coverage must be resampled onto a standard spatial grid. This process often requires modifying the data (Cracknell, Reference Cracknell1998). While non-destructive algorithms such as nearest neighbours can preserve data values (at the cost of duplicating or ignoring values), significant differences in spatial resolution often require transformation through (non-) linear resampling methods, such as cubic convolution or advanced fusion techniques (Nikolakopoulos, Reference Nikolakopoulos2008). Complex AI methods can be employed to perform spatial transformations while preserving the quality of the measured variable (e.g. multi-image super-resolution algorithms, Michel et al., Reference Michel, Vinasco-Salinas, Inglada and Hagolle2022; Razzak et al., Reference Razzak, Mateo-García, Lecuyer, Gómez-Chova, Gal and Kalaitzis2023). Another often overlooked issue arises when dealing with extensive variables. In such instances, it is crucial to ensure that, for example, mass balances are not distorted in the process of creating new products.
It is important to note that the application of resampling methods, particularly in the context of generating Global ESDCs covering the entire planet (e.g. Mahecha et al., Reference Mahecha, Gans, Brandt, Christiansen, Cornell, Fomferra, Kraemer, Peters, Bodesheim and Camps-Valls2020), may introduce geometric distortions. Projecting global datasets onto a plane can distort the data in terms of area, distances, and angles (Snyder and Voxland, Reference Snyder and Voxland1989), posing challenges for subsequent analysis (cf. Section 5.1). This can be alleviated by using a Discrete Global Grid System (DGGS, Kmoch et al., Reference Kmoch, Vasilyev, Virro and Uuemaa2022). This kind of grid system seeks to minimise distortions, harmonise cell sizes and maintain consistent distances from neighbours. Defining standards and solutions for efficient chunk storage, subsetting, and integration into the ESDC framework will be a challenging future task. Still, it could lead to significant improvements in both the performance and accuracy of spatial algorithms.
In the case of Regional ESDCs (e.g. Estupiñan Suarez et al., Reference Estupiñan Suarez, Gans, Brenning, Gutierrez-Velez, Londono, Pabon-Moreno, Poveda, Reichstein, Reu and Sierra2021), which may cover entire continents, oceans, or administrative regions at various hierarchical levels, selecting an appropriate CRS is crucial to ensure minimal geometric distortion. On local scales, Local ESDCs (also referred to as mini cubes, Requena-Mesa et al., Reference Requena-Mesa, Benson, Reichstein, Runge and Denzler2021) cover smaller areas of interest (e.g. Walther et al., Reference Walther, Besnard, Nelson, El-Madany, Migliavacca, Weber, Carvalhais, Ermida, Brümmer, Schrader, Prokushkin, Panov and Jung2022), ideally characterised by high spatial resolutions ranging from sub-meters to meters. Using local ESDCs together with a local CRS enables to minimise distortions.
When dealing with datasets characterised by varying temporal grids, even if they share the same date-time units, irregular temporal grids may emerge. These discrepancies can introduce temporal gaps within the time dimension. In cases where datasets exhibit varying date-time units, especially when working with datasets featuring finer date-time units (e.g. daily records), it becomes necessary to aggregate them to align with a predefined coarser temporal grid (e.g. monthly records). While this process is straightforward for regularly sampled data, it can pose challenges for EO data with long revisit periods (e.g. Landsat data). These challenges can potentially introduce uncertainties during aggregation. Substantial gaps in EO data can have a detrimental impact on the accuracy and representativeness of the aggregated results. This concern is further exacerbated when additional gaps arise due to data disturbances, such as cloud and shadow interference.
3.4. Harmonisation
Additional post-processing of data variables may be necessary to address Earth system challenges. This entails further data curation to obtain a fully gap-filled, harmonised product with evenly spaced time steps. In alignment with the naming conventions established for EO data cubes by Frantz, Reference Frantz2019, we refer to a thoroughly harmonised ESDC as a highly ARDC (hARDC).
Data harmonisation is crucial to ensure the consistency and compatibility of variables obtained or generated using different methodological or technical approaches (Wulder et al., Reference Wulder, Hilker, White, Coops, Masek, Pflugmacher and Crevier2015). When discrepancies exist between data measurement or production methods, it can introduce inconsistencies that hinder subsequent analyses involving the specific variables (Vogeler et al., Reference Vogeler, Braaten, Slesak and Falkowski2018). To address this, one approach is to create separate variables that represent the same measured quantity, highlighting the differences between them. However, to enhance spatio-temporal resolution and coverage, harmonisation of variables is often necessary (e.g. harmonising reflectance values from Sentinel-2 and Landsat, Claverie et al., Reference Claverie, Ju, Masek, Dungan, Vermote, Roger, Skakun and Justice2018; Marujo et al., Reference Marujo, Carlos, da Costa, de Souza Arcanjo, Fronza, Soares, de Queiroz and Ferreira2023).
This can be achieved through simple methods that involve sampling data from the same spatio-temporal index in both variables to establish a direct conversion model (e.g. using matched observations to match Landsat 8 and Sentinel-2, Shang and Zhu, Reference Shang and Zhu2019). Alternatively, more advanced AI models can harmonise data by incorporating one or more additional variables (e.g. creating a global product of OCO-2 Sun-Induced Fluorescence, SIF, Li and Xiao, Reference Li and Xiao2019). This may require the development of an entire AI pipeline to extend a variable with newly available data or reconstruct it, especially in cases where the variable was not previously measured (e.g. reconstructing SIF from TROPOMI, Chen et al., Reference Chen, Huang, Nie, Zhang, Wang, Chen and Chen2022). In this sense, data harmonisation also encompasses projecting data in simulated future scenarios (e.g. projecting vegetation dynamics for the rest of the century, Mahowald et al., Reference Mahowald, Lo, Zheng, Harrison, Funk, Lombardozzi and Goodale2016). In addition, it is crucial to incorporate uncertainty metrics to facilitate accurate and reliable future analysis using the harmonised data variables (cf. Section 4.3).
Additionally, to effectively use algorithms that incorporate temporal structures, such as Recurrent Neural Networks (RNNs, Sherstinsky, Reference Sherstinsky2020), a regularly spaced and gapless time dimension is usually required. Hence, data from an irregular time dimension should be aggregated or interpolated to fit into a regular temporal grid (e.g. gap-filling Landsat reflectances on a monthly basis, Moreno-Martínez et al., Reference Moreno-Martínez, Izquierdo-Verdiguier, Maneta, Camps-Valls, Robinson, Muñoz-Marí, Sedano, Clinton and Running2020). A suitable predefined temporal resolution must be selected, and data must be gap-filled. Various gap-filling techniques, ranging from simple linear interpolation to more complex AI-based modelling approaches, can be employed to address this (e.g. using Long Short-Term Memory networks, LSTMs, Ren et al., Reference Ren, Cromwell, Kravitz and Chen2022). The choice of the gap-filling method depends on factors such as the data’s nature, the desired accuracy level, and the specific requirements of the analysis or application.
3.5. Transformation
Expertly crafted higher-order features often prove highly relevant for addressing Earth system challenges. These new features span a spectrum, encompassing operations that range from simple transformations of the original variables to the creation of entirely novel features derived from advanced AI models. Examples of such features include the computation of spectral indices derived from reflectance bands (Montero et al., Reference Montero, Aybar, Mahecha, Martinuzzi, Söchting and Wieneke2023), the extraction of frequencies through time series decomposition (Mahecha et al., Reference Mahecha, Fürst, Gobron and Lange2010), the creation of spatio-temporal compositions (e.g. Griffiths et al., Reference Griffiths, van der Linden, Kuemmerle and Hostert2013), summarising high dimensional dynamics (e.g. Kraemer et al., Reference Kraemer, Camps-Valls, Reichstein and Mahecha2020), and outputs generated by AI models (e.g. Brown et al., Reference Brown, Brumby, Guzder-Williams, Birch, Hyde, Mazzariello, Czerwinski, Pasquarella, Haertel, Ilyushchenko, Schwehr, Weisse, Stolle, Hanson, Guinan, Moore and Tait2022). To illustrate, consider a study focusing on climate extremes like heatwaves and droughts’ impact on the terrestrial biosphere. In such cases, calculating anomalies for critical variables (e.g. air temperature and soil moisture as proxies for heatwaves and droughts, with Gross Primary Production as the target biosphere variable) is pivotal (see Figure 3). Creating these novel features introduces a new dimension to distinguish between variable values corresponding to raw data and those representing anomalies. In line with the naming conventions introduced by Frantz, Reference Frantz2019, we designate an ESDC with higher-order features as a hARDC Plus (hARDC+).
3.6. Reuse
ESDCs, after generation and analysis, can either evolve into dynamic versions through continuous updates or become static ESDCs, serving as input for the generation of new ESDCs. In the first scenario, establishing a Continuous Integration (CI) pipeline becomes essential for automating ESDC updates. This pipeline can be scheduled to align with the release of new dataset versions, ensuring the ESDC remains current. However, this approach may prove inefficient for EO products that are delivered regularly (e.g. Sentinel-2 or MODIS) and that may be constantly reprocessed by data providers, releasing new versions with updated processing pipelines. In this case, the update schedule should align with the specific needs of the ESDC (e.g. monthly, semi-annually, or annually). In the second scenario, an automatic update of dataset versions is also feasible, eliminating the necessity to extend the ESDC to the most recent date. In either scenario, it is crucial to implement clear reproducibility and traceability practices to ensure the data integrity of future ESDCs.
As highlighted in Section 3.1, standardisation is pivotal in promoting fluid data interoperability within this context. OGC has recognised the growing significance of data cube approaches for geospatial data. OGC recently established the GeoDataCubes SWG to create an API that facilitates interoperability among various solutionsFootnote 26. This standard covers a broad scope, explicitly including API functionalities for access and processing, exchange format recommendations, profiles, and a metadata model.
Additionally, cloud technologies have ushered in the development of data cube services that abstract the underlying file structures and formats, replacing them with APIs offering diverse processing functionalities and promoting interoperability. For instance, platforms like Sentinel HubFootnote 27 serve as sources for ESDC generation through tools like xcube. Moreover, the openEO platformFootnote 28 aims to provide an API that enables connections from multiple clients to various cloud backends using a unified API (Schramm et al., Reference Schramm, Pebesma, Milenković, Foresta, Dries, Jacob, Wagner, Mohr, Neteler and Kadunc2021). Approaches like these allow the tailored specification of ESDCs on-demand, with server-side processing relieving requesters of the complexities of the generation task. However, this convenience often comes with a trade-off, as the processing engine’s code basis, the processing environment, and the input data are not known to requesters. Any modifications to these specifications can result in different outcomes for identical requests to the data cube API, hindering a streamlined update of dynamic ESDCs and a transparent basis for reusing static ESDCs.
In contrast, less convenient but more transparent approaches fully document the ESDC generation process through “recipes”. These recipes contain versioned source code used for input data processing. Examples include the Pangeo ForgeFootnote 29 (Stern et al., Reference Stern, Abernathey, Hamman, Wegener, Lepore, Harkins and Merose2022) and DeepESDL recipesFootnote 30. Recipes, coupled with versioned input data and fully specified processing environments, enable practical reproducibility of resulting ESDCs. This approach supports the seamless updating of dynamic ESDCs when new data becomes available and provides transparency for incorporating static ESDCs into new datasets.
Ongoing efforts to enhance data lineage and provenance transparency are integral to the Copernicus Data Space Ecosystem. The development of the “traceability” serviceFootnote 31, currently in progress, is designed to empower users to trace all modifications to the data from its origin to its delivery to the end user, ensuring greater transparency and accountability in the ESDC life cycle.
3.7. Metadata generation
Traceability and self-explanatory power are essential aspects alongside the data values themselves. When an ESDC is generated, end users may access its description through various sources, including documentation that adheres to best practices for open data publishing within the Earth sciences. Such practices are supported by data journals (e.g. ESSDFootnote 32) and scientific associations (e.g. AGU Open ScienceFootnote 33), provided that the data producers have furnished comprehensive documentation. However, the data must carry its own encapsulated description in the form of metadata, which typically comprises a set of attributes represented as key-value pairs. This ensures the data contain relevant information about their characteristics, facilitating understanding and utilisation.
Metadata generation should begin at the initial stage of data collection, encompassing crucial information such as data descriptors (e.g. name, units, measurement methods and equipment, resolution), data transformations (e.g. resampling or interpolation methods), metadata transformations (e.g. renaming procedures, conventions conversion), and responsible producers (e.g. creator entity, data provider). This metadata generation process should be consistently maintained throughout the entire ESDC life cycle, documenting each step undertaken to derive the final product (e.g. storing the process graphs from openEO when using this platformFootnote 34). This ensures comprehensive self-contained documentation of the history and processing of the ESDC.
While flexibility exists in metadata management, conventions are crucial when dealing with Earth system data. The Climate and Forecast Metadata Conventions (CF Conventions, Hassell et al., Reference Hassell, Gregory, Blower, Lawrence and Taylor2017), for instance, represent a comprehensive set of standards specifically designed for Earth system data stored in formats such as NetCDF (although they can be readily applied to other formats like Zarr). These conventions facilitate the creation of clear and detailed descriptions of data variables and coordinate dimensions. Furthermore, software like xarray (Hoyer and Hamman, Reference Hoyer and Hamman2017) can parse CF Conventions and leverage them for different ESDC processesFootnote 35. Compliance with CF Conventions not only simplifies data sharing but also promotes interoperability among various data sources, ensuring that ESDCs adhere to established standards.
4. Leveraging ESDCs for Earth system research
ESDCs offer promising opportunities for advancing Earth system research, particularly with recent AI developments. This is exemplified for Deep Learning (DL) by the spatio-temporal nature of ESDCs in a tensor-like structure. In this context, several key subjects emerge as highly relevant for Earth system research. We present three pertinent topics where the potential of ESDCs can be leveraged for advancing Earth system research: Physics-Informed Machine Learning (PIML), the adoption of complex sampling strategies, and the quantification of uncertainties.
4.1. Adding factual knowledge via PIML
A great addition to Machine Learning (ML) modelling is combining the pure data-driven approach with factual knowledge of the system under investigation (Karniadakis et al., Reference Karniadakis, Kevrekidis, Lu, Perdikaris, Wang and Yang2021). PIML leverages domain knowledge (typically mechanistic models or differential equations) and flexible data-driven ML methods (typically neural networks). Consequently, PIML models respect physical boundaries more faithfully while being flexible enough to approximate arbitrarily complex non-linear functions from data (cf. discussion and references in Reichstein et al., Reference Reichstein, Camps-Valls, Stevens, Jung, Denzler and Carvalhais2019). ESDCs provide a unique structure to access multiple Earth system data streams, and the equation-based model describes the underlying process. Thanks to this ready availability of data and equations, exploring PIML models using a wide array of baseline models would be far easier and faster. The equations detailing a given variable could be added to the cube as a sub-field of the variable of interest in the same way that space and time are. The eventual implementation should consider the multi-platform and multi-language nature of ESDCs. As illustrated above, this requires a unified and robust approach that suits multiple use cases.
4.2. Sampling for AI in a complex system
Sampling on ESDCs is essential for learning the concrete interactions of drivers, spatial conditions, timing, and other determinants of specific processes and their implications. This involves strategically selecting a manageable subset from the ESDC. This selection process is particularly important for ML algorithms, as they rely on these subsets to establish a foundational understanding of the process to be analysed (Atkinson et al., Reference Atkinson, Stein and Jeganathan2022; Nikparvar and Thill, Reference Nikparvar and Thill2021). Pseudo-random sampling facilitates a broad and diverse data selection, while regionalised sampling uses specific patterns within the data for a more targeted analysis. The latter proves particularly advantageous when the research goal is to comprehend specific phenomena.
Constructing representative samples in Earth system processes must ensure an unbiased representation of the target variable. The multidimensional nature of Earth system processes poses sampling challenges across multiple variables. Consider, for instance, a study that aims at understanding the effects of climate extremes on the terrestrial biosphere using AI (Sippel et al., Reference Sippel, Reichstein, Ma, Mahecha, Lange, Flach and Frank2018). We know that climate extremes such as heatwaves, droughts, extreme precipitation, flooding, etc., are typically associated with multiple variables (Flach et al., Reference Flach, Brenning, Gans, Reichstein, Sippel and Mahecha2021). Additionally, such events can co-occur in unfavourable sequences, i.e., compounding heatwaves, droughts, or floods following droughts (Zscheischler et al., Reference Zscheischler, Martius, Westra, Bevacqua, Raymond, Horton, van den Hurk, AghaKouchak, Jézéquel and Mahecha2020). To understand such circumstances, one should consider the full spatio-temporal extended in all relevant dimensions, including derived meta-variables that describe the characteristics of these events, such as timing, duration, extent, and intensity (Flach et al., Reference Flach, Gans, Brenning, Denzler, Reichstein, Rodner, Bathiany, Bodesheim, Guanche and Sippel2017). Often, additional factors gain significance. For example, ecosystem responses to extremes vary in space depending on ecosystem conditions (Mahecha et al., Reference Mahecha, Gans, Sippel, Donges, Kaminski, Metzger, Migliavacca, Papale, Rammig and Zscheischler2017), land-cover types (Flach et al., Reference Flach, Brenning, Gans, Reichstein, Sippel and Mahecha2021), and associated impacts, e.g., on the carbon cycle (Sippel et al., Reference Sippel, Reichstein, Ma, Mahecha, Lange, Flach and Frank2018). Building suitable AI models that predict such impacts requires including static data (e.g. vegetation type).
Yet, the critical question is then: how to obtain adequate and balanced training and validation data? Earth system processes often involve rare events of extreme conditions, which may occur sporadically over time and space. This rarity can lead to imbalanced datasets, where certain classes of the target variable or ranges of continuous values are underrepresented. This also applies to ranges of continuous values in an imbalanced distribution. Imbalanced datasets affect the performance and generalisation of models trained on these samples. Achieving spatio-temporal representativeness in this context can be challenging. To train ML algorithms for effective recognition and understanding of these events, it is crucial to include additional sampling within the specific domains where these events occur. For example, when constructing datasets for global flood (Li et al., Reference Li, Dang, Li and Zhang2023) or cloud detection (Aybar et al., Reference Aybar, Ysuhuaylas, Loja, Gonzales, Herrera, Bautista, Yali, Flores, Diaz, Cuenca, Espinoza, Prudencio, Llactayo, Montero, Sudmanns, Tiede, Mateo-García and Gómez-Chova2022), the methodology involves initiating automatic sampling that covers a broad spectrum of ecosystem conditions. Simultaneously, manually selected events are introduced. This approach ensures a balanced representation of different classes in the dataset, thereby enhancing the algorithm’s capability to accurately predict such events. Figure 4 showcases a potential workflow where event detection is performed based on global ESDCs, and samples for high-resolution ML are extracted based on a systematic sampling strategy (e.g. Ji et al., Reference Ji, Fincke, Benson, Camps-Valls, Fernandez-Torres, Gans, Kraemer, Martinuzzi, Montero, Mora, Pellicer-Valero, Robin, Soechting, Weynants and Mahecha2024). Here, analysing land cover purity is an option (a relatively homogeneous land cover dominated by a single vegetation type allows for easier comparisons and subsequent analyses), as well as incorporating mixed land covers (which introduces heterogeneity and interactions among land cover types), providing more comprehensive information for model training.
Finally, the selection of samples with the necessary data dimensions must align with the chosen algorithm. For instance, tabular-based algorithms like tree-based methods require 2-dimensional batches (sample and variable), which are selected as individual points from the spatio-temporal domain. DL methods like Transformers (Vaswani et al., Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017) or RNNs, e.g., LSTMs (Hochreiter and Schmidhuber, Reference Hochreiter and Schmidhuber1997), which consider sequence (or positional) dependencies, require 3-dimensional batches (e.g. sample, timestep, variable) and extract samples usually as subsets of time series from the spatial domain. Convolutional Neural Networks (CNNs, LeCun et al., Reference LeCun, Boser, Denker, Henderson, Howard, Hubbard and Jackel1989) may be used with 4-dimensional batches (e.g. sample, height, width, variable) by taking spatial subsets or grids from the temporal domain. DL methods accounting for both spatio-temporal dependencies, such as 3DCNNs or Convolutional LSTMs (ConvLSTMs, Shi et al., Reference Shi, Chen, Wang, Yeung, Wong and Woo2015), require 5-dimensional batches (e.g. sample, height, width, timestep, variable) and extract samples as subsets of ESDCs.
4.3. Quantifying uncertainties
Uncertainty quantification is crucial to Earth science, providing a comprehensive assessment of the reliability and confidence associated with scientific predictions, model simulations, and observational data. Capturing and modelling uncertainty is a complex task as it arises from various sources such as data limitations, model approximations, and the inherent complexity of Earth system dynamics.
Uncertainty can be broadly categorised into two types: epistemic uncertainty and aleatoric uncertainty (Kiureghian and Ditlevsen, Reference Kiureghian and Ditlevsen2009). Epistemic uncertainty refers to the model’s confidence in its predictions and is related to the choice of model parameters. Techniques such as Bayesian inference or Dropout can estimate epistemic uncertainty (Srivastava et al., Reference Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov2014; Gal and Ghahramani, Reference Gal and Ghahramani2016). Bayesian methods assign probability distributions to model parameters, directly quantifying uncertainty. In DL, dropout-based methods create model ensembles by randomly dropping out units during training, providing a measure of uncertainty based on the variability among the ensemble members. While these techniques may not completely capture the underlying uncertainty due to assumptions made during modelling or training, they are practical and can be employed to estimate uncertainty. These methods can be computationally demanding and time-consuming, mainly when applied to real-time applications. However, advancements in cloud platforms and the Monte Carlo (MC)-Dropout technique have enabled reliable uncertainty estimates, even when working with massive amounts of data (Martínez-Ferrer et al., Reference Martínez-Ferrer, Moreno-Martínez, Campos-Taberner, García-Haro, Muñoz-Marí, Running, Kimball, Clinton and Camps-Valls2022). On the other hand, aleatoric uncertainty is associated with the noise or variability present in the data (e.g. data affected by natural variability, measurement errors, or other sources of noise) and cannot be reduced. Instead, it can be identified and quantified as part of the uncertainty characterisation.
ESDCs involving measurements or modelled data can be accompanied by associated uncertainty values. Data assimilation techniques are key in incorporating data into ESDCs while considering the associated uncertainties. Approaches such as Kalman filtering, variational data assimilation, or ensemble-based assimilation can effectively merge different data sources and quantify the resulting uncertainties (Mathieu and O’Neill, Reference Mathieu and O’Neill2008).
5. Challenges in ESDC analysis
While ESDCs present significant opportunities, it’s crucial to approach them with a well-informed strategy to avoid naive applications of analytical methods. In this section, we describe challenges associated with ESDC analysis, focusing on two key issues: addressing geometric distortions (introduced during the cubing process) and spatio-temporal autocorrelation problems.
5.1. Geometric challenge on planet Earth
Most ESDCs covering the whole globe use a simple longitude-latitude plate-carrée projection, which fits the ESDC model very well. The approach also allows for efficient storage and subsetting of cubes to user-generated subsets corresponding to a bounding box. However, for advanced data analysis, equirectangular projections have two main drawbacks: 1) grid cells differing in latitude do not have equal area, and 2) the distances to nearest neighbours are not constant.
The first drawback introduces a sampling bias towards high latitudes in the data. This bias can affect the representativeness and accuracy of analyses (cf. Section 5.2), particularly for regions located closer to the equator. The most trivial cases are computations of scalars, like global means (e.g. Figure 5), which need to be weighted or approaches like principal component analyses that require area-weighed covariance matrices. Effects of this kind have been known for decades and are considered climate textbook knowledge (Storch et al., Reference Storch, Zwiers and Livezey2000). However, they remain a challenge, as we find them often ignored in ESDC analytics. Issues of this kind can be alleviated using area-weighted statistics, suitable for most linear algorithms, or by performing weighted sampling from grid cells. For advanced, often non-linear data science methods, considering the spherical geometry is much more challenging, and careful consideration is advised before naive applications are performed. Even when applying area-weighted statistics correctly, oversampled areas lead to unnecessary increases in storage requirements and computation time.
The second drawback is particularly significant when applying spatial convolutions or moving window operations. To address this, several approaches can be employed. One option is to use Spherical Harmonics for simple convolutions, providing a transformation that respects the spherical nature of the data (Wieczorek and Meschede, Reference Wieczorek and Meschede2018). Spherical Harmonics can also be used as coordinate embeddings for neural networks (Rußwurm et al., Reference Rußwurm, Klemmer, Rolf, Zbinden and Tuia2023). Another approach involves graph convolutions that consider varying distances to neighbours.
5.2. Spatio-temporal representativeness for an accurate model evaluation
Diagnostics on predictive modelling with ESDCs can be challenged by the representativeness and spatio-temporal structure of training data (Tobler, Reference Tobler1970; Meyer and Pebesma, Reference Meyer and Pebesma2021; Ploton et al., Reference Ploton, Mortier, Réjou-Méchain, Barbier, Picard, Rossi, Dormann, Cornu, Viennois and Bayol2020; Kattenborn et al., Reference Kattenborn, Schiefer, Frey, Feilhauer, Mahecha and Dormann2022). Assessing the accuracy of a prediction is statistically straightforward as long as reference data is available for the entire population or if a respective sample represents the spatio-temporal structure of the population (Wadoux et al., Reference Wadoux, Heuvelink, De Bruin and Brus2021; Brus, Reference Brus2021). However, many modelling tasks build on observations not representative of underlying temporal dynamics or an entire land surface variability (e.g. upscaling functional ecosystem properties from sparse and clustered FLUXNET sites). Such an imbalance in reference data may not necessarily lead to a bias in model coefficients (Pabon-Moreno et al., Reference Pabon-Moreno, Migliavacca, Reichstein and Mahecha2022). However, it may lead to inflated prediction accuracy estimates, given the commonly limited capacities of ML to extrapolate into the unknown, where the predictor-response relationship may depart (Ludwig et al., Reference Ludwig, Moreno-Martinez, Hölzel, Pebesma and Meyer2023). Thus, the accuracy assessment of a prediction estimated from clustered samples will not represent the factual accuracy of predictions beyond the reference data availability. This is critical for assessing the quality of a prediction itself and potential error propagation in subsequent analysis (Yates et al., Reference Yates, Bouchet, Caley, Mengersen, Randin, Parnell, Fielding, Bamford, Ban and Barbosa2018; Meyer and Pebesma, Reference Meyer and Pebesma2021; Mila et al., Reference Mila, Mateu, Pebesma and Meyer2022). It is advised that predictions should inform on the area of applicability (Meyer and Pebesma, Reference Meyer and Pebesma2021), i.e., the area in which the predictor-space is covered by the reference data and obtained predictive accuracies thereof are assumed to hold.
However, assessing the predictive performance of a model inside the area of applicability may be challenged by the spatio-temporal structure of the training and test data. Commonly, adjacent observations (both in time and space) are more similar (autocorrelated in space and time), and therefore accuracies determined from test observations near the training data will be more accurate (Roberts et al., Reference Roberts, Bahn, Ciuti, Boyce, Elith, Guillera-Arroita, Hauenstein, Lahoz-Monfort, Schröder and Thuiller2017; Dormann et al., Reference Dormann, M. McPherson, B. Araújo, Bivand, Bolliger, Carl, G. Davies, Hirzel, Jetz and Daniel Kissling2007). For instance, seasonal effects can inflate model performance when using test observations near training data in the temporal dimension. Dependence among training and reference data results in any case on optimistic estimates of model performance, meaning that such accuracies do not reflect the actual transferability of the model to unseen areas or time steps (Roberts et al., Reference Roberts, Bahn, Ciuti, Boyce, Elith, Guillera-Arroita, Hauenstein, Lahoz-Monfort, Schröder and Thuiller2017). For instance, Ploton et al. (Reference Ploton, Mortier, Réjou-Méchain, Barbier, Picard, Rossi, Dormann, Cornu, Viennois and Bayol2020) showed that ML-based models found accurate in the presence of spatial dependent training and validation data may learn spatial data structures instead of transferable relationships between a response (biomass) and the predictors (environmental variables and optical reflectance). This may not only lead to erroneous model transferability and extrapolation to new spatial or temporal domains but also prevent an adequate interpretation of model functioning and attribution to variables and processes (Sweet et al., Reference Sweet, Müller, Anand and Zscheischler2023). Therefore, model performance and interpretation should be performed by minimising spatio-temporal dependence of observations via cross-validation strategies (cf. Roberts et al., Reference Roberts, Bahn, Ciuti, Boyce, Elith, Guillera-Arroita, Hauenstein, Lahoz-Monfort, Schröder and Thuiller2017; Meyer et al., Reference Meyer, Reudenbach, Hengl, Katurji and Nauss2018; Ploton et al., Reference Ploton, Mortier, Réjou-Méchain, Barbier, Picard, Rossi, Dormann, Cornu, Viennois and Bayol2020; Kattenborn et al., Reference Kattenborn, Schiefer, Frey, Feilhauer, Mahecha and Dormann2022).
6. Technical considerations for managing ESDCs
Managing ESDCs throughout their entire life cycle is complex and resource-intensive. This section outlines the technical considerations and limitations associated with the current state-of-the-art technological resources for ESDC management. This encompasses aspects such as computing resources, software tools, and scalable solutions that are crucial for effectively handling the challenges involved in ESDC management.
6.1. Computing resources
The data size and available computing resources determine data processing feasibility throughout the ESDC life cycle. Computing resources vary from a single laptop to a local cluster with multi-threaded or distributed processing capabilities and can extend to cloud computing environments composed of multiple clusters. Modern computers are equipped with high-speed Solid-State Drives (SSDs) featuring fast random access and the potential for multiple Gigabytes per second throughput. However, the challenge lies in their limited capacity. In data centres, this is solved by using arrays of disks, but this introduces additional challenges, including latency, throughput, reliability, and security. Computation on local systems typically involves single-threaded or lightly multi-threaded computations with a higher level of interactivity. In High-Performance Computing (HPC) environments, the software operates in a multi-threaded or multi-core manner and is usually installed by a local system administrator. HPC environments are well-suited for extensive processing tasks but offer reduced interactivity due to the involvement of job schedulers for managing computation resources. Cloud computing environments offer a promising solution for managing vast amounts of Earth system data. These environments can be further improved in terms of scalability by utilising technologies like Kubernetes and Argo, which allow for specialised workflows. Platforms such as GEE, the European Open Science Cloud (EOSC)Footnote 36, Google ColaboratoryFootnote 37, Amazon SageMakerFootnote 38, DeepESDLFootnote 39, Copernicus Data Space Ecosystem (CDSE)Footnote 40, and KaggleFootnote 41 provide opportunities for efficient data storage, processing, and collaboration in scientific research. However, it is essential to note that these platforms often impose certain limitations on the users. These limitations include storage capacity, computational resources, available tools for ESDC management, access permissions, and usage restrictions.
6.2. Software capabilities
In the context of managing ESDCs, diverse tools are available. Here, we present a compendium of useful tools for processing Earth system data within the ESDC life cycle in three prominent programming languages: Python, R, and Julia.
Python, arguably the most used language for ESDC management, offers xarray with labelled multidimensional arrays (Hoyer and Hamman, Reference Hoyer and Hamman2017), built on top of numpy (Harris et al., Reference Harris, Millman, van der Walt, Gommers, Virtanen, Cournapeau, Wieser, Taylor, Berg, Smith, Kern, Picus, Hoyer, van Kerkwijk, Brett, Haldane, del Ro, Wiebe, Peterson, Gérard-Marchant, Sheppard, Reddy, Weckesser, Abbasi, Gohlke and Oliphant2020), and supporting on-disk reading and parallel processing via dask (Rocklin, Reference Rocklin, Huff and Bergstra2015) (a Python library for parallel computing, enhancing array objects by employing data partitioning into chunks and employing dynamic task scheduling). Multiple tools are tailored to construct and process xarray datasets, which represent ESDCs. For data collection, rasterio (Gillies et al., Reference Gillies2013), rioxarray Footnote 42, satpy (Raspaud et al., Reference Raspaud, Hoese, Lahtinen, Holl, Finkensieper, Proud, Dybbroe, Meraner and Strandgren2023), or EOreader (Maxant et al., Reference Maxant, Braun, Caspard and Clandillon2022) are instrumental for reading GeoTIFFs and COGs, returning xarray objects. Xarray excels in reading NetCDF files and cloud-based data via zarr as dask-arrays. Vector data can be converted into xarray objects using geocube (Snow et al., Reference Snow, Taves, SlapDrone, Abdalla, Pierrick and Bell2023). Data sourced from STAC catalogues can be sought through pystac-client and directly transformed into xarray objects via stackstac Footnote 43, odc-stac Footnote 44, or cubo (Montero et al., Reference Montero, Aybar, Ji, Kraemer, Söchting, Teber and Mahecha2024). These tools support data collection and immediate cubing, including the temporal dimension. GEE enables data retrieval as numpy arrays through its API, which can be directly converted into xarray objects using Xee or wxee. GEE’s API (Gorelick et al., Reference Gorelick, Hancher, Dixon, Ilyushchenko, Thau and Moore2017) and extensions (Montero, Reference Montero2021) allow data curation before cubing. Xcube has various data stores for data acquisition and xarray object generationFootnote 45. XDGGS (Kmoch et al., Reference Kmoch, Bovy, Magin, Abernathey, Coca-Castro, Strobl, Fouilloux, Loos, Uuemaa, Chan, Delouis and Odaka2024) simplifies working with different DGGS in xarray. The curation, harmonisation, and transformation stages, being subjective and application-dependent, can be accomplished through xarray or numpy processing. Libraries like scipy (Virtanen et al., Reference Virtanen, Gommers, Oliphant, Haberland, Reddy, Cournapeau, Burovski, Peterson, Weckesser, Bright, van der Walt, Brett, Wilson, Millman, Mayorov, Nelson, Jones, Kern, Larson, Carey, Polat, Feng, Moore, VanderPlas, Laxalde, Perktold, Cimrman, Henriksen, Quintero, Harris, Archibald, Ribeiro, Pedregosa and van Mulbregt2020), built on top of numpy, offer additional resources leveraging ESDCs as multidimensional arrays. The analysis phase leverages a plethora of tools. ESDCs as multidimensional arrays are compatible with numpy, scipy, and related tools. Moreover, ESDCs represented as tensors interface effectively with tensorflow (Abadi et al., Reference Abadi, Agarwal, Barham, Brevdo, Chen, Citro, Corrado, Davis, Dean, Devin, Ghemawat, Goodfellow, Harp, Irving, Isard, Jia, Jozefowicz, Kaiser, Kudlur, Levenberg, Mane, Monga, Moore, Murray, Olah, Schuster, Shlens, Steiner, Sutskever, Talwar, Tucker, Vanhoucke, Vasudevan, Viegas, Vinyals, Warden, Wattenberg, Wicke, Yu and Zheng2016) or pytorch (Paszke et al., Reference Paszke, Gross, Massa, Lerer, Bradbury, Chanan, Killeen, Lin, Gimelshein, Antiga, Desmaison, Köpf, Yang, DeVito, Raison, Tejani, Chilamkurthy, Steiner, Fang, Bai and Chintala2019). Furthermore, developments that aren’t designed for direct ESDC use can also be leveraged using tensors in the representation of ESDCs (e.g. torchgeo, Stewart et al., Reference Stewart, Robinson, Corley, Ortiz, Lavista Ferres and Banerjee2022, GeoTorchAI, Chowdhury and Sarwat, Reference Chowdhury and Sarwat2022, pytorch-metric-learning, (Musgrave et al., Reference Musgrave, Belongie and Lim2020) and TorchIO, (Pérez-García et al., Reference Pérez-García, Sparks and Ourselin2021)).
R, a widely used programming language for statistical analysis, has assumed increasing significance in geospatial data processing and management. Raster data sourced from image collections can be managed seamlessly, progressing from data collection to study, with the assistance of libraries like raster Footnote 46 or its more recent counterpart, terra Footnote 47. ESDCs can be collected and analysed through dedicated tools like gdalcubes (Appel and Pebesma, Reference Appel and Pebesma2019) and stars (Pebesma and Bivand, Reference Pebesma and Bivand2023). Regionalised sampling using geospatial data can be conducted using stpp (Gabriel et al., Reference Gabriel, Rowlingson and Diggle2013) and spatstat (Baddeley et al., Reference Baddeley, Rubak and Turner2015). Recent developments have introduced the capability for lazy on-disk reading of Zarr filesFootnote 48. Furthermore, data can be sourced and cubed directly from STAC catalogues using rstac (Simoes et al., Reference Simoes, de Souza, Zaglia, de Queiroz, dos Santos and Ferreira2021b) in combination with gdalcubes. Another comprehensive package for ESDC management is sits (Simoes et al., Reference Simoes, Camara, Queiroz, Souza, Andrade, Santos, Carvalho and Ferreira2021a), offering an end-to-end solution that additionally includes various tools for AI-related tasks, encompassing sampling, tuning, prediction, and the computation of uncertainty values.
Julia, a high-speed programming language, has gained popularity in scientific computing, making it an excellent choice for processing the large volumes of data found in ESDCs. Julia offers tools that cover crucial parts of the ESDC life cycle. These tools include YAXArrays.jl Footnote 49 and Rasters.jl Footnote 50 for multidimensional labelled array operations, GriddingMachine.jl (Wang et al., Reference Wang, Köhler, Braghiere, Longo, Doughty, Bloom and Frankenberg2022) for data acquisition, and experimental libraries like STAC.jl Footnote 51 for data discovery within STAC catalogues. For analysis, Julia provides specialised tools such as EarthDataLab.jl Footnote 52 for the direct processing of the Earth System Data Cube (Mahecha et al., Reference Mahecha, Gans, Brandt, Christiansen, Cornell, Fomferra, Kraemer, Peters, Bodesheim and Camps-Valls2020). Moreover, data distortions introduced during the cubing process can be addressed using libraries like OnlineStats.jl Footnote 53 (Day and Zhou, Reference Day and Zhou2020) and WeightedOnlineStats.jl Footnote 54 (Kraemer et al., Reference Kraemer, Camps-Valls, Reichstein and Mahecha2020). Julia’s ecosystem also includes ML tools like Flux.jl (Innes, Reference Innes2018), DiffEqFlux.jl (Rackauckas et al., Reference Rackauckas, Innes, Ma, Bettencourt, White and Dixit2019), and ReservoirComputing.jl (Martinuzzi et al., Reference Martinuzzi, Rackauckas, Abdelrehim, Mahecha and Mora2022), enabling advanced data analysis, including novel techniques like PIML.
6.3. Scalability obstacles
The size of ESDCs poses several challenges for analysis. Generally, in most programming languages for data science (e.g. Python, Julia, R), data has to be completely loaded into memory before calculating a simple statistic (e.g. median). However, ESDCs often surpass the memory limit, hindering computations or resulting in significant slowdowns due to frequent disk read-write operations. Instead, users can apply specialised algorithms that calculate statistics iteratively (Welford, Reference Welford1962; Schubert and Gertz, Reference Schubert and Gertz2018). memory algorithms allow the user to track statistics (e.g. mean, sums, and standard deviations) iteratively. They give the user complete control (and responsibility) over the order of the data reads. Because of the spherical nature of the Earth and the resulting differences in the area covered by pixels, these computations require weighted versions of the statistics (cf. Section 5.1). Errors arising from floating-point arithmetic must be minimised, including the potential for catastrophic cancellation (Kahan, Reference Kahan1965; Goldberg, Reference Goldberg1991).
Often, analyses can be performed independently on timesteps, maps, or any other discrete chunks of an ESDC (e.g. dimensions, periods, spatial slices). First, users split the data into those chunks, and then apply the transformation. In the end, users combine the elements back together into a new ESDC (see Figure 6). Many analyses can be expressed in terms of split-apply-combine (Wickham, Reference Wickham2011; Mahecha et al., Reference Mahecha, Gans, Brandt, Christiansen, Cornell, Fomferra, Kraemer, Peters, Bodesheim and Camps-Valls2020), such as calculating mean seasonal cycle maps from a time axis to a day-of-year axis, or a global mean temperature time series that collapses latitude and longitude into a scalar value per timestep. This method is also known as map-reduce in distributed data processing. Still, in contrast, it is made for array-like or tabular data (and the reduce step always consists in concatenating the results of the map step, cf. Wickham, Reference Wickham2011). Implementations of split-apply-combine can trade-off between memory consumption and performance by adjusting the amount of data being loaded into memory simultaneously. They may also take advantage of parallel reading, processing, and writing of data, which is especially important if the data is not stored on local storage but on object stores with high access latency.
Storage in the form of compressed chunks typically employed by ESDCs, where reading a single element requires loading an entire chunk into memory, presents an opportunity for optimising sampling during ML training. Reading points individually is inefficient, as sampling two points from the same chunk necessitates reading the entire chunk twice. To mitigate this, reordering the points within a batch enables reading points from the same chunk jointly, reducing the number of reading operations. Adopting this approach makes it possible to limit the need to read the entire ESDC only once per batch, optimising the data access process.
Ensuring that scalability obstacles are transparent for end users during Earth system data analysis is essential. While experienced users may be able to address scalability issues effectively, less experienced users may struggle with the process if it is not fully transparent. It is important to provide a user-friendly interface that hides the complexities of scalability, allowing users to focus on their analysis tasks. Not all users can access sufficient computing resources for scaling processes, resulting in additional processing costs. Therefore, providing accessible and cost-effective solutions for scalability, such as cloud-based platforms, is crucial to enable a broader range of users to harness the benefits of scaling in Earth system data analysis.
7. Visual interaction with ESDCs
Data and process visualisation are critical for communicating Earth system science because big data are often hard to understand intuitively based on metadata alone, especially for non-expert audiences (Hibbard et al., Reference Hibbard, Böttinger, Schultz and Biercamp2002; Kendall et al., Reference Kendall, Glatter, Huang, Hoffman and Bernholdt2008; Kehrer and Hauser, Reference Kehrer and Hauser2012). The gap between analytic capability and the means to effectively visualise results slows our progress in understanding complex Earth system phenomena. Specialised tools are needed to visualise ESDCs and address their specific needs. Helbig et al. (Reference Helbig, Dransch, Böttinger, Devey, Haas, Hlawitschka, Kuenzer, Rink, Schäfer-Neth and Scheuermann2017) defined the key challenges of data visualisation for advancing Earth system sciences. Their ambition was to use ESDC visualisation for visual data exploration, facilitating multidisciplinary and collaborative research and also emphasising their educational role.
Much progress has been made in visualising ESDCs in Earth system research. Several viewers now have provided researchers with the means to explore and visualise multidimensional environmental datasets and generate scientific illustrations for publicationsFootnote 55,Footnote 56. However, most approaches still rely on the classical geographical interpretation of georeferenced data and are restricted to displaying maps, extracting singular time series, or Hovmöller diagrams. Little advances have been made to visualise ESDCs, particularly multivariate ESDCs, for a better data understanding (cf. static attempts, Mahecha et al., Reference Mahecha, Fürst, Gobron and Lange2010; Mahecha, Reference Mahecha2017; Mahecha et al., Reference Mahecha, Gans, Brandt, Christiansen, Cornell, Fomferra, Kraemer, Peters, Bodesheim and Camps-Valls2020). The long-standing challenge is the trade-off between data interactions not designed for ESDCs and reliance on standard libraries that generate only static visualisations. Recent developments like Lexcube (Söchting et al., Reference Söchting, Mahecha, Montero and Scheuermann2023, cf. Interactions in Figure 7)Footnote 57 and xcube-viewerFootnote 58 enable interactive and barrier-free visualisation, allowing users to inspect any ESDC dimension (especially space, time, and variable) interactively. Enabling interactions on large-scale spatio-temporal data on the web is key to democratising our science (Steed et al., Reference Steed, Evans, Harney, Jewell, Shipman, Smith, Thornton and Williams2014).
A significant challenge will be the integration of data analytics with interactive visualisations through visual analytics (cf. the review of Cui, Reference Cui2019). The existing suite of methods is only partially suited for dealing with highly multivariate ESDCs, and most sophisticated visual analytic tools depend on a highly developed local computing infrastructure. There is a pressing need for web-based solutions to address this limitation. The goal should be to incorporate visualisations into any complex workflow to enhance comprehension of data inputs, monitor intermediate outcomes, and observe spatiotemporally structured results. One approach could be the tight integration of visualisation in developer workflows, particularly in popular environments like Jupyter Notebooks.
Integrating analytics tools with visualisation frameworks would allow researchers to dynamically explore, analyse, and visualise ESDCs in a unified environment in real-time. This would empower researchers to gain immediate insights into the relationships and patterns within the data. Additionally, incorporating visualisation into developer workflows would facilitate seamless visualisation generation at any stage of the ESDC life cycle, allowing researchers to visualise intermediate and final results and facilitating a more intuitive, iterative exploration of Earth system data.
ESDC visualisation extends its potential beyond the scientific community to engage and inform a wider audience. Nevertheless, this is particularly effective when accompanied by expert guidance such as tutorials, workshops, or annotations. Interactive open-access visualisations, exemplified by tools like Lexcube, allow political stakeholders and the general public to directly access and examine climate data (e.g. global or regional climate anomalies and trends). Open-access interactive visualisations enable scientifically literate individuals and those with less technical expertise to delve into ESDCs easily and rapidly by visualising anomalies, trends, and the interplay of variables. Such accessibility encourages a broader understanding and appreciation of Earth system research among diverse stakeholders, fostering a more informed and constructive dialogue about climate-related issues.
8. Conclusions and perspective
This paper reviews and explores the challenges and opportunities of leveraging ESDCs for Earth system research. This becomes particularly important in developing Earth Digital Twins (i.e. “a digital replication of the state and temporal evolution of the Earth system”, Bauer et al., Reference Bauer, Stevens and Hazeleger2021b). In this sense, the topics discussed here are of significance in initiatives like Destination Earth (DestinE)Footnote 59. The inherent simplicity and versatility of ESDCs enable a comprehensive exploration of the complex Earth system, facilitating a deeper understanding of intricate processes and phenomena. For advancing our understanding of the Earth system, the following key considerations emerge and need to be addressed by the research community to tap into the full potential of ESDCs:
-
1. Artificial Intelligence on ESDCs: The abundance of large-scale Earth system data, coupled with recent advancements in AI methods, compels the application of the latest developments in deep learning to ESDCs. Capitalising on the tensor-like structure of ESDCs in DL and incorporating factual knowledge through Physics-Informed Machine Learning approaches promise great advances in modelling and understanding. Recent advancements in AI, particularly in attention mechanisms, have opened up new possibilities for Earth system research. Techniques such as LLMs, generative image models (e.g. Stable Diffusion, Rombach et al., Reference Rombach, Blattmann, Lorenz, Esser and Ommer2021), as well as recent image and video segmentation models (e.g. Segment Anything Model, SAM and SAM 2, Kirillov et al., Reference Kirillov, Mintun, Ravi, Mao, Rolland, Gustafson, Xiao, Whitehead, Berg, Lo, Dollár and Girshick2023; Ravi et al., Reference Ravi, Gabeur, Hu, Hu, Ryali, Ma, Khedr, Rädle, Rolland, Gustafson, Mintun, Pan, Alwala, Carion, Wu, Girshick, Dollár and Feichtenhofer2024), may hold the potential to significantly advance our understanding of the Earth system (Wu and Osco, Reference Wu and Osco2023; Osco et al., Reference Osco, Wu, de Lemos, Gonçalves, Ramos, Li and Marcato2023). The ability to ‘communicate’ to ESDCs to extract valuable insights (e.g. Lobry et al., Reference Lobry, Marcos, Murray and Tuia2020) is within reach (e.g. using text prompts to extract variable anomalies from a specific land cover over a particular region). Furthermore, there is potential to generate ESDCs using text prompts, images, videos, or additional data inputs simultaneously by leveraging the power of multi-modal mechanisms (e.g. ImageBind, Girdhar et al., Reference Girdhar, El-Nouby, Liu, Singh, Alwala, Joulin and Misra2023), e.g., simulating the impact on vegetation due to an extreme event over a real ESDC using text prompts and geographical data. However, caution must be exercised when applying AI methods to ESDCs to avoid erroneous predictions and interpretations. Factors such as spatio-temporal auto-correlation, the spherical nature of the Earth, and biased sampling in the spatio-temporal and multivariate domains pose risks. Still, the abstract nature of ESDCs provides an opportunity to establish a de facto standard for AI in Earth system science, benefiting from optimised data access and technical enhancements. To ensure reliable outcomes, standardised methods are needed to address spatial dependency, the model’s area of applicability, and model uncertainty within ESDC structures.
-
2. Interacting with ESDCs: The heterogeneity, size, and multivariate nature of datasets also may imply that using ESDCs’ is unintuitive, which hampers interpretation. Effective communication opportunities with such data are crucial throughout the ESDC life cycle, both for scientists and a wider audience. Visualisation plays a key role in this regard. While visualisation tools are available to support the analysis process and scientific dissemination, there is still considerable potential for further exploration and development of visualisations. We believe that interactive visualisations are one key, as demonstrated by Lexcube. One promising avenue is the integration of visualisation directly into the analytics workflow (e.g. within Jupyter Notebooks or similar environments), and another is enabling visual analytics of ESDCs. In both cases, the challenge is making such interactions possible during the analysis process to enable the scientific exploitation of large ESDCs.
-
3. Technical challenges of large ESDCs: The multidimensional nature, varying spatio-temporal scales and resolutions, and applicability of ESDCs imply a series of technical challenges. These include interoperability issues, different geographical projections, interpolation and aggregation questions, and varying readiness levels for further analyses. Ensuring data integrity and interpretability while making Earth system data analysis-ready and interoperable requires tracing and encoding all data transformations and modifications in ESDC metadata. To address these challenges, developing guidelines and standards for geospatial datacubes is crucial for promoting FAIR and Open Earth System Science. The ever-increasing size and complexity of datasets demand scalable solutions to tackle associated challenges. The ongoing efforts of the open-source software community are commendable in this regard, as they contribute to the advancement of tools and frameworks tailored to handle big Earth system data. Furthermore, cloud environments present a possible solution to quickly scale workloads when processing data within the ESDC life cycle. They offer the advantages of on-demand resource allocation and scalability, allowing researchers to access the necessary computational power and storage capacity when needed.
-
4. Integrating (geospatial) data beyond cubes: ESDCs already offer the potential for advancing Earth system research and analysis in multiple domains. However, ESDCs can benefit from integrating different methodological approaches or data sources at different scales. One example is the integration of Unoccupied Aerial Vehicle (UAV)- and Light Detection and Ranging (LiDAR)-based data. This data provides a means to collect highly localised and high-resolution measurements, making them particularly suitable for localised studies and gaining valuable insights into fine-scale processes. Another example is the integration of vector dataFootnote 60, which typically represents categorical information and carries great importance in multiple Earth system spheres (e.g. socioeconomic features). Additionally, in-situ collections of any process (e.g. via ecological monitoring data) are essential. Today, the quest is that users request the integration of any additional data sources while remaining fully valid. Yet, it poses a challenge as it raises important questions regarding interoperability and the encapsulation of multi-resolution cubes that incorporate multi-scale raster data and the combination of raster and vector data within a unified framework.
-
5. Towards flexible cube-based structures: To advance ESDCs’ benefits, it is essential to advance the standards of ESDC structures and start considering hierarchical data structures, including ESDCs as “leaves” (e.g. xarray’s DataTree structure) or even unstructured grid systems (e.g. Project RaijinFootnote 61 with uxarrayFootnote 62). Given the abundance of insightful (but heterogeneous) datasets, this would enhance Earth system research, regardless of their resolution or dimensionality. Nevertheless, this implies that we must ensure data traceability and interpretability as heterogeneity increases in the resolution or dimensionality domains. A prime example lies in integrating AI models’ predictions within ESDCs. In such instances, additional dimensions must be incorporated to capture uncertainties (or quality flag systems) associated with AI-based predictions. This provides valuable insights into the reliability and robustness of the data. Leveraging the power of ESDCs in diverse fields can drive innovation, advance scientific knowledge, and enable more informed decision-making in a wide range of domains.
Open peer review
To view the open peer review materials for this article, please visit http://doi.org/10.1017/eds.2024.22.
Acknowledgments
We are grateful for the European Space Agency (ESA) funding for the DeepESDL and the DeepExtremes projects. Also, we thank the DLR for funding the ML4Earth and VW for funding the Digital Forest project. We also thank the DFG for supporting NFDI4Earth and NFDI4Biodiversity. We thank Pablo Mahecha for generating Figure 3 using inputs from Lexcube. Comments by the editor and the anonymous reviewers greatly improved the quality of the paper. Furthermore, we thank Peter Zellner for his comments and suggestions.
Author contribution
Conceptualisation: D.M.; M.D.M.; F.G.; G.K. Writing - Original Draft: D.M. with contributions from M.D.M.; G.K.; A.A.; C.A.; F.C.; I.F.; F.G.; S.H; C.J.; T.K.; L.M.F.; F.M.; M.R.; M.S.; K.T. Review and Editing: D.M.; M.D.M.; G.K.; G.C.V.; T.K.; S.H. Visualisation: D.M.; M.D.M; G.K.; C.J.; M.S. Supervision: M.D.M. All authors approved the final submitted draft.
Competing interest
The authors declare no competing interests exist.
Data availability statement
No data were used in this paper.
Ethical standard
The research meets all ethical guidelines, including adherence to the legal requirements of the study country.
Funding statement
This research was supported by grants from the European Space Agency ESA (“AI4Science - Deep Extremes” and “DeepESDL”). D.M. and M.D.M. acknowledge support from the “Digital Forest” project, Ministry of Lower-Saxony for Science and Culture (MWK) via the program Niedersächsisches Vorab (ZN 3679), and the “RS4BEF” project via the iDiv’s Flexpool program. M.D.M. and M.R. acknowledges support by the German Aerospace Center, DLR representing the Bundesministerium für Wirtschaft und Klimaschutz (ML4Earth, 50EE2201B). M.D.M., M.S., F.C. and F.G. acknowledge support by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) for funding the “NFDI4Earth”, project number: 460036893. T.K. acknowledges support by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) for funding the “PANOPS”, project number: 504978936. C.A. acknowledges support by the National Council of Science, Technology, and Technological Innovation (CONCYTEC, Peru) through the “PROYECTOS DE INVESTIGACIÓN BÁSICA – 2023-01” program with contract number PE501083135–2023-PROCIENCIA. M.D.M. and F.M. acknowledge support by the Federal Ministry of Education and Research of Germany and by Sãchsische Staatsministerium für Wissenschaft, Kultur und Tourismus in the programme Center of Excellence for AI-research “Center for Scalable Data Analytics and Artificial Intelligence Dresden/Leipzig”, project identification number: ScaDS.AI.