1. Introduction
Deep learning is a machine learning method that uses multilayered neural networks; recently, it has been used to detect objects and structures in the field of atmospheric science. In particular, deep convolutional neural networks (DCNNs) specialized for image pattern recognition have exhibited excellent performance in detecting and/or classifying tropical cyclones (TCs) (Matsuoka et al., Reference McInnes, Healy, Saul and Großberger2018), cloud type (Gorooh et al., Reference Gorooh, Kalia, Nguyen, Hsu, Sorooshian, Ganguly and Nemani2020), weather fronts (Biard & Kunkel, Reference Biard and Kunkel2019), and atmospheric river (Prabhat et al., Reference Prabhat, Mudigonda, Kim, Kapp-Schwoerer, Graubner, Karaismailoglu, von Kleist, Kurth, Greiner, Mahesh, Yang, Lewis, Chen, Lou, Chandran, Toms, Chapman, Dagon, Shields and Collins2021) from atmospheric data.
In general, machine learning using DCNNs requires extensive training data to achieve high performance. However, in atmospheric science, the aforementioned targets such as TCs occur infrequently. In addition, the reduction in detection accuracy for extreme phenomena is a limitation owing to the inadequate number of observation cases.
Although numerical simulations employing an atmospheric model can generate data of extreme events under various initial conditions and scenarios, they do not perfectly correspond to the real atmospheric conditions. If the simulated data can interpolate a small number of observed cases, it could contribute toward improving the performance of the machine learning-based models for recognizing extreme events. This paper reports the initial results of applying a classification model developed by training only simulation data to satellite observation data, considering typhoon classification as a simple example.
2. Materials and methods
2.1. Observation and simulation data
The satellite observation data included infrared (IR) from GridSat data, which corresponds to the merging of multiple satellite observations into a grid with a horizontal resolution of 7 km (Knapp et al., Reference Knapp, Ansari, Bain, Bourassa, Dickinson, Funk, Helms, Hennon, Holmes, Huffman, Kossin, Lee, Loew and Magnusdottir2011). The simulation data included the outgoing longwave radiation (OLR) with a horizontal resolution of 14 km, which was reproduced by the cloud-resolving model NICAM (Kodama et al., Reference Kodama, Yamada, Noda, Kikuchi, Kajikawa, Nasuno, Tomita, Yamaura, Takahashi, Hara and Kawatani2015).
To detect TCs, we prepared patch images of TCs (positive examples) and non-TCs (negative examples) cropped from the original data using the TC track data in the northwest Pacific Ocean. The size of the patch images was 64 × 64 for the simulation data and 128 × 128 for the observation data, which was approximately 1,000 km square in real scale. The most suitable track data of actual TCs—the International Best Track Archive for Climate Stewardship (IBTrACS) provided by NCAR—were used for the observation data. For the simulation data, the TCs were detected by employing a TC track algorithm (Yamada et al., Reference Yamada, Satoh, Sugi, Kodama, Noda, Nakano and Nasuno2017) on 6-hourly outputs of the horizontal components of wind, air temperature, and sea-level pressure. For the negative example data, the entire area was horizontally scanned in eight grids, and the patch areas depicting portions of clouds were cropped.
Examples of a cloud image from the simulation and observation data are depicted in Figure 1. To match the resolution of the simulation dataset to that of the observation dataset, the observation data were resized to half their original resolution. To identically treat the distinct data, we applied a min–max normalization (also known as min–max scaling) to IR and OLR.
The numbers of data for the positive and negative examples in the training and test data are listed in Table 1, wherein the classifiers trained on observation and simulation data are referred to as ObsCNN and SimCNN, respectively. Moreover, we randomly sampled a large number of negative examples to construct both the classifiers, such that an equal number of data points were present for the positive and negative examples, as same as the related work (Matsuoka et al., Reference McInnes, Healy, Saul and Großberger2018).
2.2. Deep convolutional neural networks
We developed a binary classification model using a DCNN to classify TC and nonTC images. Generally, a DCNN comprises a stack of convolutional layers, pooling layers, and fully connected layers (LeCun & Bengio, Reference LeCun, Bengio and Arbib1995). In classification, the output layer outputs a score, $ P(0) $ and $ P(1) $ , corresponding to the probability for each class—negative (non-TCs) and positive (TCs). Ultimately, the final class $ \hat{y} $ was inferred using the following equation employing the threshold value.
Here, $ P(0)+P(1)\hskip0.35em =\hskip0.35em 1.0 $ , $ 0\hskip0.35em \leqq \hskip0.35em P(0)\hskip0.35em \leqq \hskip0.35em 1.0 $ , $ 0\hskip0.35em \leqq \hskip0.35em P(1)\hskip0.35em \leqq \hskip0.35em 1.0 $ , and $ 0\hskip0.35em \leqq \hskip0.35em Threshold\hskip0.35em \leqq \hskip0.35em 1.0 $ .
Although several DCNN models have been proposed in related research literature, we conducted experiments in this study using the VGG16, which is known for its high recognition accuracy despite being a relatively lightweight model (Simonyan & Zisserman, Reference Simonyan and Zisserman2015). Based on the VGG16 model, we constructed two types of classification models: one trained only on simulated data and the other trained only on observed data for comparison.
Moreover, we used recall and precision as metrics to evaluate the classification performance for the test data. In particular, recall denotes the ratio of correctly classified TCs to those with the correct class as TCs, whereas precision represents the ratio of correctly classified TCs to those with the inferred class as TCs, and they can be represented by the following equation.
where TP denotes true positive, FN represents false negative, and FP denotes false positive.
3. Results
The classification performance of the ObsCNN and SimCNN on the observation data is illustrated in Figure 2a as a precision–recall curve (P–R curve), which was plotted by varying the threshold for the output value of the DCNN. In most cases, the models trained on the observation data delivered higher classification performance than those trained on the simulation data. For a precision of 0.5, the recall of ObsCNN was 0.984, whereas that of SimCNN was 0.829. Similarly, for a precision of 0.7 and 0.9, the recall of ObsCNN was higher than that of SimCNN. This signified that the classification of the simulation data was more challenging than that of the observation data. For reference, the classification performance of both models when applied to simulation data is also shown in Figure 2a. SimCNN showed better performance than ObsCNN for the simulation data.
The classification performance for each tropical cyclone intensity is presented in Figure 2b. Accordingly, we classified the TCs into the following five categories using their maximum wind speed (10-min average): TS (17–24 m/s), STS (25–32 m/s), TY1 (33–43 m/s), TY2 (44–53 m/s), and TY3 (over 54 m/s). The recall of both ObsCNN and SimCNN for each TC intensity for the precision fixed at a specific value (0.5, 0.7, 0.9) is portrayed in Figure 2b. For both ObsCNN and SimCNN, the recall was higher for a stronger TC intensity. In addition, the performances of the SimCNN and ObsCNN were similar for strong TC intensities. The recalls of ObsCNN and SimCNN for TS were approximately 0.75 and 0.37, whereas they were approximately 1.0 and 0.94 for TY3, implying that the features of the observation and simulation data were distinct for weaker TCs.
4. Discussion
The differences in the properties of ObsCNN and SimCNN were visualized using a technique called Grad-CAM, which is a region visualization method with significant contributions toward CNN prediction (Selvaraju et al., Reference Selvaraju, Cogswell, Das, Vedantam, Parikh and Batra2020). The important areas in the decision visualized by the Grad-CAM are depicted in Figure 3. The correct inference obtained using ObsCNN and SimCNN for observation data (only TCs) are indicated in Figure 3a,b. The results from ObsCNN revealed that the regions of high importance were clustered around the center of the TC. On the contrary, the SimCNN results indicated that the regions of high importance were clustered a little farther from the center of the TC. Especially in the TY3 example with a clear eye of the TC, the ObsCNN indicated that the eye of the TC was a more important factor in the classification.
The important regions inferred from the simulation data (only TCs) using the SimCNN is visualized in Figure 3c. In any TC category, the pattern slightly outside the center of the TC was recognized to determine it as a TC. Based on these results, with respect to the classification capability of CNNs, the patterns outside the center in the simulation were similar to those in the observation data. The detection performance of strong TCs is high for both ObsCNN and SimCNN shown in Figure 2b, suggesting that the observed TCs also have a characteristic pattern around off-center region.
Finally, the observed and simulated TC images (only TY2 and TY3) were dimensionally reduced and mapped into two-dimensional feature space using Uniform Manifold Approximation and Projection (UMAP) (McInnes et al., Reference Matsuoka, Nakano, Sugiyama and Uchida2018) and t-distributed Stochastic Neighbor Embedding (t-SNE) (van der Maaten & Hinton, Reference van der Maaten and Hinton2008) as shown in Figure 4a,b. Both results show that some of the observed TCs have similar features to the simulated TCs, while the rest have different features.
5. Conclusion
In this study, we developed a CNN-based classifier trained using simulation data and applied it to the observation data.The classification model trained on the simulation data cannot be directly applied to the observation data due to differences in cloud patterns to be recognized. Although negative experimental results were obtained, there is no doubt that the simulation data have great potential. Clarifying the representation capability of both data and integrating the data will lead to advanced machine learning models as well as simulation models.
Acknowledgments
The author thanks Drs. M. Nakano, C. Kodama, and Y. Yamada for producing the training and test data on tropical cyclones.
Data availability statement
Please contact the corresponding author for data requests.
Funding statement
This work was supported by JST, PRESTO (Grant Number JPMJPR1777), and JST, CREST (Grant Number JPMJCR1663).
Conflict of interest
The author has no conflicts of interest to declare.
Authorship Contributions
D.M. designed the study, performed the experiments, analyzed the data, and wrote the manuscript.
Comments
Comments to the Author: This manuscript investigated and compared the tropical cyclone detection approach with deep learning algorithm by using either satellite observations or model output as training dataset. The authors found out that using model output as training dataset to detect TC is less accurate than using satellite observations. It is generally well written but needs some further clarifications.
Major comments:
1.Are the criteria of detecting/identifying TC the same for satellite observations and NICAM model output? 10-m max. wind can be used in a model to define a TC while the Dvorak technique is commonly used to decide TC categories from cloud patterns of the satellite visible/IR images. If the criteria are different, how will the interpretation of the results be affected?
2.Fig. 2b shows similar recall skill of SimCNN and ObsCNN for strong typhoons (TY3) but Fig. 3b shows a much more off-center pattern of SimCNN for TY3 when compared to ObsCNN in Fig. 3a. How do the authors explain the contradiction?
Minor comments
1.Table1: the positive and negative cases in the training dataset are always the same for both model and obs. Is this a requirement or just an coincidence?
2.Page 3: the denotation of recall/precision. “TN denotes true negative” should be “FN denotes false negative”.
3.Page 4: the third line from bottom “in both the TC categories”. Which two categories do “both” refer to?