Introduction
In recent years, convolutional neural networks (CNNs) have achieved astonishing performance in the tasks of object classification, segmentation, and object detection (Girshick, Reference Girshick2015; Ronneberger et al., Reference Ronneberger, Fischer and Brox2015; Simonyan & Zisserman, Reference Simonyan and Zisserman2015; He et al., Reference He, Zhang, Ren and Sun2016; Chen et al., Reference Chen, Papandreou, Kokkinos, Murphy and Yuille2018). A CNN is a type of machine learning model that consists of a series of convolution layers [as well as other layers in modern architectures, such as pooling layers and/or batch normalization (BN) layers (Ioffe & Szegedy, Reference Ioffe and Szegedy2015)]. A convolution layer contains a set of learnable filters with a specified kernel size. Each learnable filter iterates through an input and computes an inner product between the filter and the region of the input overlapping with it at each iteration.
A CNN “learns” to select a meaningful set of filters through data that enables it to be used in diverse sets of problems. Thus, we have witnessed a large volume of works in the field of materials science utilizing CNNs to tackle various aspects of the field. For example, Kaufmann et al. (Reference Kaufmann, Zhu, Rosengarten, Maryanovsky, Wang and Vecchio2020) adopted a well-known object classification CNN model, called Xception, to determine the phase of diffraction patterns of crystalline materials captured in electron backscatter diffraction (EBSD). Matson et al. (Reference Matson, Farfel, Levin, Holm and Wang2019) utilized CNNs to categorize structures of carbon nanotubes and nanofibers. Similarly, Hanson et al. (Reference Hanson, Lee, Vachet, Schwerdt, Tasdizen and Mcdonald2019) and Heffernan et al. (Reference Heffernan, Ly, Mower, Vachet, Schwerdt, Tasdizen and Iv2019) also used CNNs for the task of categorizing the characteristics of materials. In addition to using CNNs for the task of classification, CNNs have also been deployed for other tasks such as estimating optimal operational parameters (such as the focus setting) during the image acquisition of scanning electron microscopy (SEM) images (Yang et al., Reference Yang, Oh, Jang, Lyu and Lee2020), segmenting structures characterized in SEM images (Ly et al., Reference Ly, Olsen, Schwerdt, Porter, Sentz, Mcdonald and Tasdizen2019; Pazdernik et al., Reference Pazdernik, Lahaye, Artman and Zhu2020), denoising the drifted microscopic images (Vasudevan & Jesse, Reference Vasudevan and Jesse2019), and reconstructing sparse SEM images (Trampert et al., Reference Trampert, Schlabach, Dahmen and Slusallek2019).
Moreover, CNNs have also performed impressively in the image synthesis task thanks to the seminal work of Goodfellow et al. (Reference Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville and Bengio2014), who proposed a new approach, called generative adversarial training. More specifically, they used two models competing against each other, in which one model tries to generate realistic samples whereas the other simultaneously seeks to distinguish the synthetic samples from the real samples. They were able to generate perceptually realistic images using two CNNs via adversarial training, which is referred to as generative adversarial networks (GANs). Since then, many works have proposed to further improve image quality, the complexity of the type of images being generated, the diversity of generated images, etc. (Arjovsky et al., Reference Arjovsky, Chintala and Bottou2017; Isola et al., Reference Isola, Zhu, Zhou and Efros2017; Odena et al., Reference Odena, Olah and Shlens2017; Karras et al., Reference Karras, Aila, Laine and Lehtinen2018, Reference Karras, Laine and Aila2019; Brock et al., Reference Brock, Donahue and Simonyan2019). Concurrently, another area of image synthesis, called neural style transfer, has seen rapid advancement (Gatys et al., Reference Gatys, Ecker and Bethge2016; Huang & Belongie, Reference Huang and Belongie2017; Li et al., Reference Li, Fang, Yang, Wang, Lu and Yang2017). Neural style transfer is the process of representing the semantic content of an image in different styles, for example, an image is represented under various seasons or times of day. Similarly, these image synthesis approaches have inspired many works in material sciences. For instance, GANs were utilized to synthesize microstructures of alloys (Singh et al., Reference Singh, Shah, Pokuri, Sarkar, Ganapathysubramanian and Hegde2018; Iyer et al., Reference Iyer, Dey, Dasgupta, Chen and Chakraborty2019). Meanwhile, motivated by the style transfer model, Ma et al. (Reference Ma, Wei, Liu, Ban, Huang, Wang, Xue, Wu, Gao, Shen, Mukeshimana, Abuassba, Shen and Su2020) proposed a model to transform the style of simulated labels from the Potts model to be similar to how they would have appeared had they been captured by a microscope.
Determining the composition of a mixed material is of interest in many fields (Sarkar et al., Reference Sarkar, Alamelu and Aggarwal2009; Samad et al., Reference Samad, Hashim, Ma and Regalbuto2014; Rossen & Scrivener, Reference Rossen and Scrivener2017; Heffernan et al., Reference Heffernan, Ly, Mower, Vachet, Schwerdt, Tasdizen and Iv2019). For instance, composite metal oxides show tremendous benefits in absorption, separation, and photosensitive operations over single metal oxides in catalytic and electrocatalytic processes. When using mixed oxides, knowing the portion of each oxide in a mixture is essential in understanding the electro- and physico-chemical properties (Samad et al., Reference Samad, Hashim, Ma and Regalbuto2014). Meanwhile, in India, measuring the percentage of uranium in a mixture of thorium–uranium mixed oxide is one of the required steps in quality assurance of nuclear fuels (Sarkar et al., Reference Sarkar, Alamelu and Aggarwal2009). On the other hand, determining the composition of calcium aluminum silicate hydrate (C–A–S–H) in a cement paste is part of the study of phase assemblages (Rossen & Scrivener, Reference Rossen and Scrivener2017). Various elemental analysis tools, such as powder X-ray diffraction (pXRD), SEM coupled with energy-dispersive spectroscopy (EDS), and laser-induced breakdown spectroscopy (LIBS) have been employed in these studies. Diverging from the elemental analysis methods and taking full advantage of the powerful performance of CNN, we proposed a novel approach to estimate the composition of mixed materials characterized in the SEM images in our previous work (Ly et al., Reference Ly, Nizinski, Vachet, McDonald IV and Tasdizen2021). Our proposed method deployed two CNN models. The first CNN is tasked with generating SEM images of mixtures from images of the pure materials appearing in the mixtures. The synthesized images are then used to train a second CNN model that is used to estimate the composition of a given input image. The main advantage of this proposed approach is that it does not require SEM images of the mixtures; thereby, it eliminates the monetary cost and laborious process of preparing and imaging samples of the mixtures (Ly et al., Reference Ly, Nizinski, Vachet, McDonald IV and Tasdizen2021). This advantage is further amplified when more materials are involved in the mixtures.
In the present study, we derive the mathematical details of the proposed approach in Ly et al. (Reference Ly, Nizinski, Vachet, McDonald IV and Tasdizen2021), and present extensive experiments and analyses for further validation. Specifically, we validate the approach on two types of mixtures (binary and tertiary): (a) mixtures of triuranium octoxide (U3O8) synthesized from ammonium diuranate (ADU) and uranyl peroxide (UO4); and (b) mixtures of U3O8 synthesized from ADU, uranyl hydroxide (UH), and sodium diuranate (SDU). The image synthesis process of these two sets of mixtures is the same, which emphasizes the ease of extending the proposed image synthesis model to many other mixtures. Furthermore, we implemented two variants for the mixture estimation model that are tasked with (a) determining the presence of materials and (b) estimating the precise composition of a given SEM image. From these experiments, the proposed approach in Ly et al. (Reference Ly, Nizinski, Vachet, McDonald IV and Tasdizen2021) can reliably determine the materials present in a mixture characterized in an SEM image (with the area under the ROC curve >0.9). Moreover, this approach can also provide estimated compositions in agreement with the actual compositions.
Materials and Methods
Mixtures of Uranium Oxides
Two sets of uranium oxides mixtures were used in this study: (a) mixtures of U3O8 synthesized from ADU and UO4 and (b) mixtures of U3O8 synthesized from ADU, UH, and SDU. We abbreviate these sets of mixtures as ADU–UO4 and ADU–UH–SDU, respectively, for the rest of the paper. For mixtures of ADU–UO4, we utilized images from Heffernan et al. (Reference Heffernan, Ly, Mower, Vachet, Schwerdt, Tasdizen and Iv2019). These images were acquired at a resolution of 1,024 × 884 with a horizontal field width (HFW) of 5.11 μm, which represents the scale across the width of the image. The details of how the ADU–UO4 mixtures were prepared and imaged can be found in Sections 2.1 and 2.2 in Heffernan et al. (Reference Heffernan, Ly, Mower, Vachet, Schwerdt, Tasdizen and Iv2019).
For ADU–UH–SDU mixtures, we utilized images from Schwerdt et al. (Reference Schwerdt, Hawkins, Taylor, Brenkmann, Martinson and Iv2019) as well as prepared and imaged samples. Specifically, we utilized images of pure materials (i.e., images of 100% ADU, 100% UH, and $100\percnt$ SDU) from Schwerdt et al. (Reference Schwerdt, Hawkins, Taylor, Brenkmann, Martinson and Iv2019). The images of 100% ADU were acquired at a resolution of 1,024 × 884 with a HFW of 1.53 and 3.06 μm, whereas the images of 100% UH and 100% SDU were acquired at the same resolution but with a HFW of 3.06 and 6.13 μm, respectively. The images of mixtures of ADU–UH–SDU were acquired by first preparing the mixtures with U3O8 samples that were previously described individually by Schwerdt et al. (Reference Schwerdt, Hawkins, Taylor, Brenkmann, Martinson and Iv2019). The samples were stored under vacuum and at room temperature between their initial synthesis and mixing. Three binary U3O8 mixtures were prepared: ADU with UH, ADU with SDU, and UH with SDU. A tertiary mixture was prepared of U3O8 from ADU, UH, and SDU. Each mixture was prepared by aliquoting approximately 40 mg of each U3O8 component into a small PTFE vial containing a Teflon-coated stir bar, followed by 15 min of agitation with a Vortex mixer at the medium intensity setting as described by Heffernan et al. (Reference Heffernan, Ly, Mower, Vachet, Schwerdt, Tasdizen and Iv2019). Table 1 lists the measured mass and weight% of each mixture.
Samples were prepared for analysis by SEM by dusting approximately 5–10 mg of mixed sample powder onto conductive double-sided carbon tape and aluminum pin stub mounts. An FEI Nova NanoSEM 630 scanning electron microscope was used to image the samples in immersion mode with the through-lens detector (TLD). Acquisitions were made at an image resolution of 1,024 × 884 at a HFW of 6.13 μm. Moreover, all the images were acquired with a high-voltage (HV) field setting of 7.00 kV with the exception of the images of the ADU–SDU mixture, which were acquired at 5.00 kV. Each sample was imaged without sputter coating, except for the ADU–SDU mixture, which showed signs of charging during SEM analysis; this sample was sputter coated with 20.0 ± 0.1 nm of Au/Pd film with a Gatan 682 Precision Etching and Coating System (PECS).
Image Synthesis Model
Synthesizing Mixed Samples
The proposed image synthesis model is based on the texture synthesis work in Gatys et al. (Reference Gatys, Ecker, Bethge, Hertzmann and Shechtman2017). They proposed to control the spatial location of a specific reference texture appearing in the generated image by minimizing the difference between the Gram matrices of the reference texture image and the generated image only in that specific region. Multiple regions with multiple reference textures can be easily synthesized at once by summing up that difference. In the present work, there is no constraint on where the reference textures should be located in the generated image. Thus, we instead minimized the difference between the Gram matrices of the generated images and the weighted sum of the Gram matrices of the reference textures. Formally speaking, to generate a new image, xG, from a given set of desired reference texture images $T = { {\bf x}_{T_1},\; {\bf x}_{T_2},\; \ldots ,\; {\bf x}_{T_n}}$, the following objective function is optimized:
where L is a set of extracted features from a pre-trained CNN model (refer to Section A.1.1 for more detail), and Gl(x) is the Gram matrix at layer l representing the normalized correlation of the vectorized feature maps, ${\bf F}^{l}( {{\bf x}}) \in {\cal R}^{C^{l}\times N^{l}( {\bf x}) }$ with C l the number of channels, and N l(x) the product of the spatial dimension, H l × W l:
ω k is a scalar that dictates the influence of texture k on the generated image. Hence, by controlling ω k, we can condition a certain percentage of a texture k to appear on the synthesized image.
The proposed image synthesis model adopts equation (1) to synthesize images of mixed material. To achieve this objective, we first defined each pure material in a desired mixture as a texture. Each image in the set of reference texture images used as input for each synthesis process is an image of the pure material present in a desired mixture. By conditioning ω k, a new image of a mixture can easily be synthesized with the desired percentage of each pure material appearing in the mixture. In other words, the percentage of a specific pure material occupying the synthesized image corresponds to ω k. Furthermore, we added the total variation (TV; Chambolle, Reference Chambolle2004) objective function to increase the smoothness of the generated images. The final objective function used in the present study is
where α and γ are adjustable weights to control the influence of each objective function on the overall function.
Pyramid Optimization
Here, we present the details of the proposed pyramid optimization strategy that speeds up the process of generating an image of size 512 × 512 by more than $50\percnt$ compared to that of the optimization strategy used in Gatys et al. (Reference Gatys, Ecker and Bethge2015, Reference Gatys, Ecker, Bethge, Hertzmann and Shechtman2017); consequently, a large amount of data can be generated in a much more efficient manner.
The optimization strategy used in Gatys et al. (Reference Gatys, Ecker and Bethge2015, Reference Gatys, Ecker, Bethge, Hertzmann and Shechtman2017) initializes a generated image with white noise sampled from a uniform distribution $\sim {\cal U}( 0,\; 1)$ and optimizes for a certain number of iterations. To speed up the optimization process, we used the motivation presented in the progressively growing GANs work (Karras et al., Reference Karras, Aila, Laine and Lehtinen2018). In that work, Karras et al. (Reference Karras, Aila, Laine and Lehtinen2018) discovered that generating a large-scale structure at a smaller resolution first and then focusing on the fine detail at a larger resolution reduces training time. Taking advantage of that observation, we first initialized the generated image with white noise at a lower resolution. We then optimized the low-resolution generated image for a certain number of iterations. Next, we upsampled the generated image to twice its current size and optimized it further. This process is repeated until the final resolution of the desired generated image is reached. Hence, we refer to this optimization strategy as pyramid optimization. Furthermore, different to Karras et al. (Reference Karras, Aila, Laine and Lehtinen2018), the proposed pyramid optimization does not add more layers as the spatial resolution increases.
Another advantage of using the pyramid optimization strategy is its ability to capture large structures in an image. For each filter in a convolution layer, a kernel with a fixed size is chosen to iterate through a given input. The fixed size limits how large of a region the output of that convolution layer represents. By using the proposed pyramid optimization, we essentially reduce the spatial dimension of the input (at the first few levels) to the convolution layers, while maintaining the kernel sizes; thereby, we ultimately enlarge the region the outputs of the convolution layers represent.
Figure 1 provides the progression details of the proposed pyramid optimization. Each row in that figure represents a specific resolution. In this study, we used three scales for the pyramid optimization strategy. In other words, we started the pyramid optimization process with white noise input at a resolution 4× smaller than the resolution of the final output. We optimized that input for K 1 iterations. After K 1 iterations, we upsampled the generated images to twice its current size and optimized for additional K 2 iterations. We again upsampled the generated images to twice its current size and optimized for another K 3 iterations to obtain the desired image. The values of K 1, K 2, and K 3 were empirically determined and correspond to 10,000, 10,000, and 1,000, respectively.
Mixture Estimation Model
For the mixture estimation model, we implemented two variants for two separate tasks. The complete architecture of these two models can be found in Section A.2.1. Both variants have a similar architecture except in the last layer, and the objective function due to the task for which each variant was designed. In the first variant, we implemented the model to predict the presence of materials in a given image. In other words, for example, this variant is used to determine if ADU or UO4 or both are present in a given SEM of an ADU–UO4 mixture. We refer to the first variant as MEM-A. The second variant is tasked with estimating the exact composition of a given image, and is referred to as MEM-B.
Results and Discussion
Image Synthesis Model
We used the proposed image synthesis model to generate images of two sets of mixtures: ADU–UO4 and ADU–UH–SDU. The images of ADU–UO4 were generated at a resolution of 512 × 512, whereas the images of ADU–UH–SDU were generated at a resolution of 128 × 128 to account for the difference in scale between input images used for the image synthesis model. The scale correction process is detailed in Section A.1.2.
For each mixture, we generated two sets of images. In the first set, we manually selected the weights, ${\cal \omega }$, in equation (3). We named this set dataset A. The purpose of generating dataset A is to have a similar set of compositions as real images. We also generated images containing various compositions by randomly sampling ${\cal \omega }$ for each synthesized sample, and we refer to this set as dataset B. For this dataset, we generated twice the amount of images compared with dataset A to be able to sample all the possible compositions and obtain more than one image per composition. Dataset B is constructed to accurately represent the real-world scenario in which we would like to have an approach that is able to estimate all possible compositions. Table 2 details the number of synthesized samples for each dataset as well as the number of real images of ADU–UO4 and ADU–UH–SDU mixtures.
Figures 2 and 3 show a side-by-side comparison between a few representative samples of real and synthesized images. As seen in these two figures, the synthesized images are qualitatively similar to real images. For instance, in Figure 2, one noticeable characteristic of ADU–UO4 mixtures is the correlation between the size of particles and the percentage of UO4 in the images. With the higher percentage of UO4 in a given image is the more larger particles appearing in the images. This particular characteristic can be easily identified in both real and synthesized images. For ADU–UH–SDU mixtures, the morphological characteristics of each individual material are distinguishably different from each other. The particles of SDU have a rough surface and are granular, whereas the particles of ADU are more rounded and smooth. On the other hand, the particles of UH are much larger in size compared with that of ADU or SDU and have smooth plate-like structures. These characteristics are clearly visible in both real and synthesized images. For instance, the large plate-like structures can be located in both real and synthesized images of 100$\percnt$ UH (second row in Fig. 3). Moreover, both plate-like and smaller rounded particles are found in both real and synthesized images of 50$\percnt$ ADU–50$\percnt$ UH mixture (fourth row in Fig. 3).
Mixture Estimation Model
Identifying Materials
The main purpose of this experiment is to demonstrate that the synthesized images can be used to train a model used for determining the materials present in a given SEM image. We trained the mixture estimation model, MEM-A, using only synthesized images and then tested the trained model on only real samples. In addition, we also trained MEM-A on only real images and tested this trained model on the same test set for performance comparison. Tables 3 and 4 show the area under the receiver operating characteristic curve (AUROC or AUC) of each material in the ADU–UO4 and ADU–UH–SDU mixtures, respectively. We also reported the micro-average and macro-average. The micro-average represents the weighted performance based on the frequency of each class. In other words, a class with more samples has more influence on the final result. In contrast, the macro-average treats each class equally. Refer to Section A.3 for the formal definition of the micro-average and macro-average of AUC as well as AUC itself.
As seen in the tables, the performance of the mixture estimation model MEM-A when trained with synthesized images achieved high AUC values (>0.9 in both the micro-average and macro-average) for both mixtures. Even though the AUC results of MEM-A trained with synthetic data are lower than when the model trained with real images, the high AUC values of the model trained with synthesized images implies that the model can still reliably identify the presence of pure materials in a mixture. Moreover, these results further validate that the synthesized images have similar characteristics to those of real images.
Composition Estimation
In this experiment, we used the second variant of the mixture estimation model, MEM-B, to estimate the composition of a given SEM image. The overall results of both mixtures are shown in Tables 5 and 6. As seen in both tables, the MEM-B model provides a reasonable estimate for both mixtures. Since this is a much more challenging task compared with materials prediction in the previous section, the results are not as accurate as for the previous task. However, the overall performance of the model trained with only synthesized images and only real images is still comparable in the ADU–UO4 mixtures, as indicated by the coefficient of determination (R 2) and root-mean-square error (RMSE) metrics. For the ADU–UH–SDU mixtures, the performance of the model trained with only synthesized images is reasonable, but the gap in performance between the model trained with only real images and with only synthesized images is larger than the one observed in ADU–UO4. The larger gap in performance in ADU–UH–SDU is attributed to the smaller resolution of the synthesized images and the larger number of materials involved in the mixtures.
Moreover, as expected, the model trained with dataset A outperformed the one trained with dataset B since the compositions in dataset A are tailored to the test set. However, it can be argued that the model trained with dataset B would perform better in practice because capturing a broader set of compositions would help eliminate bias on unseen compositions in the test set.
Analysis
Computation Time
One of our motivations is to provide an alternative approach that can be used to accurately determine materials in a mixture while eliminating the high cost and time-consuming process of sample preparation and imaging by building a synthetic dataset. Thus, the computation time of an image synthesis model is one of the key criteria that justifies deploying the proposed approach. Table 7 lists the computation time for generating an image using the image synthesis model without and with our proposed pyramid optimization scheme implemented using Pytorch library (Paszke et al., Reference Paszke, Gross, Massa, Lerer, Bradbury, Chanan, Killeen, Lin, Gimelshein, Antiga, Desmaison, Kopf, Yang, DeVito, Raison, Tejani, Chilamkurthy, Steiner, Fang, Bai and Chintala2019) on a single Titan RTX graphics processing unit (GPU) hardware. As clearly seen from the table, the image synthesis model can synthesize an image effortlessly in a short period of time. The computation time of the image synthesis model further decreases with the pyramid optimization scheme. Furthermore, this computation time can be improved with model parallelization on multiple GPUs if resources are available.
Diversity Analysis
In this analysis, we analyzed the diversity of synthetic images. The diversity measures the variation of images within a given class. A small variation indicates that the generated images look too similar to each other. Consequently, having generated images with less diversity means that the synthetic data fail to capture the underlying distribution of the dataset of interest. We used multi-scale structural similarity (MS-SSIM) (Wang et al., Reference Wang, Simoncelli and Bovik2004) to measure the diversity in this study. The MS-SSIM of two given input images has a value between 0.0 and 1.0. The larger value indicates that the two images are much more similar to each other. For this analysis, we computed the MS-SSIM for each composition in the mixture. For each composition, we first computed the mean MS-SSIM of 7,000 randomly selected distinct pairs of images in that mixture. Then, the mean MS-SSIM across all compositions was evaluated. Tables 8 and 9 show the MS-SSIM of real images and synthetic images for each composition and for the entire dataset. As seen in Table 8, the synthetic images achieved a comparable MS-SSIM metric as the real images for ADU–UO4 mixtures. Meanwhile, the difference in MS-SSIM metric between real images and synthetic images for ADU–UH–SDU is larger. We hypothesized this larger gap is due to the smaller resolution of synthesized images compared with the real images. However, the MS-SSIM metric of synthetic images for ADU–UH–SDU mixtures is still relatively small. Thus, we believe that the synthetic images of ADU–UH–SDU mixtures still reasonably capture the underlying distribution.
Conclusion
In this study, we demonstrated that the proposed approach in Ly et al. (Reference Ly, Nizinski, Vachet, McDonald IV and Tasdizen2021) can be easily applied to many different mixtures. At the same time, the proposed approach provides an accurate prediction (>0.9% in AUC) of the presence of materials in a mixture characterized in an SEM image and a reasonable estimation of the composition. Furthermore, the proposed approach in Ly et al. (Reference Ly, Nizinski, Vachet, McDonald IV and Tasdizen2021) achieves these accuracies relying solely on the synthetic data generated without requiring any images of mixed materials. This advantage eliminates the cumbersome process of sample preparation and imaging, which scales with the number of materials involved in the mixtures.
The proposed approach provides promising results for how generation of synthetic data can be beneficial in material science research. However, many challenges still remain that follow-up studies need to address. First and foremost, the performance of both mixture estimation models, MEM-A and MEM-B, when trained on synthetic data is still lagging behind the performance of models when trained with real images. This finding indicates that a gap between synthesized images and real images still exists. Thus, the next essential step is to address this gap by developing an image synthesis model that can generate much more realistic images.
Second, even though the mixture estimation model can estimate the compositions fairly well (when trained with either real or synthesized images), the need for a more accurate estimation is still of interest. This challenge can be tackled by developing new CNN architectures or learning methodologies to improve the estimation. For example, a semi-supervised learning method, combining a small number of real images of mixed materials along with a larger number of synthetic data, would have a potential of improving the overall performance.
Funding
This work is supported by the Department of Homeland Security, Domestic Nuclear Detection Office, under Grant Number 2015-DN-077-ARI092.
Appendix A
A.1. Image Synthesis Model Architecture and Synthesis Process
A.1.1. Model Architecture
Figure A.1 shows the architecture of the proposed image synthesis model. This model uses VGG-16 (Simonyan & Zisserman, Reference Simonyan and Zisserman2015) pretrained on ImageNet (Russakovsky et al., Reference Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg and Li2015) to extract feature maps for Gram matrix computation. We extracted five layers from the network in which each layer corresponds to the first convolution layer of each block in VGG-16 (Simonyan & Zisserman, Reference Simonyan and Zisserman2015). Equation (3) is then carried out to minimize the difference between the Gram matrices of the extracted layers of a set of reference images and those of the synthesized image for a certain number of iterations to obtain the desired synthesized image. Moreover, for each synthesized sample, the vector ω can be either randomly sampled or controlled by a user to control the presence of each reference image in the synthesized image.
A.1.2. Image Synthesis Process
We used the architecture presented in Figure A.1 to generate images of mixtures of ADU–UO4 and ADU–UH–SDU. We set both α and γ to 1 and optimized equation (3) with a learning rate of 0.001. For the ADU–UO4 mixtures, we generated images of size 512 × 512. We generated images of size 128 × 128 for the ADU–UH–SDU mixtures. The smaller resolution in the generated images of ADU–UH–SDU mixtures is done to account for the difference in scale between the reference images (i.e., images of 100% ADU, 100% UH, and 100% SDU). Specifically, the images of 100% ADU, 100% UH, and 100% SDU are of size 512 × 512. These images were obtained by cropping four overlapping regions of size 512 × 512 from the original SEM images. However, the images of 100% ADU and 100% UH were acquired with HFW of 1.53 and 3.06 μm, whereas the images of 100% SDU were acquired with HFW of 6.13 μm, which means that the images of 100% ADU and 100% UH are 4× and 2× larger than the images of SDU. Thus, we performed a scale correction process before using them as input to the image synthesis model. The scale correction process includes resizing to 128 × 128 for any image that is 4× larger, resizing to 256 × 256 first, and then randomly cropping a region of 128 × 128 for any image that is 2× larger. Finally, we randomly cropped a region of 128 × 128 for images of SDU.
A.2. Mixture Estimation Model Architecture and Operation
A.2.1. Model Architecture
The mixture estimation model is built based on the ResNet-50 (He et al., Reference He, Zhang, Ren and Sun2016) model. We replaced the last fully connected (FC) layer with a set of layers including FC, BN (Ioffe & Szegedy, Reference Ioffe and Szegedy2015), and dropout. Moreover, we added a global max pooling (GMP) layer in conjunction with global average pooling (GAP) to improve the stability of feature selection. The two variants, MEM-A and MEM-B, have the same architecture except the last FC layer and the objective function used to train them. In the MEM-A model, the number of nodes in the last FC corresponds to the number of materials in the mixture. In other words, the MEM-A used for ADU–UO4 mixtures has two nodes in the last FC layer, whereas the last FC layer in the model used for ADU–UH–SDU mixtures has three nodes. This model was trained with binary cross entropy objective function defined as
where $\hat {y}_i$ and y i are the predicted and ground-truth values of sample i, respectively, and M is the total number of training samples. The MEM-B model has three nodes in the last FC layer when used for the ADU–UH–SDU mixtures, and a single node when used for the ADU–UO4 mixtures since the percentage of ADU and UO4 in a mixture is complementary and can be inferred from the other. This variant was trained with L 1 objective function defined as
Figure A.2 details the architecture of both variants. The top output (dashed blue box) is used to determine the presence of materials in a given image, and the bottom output (dashed red box) is used to estimate the precise composition of a given image.
A.2.2. Training and Inference
For both variants, we trained all the layers except the convolution layers within the ResNet-50 (He et al., Reference He, Zhang, Ren and Sun2016) model for 20 epochs with a batch size of 8 and learning rate of 0.002. After 20 epochs, we then trained the entire model with the learning rate of 0.0002 and 0.002 for the convolution layers within the ResNet-50 (He et al., Reference He, Zhang, Ren and Sun2016) and the rest, respectively, for another 30 epochs. Moreover, we also used learning rate decay, which decreases the learning rates by a factor of 0.95 every 800 iterations.
For mixtures of ADU–UO4, we trained the mixture estimation models with input images of size 512 × 512, and the resolution of input was the same during the inference stage. However, since there is a difference in scale between real images of ADU, UH, and the rest of the mixtures in ADU–UH–SDU dataset, we needed to account for this difference. During the training process, we performed a similar scale correction process as described in the image synthesis model above. On the other hand, we wanted to predict materials or estimate the composition on the entire image. Thus, we resized any images that are 4× larger to 128 × 128 and any images that are 2× larger to 256 × 256 in the inference stage. Furthermore, the scale correction process is applied only to real images since we already take into account the scale difference for synthesized images in the image synthesis model.
A.3. Micro-Average and Macro-Average AUC
The AUC value is the area under the curve defined by the true positive rate (TPR) as a function of the false positive rate (FPR). Thus, the AUC of a class, k, can be defined as
where TRAPZ is the area under a curve, which is defined by TPRk as a function of FPRk, computed using the trapezoid rule (2021). Then, the micro-average AUC is computed as
Meanwhile, the macro-average AUC is defined as
where N k is the total number of classes.
A.4. Large Structures Synthesis with Pyramid Optimization
Generating images with the pyramid optimization strategy helps reduce the computation time as indicated in Table 7. In this section, we demonstrate another advantage of using the proposed pyramid optimization strategy. The pyramid optimization strategy operates on different resolution scales while maintaining the pre-determined kernel size of the filters in the convolution layers; in turn, it enlarges the region the outputs of the convolution layers represent. Hence, the proposed pyramid is able to capture larger structures. Figure A.3 presents an example of this advantage. The last two images in that figure are generated images synthesizing without and with the pyramid optimization strategy, respectively, from using the same reference image on the left. As seen in that figure, the generated image synthesized without the pyramid optimization strategy failed to capture large structures. In contrary, the generated image synthesized with the pyramid optimization strategy properly generated large structures similar to those in the reference image.