Impact Statement
This research paper describes a particle picking and micrograph denoising method for cryo-electron microscopy (cryo-EM) images designed to work under challenging signal-to-noise ratio (SNR) conditions. Based on the idea of multitask learning, we propose to use a deep-learning framework that enables accurate particle localization and micrograph denoising simultaneously. We validate the proposed approach on multiple single-particle cryo-EM datasets and cryo-electron tomography tilted images, showing substantial improvements in SNR and particle picking accuracy. This work is addressed to people working at the interface between biological imaging and computer vision. Advances in denosising and particle identification are important to accelerate and automate the processing of challenging cryo-EM datasets of biomedical significance, such as low-molecular-weight complexes or targets imaged in their native state.
1. Introduction
Cryo-electron microscopy (EM) combined with single-particle analysis (SPA) is a popular technique that enables visualization of proteins and macromolecular complexes at near-atomic resolution.(Reference Bartesaghi, Aguerrebere, Falconieri, Banerjee, Earl, Zhu, Grigorieff, Milne, Sapiro, Wu and Subramaniam1) Data collected using this technique have an extremely low signal-to-noise ratio (SNR), as electron dosage has to be kept low to prevent radiation damage to the biological sample.(Reference Bendory, Bartesaghi and Singer2) Therefore, large amounts of data have to be acquired in order to obtain high-resolution 3D reconstructions. A typical SPA dataset usually contains thousands of micrographs and each micrograph contains a few hundred copies of the protein of interest. To obtain a single high-resolution structure, hundreds of thousands of 2D projections of particles with random orientations need to be detected within micrographs in a process commonly referred to as particle picking. Particle projections are then extracted, aligned, and back-projected to obtain a 3D reconstruction (Figure 1). The resolution of the reconstructed 3D structure increases log-linearly with the number of good particles used, making accurate particle picking a critical task for the downstream processing tasks. However, particle detection can be quite challenging due to the low SNR nature of cryo-EM images caused by the limited electron doses used during acquisition to prevent radiation damage. Currently, on datasets of larger-sized proteins with relatively high SNR, particle picking can be performed automatically using either size-based, template matching, or more advanced deep-learning-based algorithms, such as semi-supervised learning-based Topaz(Reference Bepler, Morin, Noble, Brasch, Shapiro and Berger3) and fully supervised learning-based crYOLO.(Reference Wagner, Merino, Stabrin, Moriya, Antoni, Apelbaum, Hagel, Sitsel, Raisch, Prumbaum, Quentin, Roderer, Tacke, Siebolds, Schubert, Shaikh, Lill, Gatsogiannis and Raunser4) On datasets of smaller-sized proteins with extremely low SNR, however, existing automatic picking algorithms are susceptible to under/over picking.(Reference Kumar, Zavaliev, Wu, Zhou, Cheng, Dillard, Powers, Withers, Zhao, Guan, Borgnia, Bartesaghi, Dong and Zhou5) While manual picking can be performed as a last resort, it is laborious and time consuming. As a result, the particle picking of small-sized proteins from images with very low SNR remains a significant bottleneck.
Recent efforts to tackle particle picking under extremely low SNR mainly focus on image quality enhancement through denoising. For example, simple denoising techniques such as low-pass filtering have been widely adopted in the field. Despite its simplicity, frequency cutoff-based approaches have one major drawback: the risk of over-smoothness, which can remove important high-frequency information and make particles indistinguishable from the background. Recent advancements have seen the emergence of neural network-based methodologies that leverage the Noise2Noise (N2N) principle(Reference Wagner, Merino, Stabrin, Moriya, Antoni, Apelbaum, Hagel, Sitsel, Raisch, Prumbaum, Quentin, Roderer, Tacke, Siebolds, Schubert, Shaikh, Lill, Gatsogiannis and Raunser4, Reference Lehtinen, Munkberg, Hasselgren, Laine, Karras, Aittala and Aila6–Reference Tegunov and Cramer8) or noise distribution modeling using generative models on synthetic data.(Reference Li, Zhang, Wan, Yang, Li, Li, Han, Zhu and Zhang9) Methods utilizing synthetic data, however, often yield suboptimal results when applied to real datasets due to their inability to fully capture the true noise distribution. Consequently, N2N-based techniques have gained prevalence since they do not depend on synthetic data. Nonetheless, these methods typically require complex training protocols and the need for paired noisy images, which are usually generated by splitting data movies into even and odd frames (and aggregating them separately), inherently compromising their SNR. This limitation is particularly significant for small-sized proteins where the SNR is already low in the average of all movie frames, and using half the data further diminishes their SNR and consequently the efficacy of N2N denoising. Recent work in natural(Reference Milyaev and Laptev10) and biomedical imaging(Reference Laine, Jacquemet and Krull11) has demonstrated that high-quality denoised images can substantially enhance downstream task performance, such as object detection and segmentation. Inspired by these findings, we hypothesize that particle picking, especially under the challenging conditions of low SNR, can be significantly improved if the input micrographs are denoised effectively, ensuring that critical particle features are preserved.
Therefore, to overcome low SNR and expensive data labeling challenges, we propose a framework that enables joint image denoising and particle identification. The information learned from these tasks is complementary and therefore by enabling information sharing, we are able to improve the performance of both tasks. To enable image denoising without ground-truth clean images or paired noisy images, and particle picking with sparsely annotated data, our framework incorporates both self-supervised Bayesian denoising(Reference Laine, Karras, Lehtinen and Aila12) and positive-unlabeled (PU) learning.(Reference Bepler, Morin, Noble, Brasch, Shapiro and Berger3, Reference Kiryo, Niu, du Plessis and Sugiyama13, Reference Mann and McCallum14) By leveraging the prior distribution of the underlying clean signal, our model learns to denoise without being trained using paired noisy images. The joint learning framework is able to improve the visibility of particles on low SNR micrographs and identify more particles under challenging scenarios while maintaining a low false positive (FP) rate. Training of the framework eliminates the need for splitting movie stacks into two separate sub-stacks for paired denoising training and requires only a few partially labeled micrographs, thus greatly reducing laborious data labeling.
The proposed framework can be applied both to cryo-EM micrographs and cryo-electron tomography (CET) tilt series.(Reference Bartesaghi, Sprechmann, Liu, Randall, Sapiro and Subramaniam15, Reference Liu, Zhou and Bartesaghi16) To validate the performance of our approach, we performed experiments on three cryo-EM datasets and one CET dataset. The cryo-EM datasets correspond to one high-molecular-weight target (70S ribosome, 2.5 MDa) and two low-molecular-weight proteins (trehalose-6-phosphate synthase from fungal pathogen Cryptococcus neoformans [CnTPS, 280 kDa]) and a sample of aldolase (157 kDa) from EMPIAR-10215.
On the ribosome dataset, we evaluated the performance of our method under different SNR levels. Since micrographs obtained by averaging all movie frames have relatively high SNR, we are able to treat these frame averages as ground truth and use frame averages from subsets of frames to mimic decreasing SNR levels. Using these images, we show that our model is more robust to noise as SNR conditions worsen, with the denoised outputs exhibiting a $ 10\% $ improvement in peak SNR (PSNR) and the detection algorithm producing $ 30\% $ more particles.
For the CnTPS dataset, we labeled a total of 140 particles and were able to produce a set of particles that were missed by other algorithms, which led to a resolution of 3.15 Å after 3D reconstruction.(Reference Punjani, Rubinstein, Fleet and Brubaker17) On EMPIAR-10215, we used a total of 180 labeled particles for training and identified $ 8\% $ more particles than other approaches, which improved the final 3D resolution from 2.7 to 2.6 Å. We also show that compared to performing denoising and detection sequentially, joint learning is able to achieve better results in both tasks.
On the CET dataset, we show a significant improvement in protein visualization in both raw tilt series and reconstructed tomograms. We validate the detection results by identifying particles from 2D tilt series images, achieving a $ 50\% $ increase in precision and recall scores. For specific tasks that require either denoising or detection alone, the proposed framework also has a mode that can perform these two tasks separately.
To summarize, the main contributions of this paper are as follows:
-
1. We propose a joint learning framework that allows simultaneous micrograph denoising and particle identification. By leveraging self-supervised Bayesian denoising and PU learning, training of our model does not require ground-truth clean images or paired noisy images, and does not need large amounts of annotated data.
-
2. Through extensive experimental validation, we show that by performing two tasks simultaneously, we are able to achieve better denoising performance (i.e., increased particle visibility without over-smoothing) and identify particles even under extremely low SNR conditions, thus outperforming existing methods.
-
3. This method can be applied to single-particle cryo-EM micrographs and CET tilt series for both denoising and detection. Particle coordinates obtained from 2D tilt series can be converted to 3D tomogram coordinates through backprojection and used for sub-tomogram averaging.
2. Related work
Recent developments in deep learning have led to breakthrough performance in tasks such as image enhancement, object detection and segmentation. In this section, we introduce relevant recent work, including denoising without clean images, semi-supervised object detection, and multitask learning (MTL).
2.1. Denoising without clean images
Unlike denoising algorithms based on supervised learning which are trained using noisy-clean image pairs, blind image denoising is usually achieved by leveraging internal data statistics. Traditional methods based on internal statistics include nonlocal means, which predicts clean pixel values based on similar local neighborhoods,(Reference Coll and Morel18) and block-matching 3D (BM3D), which similarly relies on data repetitiveness.(Reference Dabov, Foi, Katkovnik and Egiazarian19) More recently, denoising methods based on convolutional neural networks have been proposed, including deep image prior which trains a neural network to learn the prior distribution of data from pure noise,(Reference Ulyanov, Vedaldi and Lempitsky20) N2N that learns to denoise using pairs of independently corrupted training images that share the same underlying signal,(Reference Lehtinen, Munkberg, Hasselgren, Laine, Karras, Aittala and Aila6) and Noise2Void, which assumes the independence of noise corruption for each pixel and trains the denoising network only using the single input noisy image by masking the central pixel.(Reference Krull, Buchholz and Jug21) Built on Noise2Void, a more generalized formulation was proposed in Reference (Reference Batson and Royer22) and was further improved by incorporating Bayesian statistics in Reference (Reference Laine, Karras, Lehtinen and Aila12). In addition, a number of related image denoising methods based on deep learning have been proposed.(Reference Lee and Jeong23–Reference Prakash, Krull and Jug27) For the case of cryo-EM images, implementations of N2N that also provide generalized pre-trained models for image denoising have been successfully applied.(Reference Bepler, Kelley, Noble and Berger7, Reference Palovcak, Asarnow, Campbell, Yu and Cheng28)
2.2. Semi-supervised object detection and its application in cryo-EM
Built upon supervised object detection frameworks,(Reference Redmon, Divvala, Girshick and Farhadi29–Reference Ren, He, Girshick and Sun32) semi-supervised methods that incorporate the use of unlabeled data through pseudo-labeling and consistency regularization, along with supervised learning of labeled data, have been proposed.(Reference Xie, Dai, Hovy, Luong and Le33–Reference Sohn, Berthelot, Li, Zhang, Carlini, Cubuk, Kurakin, Zhang and Raffel35) In Reference (Reference Jeong, Lee, Kim and Kwak36), data augmentation is applied and the model is trained through maximizing consistency between detection and classification outputs of labeled/unlabeled images and their augmented pairs. Sohn et al.(Reference Sohn, Zhang, Li, Zhang, Lee and Pfister37) proposed a semi-supervised framework that use labeled data to first train a teacher model and then update the model by maximizing the consistency between output pseudo-labels of unlabeled data and strongly augmented pairs. To alleviate the problem of confirmation bias and further improve the quality of pseudo annotations, Feng Zhou et al.(Reference Feng Zhou, Yu, Wang, Qian and Li38) proposed an adoption of instant teacher that uses instant pseudo labeling with extended weak-strong data augmentations for teaching during each training iteration, instead of training in a sequential manner. In the cryo-EM field, several deep learning-based object detection methods have been used successfully for particle detection.(Reference Bepler, Morin, Noble, Brasch, Shapiro and Berger3, Reference Wagner, Merino, Stabrin, Moriya, Antoni, Apelbaum, Hagel, Sitsel, Raisch, Prumbaum, Quentin, Roderer, Tacke, Siebolds, Schubert, Shaikh, Lill, Gatsogiannis and Raunser4, Reference Tegunov and Cramer8, Reference Wang, Gong, Liu, Li, Yan, Xia, Li and Zeng39, Reference Al-Azzawi, Ouadou, Max, Duan, Tanner and Cheng40) crYOLO(Reference Wagner, Merino, Stabrin, Moriya, Antoni, Apelbaum, Hagel, Sitsel, Raisch, Prumbaum, Quentin, Roderer, Tacke, Siebolds, Schubert, Shaikh, Lill, Gatsogiannis and Raunser4) is a fully supervised picking method trained with a variety of annotated data built upon the popular YOLO model. Topaz(Reference Bepler, Morin, Noble, Brasch, Shapiro and Berger3) is a semi-supervised particle picking method based on PU learning that aims to minimize the difference between empirical data distribution and its prior. While generalized pre-trained models are available both in crYOLO and Topaz, these tend to underperform when applied to lower SNR datasets where the data distribution deviates from the distribution present in the training set.
2.3. MTL and object detection
While neural networks have shown impressive results for various tasks such as the aforementioned denoising and detection, these tasks are solved in isolation, that is, a different network is trained for each task. Recently, MTL which aims to train a single network to perform different tasks simultaneously has shown promising results.(Reference Ruder41–Reference Liu, Qiu and Huang47) MTL leverages complementary information shared between related tasks to improve model generalization and the performance of the original task. Associated tasks can also act as regularizers for one another. Earlier applications of MTL in object detection include Mask-RCNN, which extends the Faster R-CNN(Reference Ren, He, Girshick and Sun48) by adding a branch for predicting segmentation masks for each object detected in an image. More recently, object detection tasks are often combined with other tasks such as estimating distance between detected objects,(Reference Chen, Zhao, Lv and Zhang49) using segmentation masks as attention,(Reference Perreault, Bilodeau, Saunier and H’eritier50) using attention mechanisms to dynamically weigh the contributions of different tasks during training and inference,(Reference Liu, Johns and Davison51) and performing denoising and detection through a cascaded network in a supervised manner.(Reference Gubins and Veltkamp52) Inspired by these MTL applications, Buchholz et al. combined denoising and supervised segmentation and applied the joint model to fluorescence microscopy images.(Reference Buchholz, Prakash, Krull, Jug, Bartoli and Fusiello53) Most of these strategies operate on images with a relatively high SNR and require full supervision on at least one task. In earlier versions of our work,(Reference Huang, Zhou and Bartesaghi54, Reference Huang, Zhou, Liu and Bartesaghi55) there have also been advances in joint learning techniques applied to lower SNR cryo-EM images that require less supervision. However, these methods have limitations, as they either rely heavily on pixel intensity assumptions or perform poorly when applied to images of crowded environments. Building on the success of these methods, we seek to extend their applicability to lower SNR datasets such as those obtained from SPA and CET modalities of cryo-EM.
3. Method
As shown in Figure 2, the proposed framework is composed of three main components: (i) a U-Net-based feature extraction backbone; (ii) a denoising head that outputs the clean image and a detection head that outputs the probability of a pixel being at the center of a particle; and (iii) an auxiliary noise estimation network that estimates the probability distribution of the noise. Non-max suppression is applied to the detection head output to get the final particle coordinates. If only denoising or detection is needed, the other head can be easily turned off so that the framework performs only a single task. During the training stage, cropped patches of noisy micrographs are fed into the network. During the inference stage, the network takes in the whole micrograph and outputs a denoised image and a heatmap used for particle localization.
3.1. Denoiser training using a single noisy micrograph
The denoiser is trained based upon a blindspot convolutional neural network proposed in Reference (Reference Laine, Karras, Lehtinen and Aila12) that learns the underlying clean signal by maximizing the posterior probability of the observed noisy signal. Denote the underlying clean pixel value as $ x $ and the observed noisy pixel value as $ y $ . As pixels are not independent, we assume that $ x $ not only depends on $ y $ , but also its surrounding context, which we denote as $ {\Omega}_y $ . Specifically, we model the dependence of $ x $ and its surrounding context $ {\Omega}_y $ as $ p\left(x|{\Omega}_y\right) $ . In a convolutional neural network setting, $ {\Omega}_y $ can be viewed as the receptive field surrounding the center pixel $ x $ . If we assume noise is independent between pixels and independent of the context, we can model the noise distribution $ p\left(y|x\right) $ explicitly. With this, we can connect the observed noisy data to the unobserved clean signal using the noise model $ p\left(y|x\right) $ and the signal model $ p\left(x|{\Omega}_y\right) $ through marginalization:
Specifically, we assume the noise model $ p\left(y|x\right) $ to be Gaussian or Poisson (as commonly assumed in cryo-EM applications). We model the prior $ p\left(x|{\Omega}_y\right) $ as Gaussian $ \mathcal{N}\left({\mu}_x,{\sigma}_x\right) $ as well, so that the marginal likelihood $ p\left(y|{\Omega}_y\right) $ can be computed in closed form. Note that since all micrographs are grayscale images, the prior is a univariate Gaussian. Finally, we are able to infer $ {\mu}_x $ and $ {\sigma}_x $ by maximizing the data likelihood under Equation 1. Once we approximate the prior statistics, the final denoised image incorporates the observed noisy signal information through Bayesian reasoning. The posterior $ p\left(x|y,{\Omega}_y\right) $ is proportional to the product of the noise model $ p\left(y|x\right) $ and the prior $ p\left(x|{\Omega}_y\right) $ :
With this, the denoised image can be obtained by calculating the maximum a posteriori estimate.
As mentioned above, our proposed model supports both Gaussian noise and Poisson noise assumptions. Next, we discuss how our model adapts to these noise models.
3.1.1. Gaussian noise model
Since we model the signal corruption as zero-mean Gaussian noise,(Reference Penczek56) the noise model $ p\left(y|x\right) $ can be expressed as a normal distribution $ \mathcal{N}\left({\mu}_y,{\sigma}_y\right) $ where:
with $ \sigma $ being the standard deviation of the noise which is learned through an auxiliary noise estimator. For simplicity, we decided to use the same backbone architecture (UNet) for this purpose. However, a simpler architecture with fewer parameters such as a multilayer feed-forward convolutional neural network can also be used. Next, we minimize the negative log-likelihood of the observed training data:
where $ C $ is a constant that can be ignored. Once the prior statistics are learned, we obtain the denoised image using the posterior mean:
3.1.2. Poisson noise model
While in most cryo-EM applications, the noise is assumed to be Gaussian, a more accurate assumption is to use a Poisson distribution parameterized by the electron count $ \lambda $ . Therefore, we model the noise distribution as $ y=\mathrm{Poisson}\left(\lambda x\right)/\lambda $ and approximate Poisson noise as signal-dependent Gaussian noise.(Reference Hasinoff and Ikeuchi57) In this case, the resulting noise standard deviation is $ \sigma =\sqrt{\left(x/\lambda \right)} $ and the noise model becomes:
We substituted the above model into the loss function in Equation 4 and the posterior mean in Equation 5, where $ \sigma $ becomes $ {\lambda}^{-1}{\mu}_x $ and $ \lambda $ is learned using the auxiliary noise estimator.
This approach allows us to train the network without the need of paired noisy images, which reduces the time needed to divide the aligned cryo-EM movie stacks into two parts. It also reduces potential cumulative errors originating from frame alignment as one of the principles of N2N is that the underlying clean signal has to be the same for both noisy inputs. This condition is hard to satisfy in practice due to the low SNR of cryo-EM frames, which results in inaccurate image alignments.
3.2. Detector training using a partially annotated micrograph
The particle detector produces a center point heatmap $ \hat{Y}\in {\left[0,1\right]}^{W\times H} $ , where $ W $ and $ H $ are the width and height for the input micrograph. We extend the method used by Zhou et al.(Reference Zhou, Wang and Krähenbühl58) to generate the ground-truth heatmap $ Y\in {\left[-1,1\right]}^{W\times H} $ using the partially annotated micrograph. For each annotated center coordinate position $ p=\left(x,y\right) $ , we apply a Gaussian kernel $ {K}_{xy}=\exp \left(-\frac{{\left(x-{p}_x\right)}^2+{\left(y-{p}_y\right)}^2}{2{\sigma}_k^2}\right) $ , where $ {\sigma}_k $ is determined by the particle size based on Reference (Reference Law and Deng59). The remaining unlabeled coordinates on $ Y $ have a value of $ -1 $ . With such formulation, the detector learns to classify each pixel $ {v}_{i,j} $ at position $ \left(i,j\right) $ to the corresponding $ {\hat{y}}_{i,j}\in \left[0,1\right] $ . Denote $ p(v) $ as the underlying data distribution from which $ {v}_{i,j} $ is sampled from. Denote $ g:{\mathrm{\mathbb{R}}}^d\to \mathrm{\mathbb{R}} $ as an arbitrary classifier that can be parameterized by a neural network, and $ l\left(g(v)=\hat{y},y\right) $ as the loss incurred when the model outputs $ \hat{y} $ when the ground truth is $ y $ . If all the pixels are labeled, this is essentially a binary classification problem that can be optimized using a standard positive–negative learning approach with the following risk minimization function:
where $ {\tilde{R}}_p^{+}(g) $ is $ {\unicode{x1D53C}}_{v\sim {p}_p(v)}\left[l\left(g\left({v}^p\right),\hskip0.35em y=1\right)\right] $ and can be estimated as $ 1/{n}_p{\sum}_{i=1}^{n_p}l\left({\hat{y}}_p^i,1\right) $ , and $ {\tilde{R}}_n^{-}(g) $ is $ {\unicode{x1D53C}}_{v\sim {p}_n(v)}\left[l\left(g\left({v}^n\right),y=0\right)\right] $ which can be estimated as $ 1/{n}_n{\sum}_{i=1}^{n_n}l\left({\hat{y}}_n^i,0\right) $ , with $ {n}_p $ and $ {n}_n $ being the number of positive and negative pixels, respectively. In the partially annotated case, only a few positive pixels are labeled (the remainder of the data is unlabeled), which allows us to reformulate the problem into the PU setting. The positive labeled pixels are sampled from $ {p}_p\left(v|y=1\right) $ and the remaining unlabeled pixels are sampled from $ p(v) $ . The classifier $ g $ should learn to classify labeled positives, and the empirical distribution of output probabilities of unlabeled pixels should follow the prior distribution $ p(v) $ , that is, $ {argmin}_g\hskip0.35em KL\left[{q}_g(v)\parallel p(v)\right] $ . We can therefore rewrite the risk minimization expression in the PU setting as follows:
As suggested in Reference (Reference Bepler, Morin, Noble, Brasch, Shapiro and Berger3), we model $ p(v) $ as a binomial distribution parameterized by $ {\pi}_p $ , which represents the probability of observing a particle. For a mini-batch that contains K samples (pixels):
where $ n $ is the number of positive samples. The empirical distribution of the number of positives in the unlabeled samples from the mini-batch obtained from the classifier $ g $ is:
with $ \mathbf{y} $ being the indicator vector with $ K $ elements, where $ {y}_i=1 $ for positive data and $ {y}_i=0 $ for negatives, and $ Y(n) $ denotes the collection of all such vectors with $ n $ positive data points in total. By substituting $ {q}_g(v) $ and $ p(v) $ using Equations 9 and 10, we can therefore rewrite Equation 8 as follows:
In practice, since the ground-truth heatmap is splatted with Gaussian kernels, the labels are not strictly binary. Therefore, we adopt a pixel-wise logistic regression with focal loss for $ l\left(g(v),y>0\right) $ . Specifically, we have:
where $ \alpha $ and $ \beta $ are the focal loss parameters. We use $ \alpha =2,\hskip0.35em \beta =4 $ in our experiments.
3.3. Overall training objective
In addition to self-supervised denoising and pixel-level PU learning, we imposed consistency regularization to further improve detection performance. For each noisy input $ I $ , we randomly apply horizontal and vertical flipping, and $ \mathrm{90,180},\mathrm{and}\;270 $ degree rotations to generate augmented pairs $ A(I) $ , where $ A\left(\cdot \right) $ denotes the applied transformation. The augmented images are fed into the same network and the network outputs the denoised statistics and detection heatmap. We assume equivariance in the heatmap, that is, $ {\hat{Y}}_{A(I)}=A\left({\hat{Y}}_I\right) $ . Therefore, to impose consistency regularization, we define the consistency loss as follows:
where $ {\left\Vert \cdot \right\Vert}_2^2 $ denotes the L2 norm error. The final training objective for the joint framework is thus:
where $ \theta $ is the weight for each task and $ \gamma $ is the weight used for consistency regularization.
3.4. Implementation details
3.4.1. Model architecture
We adopt a U-Net-based architecture for the feature extraction backbone that uses a shifted convolutional layer, as proposed in Reference (Reference Laine, Karras, Lehtinen and Aila12) (instead of a normal convolutional layer). The shifted convolutional layer creates a restrictive receptive field that does not include the center pixel. The model contains three encoding layers (downsampling) and three decoding layers (upsampling) with skip connections. For smaller-sized proteins, it is beneficial to use a shallow architecture so that finer details can be learned. The noise estimation network is implemented using the same U-Net architecture, but using standard convolutional layers.
3.4.2. Training of the network
During training, patches of size $ 64\times 64 $ that contain particles are used as input to the network. Random horizontal and vertical flipping are applied as data augmentation. We use a batch size of 16. In order to prevent training instability and over-fitting, we adopted two strategies: (i) learning rate warm up and (ii) early stopping. We start learning with a zero learning rate and gradually increase the learning rate to $ 0.001 $ for the first $ 20\% $ of training iterations. After reaching the initial rate, we adopted cosine decay scheduler that progressively lowers the rate down until the end of training. Early stopping is used when there is less than $ 0.5\% $ of change in the overall loss. We experimentally found out that $ \theta =0.75 $ , $ \gamma =0.1 $ , and $ \lambda =2 $ work well and we adopt the ADAM optimizer. Training with $ \mathrm{200,000} $ iterations takes less than an hour, and usually the model converges within $ \mathrm{150,000} $ iterations. The model is trained on a single NVIDIA Tesla V100 GPU with 32GB of RAM.
3.4.3. Inference of the network
Given a trained joint model, we perform image denoising and particle detection on full-size micrographs. Given the trained model, inference takes less than a second on each micrograph. To obtain the final particle coordinates, non-maximum suppression is applied. The radius of non-maximum suppression is determined based on the half major-axis length of the particles.
4. Results
4.1. Description of datasets
We evaluate our methods on one cryo-EM dataset of K63 ribosomes (high-molecular weight), two challenging low-molecular-weight cryo-EM datasets: trehalose-6-phosphate synthase from fungal pathogen C. neoformans (CnTPS) and aldolase from EMPIAR-10215, and one CET dataset of 70S ribosomes from EMPIAR-10304.
4.1.1. K63 ribosomes
This dataset contains 1,000 movies with defocus values ranging from 0.8 to 3.0 μm. The pixel size is 1.08 Å. Each movie contains 60 frames. Full frame averages have relatively high SNR as ribosomes have high molecular weight. In order to simulate increasingly lower SNR conditions, such as those observed for lower molecular weight proteins and lower defocus datasets, we use partial averages calculated from $ 10\% $ , $ 20\% $ , and $ 60\% $ of the total number of frames (6, 12, and 36 frames, respectively). We treat full frame averages as “pseudo clean images” for the denoising task and particles identified using these averages as pseudo ground truth. Movies are motion corrected using MotionCor2.(Reference Kremer, Mastronarde and McIntosh61) We used 16 micrographs for training and the remaining ones for testing. Each micrograph is labeled with 20–40 particles. The total training set is composed of 500 particles, which accounts to around $ 0.04\% $ of the total number of particles in the dataset. Micrographs are downsampled by a factor of $ 8 $ from $ 4096\times 4096 $ to $ 512\times 512 $ and no exposure weighting was used.
4.1.2. Trehalose-6-phosphate synthase from fungal pathogen C. neoformans (CnTPS)
This dataset contains 3,177 movies with defocus values ranging from 0.5 to 2.5 μm, each of size $ 5760\times 4092 $ pixels. Movies were collected at a nominal magnification of $ \mathrm{81,000}\times $ using a pixel size of 1.08 Å. Each movie contains 60 frames. As the dataset is of low-molecular weight and already has low SNR, we used the full frame averages in this case. We followed the same motion correction and downsampling procedure (4× binning) as in the K63 ribosome dataset. The entire training set is composed of 140 labeled particles selected from 2 micrographs.
4.1.3. Rabbit muscle aldolase (EMPIAR-10215)
This dataset of rabbit muscle aldolase contains 1,052 movies with defocus values ranging from 1 to 2 μm, each of size $ 7420\times 7676 $ pixels. Movies were collected in super-resolution mode using a pixel size of 0.416 Å. Similar to CnTPS, we used full frame averages for both denoising and detection tasks. We followed the same motion correction and downsampling procedure (4× binning). The training set was composed of 180 labeled particles selected from 2 micrographs.
4.1.4. 70S ribosomes (EMPIAR-10304)
This CET dataset consists of 12 tilt series from a sample of purified 70S ribosomes. Each tilt series is composed of 41 projection images ranging from −60° to +60°. A single tilt image has a size of $ 4092\times 5760 $ pixels, with a pixel size of 2.1 Å. We downsampled all tilt images by a factor of 8 and performed both tasks on the binned images. The training set for detection is composed of 211 particles selected from 3 images at zero degree tilt.
4.2. Joint training improves micrograph SNR and particle visibility
First, we assessed the ability of joint training to denoise images from all datasets and compare our method against BM3D and Topaz denoise. For Topaz denoise (both for 2D micrographs and 3D tomograms), in order to achieve optimal performance, we followed the guidelines in the tutorial (https://github.com/tbepler/topaz) and trained dataset-specific models for each dataset. As model training requires pairs of noisy images, we used the aligned stacks from MotionCor2(Reference Zheng, Palovcak, Armache, Verba, Cheng and Agard60) and divided the stack into even and odd stacks based on the frame number. The frame averages are obtained from each stack and then fed into the network for training. We used the same training sets for both denoising and detection. For the K63 ribosome dataset, we treat the full dose frame averages as ground-truth clean images and use PSNR values to evaluate denoising performance. As full dosage micrographs can still be noisy, we apply a low-pass filter to remove potential high-frequency noise. As shown in Table 1, the left half shows the mean and standard deviation of PSNR values computed against full dosage frame averages and the right half shows values calculated against low-pass filtered images. Since low-pass filtering can remove high-frequency noise as well as information, we expect the actual PSNR value to lie in between the two reported values. As SNR decreases, our joint learning model is able to significantly outperform other methods. Additionally, joint learning with consistency regularization performs better than learning without consistency across all SNR levels. Overall, neural network-based methods perform better than the traditional denoising method BM3D, especially at lower SNR levels. In particular, we can confirm that at the lowest SNR, our denoising method based on Reference (Reference Laine, Karras, Lehtinen and Aila12) performs much better than Topaz denoise. It is worth pointing out that inputs to Topaz denoise have even lower SNR because training of the network requires splitting the noisy images into odd and even subsets of frame averages.
Our results also confirm that the performance of the N2N-based methods is limited by the input noise levels. When input images have extremely low SNR, the network is not able to fully distinguish real signal from noise, thus decreasing the probability of learning the correct noise distribution from pairs of noise-corrupted images. At lower SNR, our joint training method with consistency regularization also performs better than the pure denoising method, further demonstrating that the information obtained from the detection task can be used to improve the denoising task. It should be noted that pure denoising has better performance than joint learning without consistency. This shows the importance of consistency regularization: information shared between multiple tasks is beneficial if properly regularized. Otherwise, naive information sharing can affect the performance negatively. In addition, we observed that using smaller batch sizes can sometimes benefit the denoising task, whereas bigger sizes can improve the performance of the detection task but worsen the denoising performance. With consistency regularization, the influence of the batch size on both tasks is mitigated.
At higher SNR, our joint learning method represents less of an advantage and only performs slightly better than or similar to other methods. This shows that the benefits of the joint learning approach are most noticeable at lower SNRs where the information from the different tasks is successfully aggregated. Conversely, when a sufficient amount of information is present, complementary information from additional tasks is less useful. In addition, since our method incorporates information from the noisy input during the inference process, compared to both low-pass filtering and Topaz denoising, our approach is able to avoid over-smoothness while preserving more structural information.
For the remaining datasets, as clean images do not exist and the main purpose of image denoising is to facilitate downstream particle identification process by enhancing contrast between background and particle, instead of calculating quantitative measurements such as PSNR, we provide qualitative evaluations, that is, whether particles become more visible after denoising. As shown in Figures 3 and 4, for cryo-EM single-particle micrographs from K63, CnTPS and EMPIAR-10215, our proposed solution is able to better preserve high-frequency particle information without over-smoothing.
For the CET dataset (EMPIAR-10304), we provide qualitative evaluations for both the denoised tilt series and 3D tomograms reconstructed using denoised tilt series (Figure 5). After obtaining affine transforms from tilt series alignment using fiducial-based alignment in IMOD,(Reference Kremer, Mastronarde and McIntosh61) 3D tomogram reconstruction is performed using IMOD’s backprojection algorithm. The model is able to smooth the background noise while preserving the structural features in all cases. Figure 4 shows that for low-molecular weight-particles that are barely visible on raw micrographs, after denoising, the particles become more visible. Similarly, for CET datasets, there is substantial contrast improvement for tomograms reconstructed using denoised tilt series, where reconstructions show a smooth background.
4.3. Joint training improves particle identification in challenging datasets
Next, we assess the ability of the proposed joint framework to detect particles under low SNR conditions and in densely populated environments. In addition, we compare the particles picked using our method with Topaz and crYOLO (K63 ribosome and EMPIAR-10304 datasets only). Since both our proposed framework and Topaz require a few annotated positive particles for training, we used the same training set in all the experiments except for the CnTPS dataset. In this case, we have to use more labeled particles (280 particles) to train Topaz, because the model trained using the original labels was unable to produce a sufficient amount of particles.
For Topaz, instead of fine-tuning the generalized model using new annotated data from all datasets, we trained data-specific models from scratch and adjusted the hyperparameters based on the instructions provided in Reference (Reference Bepler, Morin, Noble, Brasch, Shapiro and Berger3). It is worth noting that because our datasets are very densely populated with irregular-shaped particles, we encountered difficulties with convergence during training when we attempted to fine-tune the pre-trained model.
For crYOLO, we evaluated picking performance using the following: (i) the pre-trained generalized model (https://cryolo.readthedocs.io/en/stable/tutorials/) with crYOLO’s built-in low-pass filtering (crYOLO column in Table 2), (ii) a fine-tuned model trained using our annotations with crYOLO’s built-in low-pass filtering (crYOLO Fine-Tune column in Table 2), and (iii) a fine-tuned model using our denoised micrographs as input (crYOLO with our denoise column in Table 2). As crYOLO is a fully supervised algorithm and training data require specifying a bounding box size in addition to particle locations, we manually annotated different sets of micrographs and particles. Specifically, for K63 ribosomes, we manually annotated three micrographs, which included 500 particles in total. The newly annotated data were used as training data for model refinement.
We evaluated detection performance at both 2D and 3D levels. For K63 ribosomes and EMPIAR-10304 that have ground-truth annotations, 2D results are evaluated using precision, recall, and precision–recall area under curve, as shown in Table 2. For the K63 ribosome dataset, we compared the particles identified using $ 10\% $ of frames against ground-truth particles identified using full dose frame averages in terms of precision and recall scores and the total number of good particles after 2D classification. For EMPIAR-10304, scores are obtained by comparing against manual labels. We treat particles picked using the full dose as ground truth. True positives (TP) correspond to particles that are detected on both full dose and partial dose micrographs. FPs correspond to particles that were identified on partial dose but not on full dose micrographs. False negatives (FNs) are particles located on full dose but not on the averages obtained from the partial dose micrographs. To account for small variations in the detected particle centers, instead of looking at a single pixel, we also look at pixels around the center pixel within a certain radius. For example, suppose a particle is detected on a full dose micrograph; if the particle is also detected on the partial dose image within a radius of $ r $ , then the two particles are considered as a true positive match. We use a radius of $ 5 $ pixels. This way, precision and recall are calculated as follows:
For CnTPS and EMPIAR-10215, as ground-truth labels for the whole dataset do not exist, in 2D, the performance is evaluated only in terms of the total number of good particles after 2D classification and the percentage of good particles. In 3D, the performance is evaluated using the final resolution of the 3D reconstruction. Final resolution is determined using the Fourier shell correlation (FSC), which measures the normalized cross-correlation coefficient between two independently determined reconstructions, each based on half of the particles in the dataset:
where $ r $ represents a frequency shell in Fourier space, and $ F\left({r}_i\right) $ is the Fourier component at the $ i $ th shell. All processing steps (2D classification, 3D reconstruction, and map sharpening) were performed using the cryoSPARC software.(Reference Punjani, Rubinstein, Fleet and Brubaker17)
4.3.1. K63 ribosome
Similar to the denoising task, we found that as the SNR increases, all methods tend to produce similar results both in terms of precision–recall values and total number of clean particles picked. At the lowest SNR ( $ 10\% $ frame averages), however, our joint learning method with consistency regularization is able to perform much better than others. At a cutoff probability threshold of $ p=0.5 $ (if the probability of being a particle is greater than $ 0.5 $ , we treat is as a particle), our method is able to obtain precision, recall, and number of good particles of $ 0.83 $ , $ 0.43 $ , and $ \mathrm{61,955} $ , respectively, compared to $ 0.87 $ , $ 0.27 $ , and $ \mathrm{40,712} $ from Topaz, and $ 0.573 $ , $ 0.4 $ , and $ \mathrm{38,472} $ from crYOLO’s fine-tuned model. crYOLO tends to identify more FPs compared to Topaz and our algorithm. Both crYOLO’s generalized model with low-pass filtering and crYOLO’s fine-tuned model using our denoised micrographs were not able to perform well. As both denoised micrographs and $ 10\% $ frame averages are very different from micrographs used for training crYOLO’s generalized model, the pre-trained model is not able to generalize well to unseen data from a different data distribution.
4.3.2. EMPIAR-10215
After 2D classification and filtering, 529,212 clean particles from an initial set of 797,229 particles picked by Topaz were selected. For our approach, 574,073 good particles were identified from an initial set of 851,882 positions. Similar to CnTPS, clean particles obtained from 2D classification were used for further processing with fixed seeding. Ab initio reconstruction was performed in cryoSPARC followed by eight iterations of nonuniform refinement using D2 symmetry. Particles selected using our method resulted in a reconstruction at 2.6 Å resolution (using the 0.143-cutoff FSC criteria), compared to 2.7 Å obtained using particles picked by Topaz.
4.3.3. CnTPS
After 2D classification and filtering, $ \mathrm{1,228,334} $ clean particles were left from an initial set of $ \mathrm{1,638,619} $ positions picked by Topaz. Our approach resulted in $ \mathrm{1,927,359} $ good particles from a total of $ \mathrm{2,403,560} $ starting positions. As shown in Figure 6, despite using fewer particles for training, our method is able to identify more good particles while keeping a low FP rate (ours $ 21\% $ vs. Topaz $ 25\% $ ). We then compared the quality of each particle set through 3D reconstruction. For each set of particles, clean particles obtained from 2D classification were used for further processing using fixed seeding. We confirmed that the 2D class averages showed the characteristic views of the complex. With that knowledge, ab initio reconstruction was performed in cryoSPARC followed by 7 iterations of homogeneous refinement using D2 symmetry. We found that particles picked using our method yield a structure at 3.0 Å resolution determined using the 0.143-cutoff of the FSC curve, compared to 3.1 Å obtained from Topaz. While our method results in almost $ \mathrm{600,000} $ more particles compared to Topaz, the resolution difference is not significant, indicating that factors other than the number of particles may be limiting the resolution in this case. In order to further assess the quality of the particles, we removed particles identified using Topaz from the set of particles obtained using our method, and used these particles to produce a 3D reconstruction. Using the resulting $ \mathrm{866,339} $ particles (picked by our method and missed by Topaz), we obtained a 3D structure at 3.15 Å (Figure 6). This validates the quality of the set of additional particles identified using our method, showing that they can be used to produce a reconstruction of similar resolution to that obtained using Topaz particles only.
4.3.4. EMPIAR-10304
We manually labeled 12 tilt series and treated manual labels as the ground truth. Precision, recall, and F1 scores are calculated as described in Equation 15. Using a radius of three pixels, our method obtains a precision of 0.70 and recall of 0.86, compared to 0.48 and 0.67 obtained from Topaz picking, and 0.34 and 0.22 obtained from crYOLO. Similar to the K63 ribosome dataset, we observed a lack of generalizability of Topaz and crYOLO to unseen data distributions.
5. Discussion
In this section, we delve into a deeper analysis of the proposed joint-learning framework. Specifically, we look at how denoising can be beneficial to detection, how to choose the proper positive prior $ {\pi}_p $ , and analyze the effect of hyperparameters on overall performance.
5.1. Features learned through denoising can be directly used for detection
In order to better understand how denoising can be beneficial to detection, we froze the weights of the detection head, trained the model with denoise loss exclusively (without the PU loss for detection), and assessed the resulting particle center heatmap generated by the untrained detection head (the input features to the detection head are trained through denosing loss only). The model received inputs consisting of both particle-containing patches and random patches cropped from the input micrograph. As Figure 7 suggests, at the beginning of training, where all weights are randomly initialized, it is very hard to separate particles from the background using the resulting heatmap. As weights are updated based on the denoising loss, input features to the detection head are no longer random. Therefore, even though the weights of the detection head remain random, by using learned features, it is easier to differentiate particles from the background on the generated output. This shows that with denoising, the background and particle features become more easily separable. This separability improves particle detection accuracy. This also means that features extracted through denoising learning are good enough for particle detection and additional feature extraction is not necessary. Therefore, instead of performing denoising and detection independently where features are learned multiple times, these two tasks can be performed jointly as the same set of features can be used to perform both tasks. In addition, as shown in Table 1, joint learning leads to better denoising results compared to performing denoising alone.
5.2. Prior belief and its effect on performance
As particles are identified on a per-pixel basis, even for crowded datasets, we use a relatively small class prior probability $ \pi $ . For CnTPS, EMPIAR-10215, and EMPIAR-10304, we used $ \pi =0.08 $ , and for the K63 ribosome, we used $ \pi =0.01 $ . This is because the actual number of pixels that belong to the center of particles is relatively small compared to the total number of pixels in a micrograph. If we use a larger positive prior, more FPs get identified and the resulting heatmaps tend to have higher scores. Similarly, if we use smaller positive priors, we are able to maintain a high true-positive rate while sacrificing the rate of FNs. While having a reliable estimation of $ \pi $ is important, this can be obtained by simple visual inspection of micrographs and the size of the particles when doing annotations. For instance, $ \pi $ can be estimated by calculating the product of the expected number of particles per micrograph by the number of pixels per particle, and then dividing this product by the total number of pixels in a micrograph.
5.3. Effects of other hyperparameters on performance
In order to assess how hyperparameters in the loss function (Equation 14) can affect the performance of our algorithm, we performed experiments with $ \theta =\mathrm{0.3,0.5,0.75} $ , and $ 0.95 $ . We found that as $ \theta $ decreases, the performance of both denoising and detection tasks worsens. This makes sense as while information provided by both tasks is complementary, the detection task is more dependent upon the denoising task: information from the detection task can be used to refine learned features from the denoising task, instead of generating those features (as suggested in the previous section). Therefore, the denoising task should be the dominating task in this case. The choice of $ \alpha $ and $ \beta $ depends on the SNR level of the datasets. For datasets with lower SNR where proteins are harder to identify, $ \alpha >1 $ results in better performance. Increasing $ \beta $ forces pixels located at the center of particles to have a higher detection probability (classified as particle) and pixels around the center of particles to have a lower probability (classified as background). Therefore, for well-annotated training data, increasing $ \beta $ leads to better center localization. Experimentally, we found that $ \alpha =2 $ and $ \beta =4 $ works for most cases. The consistency regularization weight $ \gamma $ has only minor effects on the overall performance.
6. Conclusion
Even though many deep-learning-based denoising and particle picking methods have been proposed, existing methods either require a large amount of training data or additional manual paired image curation, and most importantly, they yield unsatisfying results on extremely low SNR and crowded datasets. In these cases, the use of denoising as a preprocessing step can aid the detection task. However, this step takes additional time, and in many cases, naive denoising can lead to over-smoothing, which leads to a high FN rate. Intuitively, micrograph denoising and particle identification should not be treated as two completely different problems. Instead, these two tasks are complementary: the underlying data distribution for background and particles are different, and by knowing the location of particles, we should be able to produce a denoised micrograph that better differentiates particles from background. Similarly, with cleaner micrographs, we should be able to identify particles more easily. Therefore, by enabling information sharing through joint training, we expect the performance of both tasks to improve. Here, we present a framework that performs micrograph denoising and particle detection simultaneously without the need of noiseless images or fully annotated datasets. The framework is able to identify small particles under low SNR conditions and enables better visualization of these particles in both single-particle cryo-EM and CET images. In addition, this approach can be easily adapted to perform a single task if only one of the tasks is needed.
Even though the proposed framework uses a dataset-specific model, training requires only a few partially annotated micrographs with a total of 100+ particle annotations, which significantly reduces manual labor and training time. In addition, as shown in Reference (Reference Bepler, Morin, Noble, Brasch, Shapiro and Berger3), dataset-specific models often outperform generalized particle picking models, especially in the case of challenging datasets. Similar to Reference (Reference Bepler, Morin, Noble, Brasch, Shapiro and Berger3), our method requires only one hyperparameter, the class prior, which can be estimated by visual inspection of micrographs during data labeling. While the default value for the task weight ( $ \alpha $ ) works well on multiple datasets, further dataset-specific tuning can further improve the performance.
Our joint framework offers a viable way to enable particle localization under some of the most challenging scenarios, such as those encountered when imaging proteins with low-molecular weight or in crowded environments. The proposed method is modular, which allows it to be easily incorporated into existing data processing software and is currently integrated into nextPYP.(Reference Liu, Zhou, Huang, Piland, Jin, Mandel, Du, Martin and Bartesaghi62) A pre-trained model for micrograph denoising can be downloaded from Zenodo. In addition, the framework can be extended to the preprocessing of CET tilt series to improve tilt series alignment and tomogram reconstruction. We hope that our algorithm will facilitate and expedite the structural analysis of challenging biomolecular targets such as low-molecular-weight complexes or thick specimens imaged in their native context.
Data availability statement
The source code is freely available from https://github.com/nextpyp/spr_pick. A pre-trained model for the denoising module is available at https://doi.org/10.5281/zenodo.10445797.
Acknowledgments
The authors would like to thank Drs. Richard Brennan and Erica Washington for providing access to the single-particle dataset of CnTPS, and Dr. Gustavo Silva for providing access to the K63 ribosome dataset. This study utilized the computational resources offered by Duke Research Computing (http://rc.duke.edu). The authors would also like to thank Tracy Futhey, Charley Kneifel, Katie Kilroy, Mike Newton, Victor Orlikowski, Tom Milledge, and David Lane from the Duke Office of Information Technology and Research Computing for providing assistance with the computing environment.
Author contribution
Conceptualization: Q.H. and A.B.; Methodology and implementation: Q.H.; Data processing and visualization: Y.Z. and H.L.; Writing original draft: Q.H. All authors approved the final submitted draft.
Funding statement
This study has been made possible by a CZI grant DAF2021-234602 and grant DOI https://doi.org/10.37921/181830egjglp from the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation (funder DOI https://doi.org/10.13039/100014989).
Competing interest
The authors declare no competing interests exist.
Ethical standard
The research meets all ethical guidelines, including adherence to the legal requirements of the study country.