1. Introduction
Visual place recognition (VPR) is essential in autonomous robot navigation. VPR enables a robot to recognize previously visited places using visual data. VPR provides loop closure information for a simultaneous localization and mapping (SLAM) algorithm to obtain a globally consistent map. Furthermore, VPR can support re-localization in a pre-built map of an environment. Due to its essential role, many VPR methods [Reference Lowry, Sünderhauf, Newman, Leonard, Cox, Corke and Milford1] have been proposed. However, in long-term navigation tasks, significant appearance variation, typically caused by seasonal change, illumination change, weather change, and dynamic objects, such as those shown in Fig. 1, is still a challenge to VPR.
VPR is typically formulated as an image-matching procedure, which can be divided into two steps. The first step of VPR, also known as loop closure detection in the literature, selects candidates where map images are represented by global descriptors and a matching procedure between the map images and the current robot view can be performed in terms of image similarity. In the second step of VPR, verification is conducted via multi-view geometry, which uses keypoints in the images to determine if a query image (current robot view) is geometrically consistent with a candidate map image [Reference Alahi, Ortiz and Vandergheynst2–Reference Wang and Lin8]. In this paper, we focus on the first step of loop closure detection, namely, the generation of loop closure candidates efficiently and accurately. Traditionally, a global descriptor is obtained by aggregating the handcrafted local descriptors, like SIFT [Reference Lowe6], ORB [Reference Rublee, Rabaud, Konolige and Bradski7], and SURF [Reference Bay, Ess, Tuytelaars and Van Gool3]. In the case of significant appearance variations caused by, for example, the day–night, season change and dynamic objects, handcrafted descriptors often fail to recognize places since locally keypoint descriptors can change significantly with the condition-dependent appearance. Convolutional neural networks (CNNs) have shown their advantages in various visual recognition tasks [Reference Girshick, Donahue, Darrell and Malik9–Reference Ronneberger, Fischer and Brox11] and have been used to generate global image descriptors for visual loop closure detection. In ref. [Reference Sünderhauf, Shirazi, Dayoub, Upcroft and Milford12], a pre-trained CNN is firstly used to produce a global descriptor directly. Alternatively, end-to-end trained descriptors with aggregating methods [Reference Arandjelovic, Gronat, Torii, Pajdla and Sivic13–Reference Radenović, Tolias and Chum16] are proposed to gain higher performance.
However, deep learning-based VPR methods have some limitations. Firstly, a pre-trained CNN may generate descriptors easily with a dimension in the 10’s of thousands, and hence result in time and storage problems. Secondly, the generalization ability of CNN descriptors is often poor. To tackle these limitations, we use a convolutional autoencoder (CAE) to compress a CNN-generated descriptor and improve its generalization ability. Experiments on challenging datasets show that, by compressing the local feature maps of a CNN by CAE, the compressed descriptor achieves better results than the uncompressed descriptor in both seen and unseen environments at a lower computational cost.
2. Related work
2.1. Visual place recognition
In this paper, our concern is using a global descriptor to represent an image for loop closure detection. In the early works, VPR has been attained by extracting handcrafted local keypoints and descriptors firstly, such as SIFT [Reference Lowe6], ORB [Reference Rublee, Rabaud, Konolige and Bradski7], and SURF [Reference Bay, Ess, Tuytelaars and Van Gool3]. Then, these local features are aggregated to a global descriptor by vector quantization such as bag-of-words [Reference Jégou, Douze, Schmid and Pérez17–Reference Sivic and Zisserman19], VLAD [Reference Arandjelovic and Zisserman20], and Fisher Vectors [Reference Jégou, Perronnin, Douze, Sánchez, Pérez and Schmid21]. Through clustering, a low-dimensional global descriptor can be achieved although spatial relations between the local descriptors are not encoded. Although these traditional methods have been widely used in SLAM research, they still struggle in large-scale environments with severe appearance changes [Reference Lowry, Sünderhauf, Newman, Leonard, Cox, Corke and Milford1].
Recently, researchers have proposed to use CNNs to extract features for loop closure detection in large-scale environments. At first, pre-trained classification CNNs are directly used to extract dense local feature maps [Reference Sünderhauf, Shirazi, Dayoub, Upcroft and Milford12, Reference Babenko, Slesarev, Chigorin and Lempitsky22, Reference Razavian, Azizpour, Sullivan and Carlsson23], which serve as the visual features for visual place description. However, due to their high dimensions and inability to adapt to crowded environments, an end-to-end training model with a feature extractor and a pooling layer has been proposed, for example, NetVLAD [Reference Arandjelovic, Gronat, Torii, Pajdla and Sivic13], SFRS [Reference Ge, Wang, Zhu, Zhao and Li14], generalized-mean pooling [Reference Radenović, Tolias and Chum16], max pooling [Reference Tolias, Sicre and Jégou24], and average pooling [Reference Razavian, Sullivan, Carlsson and Maki25]. Although end-to-end models can perform well in crowded environments with low dimensions, training bias is introduced by training datasets. It leads to a poor generalization of the end-to-end trained descriptors to unseen environments. Here, we use the unsupervised method of CAE to learn an image descriptor by minimizing the reconstruction loss of the high-level features of a CNN instead. This enables the encoded descriptor to attain discriminative features and generalize to unseen environments with a lower dimension.
2.2. Convolutional autoencoder
CAE has shown its superior performance in many applications. The first usage is to learn a feature extractor unsupervised by reconstructing input images as a pre-processing method, then finetuning the encoder with other downstream tasks [Reference Mao, Shen and Yang26]. Another usage is to learn a mapping function by reconstructing the input images to other domain information, for example, flow image, depth image, path planning image, and so on. Conditional generative adversarial nets [Reference Mirza and Osindero27] and pix2pix [Reference Isola, Zhu, Zhou and Efros28] use CAE as their basic architecture to transfer, such as from day to night, from labels to faces and from edges to a photo. Moreover, in U-Net [Reference Ronneberger, Fischer and Brox11], semantic segmentation is achieved by a CAE-like architecture. Recently, Vankadari et al. [Reference Vankadari, Garg, Majumder, Kumar and Behera29] proposed a CAE-based GAN to estimate depth maps from night-time images. It is worth noting that it also uses the descriptor from the encoder to accomplish the day–night VPR task. Merrill and Huang [Reference Merrill and Huang30] utilize CAE to force the output of the decoder to be similar to the histogram of oriented gradients, and the output of the encoder is used as a global descriptor in the inference procedure.
Differently from the above usages of CAE, in this paper, we propose to use CAE to reconstruct the high-level features of the CNN, which is a post-processing method for decreasing the dimension of the features and promoting place recognition performance. The most similar to this idea is that Dai et al. [Reference Dai, Huang, Chen, Chen, He, Wen and Zhang31] use a CAE to compress and fuse the local feature maps of the image patches for improving loop closure verification. Unlike this, the CAE in our method reconstructs the feature maps of the whole image, instead of feature maps of local image patches. In this way, our encoder can capture the most relevant features of the whole image for VPR.
3. Approach
In this section, we describe our network architecture and training strategy. The overall structure is shown in Fig. 2. In our framework, a local feature map is extracted from a pre-trained CNN. Specifically, the local feature map is extracted by a high-level layer of a pre-trained CNN. The map is then normalized [Reference Ba, Kiros and Hinton32] and fed into the CAE. In the training procedure, the CAE consisting of an encoder and a decoder is trained by a reconstruction loss. However, in the inference step, the decoder part is dropped and only the encoder part is kept to produce the image descriptor.
3.1. Feature extraction
Different layers of CNNs describe an image at different levels of semantics [Reference Sünderhauf, Shirazi, Dayoub, Upcroft and Milford12, Reference Hou, Zhang and Zhou33]. In the VPR task, we choose the feature map of a deep layer, which is found in previous works to be condition-invariant and low-dimensional.
Similar to ref. [Reference Arandjelovic, Gronat, Torii, Pajdla and Sivic13], we choose AlexNet [Reference Krizhevsky, Sutskever and Hinton10] and VGG16 [Reference Simonyan, Zisserman, Bengio and LeCun34] as our backbone. The local feature map $F$ is computed as
where $I$ is an input image with a dimension of $3\times{H}\times{W}$ . $H$ and $W$ are the height and width of the input image. $f_{\theta }$ is a VPR-trained or pre-trained CNN without fine-tuning. In our work, $F$ is from the last convolution layer of a CNN, before ReLU. For AlexNet, the dimension of $F$ is $256\times{\left(\frac{1}{16}H-2\right)}\times{\left(\frac{1}{16}W-2\right)}$ . For VGG16, the dimension is $512\times{\frac{1}{16}H}\times{\frac{1}{16}W}$ . At such high dimensions, the global descriptor is of low computational efficiency to be stored and compared algorithmically for real-time performance. To tackle these problems, we use a CAE to compress the descriptor into a low-dimensional representation while promoting its condition-invariant capacity.
3.2. Convolutional autoencoder
Given a high-dimensional local feature map $F$ , we first normalize it with layer normalization [Reference Ba, Kiros and Hinton32]. Then, a CAE with three encoder layers and three decoder layers is trained to reconstruct the normalized feature map. The architecture and the whole training strategy are
where $h(x)$ is a layer normalization function [Reference Ba, Kiros and Hinton32], $g_\theta$ is a CAE with an encoder $g_{\text{enc}}$ and a decoder $g_{\text{dec}}$ . In the training procedure, we reconstruct the normalized local feature map $h(F)$ to train the CAE. We use mean squared error and backpropagation to reconstruct the normalized feature map $h(F)$ . The mean squared error is defined as
where $n$ is the dimension of the local feature map $F$ . The layer normalization $h(x)$ is defined as
where $x$ is a sample, $E[x]$ and $\text{Var}(x)$ are the mean and variance of the sample, respectively, both of which are updated during training but frozen in the inference step. $\epsilon$ is a given value added to the denominator for numerical stability, which is set to $10^{-5}$ in our study. Empirically, a layer normalization can help to stable the optimization procedure and speed up the convergence.
Classic dimension reduction methods, for example, PCA, only detect the linear relationship between features. For deep-learning-based pooling approaches, for example, GeM [Reference Radenović, Tolias and Chum16], max pooling [Reference Tolias, Sicre and Jégou24], and average pooling [Reference Razavian, Sullivan, Carlsson and Maki25], they directly aggregate the $D^{'}\times{H^{'}\times{W^{'}}}$ local feature map into a descriptor with $D^{'}$ dimensions. Because the features across spatial dimensions are directly aggregated, the spatial information in the feature map is therefore lost. In contrast, our CAE compresses the feature map nonlinearly while maintaining the spatial relationship. In addition, the local feature map $F$ is usually sparse and high-dimensional [Reference Chen, Maffra, Sa and Chli35], indicating that only a few regions of a feature map have a solid response to a particular task like VPR or classification. With these attributes, our CAE can keep the most relevant features by reconstructing the input.
As shown in Fig. 3, in our CAE, each block in the encoder/decoder is composed of a convolutional/deconvolutional unit, a batch normalization unit [Reference Ioffe and Szegedy36], and a parametric rectified linear unit [Reference He, Zhang, Ren and Sun37].
Since our CAE is based-on VGG16, the kernel sizes of three encoder blocks are $4\times{4},\ 7\times{5},\ 5\times{3}$ , with strides 1, 2, 2, respectively. The channels of the first two encoder blocks are, respectively, $d_1$ and $d_2$ . Similar to ref. [Reference Dai, Huang, Chen, Chen, He, Wen and Zhang31], to generate descriptors of different dimensions for comparison, the channels of the last encoder block $d_3$ are accordingly set to 8, 16, 32, 64, 128, 256, and 512. For AlexNet, the kernel sizes of three encoder blocks are $4\times{4},\ 5\times{3},\ 5\times{3}$ , and the strides are 1, 2, 2, respectively. We adopt the same configurations as the encoder channels of VGG16 in our AlexNet encoder. For both architectures, the parameters of the decoder are similar to the encoder.
In the inference step, the decoder is not involved and the encoder is used to infer the compressed descriptor:
where $X$ is then flattened and L2-normalized to generate the final global descriptor. In the matching step, images are represented by the descriptors. Then a cosine similarity is utilized to find the best match in the reference set for a query image.
4. Experiments setup
4.1. Dataset
To evaluate the performance of our proposed method, four datasets are utilized in the experiments, including Oxford RobotCar [Reference Maddern, Pascoe, Linegar and Newman38], Nordland [Reference Olid, Fácil and Civera39], and UACampus [Reference Liu, Feng and Zhang40] where only a part of the Oxford RobotCar is used as the training set and other datasets are used as the test set. These datasets contain significant appearance changes in urban, train track, university campus, and city simulation environments. Sample images from the datasets are shown in Fig. 1. A detailed description of the datasets is provided in Table I as well as below.
Note: RobotCar (dbNight vs. qAutumn) indicates that a night sequence is used as the reference set (database) and an autumn sequence is the query set.
Oxford RobotCar [Reference Maddern, Pascoe, Linegar and Newman38] is an urban dataset that records a 10-km route through central Oxford multiple times over 1 year. Within this dataset, challenging views with appearance changes are captured due to season, weather, and time of the day. We choose a subset consisting of five sequencesFootnote 1 involving sun cloud, autumn, snow, and night environments, all of which contain strong appearance changes. To validate the effectiveness of our method, the dataset is separated with no overlap. Specifically, we extract a front-view image per meter for all sequences to construct the datasets. As shown in Fig. 4, the red route is the training set which includes 24k images, the green route is the test set, and the blue route represents the validation set. In the matching procedure, we have a query and a reference set. The query set contains the images of the green route, and the reference set includes the images of the whole route to increase the difficulty of matching. If the distance between a matched pair is within 25 m, the decision is considered as a true positive.
Nordland [Reference Olid, Fácil and Civera39] is a train journey dataset that contains significant seasonal changes. In this paper, summer and winter traverse are used as reference and query, respectively. If the reference image is within two frames relative to the query, it is treated as a true positive.
UACampus [Reference Liu, Feng and Zhang40] is a campus dataset with day–night illumination changes recorded on the campus of University of Alberta. Here, two subsets were captured in the morning (06:20) and evening (22:15) along the same route. Ground truth matching is available by manual annotation.
4.2. Evaluation metric
Recall@1,5,10. To verify the overall performance of an image descriptor, we follow the common evaluation metric defined in ref. [Reference Arandjelovic, Gronat, Torii, Pajdla and Sivic13], which is based on the top $K$ nearest neighbors among all database descriptors to a query one. It can identify the matching ability of the descriptor in a tolerable interval. Matching is considered successful if the correct match exists within the top $K$ nearest pairs. $K$ is set to 1, 5, and 10 in our experiments.
Precision−Recall curve. Precision−Recall (PR) is another key evaluation metric in VPR. In the robotics field, top 1 matching pairs are vital because a decision must be made in the robot running. Given matched pairs and a threshold in terms of cosine similarity between image descriptors, we have the numbers of true positives, false positives (FPs), and false negatives (FNs). Precision and recall are defined as:
Multiple pairs of PR values are produced by varying the threshold, and a PR curve is regressed from the points formed by these pairs. A high threshold often causes low recall and high precision because a strict matching policy always reduces FPs but at the cost of many FNs. The ideal performance is when both precision and recall are high.
Average Precision (AP). The overall performance is usually represented by the AP. It summarizes a PR curve as the weighted mean of precisions achieved at all recall values:
where $P_n$ and $R_n$ are the precision and recall, respectively. $n$ is the $n_{th}$ threshold. Intuitively, AP is the integral of the PR curve.
L2 distance distribution. To test the discriminative capacity of our global descriptor, we draw the histogram distribution of L2 distances of the true matches and the false matches. It can be used as prior knowledge to tell which distance interval can be trusted more for making a place recognition decision. Firstly, given a query vector, the reference vectors in the reference set within 25 m (for RobotCar) are regarded as true matches and the remaining vectors are false matches. Secondly, L2 distance is calculated from a pair of vectors consisting of a query vector and a reference vector. Thirdly, traversing the whole query set, we get L2 distances of all the true matched pairs and the false matched pairs. Lastly, L2 distances of the true matched pairs and false matches are formed as true matched and false matches histogram distributions, respectively. Intuitively, a small overlap of two distributions indicates good discrimination.
4.3. Baseline and our method
NetVLAD [Reference Arandjelovic, Gronat, Torii, Pajdla and Sivic13] is a popular method trained on Pitts30k. We choose the best model by evaluating the method on the Pitts30k-valid. After training, we compress the descriptor dimension to 4096 with PCA and whitening. In our experiments, we use four tuples (one query, one positive, one negative) for training, for the purpose of reducing computational resource usage. SFRS [Reference Ge, Wang, Zhu, Zhao and Li14] is the SOTA method which is also trained on Pitts30k with same network architecture as NetVLAD. The best model is obtained similarly as NetVLAD. NetVLAD_VGG16 and SFRS_VGG16 are the backbones of the Pitts30k-trained NetVLAD and SFRS, respectively, and outputs the descriptors with $512\times{\left(\frac{1}{16}H\right)}\times{\left(\frac{1}{16}H\right)}$ dimensions. AlexNet [Reference Krizhevsky, Sutskever and Hinton10] is of the matconvnet version pre-trained in ImageNet, with the descriptor of $256\times{\left(\frac{1}{16}H-2\right)}\times{\left(\frac{1}{16}H-2\right)}$ dimensions.
NetVLAD_VGG16+OursV is composed of NetVLAD_VGG16 and the CAE introduced in Section 3.2. Similarly, SFRS_VGG16+OursV includes SFRS_VGG16 and our CAE. AlexNet+OursA consists of AlexNet and our CAE.
4.4. Implementation details
In the experiments, the resolution of the input image is $640\times{480}$ . For VGG16, the local feature map has a dimension of 614,400. For AlexNet, the dimension is 272,384. The hyperparameters of our CAE are optimized empirically by experiments conducted in RobotCar (dbNight vs. qSnow). The results are shown in Table II where $d_1$ , $d_2$ , and $d_3$ represent the number of channels of the three encoder blocks, c1 and c2 imply the different settings (kernel sizes and strides) of the encoder blocks. Specifically, c1 adopts the original setting as mentioned in Section 3.2 and c2 indicates that the kernel size is $3\times{3}$ and the stride is 1. Here, c1 would cause lower spatial dimension of the encoder output, c2 does the opposite. For a balance of effectiveness and computation resource, $d_1$ and $d_2$ are set to 128, and c1 is adopted.
Note: $d_1$ , $d_2$ , and $d_3$ represent the number of channels of the three encoder blocks. c1 and c2 are different settings (kernel sizes and strides) of the encoder module where c1 would cause lower spatial dimension of the encoder output, c2 does the opposite. This experiment is conducted on RobotCar (dbNight vs. qSnow).
During the CAE training, the backbone CNN is frozen. The Adam optimization algorithm is used to learn the model parameters, with a learning rate of 0.001 and a batch size of 128. The model is trained for 50 epochs. All the training is executed in PyTorch with 4 TITAN XP.
5. Results and discussion
5.1. Effectiveness and stability
We first compare the performance of our method representative CNN-based image descriptors, namely, NetVLAD, SFRS, and those from VGG16 and AlexNet. In Table III, we can observe that NetVLAD and SFRS perform better on RobotCar than on Nordland and UACampus. VGG16 (the backbone of NetVlad and SFRS) shows quite different results in this test. On Nordland, SFRS_VGG16 surpasses SFRS by a significant margin with an AP of 0.969 versus 0.465 and recall@1 of 0.889 versus 0.282. Nevertheless, it is slightly worse with a recall@1 of 0.772 versus 0.834 on RobotCar (dbNight vs. qAutumn). This result could be attributed to the training bias introduced by the Pitts30k dataset, which is also an urban dataset similar to RobotCar. For VGG16, only conv5 and the following layer are trained. NetVLAD and SFRS, with VGG16 as their backbone, include a deep-learning-based VLAD module. Furthermore, the deep-learning-based VLAD module is optimized in the clustering space of the training datasets.
Boldface value indicates the value is the largest one in comparison with other values.
Compared to NetVLAD and NetVLAD_VGG16, OursV achieves better results with a higher AP, with the output dimension set to 4096 for a fair comparison. Even on a dataset with large appearance changes, such as RobotCar (dbNight vs. qSnow), the recall@1 of NetVLAD_VGG16+OursV is 0.861 versus NetVLAD’s 0.691 and NetVLAD_VGG16’s 0.523. It is worth noting that our method is unsupervised in this experiment on RobotCar, and it can nonetheless perform well in a nonurban dataset like Nordland and UACampus. To further validate the effectiveness and generalization ability of our method, we conduct experiments with different feature extractors, such as AlexNet pre-trained on ImageNet. AlexNet+OursA, which is also of dimension 4096, always produces better results than AlexNet, with an AP of 0.950 versus 0.657 in RobotCar (dbNight vs. qSnow) and recall@1 of 0.984 versus 0.956 in Nordland.
As shown in Table III, our CAE is effective and memory-efficient on all datasets and outperforms NetVLAD and SFRS and their backbones in most tests. Furthermore, at the dimension of 4096, the dimension of our descriptor is two orders of magnitude smaller than VGG16’s 614,400 and AlexNet’s 272,384.
5.2. Comparison of encoded dimensions
In this section, we will present the results from our study of the relationship between the output dimension of our CAE and matching performance. Figure 5 shows the AP results in different datasets with the variation of the encoded dimensions. As shown in the left subfigure, SFRS_VGG16+OursV and AlexNet+OursA achieve similar results to AlexNet. However, the output dimension of AlexNet is 272,384. Although the performance of both methods is a bit worse than AlexNet when the dimensions are small, for example, 512 or 256, they still achieve better results than SFRS and SFRS_VGG16 with a moderate dimension.
From the right subfigure, we can observe that, even in the urban-scale RobotCar dataset (dbNight vs. qSnow), OursV and OursA can achieve the same results as SFRS when the dimension is higher than 1024. From the above observation, we can infer that our CAE can attain high performance with low dimensions. However, as we continue to reduce the descriptor dimension, the performance will deteriorate.
5.3. Discriminative capacity
We also plot the distribution of L2 distances between true matches and false matches, to evaluate the discriminating power of our CAE. For a fair comparison, we set the dimension of SFRS_VGG16+OursV as 4096, the same as SFRS. From Fig. 6(a), we can observe that the overlapping area of OursV is smaller than that of SFRS with a mean gap value of 0.322 versus 0.121. As shown in Fig. 6(b), the distributions of L2 distances between the true matches and false matches of SFRS are close where half of the true matches overlap with the false matches, resulting in a low mean gap value of 0.087. For SFRS_VGG16+OursV, the gap is 0.151, and half of the true matches do not overlap with the false ones.
These results show that our CAE is more discriminative than NetVLAD. However, the distributions of L2 distances between the true and false matches still overlap considerably. This could be caused by the fact that Nordland consists of only train road views, while RobotCar is more complicated with dynamic objects.
5.4. Ability of false positives avoidance
As mentioned in ref. [Reference Lowry, Sünderhauf, Newman, Leonard, Cox, Corke and Milford1], FP matches are fatal to VPR, since false matches lead to incorrect input to robot pose trajectory optimization. Consequently, recall at 100% precision is the prime metric for many tasks. From the result of the Nordland dataset shown in Fig. 7(a), SFRS_VGG16+OursV surpasses SFRS in terms of recall at 100% precision, while AlexNet+OursA performs poorly in this test. However, as shown in Fig. 7(b) of a RobotCar (dbNight vs. qSnow) experiment, AlexNet+OursA and SFRS_VGG16+OursV perform significantly better than other baselines.
5.5. Failure cases analysis
Some true positive and FP examples are shown in Fig. 8. From the left image pair, we can observe that a similar structure is a key to recognizing the same place. However, this might lead to a failure in the environments where similar structures widely exist, for example, the environment shown in the middle image pair. In this wrongly matching pair, the tree distributions are similar, which is the reason for the recognition by our algorithm. However, these two places are not the same place. The right pair is also a failure case where dynamic objects occupy most of the image region. In this situation, our method would fail because not much discriminative information is captured. As the analysis of the above example, we conclude that loop closure verification is necessary for further accurate place recognition. In the verification, the local descriptors matching should have abilities of meaningless regions exclusion (e.g., dynamic objects in RobotCar) and adaptive attention on discriminative objects (e.g., trail direction in Nordland).
6. Conclusion
In this paper, we propose a simple method that uses a CAE in constructing an image descriptor from image feature maps from by a CNN. The experimental results have shown that the compressed CNN descriptor by the CAE can attain high performance, better than state-of-the-art image descriptors such as NetVLAD, SFRS and than CNN-based descriptors at much higher dimensions such as VGG16 and AlexNet. Specifically, our CAE can consistently achieve a higher AP and recall than SFRS, when using the same descriptor dimension; in addition, our CAE achieves comparable results to other baseline descriptors when using a lower dimension than these descriptors. In RobotCar (dbNight vs. qSnow), OursV can achieve top-1 recall of 0.861 with 4096 dimensions, outperforming NetVLAD and SFRS. Furthermore, from the system perspective, our CAE can achieve higher recall at 100% precision than others. These quantitative results indicate that dimension reduction by our CAE can produce a compact and condition-invariant global descriptor while reducing the computational cost.
Data availability statement
A preprint of an old version of this paper is available at https://arxiv.org/pdf/2204.07350.pdf.
Author contributions
Hanjing Ye raised the main idea and completed the experiments and the draft. Weinan Chen helps with code work and idea revising. Jingwen Yu provided help with code work. Li He, Yisheng Guan, and Hong Zhang shared their suggestions for revising the idea in this paper.
Funding
This work was supported in part by the Leading Talents Program of Guangdong Province under Grant No. 2019QN01X761 and the National Nature Science Foundation of China (62103179).
Conflicts of interest
The authors declare no conflicts of interest exist.