Hostname: page-component-745bb68f8f-v2bm5 Total loading time: 0 Render date: 2025-01-27T06:48:16.471Z Has data issue: false hasContentIssue false

Condition-invariant and compact visual place description by convolutional autoencoder

Published online by Cambridge University Press:  15 March 2023

Hanjing Ye
Affiliation:
Shenzhen Key Laboratory of Robotics and Computer Vision, Southern University of Science and Technology, Shenzhen, China Department of Electronic and Electrical Engineering, Southern University of Science and Technology, Shenzhen, China
Weinan Chen
Affiliation:
School of Mechanical and Electrical Engineering, Guangdong University of Technology, Guangzhou, China
Jingwen Yu
Affiliation:
Shenzhen Key Laboratory of Robotics and Computer Vision, Southern University of Science and Technology, Shenzhen, China Department of Electronic and Electrical Engineering, Southern University of Science and Technology, Shenzhen, China
Li He
Affiliation:
Shenzhen Key Laboratory of Robotics and Computer Vision, Southern University of Science and Technology, Shenzhen, China Department of Electronic and Electrical Engineering, Southern University of Science and Technology, Shenzhen, China
Yisheng Guan
Affiliation:
School of Mechanical and Electrical Engineering, Guangdong University of Technology, Guangzhou, China
Hong Zhang*
Affiliation:
Shenzhen Key Laboratory of Robotics and Computer Vision, Southern University of Science and Technology, Shenzhen, China Department of Electronic and Electrical Engineering, Southern University of Science and Technology, Shenzhen, China
*
*Corresponding author. E-mail: hzhang@sustech.edu.cn

Abstract

Visual place recognition (VPR) in condition-varying environments is still an open problem. Popular solutions are convolutional neural network (CNN)-based image descriptors, which have been shown to outperform traditional image descriptors based on hand-crafted visual features. However, there are two drawbacks of current CNN-based descriptors: (a) their high dimension and (b) lack of generalization, leading to low efficiency and poor performance in real robotic applications. In this paper, we propose to use a convolutional autoencoder (CAE) to tackle this problem. We employ a high-level layer of a pre-trained CNN to generate features and train a CAE to map the features to a low-dimensional space to improve the condition invariance property of the descriptor and reduce its dimension at the same time. We verify our method in four challenging real-world datasets involving significant illumination changes, and our method is shown to be superior to the state-of-the-art. The code of our work is publicly available at https://github.com/MedlarTea/CAE-VPR.

Type
Research Article
Copyright
© The Author(s), 2023. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

Hanjing Ye and Weinan Chen are co-first-author

Hanjing Ye and Weinan Chen contribute equally to this paper.

References

Lowry, S., Sünderhauf, N., Newman, P., Leonard, J. J., Cox, D., Corke, P. and Milford, M. J., “Visual place recognition: A survey,” IEEE Trans. Robot. 32(1), 119 (2015).CrossRefGoogle Scholar
Alahi, A., Ortiz, R. and Vandergheynst, P.. Freak: Fast Retina Keypoint. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, IEEE (2012) pp. 510517.Google Scholar
Bay, H., Ess, A., Tuytelaars, T. and Van Gool, L., “Speeded-up robust features (surf),” Comput. Vis. Image Underst. 110(3), 346359 (2008).CrossRefGoogle Scholar
Cheng, J., Wang, C. and Meng, M. Q.-H., “Robust visual localization in dynamic environments based on sparse motion removal,” IEEE Trans. Autom. Sci. Eng. 17(2), 658669 (2019).CrossRefGoogle Scholar
Cheng, J., Zhang, H. and Meng, M. Q.-H., “Improving visual localization accuracy in dynamic environments based on dynamic region removal,” IEEE Trans. Autom. Sci. Eng. 17(3), 15851596 (2020).CrossRefGoogle Scholar
Lowe, D. G., “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis. 60(2), 91110 (2004).CrossRefGoogle Scholar
Rublee, E., Rabaud, V., Konolige, K. and Bradski, G.. Orb: An Efficient Alternative to Sift or Surf. In: 2011 International Conference on Computer Vision, IEEE (2011) pp. 25642571.Google Scholar
Wang, Y.-T. and Lin, G.-Y., “Improvement of speeded-up robust features for robot visual simultaneous localization and mapping,” Robotica 32(4), 533549 (2014).CrossRefGoogle Scholar
Girshick, R., Donahue, J., Darrell, T. and Malik, J.. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE (2014) pp. 580587.Google Scholar
Krizhevsky, A., Sutskever, I. and Hinton, G. E., “Imagenet classification with deep convolutional neural networks,” Adv. Neural Inf. Process. Syst. 25, 10971105 (2012).Google Scholar
Ronneberger, O., Fischer, P. and Brox, T.. U-Net: Convolutional Networks for Biomedical Image Segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer (2015) pp. 234241.Google Scholar
Sünderhauf, N., Shirazi, S., Dayoub, F., Upcroft, B. and Milford, M.. On the Performance of Convnet Features for Place Recognition. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE (2015) pp. 42974304.Google Scholar
Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T. and Sivic, J.. Netvlad: CNN Architecture for Weakly Supervised Place Recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE (2016) pp. 52975307.Google Scholar
Ge, Y., Wang, H., Zhu, F., Zhao, R. and Li, H.. Self-Supervising Fine-Grained Region Similarities for Large-Scale Image Localization. In: European Conference on Computer Vision, Springer (2020) pp. 369386.Google Scholar
Gordo, A., Almazán, J., Revaud, J. and Larlus, D.. Deep Image Retrieval: Learning Global Representations for Image Search. In: European Conference on Computer Vision, Springer (2016) pp. 241257.Google Scholar
Radenović, F., Tolias, G. and Chum, O., “Fine-tuning CNN image retrieval with no human annotation,” IEEE Trans. Pattern Anal. Mach. Intell. 41(7), 16551668 (2018).CrossRefGoogle ScholarPubMed
Jégou, H., Douze, M., Schmid, C. and Pérez, P.. Aggregating Local Descriptors into a Compact Image Representation. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE (2010) pp. 33043311.Google Scholar
Philbin, J., Chum, O., Isard, M., Sivic, J. and Zisserman, A.. Object Retrieval with Large Vocabularies and Fast Spatial Matching. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, IEEE (2007) pp. 18.Google Scholar
Sivic, J. and Zisserman, A.. Video Google: A Text Retrieval Approach to Object Matching in Videos. In: Computer Vision, IEEE International Conference, IEEE Computer Society, 3, (2003) pp. 14701470.Google Scholar
Arandjelovic, R. and Zisserman, A.. All About Vlad. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE (2013) pp. 15781585.Google Scholar
Jégou, H., Perronnin, F., Douze, M., Sánchez, J., Pérez, P. and Schmid, C., “Aggregating local image descriptors into compact codes,” IEEE Trans. Pattern Anal. Mach. Intell. 34(9), 17041716 (2011).CrossRefGoogle Scholar
Babenko, A., Slesarev, A., Chigorin, A. and Lempitsky, V.. Neural Codes for Image Retrieval. In: European Conference on Computer Vision, Springer (2014) pp. 584599.Google Scholar
Razavian, A. S., Azizpour, H., Sullivan, J. and Carlsson, S.. CNN Features Off-the-Shelf: An Astounding Baseline for Recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, IEEE (2014) pp. 806813.Google Scholar
Tolias, G., Sicre, R. and Jégou, H.. Particular Object Retrieval with Integral Max-Pooling of CNN Activations. In: ICLR 2016-International Conference on Learning Representations, (2016) pp. 112.Google Scholar
Razavian, A. S., Sullivan, J., Carlsson, S. and Maki, A., “Visual instance retrieval with deep convolutional networks,” ITE Trans. Media Technol. Appl. 4(3), 251258 (2016).CrossRefGoogle Scholar
Mao, X.-J., Shen, C. and Yang, Y.-B., Image Restoration Using Convolutional Auto-Encoders with Symmetric Skip Connections, arXiv preprint arXiv: 1606.08921, (2016).Google Scholar
Mirza, M. and Osindero, S., Conditional Generative Adversarial Nets, arXiv preprint arXiv: 1411.1784, (2014).Google Scholar
Isola, P., Zhu, J.-Y., Zhou, T. and Efros, A. A.. Image-to-Image Translation with Conditional Adversarial Networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE (2017) pp. 11251134.Google Scholar
Vankadari, M., Garg, S., Majumder, A., Kumar, S. and Behera, A.. Unsupervised Monocular Depth Estimation for Night-Time Images Using Adversarial Domain Feature Adaptation. In: European Conference on Computer Vision, Springer (2020) pp. 443459.Google Scholar
Merrill, N. and Huang, G.. Lightweight Unsupervised Deep Loop Closure. In: Proceedings of Robotics: Science and Systems (RSS), Pittsburgh, PA (2018).Google Scholar
Dai, Z., Huang, X., Chen, W., Chen, C., He, L., Wen, S. and Zhang, H.. Keypoint Description by Descriptor Fusion Using Autoencoders. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), IEEE (2020) pp. 6571.Google Scholar
Ba, J. L., Kiros, J. R. and Hinton, G. E., Layer Normalization, arXiv preprint arXiv: 1607.06450, (2016).Google Scholar
Hou, Y., Zhang, H. and Zhou, S.. Convolutional Neural Network-Based Image Representation for Visual Loop Closure Detection. In: 2015 IEEE International Conference on Information and Automation, IEEE (2015) pp. 22382245.Google Scholar
Simonyan, K. and Zisserman, A., “Very Deep Convolutional Networks for Large-Scale Image Recognition,” In: 3rd International Conference on Learning Representations, ICLR 2015, Conference Track Proceedings, (Bengio, Y. and LeCun, Y., eds.) San Diego, CA, USA, (May 7-9, 2015).Google Scholar
Chen, Z., Maffra, F., Sa, I. and Chli, M.. Only Look Once, Mining Distinctive Landmarks from Convnet for Visual Place Recognition. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE (2017) pp. 916.Google Scholar
Ioffe, S. and Szegedy, C.. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In: International Conference on Machine Learning, PMLR (2015) pp. 448456.Google Scholar
He, K., Zhang, X., Ren, S. and Sun, J.. Delving Deep Into Rectifiers: Surpassing Human-Level Performance on Imagenet Classification. In: Proceedings of the IEEE International Conference on Computer Vision, IEEE (2015) pp. 10261034.Google Scholar
Maddern, W., Pascoe, G., Linegar, C. and Newman, P., “1 year, 1000 km: The oxford robotcar dataset,” Int. J. Robot. Res. 36(1), 315 (2017).CrossRefGoogle Scholar
Olid, D., Fácil, J. M. and Civera, J., Single-View Place Recognition Under Seasonal Changes. In: PPNIV Workshop at IROS 2018, (2018).Google Scholar
Liu, Y., Feng, R. and Zhang, H.. Keypoint Matching by Outlier Pruning with Consensus Constraint. In: 2015 IEEE International Conference on Robotics and Automation (ICRA), IEEE (2015) pp. 54815486.Google Scholar