Abstract
Visual place recognition (VPR) using deep networks has achieved state-of-the-art performance. However, most of them require a training set with ground truth sensor poses to obtain positive and negative samples of each observation's spatial neighborhood for supervised learning. When such information is unavailable, temporal neighborhoods from a sequentially collected data stream could be exploited for self-supervised training, although we find its performance suboptimal. Inspired by noisy label learning, we propose a novel self-supervised framework named TF-VPR that uses temporal neighborhoods and learnable feature neighborhoods to discover unknown spatial neighborhoods. Our method follows an iterative training paradigm which alternates between: (1) representation learning with data augmentation, (2) positive set expansion to include the current feature space neighbors, and (3) positive set contraction via geometric verification. We conduct auto-labeling and generalization tests on both simulated and real datasets, with either RGB images or point clouds as inputs. The results show that our method outperforms self-supervised baselines in recall rate, robustness, and heading diversity, a novel metric we propose for VPR.
Original language | English (US) |
---|---|
Pages (from-to) | 248-255 |
Number of pages | 8 |
Journal | IEEE Robotics and Automation Letters |
Volume | 10 |
Issue number | 1 |
DOIs | |
State | Published - 2025 |
Keywords
- Convolutional neural network
- feature maps
- global representation
- image descriptors
- place recognition
- point cloud
- retrieval results
- robust representation
- self-supervised learning
- self-supervised task
- street view
- viewpoint changes
- visual localization
- visual place recognition
ASJC Scopus subject areas
- Control and Systems Engineering
- Biomedical Engineering
- Human-Computer Interaction
- Mechanical Engineering
- Computer Vision and Pattern Recognition
- Computer Science Applications
- Control and Optimization
- Artificial Intelligence