Depth-aware vision-and-language navigation using scene query attention network
2022 International Conference on Robotics and Automation (ICRA), 2022•ieeexplore.ieee.org
Vision-and-language navigation (VLN) has been an important task in the field of Robotics
and Computer Vision. However, most existing vision-and-language navigation models only
use features extracted from RGB observation as input, while robots can utilize depth sensors
in the real world. Existing research has also shown that simply adding a depth stream to
neural models could only provide a marginal improvement to the performance of the VLN
task. Therefore, in our work, we develop a novel method for the VLN task using semantic …
and Computer Vision. However, most existing vision-and-language navigation models only
use features extracted from RGB observation as input, while robots can utilize depth sensors
in the real world. Existing research has also shown that simply adding a depth stream to
neural models could only provide a marginal improvement to the performance of the VLN
task. Therefore, in our work, we develop a novel method for the VLN task using semantic …
Vision-and-language navigation (VLN) has been an important task in the field of Robotics and Computer Vision. However, most existing vision-and-language navigation models only use features extracted from RGB observation as input, while robots can utilize depth sensors in the real world. Existing research has also shown that simply adding a depth stream to neural models could only provide a marginal improvement to the performance of the VLN task. Therefore, in our work, we develop a novel method for the VLN task using semantic map observations built from RGB-D input. We use vision-pretraining to efficiently encode the semantic map with CNN and scene query attention network by answering queries about semantic information of specific regions of a scene. The proposed method could be used with a simple model and does not require large-scale vision-language transformer pretraining, bringing a more than 10% increase in the success rate compared with a baseline model. When used together with the Speaker-Follower training technique, it achieves a success rate of 58 % on the test set for the R2R dataset in single-run setting, outperforming the previous RGB-D method and most existing RGB-only models that do not use large-scale vision-language transformers pretraining.
ieeexplore.ieee.org
以上显示的是最相近的搜索结果。 查看全部搜索结果