SPT: Spatial pyramid transformer for image captioning

Y Mao, J Xiao, D Zhang, M Cao, J Shao… - ACM Transactions on …, 2023 - dl.acm.org

Distinctive Image Captioning (DIC)—generating distinctive captions that describe the unique
details of a target image—has received considerable attention over the last few years. A …

被引用次数：4 相关文章所有 2 个版本

Exploring refined dual visual features cross-combination for image captioning

J Hu, Z Li, Q Su, Z Tang, H Ma - Neural Networks, 2024 - Elsevier

For current image caption tasks used to encode region features and grid features
Transformer-based encoders have become commonplace, because of their multi-head self …

Center-enhanced video captioning model with multimodal semantic alignment

B Zhang, J Gao, Y Yuan - Neural Networks, 2024 - Elsevier

Video captioning aims at automatically generating descriptive sentences based on the given
video, establishing an association between the visual contents and textual languages, has …

Exploring Vision-Language Foundation Model for Novel Object Captioning

J Luo, Y Li, Y Pan, T Yao, J Feng… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

It is always well believed that pre-trained vision-language foundation models (eg, CLIP)
would substantially facilitate vision-language tasks. Nevertheless, there has been less …

[PDF] nature.com

Clustering swap prediction for image-text pre-training

S Fayou, HC Ngo, YW Sek, Z Meng - Scientific Reports, 2024 - nature.com

It is essential to delve into the strategy of multimodal model pre-training, which is an obvious
impact on downstream tasks. Currently, clustering learning has achieved noteworthy …

Relation-aware Multi-pass Comparison Deconfounded Network for Change Captioning

Z Lu, L Jin, Z Chen, C Tian, X Sun, X Li… - … on Circuits and …, 2024 - ieeexplore.ieee.org

Change captioning aims to describe the semantic change between a pair of images with
natural language while remaining immune to viewpoint change. Based on the encoder …

[PDF] wiley.com Full View

HIST: Hierarchical and sequential transformer for image captioning

F Lv, R Wang, L Jing, P Dai - IET Computer Vision, 2024 - Wiley Online Library

Image captioning aims to automatically generate a natural language description of a given
image, and most state‐of‐the‐art models have adopted an encoder–decoder transformer …