Pix2struct: Screenshot parsing as pretraining for visual language understanding

K Lee, M Joshi, IR Turc, H Hu, F Liu… - International …, 2023 - proceedings.mlr.press
Visually-situated language is ubiquitous—sources range from textbooks with diagrams to
web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to …

Towards future internet: The metaverse perspective for diverse industrial applications

P Bhattacharya, D Saraswat, D Savaliya, S Sanghavi… - Mathematics, 2023 - mdpi.com
The Metaverse allows the integration of physical and digital versions of users, processes,
and environments where entities communicate, transact, and socialize. With the shift …

[PDF][PDF] Prompt Learns Prompt: Exploring Knowledge-Aware Generative Prompt Collaboration For Video Captioning.

L Yan, C Han, Z Xu, D Liu, Q Wang - IJCAI, 2023 - ijcai.org
Fine-tuning large vision-language models is a challenging task. Prompt tuning approaches
have been introduced to learn fixed textual or visual prompts while freezing the pre-trained …

Label-efficient video object segmentation with motion clues

Y Lu, J Zhang, S Sun, Q Guo, Z Cao… - … on Circuits and …, 2023 - ieeexplore.ieee.org
Video object segmentation (VOS) plays an important role in video analysis and
understanding, which in turn facilitates a number of diverse applications, including video …

Feature fusion Vision Transformers using MLP-Mixer for enhanced deepfake detection

E Essa - Neurocomputing, 2024 - Elsevier
Deepfake technology, utilizing deep learning and computer vision, presents significant
security threats by generating highly realistic synthetic media, such as images and videos. In …

Weblinx: Real-world website navigation with multi-turn dialogue

XH Lù, Z Kasner, S Reddy - arXiv preprint arXiv:2402.05930, 2024 - arxiv.org
We propose the problem of conversational web navigation, where a digital agent controls a
web browser and follows user instructions to solve real-world tasks in a multi-turn dialogue …

Learning to generate question by asking question: A primal-dual approach with uncommon word generation

Q Wang, L Yang, X Quan, F Feng, D Liu… - Proceedings of the …, 2022 - aclanthology.org
Automatic question generation (AQG) is the task of generating a question from a given
passage and an answer. Most existing AQG methods aim at encoding the passage and the …

WIERT: web information extraction via render tree

Z Li, B Shao, L Shou, M Gong, G Li… - Proceedings of the AAAI …, 2023 - ojs.aaai.org
Web information extraction (WIE) is a fundamental problem in web document understanding,
with a significant impact on various applications. Visual information plays a crucial role in …

A triangulation-based visual localization for field robots

J Liang, Y Wang, Y Chen, B Yang… - IEEE/CAA Journal of …, 2022 - ieeexplore.ieee.org
Dear Editor, Visual localization relies on local features and searches a pre-stored GPS-
tagged image database to retrieve the reference image with the highest similarity in feature …

Smartave: Structured multimodal transformer for product attribute value extraction

Q Wang, L Yang, J Wang, J Krishnan… - Findings of the …, 2022 - aclanthology.org
Automatic product attribute value extraction refers to the task of identifying values of an
attribute from the product information. Product attributes are essential in improving online …