Pix2struct: Screenshot parsing as pretraining for visual language understanding
Visually-situated language is ubiquitous—sources range from textbooks with diagrams to
web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to …
web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to …
Towards future internet: The metaverse perspective for diverse industrial applications
The Metaverse allows the integration of physical and digital versions of users, processes,
and environments where entities communicate, transact, and socialize. With the shift …
and environments where entities communicate, transact, and socialize. With the shift …
[PDF][PDF] Prompt Learns Prompt: Exploring Knowledge-Aware Generative Prompt Collaboration For Video Captioning.
Fine-tuning large vision-language models is a challenging task. Prompt tuning approaches
have been introduced to learn fixed textual or visual prompts while freezing the pre-trained …
have been introduced to learn fixed textual or visual prompts while freezing the pre-trained …
Label-efficient video object segmentation with motion clues
Video object segmentation (VOS) plays an important role in video analysis and
understanding, which in turn facilitates a number of diverse applications, including video …
understanding, which in turn facilitates a number of diverse applications, including video …
Feature fusion Vision Transformers using MLP-Mixer for enhanced deepfake detection
E Essa - Neurocomputing, 2024 - Elsevier
Deepfake technology, utilizing deep learning and computer vision, presents significant
security threats by generating highly realistic synthetic media, such as images and videos. In …
security threats by generating highly realistic synthetic media, such as images and videos. In …
Weblinx: Real-world website navigation with multi-turn dialogue
We propose the problem of conversational web navigation, where a digital agent controls a
web browser and follows user instructions to solve real-world tasks in a multi-turn dialogue …
web browser and follows user instructions to solve real-world tasks in a multi-turn dialogue …
Learning to generate question by asking question: A primal-dual approach with uncommon word generation
Automatic question generation (AQG) is the task of generating a question from a given
passage and an answer. Most existing AQG methods aim at encoding the passage and the …
passage and an answer. Most existing AQG methods aim at encoding the passage and the …
WIERT: web information extraction via render tree
Web information extraction (WIE) is a fundamental problem in web document understanding,
with a significant impact on various applications. Visual information plays a crucial role in …
with a significant impact on various applications. Visual information plays a crucial role in …
A triangulation-based visual localization for field robots
Dear Editor, Visual localization relies on local features and searches a pre-stored GPS-
tagged image database to retrieve the reference image with the highest similarity in feature …
tagged image database to retrieve the reference image with the highest similarity in feature …
Smartave: Structured multimodal transformer for product attribute value extraction
Automatic product attribute value extraction refers to the task of identifying values of an
attribute from the product information. Product attributes are essential in improving online …
attribute from the product information. Product attributes are essential in improving online …