Sora: A review on background, technology, limitations, and opportunities of large vision models
Sora is a text-to-video generative AI model, released by OpenAI in February 2024. The
model is trained to generate videos of realistic or imaginative scenes from text instructions …
model is trained to generate videos of realistic or imaginative scenes from text instructions …
Automated diagnosis of cardiovascular diseases from cardiac magnetic resonance imaging using deep learning models: A review
In recent years, cardiovascular diseases (CVDs) have become one of the leading causes of
mortality globally. At early stages, CVDs appear with minor symptoms and progressively get …
mortality globally. At early stages, CVDs appear with minor symptoms and progressively get …
Joint feature learning and relation modeling for tracking: A one-stream framework
The current popular two-stream, two-stage tracking framework extracts the template and the
search region features separately and then performs relation modeling, thus the extracted …
search region features separately and then performs relation modeling, thus the extracted …
Focal modulation networks
We propose focal modulation networks (FocalNets in short), where self-attention (SA) is
completely replaced by a focal modulation module for modeling token interactions in vision …
completely replaced by a focal modulation module for modeling token interactions in vision …
Scaling open-vocabulary object detection
M Minderer, A Gritsenko… - Advances in Neural …, 2024 - proceedings.neurips.cc
Open-vocabulary object detection has benefited greatly from pretrained vision-language
models, but is still limited by the amount of available detection training data. While detection …
models, but is still limited by the amount of available detection training data. While detection …
Patch n'pack: Navit, a vision transformer for any aspect ratio and resolution
The ubiquitous and demonstrably suboptimal choice of resizing images to a fixed resolution
before processing them with computer vision models has not yet been successfully …
before processing them with computer vision models has not yet been successfully …
Confident adaptive language modeling
Recent advances in Transformer-based large language models (LLMs) have led to
significant performance improvements across many tasks. These gains come with a drastic …
significant performance improvements across many tasks. These gains come with a drastic …
Flexivit: One model for all patch sizes
Vision Transformers convert images to sequences by slicing them into patches. The size of
these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher …
these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher …
Propainter: Improving propagation and transformer for video inpainting
Flow-based propagation and spatiotemporal Transformer are two mainstream mechanisms
in video inpainting (VI). Despite the effectiveness of these components, they still suffer from …
in video inpainting (VI). Despite the effectiveness of these components, they still suffer from …
Global context vision transformers
We propose global context vision transformer (GC ViT), a novel architecture that enhances
parameter and compute utilization for computer vision. Our method leverages global context …
parameter and compute utilization for computer vision. Our method leverages global context …