Self-supervised learning: A succinct review
Abstract Machine learning has made significant advances in the field of image processing.
The foundation of this success is supervised learning, which necessitates annotated labels …
The foundation of this success is supervised learning, which necessitates annotated labels …
Beyond just vision: A review on self-supervised representation learning on multimodal and temporal data
Recently, Self-Supervised Representation Learning (SSRL) has attracted much attention in
the field of computer vision, speech, natural language processing (NLP), and recently, with …
the field of computer vision, speech, natural language processing (NLP), and recently, with …
Open-vocabulary object detection using captions
Despite the remarkable accuracy of deep neural networks in object detection, they are costly
to train and scale due to supervision requirements. Particularly, learning more object …
to train and scale due to supervision requirements. Particularly, learning more object …
End-to-end learning of visual representations from uncurated instructional videos
Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video
models still rely on manually annotated data. With the recent introduction of the HowTo100M …
models still rely on manually annotated data. With the recent introduction of the HowTo100M …
A survey of self-supervised and few-shot object detection
Labeling data is often expensive and time-consuming, especially for tasks such as object
detection and instance segmentation, which require dense labeling of the image. While few …
detection and instance segmentation, which require dense labeling of the image. While few …
Noise estimation using density estimation for self-supervised multimodal learning
One of the key factors of enabling machine learning models to comprehend and solve real-
world tasks is to leverage multimodal data. Unfortunately, annotation of multimodal data is …
world tasks is to leverage multimodal data. Unfortunately, annotation of multimodal data is …
Contrastive learning for neural topic model
Recent empirical studies show that adversarial topic models (ATM) can successfully capture
semantic patterns of the document by differentiating a document with another dissimilar …
semantic patterns of the document by differentiating a document with another dissimilar …
Look at what i'm doing: Self-supervised spatial grounding of narrations in instructional videos
We introduce the task of spatially localizing narrated interactions in videos. Key to our
approach is the ability to learn to spatially localize interactions with self-supervision on a …
approach is the ability to learn to spatially localize interactions with self-supervision on a …
Detours for navigating instructional videos
We introduce the video detours problem for navigating instructional videos. Given a source
video and a natural language query asking to alter the how-to video's current path of …
video and a natural language query asking to alter the how-to video's current path of …
Localized vision-language matching for open-vocabulary object detection
In this work, we propose an open-vocabulary object detection method that, based on image-
caption pairs, learns to detect novel object classes along with a given set of known classes …
caption pairs, learns to detect novel object classes along with a given set of known classes …