A review of modern recommender systems using generative models (gen-recsys)

Y Deldjoo, Z He, J McAuley, A Korikov… - Proceedings of the 30th …, 2024 - dl.acm.org
Traditional recommender systems typically use user-item rating histories as their main data
source. However, deep generative models now have the capability to model and sample …

Leveraging temporal contextualization for video action recognition

M Kim, D Han, T Kim, B Han - European Conference on Computer Vision, 2025 - Springer
We propose a novel framework for video understanding, called Temporally Contextualized
CLIP (TC-CLIP), which leverages essential temporal information through global interactions …

Rethinking clip-based video learners in cross-domain open-vocabulary action recognition

KY Lin, H Ding, J Zhou, YM Tang, YX Peng… - arXiv preprint arXiv …, 2024 - arxiv.org
Building upon the impressive success of CLIP (Contrastive Language-Image Pretraining),
recent pioneer works have proposed to adapt the powerful CLIP to video data, leading to …

Recommendation with generative models

Y Deldjoo, Z He, J McAuley, A Korikov… - arXiv preprint arXiv …, 2024 - arxiv.org
Generative models are a class of AI models capable of creating new instances of data by
learning and sampling from their statistical distributions. In recent years, these models have …

Awt: Transferring vision-language models via augmentation, weighting, and transportation

Y Zhu, Y Ji, Z Zhao, G Wu, L Wang - arXiv preprint arXiv:2407.04603, 2024 - arxiv.org
Pre-trained vision-language models (VLMs) have shown impressive results in various visual
classification tasks. However, we often fail to fully unleash their potential when adapting …

Tabpedia: Towards comprehensive visual table understanding with concept synergy

W Zhao, H Feng, Q Liu, J Tang, S Wei, B Wu… - arXiv preprint arXiv …, 2024 - arxiv.org
Tables contain factual and quantitative data accompanied by various structures and
contents that pose challenges for machine comprehension. Previous methods generally …

MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer

M Zhu, Z Wang, M Hu, R Dang, X Lin, X Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org
Transferring visual-language knowledge from large-scale foundation models for video
recognition has proved to be effective. To bridge the domain gap, additional parametric …

Multi-modal Generative Models in Recommendation System

A Ramisa, R Vidal, Y Deldjoo, Z He, J McAuley… - arXiv preprint arXiv …, 2024 - arxiv.org
Many recommendation systems limit user inputs to text strings or behavior signals such as
clicks and purchases, and system outputs to a list of products sorted by relevance. With the …

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

H Wang, C Ju, W Lin, S Xiao, M Chen, Y Huang… - arXiv preprint arXiv …, 2024 - arxiv.org
In rapidly evolving field of vision-language models (VLMs), contrastive language-image pre-
training (CLIP) has made significant strides, becoming foundation for various downstream …

LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living

R Chakraborty, A Sinha, D Reilly, MK Govind… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Vision Models (LLVMs) have demonstrated effectiveness in processing
internet videos, yet they struggle with the visually perplexing dynamics present in Activities …