Review of lightweight deep convolutional neural networks

F Chen, S Li, J Han, F Ren, Z Yang - Archives of Computational Methods …, 2024 - Springer
Lightweight deep convolutional neural networks (LDCNNs) are vital components of mobile
intelligence, particularly in mobile vision. Although various heavy networks with increasingly …

YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

CY Wang, A Bochkovskiy… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Real-time object detection is one of the most important research topics in computer vision.
As new approaches regarding architecture optimization and training optimization are …

Repvit: Revisiting mobile cnn from vit perspective

A Wang, H Chen, Z Lin, J Han… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Abstract Recently lightweight Vision Transformers (ViTs) demonstrate superior performance
and lower latency compared with lightweight Convolutional Neural Networks (CNNs) on …

MobileNetV4: Universal Models for the Mobile Ecosystem

D Qin, C Leichner, M Delakis, M Fornoni, S Luo… - … on Computer Vision, 2025 - Springer
We present the latest generation of MobileNets: MobileNetV4 (MNv4). They feature
universally-efficient architecture designs for mobile devices. We introduce the Universal …

Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives

K Grauman, A Westbury, L Torresani… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract We present Ego-Exo4D a diverse large-scale multimodal multiview video dataset
and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric …

Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices

X Chu, L Qiao, X Lin, S Xu, Y Yang, Y Hu, F Wei… - arXiv preprint arXiv …, 2023 - arxiv.org
We present MobileVLM, a competent multimodal vision language model (MMVLM) targeted
to run on mobile devices. It is an amalgamation of a myriad of architectural designs and …

Mobileclip: Fast image-text models through multi-modal reinforced training

PKA Vasu, H Pouransari, F Faghri… - Proceedings of the …, 2024 - openaccess.thecvf.com
Contrastive pre-training of image-text foundation models such as CLIP demonstrated
excellent zero-shot performance and improved robustness on a wide range of downstream …

Clipping: Distilling clip-based models with a student base for video-language retrieval

R Pei, J Liu, W Li, B Shao, S Xu… - Proceedings of the …, 2023 - openaccess.thecvf.com
Pre-training a vison-language model and then fine-tuning it on downstream tasks have
become a popular paradigm. However, pre-trained vison-language models with the …

[HTML][HTML] From Near-Sensor to In-Sensor: A State-of-the-Art Review of Embedded AI Vision Systems

W Fabre, K Haroun, V Lorrain, M Lepecq, G Sicard - Sensors, 2024 - mdpi.com
In modern cyber-physical systems, the integration of AI into vision pipelines is now a
standard practice for applications ranging from autonomous vehicles to mobile devices …

Temporal dynamic quantization for diffusion models

J So, J Lee, D Ahn, H Kim… - Advances in Neural …, 2024 - proceedings.neurips.cc
Diffusion model has gained popularity in vision applications due to its remarkable
generative performance and versatility. However, its high storage and computation …