Online multi-modal person search in videos

Q Huang, Y Xiong, A Rao, J Wang, D Lin - Computer Vision–ECCV 2020 …, 2020 - Springer

Recent years have seen remarkable advances in visual understanding. However, how to
understand a story-based long video with artistic styles, eg movie, remains challenging. In …

被引用次数：221 相关文章所有 4 个版本

[PDF] thecvf.com

Hvpr: Hybrid voxel-point representation for single-stage 3d object detection

J Noh, S Lee, B Ham - … of the IEEE/CVF conference on …, 2021 - openaccess.thecvf.com

We address the problem of 3D object detection, that is, estimating 3D object bounding boxes
from point clouds. 3D object detection methods exploit either voxel-based or point-based …

被引用次数：132 相关文章所有 6 个版本

[PDF] arxiv.org

Category-level 6d object pose estimation via cascaded relation and recurrent reconstruction networks

J Wang, K Chen, Q Dou - 2021 IEEE/RSJ International …, 2021 - ieeexplore.ieee.org

Category-level 6D pose estimation, aiming to predict the location and orientation of unseen
object instances, is fundamental to many scenarios such as robotic manipulation and …

被引用次数：77 相关文章所有 5 个版本

[PDF] arxiv.org

A unified framework for shot type classification based on subject centric lens

A Rao, J Wang, L Xu, X Jiang, Q Huang, B Zhou… - Computer Vision–ECCV …, 2020 - Springer

Shots are key narrative elements of various videos, eg movies, TV series, and user-
generated videos that are thriving over the Internet. The types of shots greatly influence how …

被引用次数：67 相关文章所有 4 个版本

[PDF] acm.org

Ava-avd: Audio-visual speaker diarization in the wild

EZ Xu, Z Song, S Tsutsui, C Feng, M Ye… - Proceedings of the 30th …, 2022 - dl.acm.org

Audio-visual speaker diarization aims at detecting" who spoke when''using both auditory
and visual signals. Existing audio-visual diarization datasets are mainly focused on indoor …

被引用次数：32 相关文章所有 2 个版本

[PDF] arxiv.org

Moviecuts: A new dataset and benchmark for cut type recognition

A Pardo, FC Heilbron, JL Alcázar, A Thabet… - … on Computer Vision, 2022 - Springer

Understanding movies and their structural patterns is a crucial task in decoding the craft of
video editing. While previous works have developed tools for general analysis, such as …

被引用次数：26 相关文章所有 7 个版本

[PDF] thecvf.com

Face, body, voice: Video person-clustering with multiple modalities

A Brown, V Kalogeiton… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

The objective of this work is person-clustering in videos--grouping characters according to
their identity. Previous methods focus on the narrower task of face-clustering, and for the …

被引用次数：29 相关文章所有 17 个版本

[PDF] google.com

[PDF][PDF] End-to-end audio-visual neural speaker diarization

MK He, J Du, CH Lee - Proc. Interspeech, 2022 - drive.google.com

In this paper, we propose a novel end-to-end neural-networkbased audio-visual speaker
diarization method. Unlike most existing audio-visual methods, our audio-visual model takes …

被引用次数：15 相关文章所有 4 个版本

[PDF] thecvf.com

Learning to cut by watching movies

A Pardo, F Caba, JL Alcázar… - Proceedings of the …, 2021 - openaccess.thecvf.com

Video content creation keeps growing at an incredible pace; yet, creating engaging stories
remains challenging and requires non-trivial video editing expertise. Many video editing …

被引用次数：20 相关文章所有 7 个版本

[PDF] arxiv.org

Transformation vs tradition: Artificial general intelligence (agi) for arts and humanities

Z Liu, Y Li, Q Cao, J Chen, T Yang, Z Wu, J Hale… - arXiv preprint arXiv …, 2023 - arxiv.org

Recent advances in artificial general intelligence (AGI), particularly large language models
and creative image generation systems have demonstrated impressive capabilities on …

被引用次数：9 相关文章所有 2 个版本