[PDF][PDF] Relational Learning in Computer Vision.

N Messina, F Falchi, G Amato, M Avvenuti, J Lokoc… - 2022 - researchgate.net
THE increasing interest in social networks, smart cities, and Industry 4.0 is en-couraging the
development of techniques for processing, understanding, and organizing vast amounts of
data. Recent important advances in Artificial Intelligence brought to life a subfield of Machine
Learning called Deep Learning, which can automatically learn common patterns from raw
data directly, without relying on manual feature selection. This framework overturned many
computer science fields, like Computer Vision and Natural Language Processing, obtaining …
Summary
THE increasing interest in social networks, smart cities, and Industry 4.0 is en-couraging the development of techniques for processing, understanding, and organizing vast amounts of data. Recent important advances in Artificial Intelligence brought to life a subfield of Machine Learning called Deep Learning, which can automatically learn common patterns from raw data directly, without relying on manual feature selection. This framework overturned many computer science fields, like Computer Vision and Natural Language Processing, obtaining astonishing results. Nevertheless, many challenges are still open. Although deep neural networks obtained impressive results on many tasks, they cannot perform non-local processing by explicitly relating potentially interconnected visual or textual entities. This relational aspect is fundamental for capturing high-level semantic interconnections in multimedia data or understanding the relationships between spatially distant objects in an image. This thesis tackles the relational understanding problem in Deep Neural Networks, considering three different yet related tasks. First, we introduce a challenging variant of the Content-Based Image Retrieval (CBIR) task, called Relational CBIR. In R-CBIR, we aim to retrieve images also having similar relationships among the multiple objects present in the images. We define some architectures able to extract relationship-aware visual descriptors, and we extend the CLEVR synthetic dataset for obtaining a suitable ground-truth for evaluating R-CBIR. Then, we move a step further, considering realworld images and focusing on cross-modal visual-textual retrieval. We use the Transformer Encoder, a recently introduced module that relies on the power of self-attention, to relate different sentence words and image regions, with large-scale retrieval as the main goal. We show that the obtained features contain very high-level semantics and defeat current image descriptors on the challenging Semantic CBIR task. We then propose some solutions for scaling the search to possibly millions of images or texts. In the end, we deploy the developed networks in a large-scale interactive video retrieval software, called VISIONE, developed in our laboratory. Sticking to the multi-modal Transformer framework, we tackle another critical task in the modern Internet: detecting persuasion techniques in memes spread on social networks during disinformation campaigns. Finally, we probe current state-of-the-art CNNs on challenging visual rea-
researchgate.net
以上显示的是最相近的搜索结果。 查看全部搜索结果