Multimodal datasets: misogyny, pornography, and malignant stereotypes

A Birhane, VU Prabhu, E Kahembwe - arXiv preprint arXiv:2110.01963, 2021 - arxiv.org
We have now entered the era of trillion parameter machine learning models trained on
billion-sized datasets scraped from the internet. The rise of these gargantuan datasets has …

What's in the box? a preliminary analysis of undesirable content in the common crawl corpus

AS Luccioni, JD Viviano - arXiv preprint arXiv:2105.02732, 2021 - arxiv.org
Whereas much of the success of the current generation of neural language models has
been driven by increasingly large training corpora, relatively little research has been …

Consent in crisis: The rapid decline of the ai data commons

S Longpre, R Mahari, A Lee, C Lund, H Oderinwale… - NEURIPS, 2024 - hal.science
General-purpose artificial intelligence (AI) systems are built on massive swathes of public
web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge …

When Sally met trackers: Web tracking from the users' perspective

S Dambra, I Sanchez-Rola, L Bilge… - 31st USENIX Security …, 2022 - usenix.org
Web tracking has evolved to become a norm on the Internet. As a matter of fact, the web
tracking market has grown to raise billions of dollars. Privacy cautious web practitioners and …

The Hitchhiker's guide to facebook web tracking with invisible pixels and click IDs

P Bekos, P Papadopoulos, EP Markatos… - Proceedings of the ACM …, 2023 - dl.acm.org
Over the past years, advertisement companies have used various tracking methods to
persistently track users across the web. Such tracking methods usually include first and third …

Towards website domain name classification using graph based semi-supervised learning

A Faroughi, A Morichetta, L Vassio, F Figueiredo… - Computer Networks, 2021 - Elsevier
In this work, we tackle the problem of classifying websites domain names to a category, eg,
mapping bbc. com to the” News and Media” class. Domain name classification is …

Measuring web cookies in governmental websites

M Gotze, S Matic, C Iordanou, G Smaragdakis… - Proceedings of the 14th …, 2022 - dl.acm.org
In recent years, governments worldwide have moved their services online to better serve
their citizens. Benefits aside, this choice increases the danger of tracking via such sites. This …

Securing federated sensitive topic classification against poisoning attacks

T Chu, A Garcia-Recuero, C Iordanou… - arXiv preprint arXiv …, 2022 - arxiv.org
We present a Federated Learning (FL) based solution for building a distributed classifier
capable of detecting URLs containing GDPR-sensitive content related to categories such as …

Discovering obscure looking glass sites on the web to facilitate internet measurement research

S Zhuang, JH Wang, J Wang, Z Pan, T Wu, F Li… - Proceedings of the 17th …, 2021 - dl.acm.org
Despite researchers have noticed that Looking Glass (LG) vantage points (VPs) are
valuable for Internet measurement researches, they can only exploit VPs from well-known …

Decoding the Kodi Ecosystem

Y Xiao, M Varvello, M Warrior… - ACM Transactions on the …, 2023 - dl.acm.org
Free and open-source media centers are experiencing a boom in popularity for the
convenience they offer users seeking to remotely consume digital content. Kodi is today's …