Multimodal datasets: misogyny, pornography, and malignant stereotypes
We have now entered the era of trillion parameter machine learning models trained on
billion-sized datasets scraped from the internet. The rise of these gargantuan datasets has …
billion-sized datasets scraped from the internet. The rise of these gargantuan datasets has …
What's in the box? a preliminary analysis of undesirable content in the common crawl corpus
AS Luccioni, JD Viviano - arXiv preprint arXiv:2105.02732, 2021 - arxiv.org
Whereas much of the success of the current generation of neural language models has
been driven by increasingly large training corpora, relatively little research has been …
been driven by increasingly large training corpora, relatively little research has been …
Consent in crisis: The rapid decline of the ai data commons
General-purpose artificial intelligence (AI) systems are built on massive swathes of public
web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge …
web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge …
When Sally met trackers: Web tracking from the users' perspective
S Dambra, I Sanchez-Rola, L Bilge… - 31st USENIX Security …, 2022 - usenix.org
Web tracking has evolved to become a norm on the Internet. As a matter of fact, the web
tracking market has grown to raise billions of dollars. Privacy cautious web practitioners and …
tracking market has grown to raise billions of dollars. Privacy cautious web practitioners and …
The Hitchhiker's guide to facebook web tracking with invisible pixels and click IDs
Over the past years, advertisement companies have used various tracking methods to
persistently track users across the web. Such tracking methods usually include first and third …
persistently track users across the web. Such tracking methods usually include first and third …
Towards website domain name classification using graph based semi-supervised learning
In this work, we tackle the problem of classifying websites domain names to a category, eg,
mapping bbc. com to the” News and Media” class. Domain name classification is …
mapping bbc. com to the” News and Media” class. Domain name classification is …
Measuring web cookies in governmental websites
In recent years, governments worldwide have moved their services online to better serve
their citizens. Benefits aside, this choice increases the danger of tracking via such sites. This …
their citizens. Benefits aside, this choice increases the danger of tracking via such sites. This …
Securing federated sensitive topic classification against poisoning attacks
We present a Federated Learning (FL) based solution for building a distributed classifier
capable of detecting URLs containing GDPR-sensitive content related to categories such as …
capable of detecting URLs containing GDPR-sensitive content related to categories such as …
Discovering obscure looking glass sites on the web to facilitate internet measurement research
S Zhuang, JH Wang, J Wang, Z Pan, T Wu, F Li… - Proceedings of the 17th …, 2021 - dl.acm.org
Despite researchers have noticed that Looking Glass (LG) vantage points (VPs) are
valuable for Internet measurement researches, they can only exploit VPs from well-known …
valuable for Internet measurement researches, they can only exploit VPs from well-known …
Decoding the Kodi Ecosystem
Y Xiao, M Varvello, M Warrior… - ACM Transactions on the …, 2023 - dl.acm.org
Free and open-source media centers are experiencing a boom in popularity for the
convenience they offer users seeking to remotely consume digital content. Kodi is today's …
convenience they offer users seeking to remotely consume digital content. Kodi is today's …