A survey on classifying big data with label noise

JM Johnson, TM Khoshgoftaar - ACM Journal of Data and Information …, 2022 - dl.acm.org
ACM Journal of Data and Information Quality, 2022dl.acm.org
Class label noise is a critical component of data quality that directly inhibits the predictive
performance of machine learning algorithms. While many data-level and algorithm-level
methods exist for treating label noise, the challenges associated with big data call for new
and improved methods. This survey addresses these concerns by providing an extensive
literature review on treating label noise within big data. We begin with an introduction to the
class label noise problem and traditional methods for treating label noise. Next, we present …
Class label noise is a critical component of data quality that directly inhibits the predictive performance of machine learning algorithms. While many data-level and algorithm-level methods exist for treating label noise, the challenges associated with big data call for new and improved methods. This survey addresses these concerns by providing an extensive literature review on treating label noise within big data. We begin with an introduction to the class label noise problem and traditional methods for treating label noise. Next, we present 30 methods for treating class label noise in a range of big data contexts, i.e., high-volume, high-variety, and high-velocity problems. The surveyed works include distributed solutions capable of operating on datasets of arbitrary sizes, deep learning techniques for large-scale datasets with limited clean labels, and streaming techniques for detecting class noise in the presence of concept drift. Common trends and best practices are identified in each of these areas, implementation details are reviewed, empirical results are compared across studies when applicable, and references to 17 open-source projects and programming packages are provided. An emphasis on label noise challenges, solutions, and empirical results as they relate to big data distinguishes this work as a unique contribution that will inspire future research and guide machine learning practitioners.
ACM Digital Library
以上显示的是最相近的搜索结果。 查看全部搜索结果