A survey of data partitioning and sampling methods to support big data analysis

MS Mahmud, JZ Huang, S Salloum… - Big Data Mining and …, 2020 - ieeexplore.ieee.org
Computer clusters with the shared-nothing architecture are the major computing platforms
for big data processing and analysis. In cluster computing, data partitioning and sampling …

Survey of distributed computing frameworks for supporting big data analysis

X Sun, Y He, D Wu, JZ Huang - Big Data Mining and Analytics, 2023 - ieeexplore.ieee.org
Distributed computing frameworks are the fundamental component of distributed computing
systems. They provide an essential way to support the efficient processing of big data on …

Random sample partition: a distributed data model for big data analysis

S Salloum, JZ Huang, Y He - IEEE Transactions on Industrial …, 2019 - ieeexplore.ieee.org
With the ever-increasing volume of data, alternative strategies are required to divide big data
into statistically consistent data blocks that can be used directly as representative samples of …

[HTML][HTML] A scalable and flexible basket analysis system for big transaction data in Spark

X Sun, A Ngueilbaye, K Luo, Y Cai, D Wu… - Information Processing & …, 2024 - Elsevier
Basket analysis is a prevailing technique to help retailers uncover patterns and associations
of sold products in customer shopping transactions. However, as the size of transaction …

Approximate clustering ensemble method for big data

MS Mahmud, JZ Huang, R Ruby… - … Transactions on Big …, 2023 - ieeexplore.ieee.org
Clustering a big distributed dataset of hundred gigabytes or more is a challenging task in
distributed computing. A popular method to tackle this problem is to use a random sample of …

Distributed data strategies to support large-scale data analysis across geo-distributed data centers

TZ Emara, JZ Huang - IEEE Access, 2020 - ieeexplore.ieee.org
As the volume of data grows rapidly, storing big data in a single data center is no longer
feasible. Hence, companies have developed two scenarios to store their big data in multiple …

Clustering approximation via a fusion of multiple random samples

MS Mahmud, JZ Huang, S García - Information Fusion, 2024 - Elsevier
In big data clustering exploration, the situation is paradoxical because there is no prior or
insufficient domain knowledge. Moreover, clustering a big dataset is a challenging task in …

Non-MapReduce computing for intelligent big data analysis

X Sun, L Zhao, J Chen, Y Cai, D Wu… - Engineering Applications of …, 2024 - Elsevier
MapReduce is a popular paradigm in distributed computing, but it is not efficient when
executing iterative algorithms over a distributed big dataset due to its heavy data …

Exploring and cleaning big data with random sample data blocks

S Salloum, JZ Huang, Y He - Journal of Big Data, 2019 - Springer
Data scientists need scalable methods to explore and clean big data before applying
advanced data analysis and mining algorithms. In this paper, we propose the RSP-Explore …

A distributed data management system to support large-scale data analysis

TZ Emara, JZ Huang - Journal of Systems and Software, 2019 - Elsevier
Distributed data management is a key technology to enable efficient massive data
processing and analysis in cluster-computing environments. Specifically, in environments …