A two-stage data processing algorithm to generate random sample partitions for big data analysis

MS Mahmud, JZ Huang, S Salloum… - Big Data Mining and …, 2020 - ieeexplore.ieee.org

Computer clusters with the shared-nothing architecture are the major computing platforms
for big data processing and analysis. In cluster computing, data partitioning and sampling …

被引用次数：312 相关文章所有 5 个版本

[PDF] ieee.org

Survey of distributed computing frameworks for supporting big data analysis

X Sun, Y He, D Wu, JZ Huang - Big Data Mining and Analytics, 2023 - ieeexplore.ieee.org

Distributed computing frameworks are the fundamental component of distributed computing
systems. They provide an essential way to support the efficient processing of big data on …

被引用次数：26 相关文章所有 2 个版本

[PDF] arxiv.org

Random sample partition: a distributed data model for big data analysis

S Salloum, JZ Huang, Y He - IEEE Transactions on Industrial …, 2019 - ieeexplore.ieee.org

With the ever-increasing volume of data, alternative strategies are required to divide big data
into statistically consistent data blocks that can be used directly as representative samples of …

被引用次数：117 相关文章所有 3 个版本

[HTML] sciencedirect.com

[HTML][HTML] A scalable and flexible basket analysis system for big transaction data in Spark

X Sun, A Ngueilbaye, K Luo, Y Cai, D Wu… - Information Processing & …, 2024 - Elsevier

Basket analysis is a prevailing technique to help retailers uncover patterns and associations
of sold products in customer shopping transactions. However, as the size of transaction …

被引用次数：6 相关文章所有 2 个版本

Approximate clustering ensemble method for big data

MS Mahmud, JZ Huang, R Ruby… - … Transactions on Big …, 2023 - ieeexplore.ieee.org

Clustering a big distributed dataset of hundred gigabytes or more is a challenging task in
distributed computing. A popular method to tackle this problem is to use a random sample of …

被引用次数：19 相关文章所有 2 个版本

[PDF] ieee.org

Distributed data strategies to support large-scale data analysis across geo-distributed data centers

TZ Emara, JZ Huang - IEEE Access, 2020 - ieeexplore.ieee.org

As the volume of data grows rapidly, storing big data in a single data center is no longer
feasible. Hence, companies have developed two scenarios to store their big data in multiple …

被引用次数：37 相关文章所有 3 个版本

Clustering approximation via a fusion of multiple random samples

MS Mahmud, JZ Huang, S García - Information Fusion, 2024 - Elsevier

In big data clustering exploration, the situation is paradoxical because there is no prior or
insufficient domain knowledge. Moreover, clustering a big dataset is a challenging task in …

被引用次数：9 相关文章所有 2 个版本

Non-MapReduce computing for intelligent big data analysis

X Sun, L Zhao, J Chen, Y Cai, D Wu… - Engineering Applications of …, 2024 - Elsevier

MapReduce is a popular paradigm in distributed computing, but it is not efficient when
executing iterative algorithms over a distributed big dataset due to its heavy data …

被引用次数：4 相关文章所有 2 个版本

[PDF] springer.com

Exploring and cleaning big data with random sample data blocks

S Salloum, JZ Huang, Y He - Journal of Big Data, 2019 - Springer

Data scientists need scalable methods to explore and clean big data before applying
advanced data analysis and mining algorithms. In this paper, we propose the RSP-Explore …

被引用次数：24 相关文章所有 10 个版本

A distributed data management system to support large-scale data analysis

TZ Emara, JZ Huang - Journal of Systems and Software, 2019 - Elsevier

Distributed data management is a key technology to enable efficient massive data
processing and analysis in cluster-computing environments. Specifically, in environments …

被引用次数：25 相关文章