A survey of data partitioning and sampling methods to support big data analysis
Computer clusters with the shared-nothing architecture are the major computing platforms
for big data processing and analysis. In cluster computing, data partitioning and sampling …
for big data processing and analysis. In cluster computing, data partitioning and sampling …
Survey of distributed computing frameworks for supporting big data analysis
Distributed computing frameworks are the fundamental component of distributed computing
systems. They provide an essential way to support the efficient processing of big data on …
systems. They provide an essential way to support the efficient processing of big data on …
Random sample partition: a distributed data model for big data analysis
With the ever-increasing volume of data, alternative strategies are required to divide big data
into statistically consistent data blocks that can be used directly as representative samples of …
into statistically consistent data blocks that can be used directly as representative samples of …
[HTML][HTML] A scalable and flexible basket analysis system for big transaction data in Spark
Basket analysis is a prevailing technique to help retailers uncover patterns and associations
of sold products in customer shopping transactions. However, as the size of transaction …
of sold products in customer shopping transactions. However, as the size of transaction …
Approximate clustering ensemble method for big data
Clustering a big distributed dataset of hundred gigabytes or more is a challenging task in
distributed computing. A popular method to tackle this problem is to use a random sample of …
distributed computing. A popular method to tackle this problem is to use a random sample of …
Distributed data strategies to support large-scale data analysis across geo-distributed data centers
TZ Emara, JZ Huang - IEEE Access, 2020 - ieeexplore.ieee.org
As the volume of data grows rapidly, storing big data in a single data center is no longer
feasible. Hence, companies have developed two scenarios to store their big data in multiple …
feasible. Hence, companies have developed two scenarios to store their big data in multiple …
Clustering approximation via a fusion of multiple random samples
In big data clustering exploration, the situation is paradoxical because there is no prior or
insufficient domain knowledge. Moreover, clustering a big dataset is a challenging task in …
insufficient domain knowledge. Moreover, clustering a big dataset is a challenging task in …
Non-MapReduce computing for intelligent big data analysis
MapReduce is a popular paradigm in distributed computing, but it is not efficient when
executing iterative algorithms over a distributed big dataset due to its heavy data …
executing iterative algorithms over a distributed big dataset due to its heavy data …
Exploring and cleaning big data with random sample data blocks
Data scientists need scalable methods to explore and clean big data before applying
advanced data analysis and mining algorithms. In this paper, we propose the RSP-Explore …
advanced data analysis and mining algorithms. In this paper, we propose the RSP-Explore …
A distributed data management system to support large-scale data analysis
TZ Emara, JZ Huang - Journal of Systems and Software, 2019 - Elsevier
Distributed data management is a key technology to enable efficient massive data
processing and analysis in cluster-computing environments. Specifically, in environments …
processing and analysis in cluster-computing environments. Specifically, in environments …