A survey on automatic parameter tuning for big data processing systems

H Herodotou, Y Chen, J Lu - ACM Computing Surveys (CSUR), 2020 - dl.acm.org
Big data processing systems (eg, Hadoop, Spark, Storm) contain a vast number of
configuration parameters controlling parallelism, I/O behavior, memory settings, and …

[HTML][HTML] A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench

N Ahmed, ALC Barczak, T Susnjak, MA Rashid - Journal of Big Data, 2020 - Springer
Big Data analytics for storing, processing, and analyzing large-scale datasets has become
an essential tool for the industry. The advent of distributed computing frameworks such as …

Using machine learning to optimize parallelism in big data applications

ÁB Hernández, MS Perez, S Gupta… - Future Generation …, 2018 - Elsevier
In-memory cluster computing platforms have gained momentum in the last years, due to their
ability to analyse big amounts of data in parallel. These platforms are complex and difficult-to …

A methodology for spark parameter tuning

A Gounaris, J Torres - Big data research, 2018 - Elsevier
Spark has been established as an attractive platform for big data analysis, since it manages
to hide most of the complexities related to parallelism, fault tolerance and cluster setting from …

Efficient performance prediction for apache spark

G Cheng, S Ying, B Wang, Y Li - Journal of Parallel and Distributed …, 2021 - Elsevier
Spark is a more efficient distributed big data processing framework following Hadoop. It
provides users with more than 180 adjustable configuration parameters, and how to choose …

Locat: Low-overhead online configuration auto-tuning of spark sql applications

J Xin, K Hwang, Z Yu - … of the 2022 International Conference on …, 2022 - dl.acm.org
Spark SQL has been widely deployed in industry but it is challenging to tune its
performance. Recent studies try to employ machine learning (ML) to solve this problem, but …

Towards general and efficient online tuning for spark

Y Li, H Jiang, Y Shen, Y Fang, X Yang, D Huang… - arXiv preprint arXiv …, 2023 - arxiv.org
The distributed data analytic system--Spark is a common choice for processing massive
volumes of heterogeneous data, while it is challenging to tune its parameters to achieve …

Rover: An online Spark SQL tuning service via generalized transfer learning

Y Shen, X Ren, Y Lu, H Jiang, H Xu, D Peng… - Proceedings of the 29th …, 2023 - dl.acm.org
Distributed data analytic engines like Spark are common choices to process massive data in
industry. However, the performance of Spark SQL highly depends on the choice of …

[HTML][HTML] Runtime prediction of big data jobs: performance comparison of machine learning algorithms and analytical models

N Ahmed, ALC Barczak, MA Rashid, T Susnjak - Journal of Big Data, 2022 - Springer
Due to the rapid growth of available data, various platforms offer parallel infrastructure that
efficiently processes big data. One of the critical issues is how to use these platforms to …

You only run once: spark auto-tuning from a single run

DB Prats, FA Portella, CHA Costa… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
Tuning configurations of Spark jobs is not a trivial task. State-of-the-art auto-tuning systems
are based on iteratively running workloads with different configurations. During the …