XDataExplorer: a three-stage comprehensive self-tuning tool for Big Data platforms
Q Guo, Y Xie, Q Li, Y Zhu - Big Data Research, 2022 - Elsevier
To meet the challenges of massive data, many big data platforms have been used in
practice. In these data processing platforms, there are many correlated parameters that have …
practice. In these data processing platforms, there are many correlated parameters that have …
An empirical study on the challenges that developers encounter when developing Apache Spark applications
Apache Spark is one of the most popular big data frameworks that abstract the underlying
distributed computation details. However, even though Spark provides various abstractions …
distributed computation details. However, even though Spark provides various abstractions …
Approximation with error bounds in spark
Many decision-making queries are based on aggregating massive amounts of data, where
sampling is an important approximation technique for reducing execution times. It is …
sampling is an important approximation technique for reducing execution times. It is …
Evolutionary scheduling of dynamic multitasking workloads for big-data analytics in elastic cloud
Scheduling of dynamic and multitasking workloads for big-data analytics is a challenging
issue, as it requires a significant amount of parameter sweeping and iterations. Therefore …
issue, as it requires a significant amount of parameter sweeping and iterations. Therefore …
DeepCAT: A Cost-Efficient Online Configuration Auto-Tuning Approach for Big Data Frameworks
To support different application scenarios, big data frameworks usually provide a large
number of performance-related configuration parameters. Online auto-tuning these …
number of performance-related configuration parameters. Online auto-tuning these …
Large scale distributed data science from scratch using Apache Spark 2.0
J Shanahan, L Dai - Proceedings of the 26th International Conference on …, 2017 - dl.acm.org
Apache Spark is an open-source cluster computing framework. It has emerged as the next
generation big data processing engine, overtaking Hadoop MapReduce which helped ignite …
generation big data processing engine, overtaking Hadoop MapReduce which helped ignite …
[PDF][PDF] Mapreduce/bigtable for distributed optimization
With large data sets, it can be time consuming to run gradient based optimization, for
example to minimize the log-likelihood for maximum entropy models. Distributed methods …
example to minimize the log-likelihood for maximum entropy models. Distributed methods …
SWAT: A programmable, in-memory, distributed, high-performance computing platform
M Grossman, V Sarkar - Proceedings of the 25th ACM International …, 2016 - dl.acm.org
The field of data analytics is currently going through a renaissance as a result of ever-
increasing dataset sizes, the value of the models that can be trained from those datasets …
increasing dataset sizes, the value of the models that can be trained from those datasets …
Per-run algorithm selection with warm-starting using trajectory-based features
Per-instance algorithm selection seeks to recommend, for a given problem instance and a
given performance criterion, one or several suitable algorithms that are expected to perform …
given performance criterion, one or several suitable algorithms that are expected to perform …