XDataExplorer: a three-stage comprehensive self-tuning tool for Big Data platforms

Q Guo, Y Xie, Q Li, Y Zhu - Big Data Research, 2022 - Elsevier
To meet the challenges of massive data, many big data platforms have been used in
practice. In these data processing platforms, there are many correlated parameters that have …

An empirical study on the challenges that developers encounter when developing Apache Spark applications

Z Wang, THP Chen, H Zhang, S Wang - Journal of Systems and Software, 2022 - Elsevier
Apache Spark is one of the most popular big data frameworks that abstract the underlying
distributed computation details. However, even though Spark provides various abstractions …

Approximation with error bounds in spark

G Hu, S Rigo, D Zhang… - 2019 IEEE 27th …, 2019 - ieeexplore.ieee.org
Many decision-making queries are based on aggregating massive amounts of data, where
sampling is an important approximation technique for reducing execution times. It is …

Evolutionary scheduling of dynamic multitasking workloads for big-data analytics in elastic cloud

F Zhang, J Cao, W Tan, SU Khan, K Li… - IEEE Transactions on …, 2014 - ieeexplore.ieee.org
Scheduling of dynamic and multitasking workloads for big-data analytics is a challenging
issue, as it requires a significant amount of parameter sweeping and iterations. Therefore …

[引用][C] Deepspark: Spark-based deep learning supporting asynchronous updates and caffe compatibility

H Kim, J Park, J Jang, S Yoon - CoRR, vol. abs/1602.08191, 2016

DeepCAT: A Cost-Efficient Online Configuration Auto-Tuning Approach for Big Data Frameworks

H Dou, Y Wang, Y Zhang, P Chen - Proceedings of the 51st International …, 2022 - dl.acm.org
To support different application scenarios, big data frameworks usually provide a large
number of performance-related configuration parameters. Online auto-tuning these …

Large scale distributed data science from scratch using Apache Spark 2.0

J Shanahan, L Dai - Proceedings of the 26th International Conference on …, 2017 - dl.acm.org
Apache Spark is an open-source cluster computing framework. It has emerged as the next
generation big data processing engine, overtaking Hadoop MapReduce which helped ignite …

[PDF][PDF] Mapreduce/bigtable for distributed optimization

KB Hall, S Gilpin, G Mann - NIPS LCCC Workshop, 2010 - researchgate.net
With large data sets, it can be time consuming to run gradient based optimization, for
example to minimize the log-likelihood for maximum entropy models. Distributed methods …

SWAT: A programmable, in-memory, distributed, high-performance computing platform

M Grossman, V Sarkar - Proceedings of the 25th ACM International …, 2016 - dl.acm.org
The field of data analytics is currently going through a renaissance as a result of ever-
increasing dataset sizes, the value of the models that can be trained from those datasets …

Per-run algorithm selection with warm-starting using trajectory-based features

A Kostovska, A Jankovic, D Vermetten… - … Conference on Parallel …, 2022 - Springer
Per-instance algorithm selection seeks to recommend, for a given problem instance and a
given performance criterion, one or several suitable algorithms that are expected to perform …