Matrix factorizations at scale: A comparison of scientific data analytics in Spark and C+ MPI using three case studies

A Gittens, A Devarakonda, E Racah… - … Conference on Big …, 2016 - ieeexplore.ieee.org
We explore the trade-offs of performing linear algebra using Apache Spark, compared to
traditional C and MPI implementations on HPC platforms. Spark is designed for data …

ArrayUDF: User-defined scientific data analysis on arrays

B Dong, K Wu, S Byna, J Liu, W Zhao… - Proceedings of the 26th …, 2017 - dl.acm.org
User-Defined Functions (UDF) allow application programmers to specify analysis operations
on data, while leaving the data management tasks to the system. This general approach …

A high performance query analytical framework for supporting data-intensive climate studies

Z Li, Q Huang, GJ Carbone, F Hu - Computers, Environment and Urban …, 2017 - Elsevier
Climate observations and model simulations produce vast amounts of data. The
unprecedented data volume and the complexity of geospatial statistics and analysis requires …

The case for alternative web archival formats to expedite the data-to-insight cycle

X Wang, Z Xie - Proceedings of the ACM/IEEE Joint Conference on …, 2020 - dl.acm.org
The WARC file format is widely used by web archives to preserve collected web content for
future use. With the rapid growth of web archives and the increasing interest to reuse these …

Spark and HPC for high energy physics data analyses

S Sehrish, J Kowalkowski… - 2017 IEEE International …, 2017 - ieeexplore.ieee.org
A full High Energy Physics (HEP) data analysis is divided into multiple data reduction
phases. Processing within these phases is extremely time consuming, therefore …

Zero-cost, Arrow-enabled data interface for Apache Spark

SA Rodriguez, J Chackrabroty, A Chu… - … Conference on Big …, 2021 - ieeexplore.ieee.org
Distributed data processing ecosystems are widespread and their components are highly
specialized, such that efficient interoperability is urgent. Recently, Apache Arrow was …

SciDP: Support HPC and big data applications via integrated scientific data processing

K Feng, XH Sun, X Yang, S Zhou - 2018 IEEE International …, 2018 - ieeexplore.ieee.org
Modern High Performance Computing (HPC) applications, such as Earth science
simulations, produce large amounts of data due to the surging of computing power, while big …

Distributed interactive visualization using GPU-optimized spark

S Hong, J Choi, WK Jeong - IEEE Transactions on Visualization …, 2020 - ieeexplore.ieee.org
With the advent of advances in imaging and computing technologies, large-scale data
acquisition and processing have become commonplace in many science and engineering …

Fits data source for apache spark

J Peloton, C Arnault, S Plaszczynski - Computing and Software for Big …, 2018 - Springer
We investigate the performance of Apache Spark, a cluster computing framework, for
analyzing data from future LSST-like galaxy surveys. Apache Spark attempts to address big …

[PDF][PDF] Towards Implicit Parallel Programming for Systems

S Ertel - 2019 - core.ac.uk
Processor architectures have reached a physical boundary that prevents scaling
performance with the number of transistors. Effectively, this means that the sequential …