Data management in machine learning: Challenges, techniques, and systems

A Kumar, M Boehm, J Yang - Proceedings of the 2017 ACM International …, 2017 - dl.acm.org
Large-scale data analytics using statistical machine learning (ML), popularly called
advanced analytics, underpins many modern data-driven applications. The data …

MAD skills: new analysis practices for big data

J Cohen, B Dolan, M Dunlap, JM Hellerstein… - Proceedings of the …, 2009 - dl.acm.org
As massive data acquisition and storage becomes increasingly affordable, a wide variety of
enterprises are employing statisticians to engage in sophisticated data analysis. In this …

Keystoneml: Optimizing pipelines for large-scale advanced analytics

ER Sparks, S Venkataraman, T Kaftan… - 2017 IEEE 33rd …, 2017 - ieeexplore.ieee.org
Modern advanced analytics applications make use of machine learning techniques and
contain multiple steps of domain-specific and general-purpose processing with high …

Ricardo: integrating R and Hadoop

S Das, Y Sismanis, KS Beyer, R Gemulla… - Proceedings of the …, 2010 - dl.acm.org
Many modern enterprises are collecting data at the most detailed level possible, creating
data repositories ranging from terabytes to petabytes in size. The ability to apply …

[PDF][PDF] Putting pandas in a box

S Hagedorn, S Kläbe, KU Sattler - Conference on Innovative Data …, 2021 - db-thueringen.de
ABSTRACT Pandas–the Python Data Analysis Library–is a powerful and widely used
framework for data analytics. In this work we present our approach to push down the …

Hybrid parallelization strategies for large-scale machine learning in systemml

M Boehm, S Tatikonda, B Reinwald, P Sen… - Proceedings of the …, 2014 - dl.acm.org
SystemML aims at declarative, large-scale machine learning (ML) on top of MapReduce,
where high-level ML scripts with R-like syntax are compiled to programs of MR jobs. The …

Cumulon: Optimizing statistical data analysis in the cloud

B Huang, S Babu, J Yang - Proceedings of the 2013 ACM SIGMOD …, 2013 - dl.acm.org
We present Cumulon, a system designed to help users rapidly develop and intelligently
deploy matrix-based big-data analysis programs in the cloud. Cumulon features a flexible …

Resource elasticity for large-scale machine learning

B Huang, M Boehm, Y Tian, B Reinwald… - Proceedings of the …, 2015 - dl.acm.org
Declarative large-scale machine learning (ML) aims at flexible specification of ML algorithms
and automatic generation of hybrid runtime plans ranging from single node, in-memory …

Flexible rule-based decomposition and metadata independence in modin: a parallel dataframe system

D Petersohn, D Tang, R Durrani… - Proceedings of the …, 2021 - par.nsf.gov
Dataframes have become universally popular as a means to represent data in various
stages of structure, and manipulate it using a rich set of operators---thereby becoming an …

Enabling and optimizing non-linear feature interactions in factorized linear algebra

S Li, L Chen, A Kumar - … of the 2019 International Conference on …, 2019 - dl.acm.org
Accelerating machine learning (ML) over relational data is a key focus of the database
community. While many real-world datasets are multi-table, most ML tools expect single …