Data management in machine learning: Challenges, techniques, and systems
Large-scale data analytics using statistical machine learning (ML), popularly called
advanced analytics, underpins many modern data-driven applications. The data …
advanced analytics, underpins many modern data-driven applications. The data …
MAD skills: new analysis practices for big data
J Cohen, B Dolan, M Dunlap, JM Hellerstein… - Proceedings of the …, 2009 - dl.acm.org
As massive data acquisition and storage becomes increasingly affordable, a wide variety of
enterprises are employing statisticians to engage in sophisticated data analysis. In this …
enterprises are employing statisticians to engage in sophisticated data analysis. In this …
Keystoneml: Optimizing pipelines for large-scale advanced analytics
ER Sparks, S Venkataraman, T Kaftan… - 2017 IEEE 33rd …, 2017 - ieeexplore.ieee.org
Modern advanced analytics applications make use of machine learning techniques and
contain multiple steps of domain-specific and general-purpose processing with high …
contain multiple steps of domain-specific and general-purpose processing with high …
Ricardo: integrating R and Hadoop
Many modern enterprises are collecting data at the most detailed level possible, creating
data repositories ranging from terabytes to petabytes in size. The ability to apply …
data repositories ranging from terabytes to petabytes in size. The ability to apply …
[PDF][PDF] Putting pandas in a box
S Hagedorn, S Kläbe, KU Sattler - Conference on Innovative Data …, 2021 - db-thueringen.de
ABSTRACT Pandas–the Python Data Analysis Library–is a powerful and widely used
framework for data analytics. In this work we present our approach to push down the …
framework for data analytics. In this work we present our approach to push down the …
Hybrid parallelization strategies for large-scale machine learning in systemml
SystemML aims at declarative, large-scale machine learning (ML) on top of MapReduce,
where high-level ML scripts with R-like syntax are compiled to programs of MR jobs. The …
where high-level ML scripts with R-like syntax are compiled to programs of MR jobs. The …
Cumulon: Optimizing statistical data analysis in the cloud
We present Cumulon, a system designed to help users rapidly develop and intelligently
deploy matrix-based big-data analysis programs in the cloud. Cumulon features a flexible …
deploy matrix-based big-data analysis programs in the cloud. Cumulon features a flexible …
Resource elasticity for large-scale machine learning
Declarative large-scale machine learning (ML) aims at flexible specification of ML algorithms
and automatic generation of hybrid runtime plans ranging from single node, in-memory …
and automatic generation of hybrid runtime plans ranging from single node, in-memory …
Flexible rule-based decomposition and metadata independence in modin: a parallel dataframe system
D Petersohn, D Tang, R Durrani… - Proceedings of the …, 2021 - par.nsf.gov
Dataframes have become universally popular as a means to represent data in various
stages of structure, and manipulate it using a rich set of operators---thereby becoming an …
stages of structure, and manipulate it using a rich set of operators---thereby becoming an …
Enabling and optimizing non-linear feature interactions in factorized linear algebra
Accelerating machine learning (ML) over relational data is a key focus of the database
community. While many real-world datasets are multi-table, most ML tools expect single …
community. While many real-world datasets are multi-table, most ML tools expect single …