Occupy the cloud: Distributed computing for the 99%

E Jonas, Q Pu, S Venkataraman, I Stoica… - Proceedings of the 2017 …, 2017 - dl.acm.org
Distributed computing remains inaccessible to a large number of users, in spite of many
open source platforms and extensive commercial offerings. While distributed computation …

Cluster frameworks for efficient scheduling and resource allocation in data center networks: A survey

K Wang, Q Zhou, S Guo, J Luo - IEEE Communications Surveys …, 2018 - ieeexplore.ieee.org
Data centers are widely used for big data analytics, which often involve data-parallel jobs,
including query and web service. Meanwhile, cluster frameworks are rapidly developed for …

Sparrow: distributed, low latency scheduling

K Ousterhout, P Wendell, M Zaharia… - Proceedings of the twenty …, 2013 - dl.acm.org
Large-scale data analytics frameworks are shifting towards shorter task durations and larger
degrees of parallelism to provide low latency. Scheduling highly parallel jobs that complete …

Making sense of performance in data analytics frameworks

K Ousterhout, R Rasti, S Ratnasamy… - … USENIX Symposium on …, 2015 - usenix.org
There has been much research devoted to improving the performance of data analytics
frameworks, but comparatively little effort has been spent systematically identifying the …

Firmament: Fast, centralized cluster scheduling at scale

I Gog, M Schwarzkopf, A Gleave, RNM Watson… - … USENIX Symposium on …, 2016 - usenix.org
Centralized datacenter schedulers can make high-quality placement decisions when
scheduling tasks in a cluster. Today, however, high-quality placements come at the cost of …

Shark: SQL and rich analytics at scale

RS Xin, J Rosen, M Zaharia, MJ Franklin… - Proceedings of the …, 2013 - dl.acm.org
Shark is a new data analysis system that marries query processing with complex analytics
on large clusters. It leverages a novel distributed memory abstraction to provide a unified …

Drizzle: Fast and adaptable stream processing at scale

S Venkataraman, A Panda, K Ousterhout… - Proceedings of the 26th …, 2017 - dl.acm.org
Large scale streaming systems aim to provide high throughput and low latency. They are
often used to run mission-critical applications, and must be available 24x7. Thus such …

Accelerating distributed {MoE} training and inference with lina

J Li, Y Jiang, Y Zhu, C Wang, H Xu - 2023 USENIX Annual Technical …, 2023 - usenix.org
Scaling model parameters improves model quality at the price of high computation
overhead. Sparsely activated models, usually in the form of Mixture of Experts (MoE) …

Hopper: Decentralized speculation-aware cluster scheduling at scale

X Ren, G Ananthanarayanan, A Wierman… - Proceedings of the 2015 …, 2015 - dl.acm.org
As clusters continue to grow in size and complexity, providing scalable and predictable
performance is an increasingly important challenge. A crucial roadblock to achieving …

{GRASS}: Trimming stragglers in approximation analytics

G Ananthanarayanan, MCC Hung, X Ren… - … USENIX symposium on …, 2014 - usenix.org
In big data analytics, timely results, even if based on only part of the data, are often good
enough. For this reason, approximation jobs, which have deadline or error bounds and …