Understanding data storage and ingestion for large-scale deep recommendation model training: Industrial product

M Zhao, N Agarwal, A Basant, B Gedik, S Pan… - Proceedings of the 49th …, 2022 - dl.acm.org
Datacenter-scale AI training clusters consisting of thousands of domain-specific accelerators
(DSA) are used to train increasingly-complex deep learning models. These clusters rely on a …

Softsku: Optimizing server architectures for microservice diversity@ scale

A Sriraman, A Dhanotia, TF Wenisch - Proceedings of the 46th …, 2019 - dl.acm.org
The variety and complexity of microservices in warehouse-scale data centers has grown
precipitously over the last few years to support a growing user base and an evolving product …

Bolt: a practical binary optimizer for data centers and beyond

M Panchenko, R Auler, B Nell… - 2019 IEEE/ACM …, 2019 - ieeexplore.ieee.org
Performance optimization for large-scale applications has recently become more important
as computation continues to move towards data centers. Data-center applications are …

Mira: A program-behavior-guided far memory system

Z Guo, Z He, Y Zhang - Proceedings of the 29th Symposium on …, 2023 - dl.acm.org
Far memory, where memory accesses are non-local, has become more popular in recent
years as a solution to expand memory size and avoid memory stranding. Prior far memory …

Codescope: An execution-based multilingual multitask multidimensional benchmark for evaluating llms on code understanding and generation

W Yan, H Liu, Y Wang, Y Li, Q Chen, W Wang… - arXiv preprint arXiv …, 2023 - arxiv.org
Large Language Models (LLMs) have demonstrated remarkable performance on assisting
humans in programming and facilitating programming automation. However, existing …

Asmdb: understanding and mitigating front-end stalls in warehouse-scale computers

G Ayers, NP Nagendra, DI August, HK Cho… - Proceedings of the 46th …, 2019 - dl.acm.org
The large instruction working sets of private and public cloud workloads lead to frequent
instruction cache misses and costs in the millions of dollars. While prior work has identified …

Classifying memory access patterns for prefetching

G Ayers, H Litz, C Kozyrakis… - Proceedings of the Twenty …, 2020 - dl.acm.org
Prefetching is a well-studied technique for addressing the memory access stall time of
contemporary microprocessors. However, despite a large body of related work, the memory …

Unleashing SmartNIC packet processing performance in P4

J Xing, Y Qiu, KF Hsu, S Sui, K Manaa… - Proceedings of the …, 2023 - dl.acm.org
SmartNICs are on the rise as a packet processing platform, with the trend towards a uniform
P4 programming model. However, unleashing SmartNIC packet processing performance in …

Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator

AH Hunter, C Kennelly, P Turner, D Gove… - … on Operating Systems …, 2021 - usenix.org
Memory allocation represents significant compute cost at the warehouse scale and its
optimization can yield considerable cost savings. One classical approach is to increase the …

I-spy: Context-driven conditional instruction prefetching with coalescing

TA Khan, A Sriraman, J Devietti… - 2020 53rd Annual …, 2020 - ieeexplore.ieee.org
Modern data center applications have rapidly expanding instruction footprints that lead to
frequent instruction cache misses, increasing cost and degrading data center performance …