AutoFDO: Automatic feedback-directed optimization for warehouse-scale applications

Understanding data storage and ingestion for large-scale deep recommendation model training: Industrial product

M Zhao, N Agarwal, A Basant, B Gedik, S Pan… - Proceedings of the 49th …, 2022 - dl.acm.org

Datacenter-scale AI training clusters consisting of thousands of domain-specific accelerators
(DSA) are used to train increasingly-complex deep learning models. These clusters rely on a …

被引用次数：78 相关文章所有 4 个版本

Softsku: Optimizing server architectures for microservice diversity@ scale

A Sriraman, A Dhanotia, TF Wenisch - Proceedings of the 46th …, 2019 - dl.acm.org

The variety and complexity of microservices in warehouse-scale data centers has grown
precipitously over the last few years to support a growing user base and an evolving product …

被引用次数：131 相关文章所有 3 个版本

[PDF] arxiv.org

Bolt: a practical binary optimizer for data centers and beyond

M Panchenko, R Auler, B Nell… - 2019 IEEE/ACM …, 2019 - ieeexplore.ieee.org

Performance optimization for large-scale applications has recently become more important
as computation continues to move towards data centers. Data-center applications are …

被引用次数：134 相关文章所有 5 个版本

[PDF] acm.org

Mira: A program-behavior-guided far memory system

Z Guo, Z He, Y Zhang - Proceedings of the 29th Symposium on …, 2023 - dl.acm.org

Far memory, where memory accesses are non-local, has become more popular in recent
years as a solution to expand memory size and avoid memory stranding. Prior far memory …

被引用次数：12 相关文章所有 3 个版本

[PDF] arxiv.org

Codescope: An execution-based multilingual multitask multidimensional benchmark for evaluating llms on code understanding and generation

W Yan, H Liu, Y Wang, Y Li, Q Chen, W Wang… - arXiv preprint arXiv …, 2023 - arxiv.org

Large Language Models (LLMs) have demonstrated remarkable performance on assisting
humans in programming and facilitating programming automation. However, existing …

被引用次数：13 相关文章所有 4 个版本

[PDF] acm.org

Asmdb: understanding and mitigating front-end stalls in warehouse-scale computers

G Ayers, NP Nagendra, DI August, HK Cho… - Proceedings of the 46th …, 2019 - dl.acm.org

The large instruction working sets of private and public cloud workloads lead to frequent
instruction cache misses and costs in the millions of dollars. While prior work has identified …

被引用次数：100 相关文章所有 15 个版本

[PDF] acm.org

Classifying memory access patterns for prefetching

G Ayers, H Litz, C Kozyrakis… - Proceedings of the Twenty …, 2020 - dl.acm.org

Prefetching is a well-studied technique for addressing the memory access stall time of
contemporary microprocessors. However, despite a large body of related work, the memory …

被引用次数：75 相关文章所有 5 个版本

[PDF] acm.org

Unleashing SmartNIC packet processing performance in P4

J Xing, Y Qiu, KF Hsu, S Sui, K Manaa… - Proceedings of the …, 2023 - dl.acm.org

SmartNICs are on the rise as a packet processing platform, with the trend towards a uniform
P4 programming model. However, unleashing SmartNIC packet processing performance in …

被引用次数：12 相关文章所有 9 个版本

[PDF] usenix.org

Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator

AH Hunter, C Kennelly, P Turner, D Gove… - … on Operating Systems …, 2021 - usenix.org

Memory allocation represents significant compute cost at the warehouse scale and its
optimization can yield considerable cost savings. One classical approach is to increase the …

被引用次数：42 相关文章所有 5 个版本

[PDF] nsf.gov

I-spy: Context-driven conditional instruction prefetching with coalescing

TA Khan, A Sriraman, J Devietti… - 2020 53rd Annual …, 2020 - ieeexplore.ieee.org

Modern data center applications have rapidly expanding instruction footprints that lead to
frequent instruction cache misses, increasing cost and degrading data center performance …

被引用次数：50 相关文章所有 9 个版本