Understanding data storage and ingestion for large-scale deep recommendation model training: Industrial product
Datacenter-scale AI training clusters consisting of thousands of domain-specific accelerators
(DSA) are used to train increasingly-complex deep learning models. These clusters rely on a …
(DSA) are used to train increasingly-complex deep learning models. These clusters rely on a …
Softsku: Optimizing server architectures for microservice diversity@ scale
The variety and complexity of microservices in warehouse-scale data centers has grown
precipitously over the last few years to support a growing user base and an evolving product …
precipitously over the last few years to support a growing user base and an evolving product …
Bolt: a practical binary optimizer for data centers and beyond
M Panchenko, R Auler, B Nell… - 2019 IEEE/ACM …, 2019 - ieeexplore.ieee.org
Performance optimization for large-scale applications has recently become more important
as computation continues to move towards data centers. Data-center applications are …
as computation continues to move towards data centers. Data-center applications are …
Mira: A program-behavior-guided far memory system
Far memory, where memory accesses are non-local, has become more popular in recent
years as a solution to expand memory size and avoid memory stranding. Prior far memory …
years as a solution to expand memory size and avoid memory stranding. Prior far memory …
Codescope: An execution-based multilingual multitask multidimensional benchmark for evaluating llms on code understanding and generation
Large Language Models (LLMs) have demonstrated remarkable performance on assisting
humans in programming and facilitating programming automation. However, existing …
humans in programming and facilitating programming automation. However, existing …
Asmdb: understanding and mitigating front-end stalls in warehouse-scale computers
The large instruction working sets of private and public cloud workloads lead to frequent
instruction cache misses and costs in the millions of dollars. While prior work has identified …
instruction cache misses and costs in the millions of dollars. While prior work has identified …
Classifying memory access patterns for prefetching
Prefetching is a well-studied technique for addressing the memory access stall time of
contemporary microprocessors. However, despite a large body of related work, the memory …
contemporary microprocessors. However, despite a large body of related work, the memory …
Unleashing SmartNIC packet processing performance in P4
SmartNICs are on the rise as a packet processing platform, with the trend towards a uniform
P4 programming model. However, unleashing SmartNIC packet processing performance in …
P4 programming model. However, unleashing SmartNIC packet processing performance in …
Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator
AH Hunter, C Kennelly, P Turner, D Gove… - … on Operating Systems …, 2021 - usenix.org
Memory allocation represents significant compute cost at the warehouse scale and its
optimization can yield considerable cost savings. One classical approach is to increase the …
optimization can yield considerable cost savings. One classical approach is to increase the …
I-spy: Context-driven conditional instruction prefetching with coalescing
Modern data center applications have rapidly expanding instruction footprints that lead to
frequent instruction cache misses, increasing cost and degrading data center performance …
frequent instruction cache misses, increasing cost and degrading data center performance …