Coda: Enabling co-location of computation and data for multiple gpu systems
ACM Transactions on Architecture and Code Optimization (TACO), 2018•dl.acm.org
To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place
compute and data together. However, two key techniques that have been used to hide
memory latency and improve thread-level parallelism (TLP), memory interleaving, and
thread block scheduling, in traditional GPU systems are at odds with efficient use of multiple
GPUs. Distributing data across multiple GPUs to improve overall memory bandwidth
utilization incurs high remote traffic when the data and compute are misaligned …
compute and data together. However, two key techniques that have been used to hide
memory latency and improve thread-level parallelism (TLP), memory interleaving, and
thread block scheduling, in traditional GPU systems are at odds with efficient use of multiple
GPUs. Distributing data across multiple GPUs to improve overall memory bandwidth
utilization incurs high remote traffic when the data and compute are misaligned …
To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place compute and data together. However, two key techniques that have been used to hide memory latency and improve thread-level parallelism (TLP), memory interleaving, and thread block scheduling, in traditional GPU systems are at odds with efficient use of multiple GPUs. Distributing data across multiple GPUs to improve overall memory bandwidth utilization incurs high remote traffic when the data and compute are misaligned. Nondeterministic thread block scheduling to improve compute resource utilization impedes co-placement of compute and data. Our goal in this work is to enable co-placement of compute and data in the presence of fine-grained interleaved memory with a low-cost approach.
To this end, we propose a mechanism that identifies exclusively accessed data and place the data along with the thread block that accesses it in the same GPU. The key ideas are (1) the amount of data exclusively used by a thread block can be estimated, and that exclusive data (of any size) can be localized to one GPU with coarse-grained interleaved pages; (2) using the affinity-based thread block scheduling policy, we can co-place compute and data together; and (3) by using dual address mode with lightweight changes to virtual to physical page mappings, we can selectively choose different interleaved memory pages for each data structure. Our evaluations across a wide range of workloads show that the proposed mechanism improves performance by 31% and reduces 38% remote traffic over a baseline system.
ACM Digital Library
以上显示的是最相近的搜索结果。 查看全部搜索结果