Improving effective bandwidth through compiler enhancement of global cache reuse

Z Guo, Z He, Y Zhang - Proceedings of the 29th Symposium on …, 2023 - dl.acm.org

Far memory, where memory accesses are non-local, has become more popular in recent
years as a solution to expand memory size and avoid memory stranding. Prior far memory …

被引用次数：12 相关文章所有 3 个版本

[PDF] psu.edu

On-the-fly elimination of dynamic irregularities for GPU computing

EZ Zhang, Y Jiang, Z Guo, K Tian, X Shen - ACM SIGPLAN Notices, 2011 - dl.acm.org

The power-efficient massively parallel Graphics Processing Units (GPUs) have become
increasingly influential for general-purpose computing over the past few years. However …

被引用次数：253 相关文章所有 12 个版本

[PDF] cmu.edu

The evicted-address filter: A unified mechanism to address both cache pollution and thrashing

V Seshadri, O Mutlu, MA Kozuch… - Proceedings of the 21st …, 2012 - dl.acm.org

Off-chip main memory has long been a bottleneck for system performance. With increasing
memory pressure due to multiple on-chip cores, effective cache utilization is important. In a …

被引用次数：167 相关文章所有 20 个版本

[PDF] acm.org

Program locality analysis using reuse distance

Y Zhong, X Shen, C Ding - ACM Transactions on Programming …, 2009 - dl.acm.org

On modern computer systems, the memory performance of an application depends on its
locality. For a single execution, locality-correlated measures like average miss rate or …

被引用次数：203 相关文章所有 12 个版本

[PDF] acm.org

Flat: An optimized dataflow for mitigating attention bottlenecks

SC Kao, S Subramanian, G Agrawal… - Proceedings of the 28th …, 2023 - dl.acm.org

Attention mechanisms, primarily designed to capture pairwise correlations between words,
have become the backbone of machine learning, expanding beyond natural language …

被引用次数：41 相关文章所有 6 个版本

[PDF] archive.org

Scalable kernel fusion for memory-bound GPU applications

M Wahib, N Maruyama - SC'14: Proceedings of the …, 2014 - ieeexplore.ieee.org

GPU implementations of HPC applications relying on finite difference methods can include
tens of kernels that are memory-bound. Kernel fusion can improve performance by reducing …

被引用次数：116 相关文章所有 7 个版本

[PDF] psu.edu

A general probabilistic framework for worst case timing analysis

M Orshansky, K Keutzer - Proceedings of the 39th Annual Design …, 2002 - dl.acm.org

The traditional approach to worst-case static-timing analysis is becoming unacceptably
conservative due to an ever-increasing number of circuit and process effects. We propose a …

被引用次数：277 相关文章所有 12 个版本

[PDF] rochester.edu

Array regrouping and structure splitting using whole-program reference affinity

Y Zhong, M Orlovich, X Shen, C Ding - ACM SIGPLAN Notices, 2004 - dl.acm.org

While the memory of most machines is organized as a hierarchy, program data are laid out
in a uniform address space. This paper defines a model of reference affinity, which …

被引用次数：189 相关文章所有 20 个版本

[PDF] rutgers.edu

Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on gpu

B Wu, Z Zhao, EZ Zhang, Y Jiang, X Shen - ACM SIGPLAN Notices, 2013 - dl.acm.org

The performance of Graphic Processing Units (GPU) is sensitive to irregular memory
references. Some recent work shows the promise of data reorganization for eliminating non …

被引用次数：127 相关文章所有 11 个版本

[PDF] ncsu.edu

Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping

EZ Zhang, Y Jiang, Z Guo, X Shen - Proceedings of the 24th ACM …, 2010 - dl.acm.org

Because of their tremendous computing power and remarkable cost efficiency, GPUs
(graphic processing unit) have quickly emerged as a kind of influential platform for high …

被引用次数：148 相关文章所有 11 个版本