Mira: A program-behavior-guided far memory system

Z Guo, Z He, Y Zhang - Proceedings of the 29th Symposium on …, 2023 - dl.acm.org
Far memory, where memory accesses are non-local, has become more popular in recent
years as a solution to expand memory size and avoid memory stranding. Prior far memory …

On-the-fly elimination of dynamic irregularities for GPU computing

EZ Zhang, Y Jiang, Z Guo, K Tian, X Shen - ACM SIGPLAN Notices, 2011 - dl.acm.org
The power-efficient massively parallel Graphics Processing Units (GPUs) have become
increasingly influential for general-purpose computing over the past few years. However …

The evicted-address filter: A unified mechanism to address both cache pollution and thrashing

V Seshadri, O Mutlu, MA Kozuch… - Proceedings of the 21st …, 2012 - dl.acm.org
Off-chip main memory has long been a bottleneck for system performance. With increasing
memory pressure due to multiple on-chip cores, effective cache utilization is important. In a …

Program locality analysis using reuse distance

Y Zhong, X Shen, C Ding - ACM Transactions on Programming …, 2009 - dl.acm.org
On modern computer systems, the memory performance of an application depends on its
locality. For a single execution, locality-correlated measures like average miss rate or …

Flat: An optimized dataflow for mitigating attention bottlenecks

SC Kao, S Subramanian, G Agrawal… - Proceedings of the 28th …, 2023 - dl.acm.org
Attention mechanisms, primarily designed to capture pairwise correlations between words,
have become the backbone of machine learning, expanding beyond natural language …

Scalable kernel fusion for memory-bound GPU applications

M Wahib, N Maruyama - SC'14: Proceedings of the …, 2014 - ieeexplore.ieee.org
GPU implementations of HPC applications relying on finite difference methods can include
tens of kernels that are memory-bound. Kernel fusion can improve performance by reducing …

A general probabilistic framework for worst case timing analysis

M Orshansky, K Keutzer - Proceedings of the 39th Annual Design …, 2002 - dl.acm.org
The traditional approach to worst-case static-timing analysis is becoming unacceptably
conservative due to an ever-increasing number of circuit and process effects. We propose a …

Array regrouping and structure splitting using whole-program reference affinity

Y Zhong, M Orlovich, X Shen, C Ding - ACM SIGPLAN Notices, 2004 - dl.acm.org
While the memory of most machines is organized as a hierarchy, program data are laid out
in a uniform address space. This paper defines a model of reference affinity, which …

Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on gpu

B Wu, Z Zhao, EZ Zhang, Y Jiang, X Shen - ACM SIGPLAN Notices, 2013 - dl.acm.org
The performance of Graphic Processing Units (GPU) is sensitive to irregular memory
references. Some recent work shows the promise of data reorganization for eliminating non …

Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping

EZ Zhang, Y Jiang, Z Guo, X Shen - Proceedings of the 24th ACM …, 2010 - dl.acm.org
Because of their tremendous computing power and remarkable cost efficiency, GPUs
(graphic processing unit) have quickly emerged as a kind of influential platform for high …