Affinity-based thread and data mapping in shared memory systems

M Diener, EHM Cruz, MAZ Alves, POA Navaux… - ACM Computing …, 2016 - dl.acm.org
Shared memory architectures have recently experienced a large increase in thread-level
parallelism, leading to complex memory hierarchies with multiple cache memory levels and …

{MemProf}: A Memory {Profiler} for {NUMA} Multicore Systems

R Lachaize, B Lepers, V Quéma - 2012 USENIX Annual Technical …, 2012 - usenix.org
Modern multicore systems are based on a Non-Uniform Memory Access (NUMA) design.
Efficiently exploiting such architectures is notoriously complex for programmers. One of the …

Handling the problems and opportunities posed by multiple on-chip memory controllers

M Awasthi, DW Nellans, K Sudan… - Proceedings of the 19th …, 2010 - dl.acm.org
Modern processors such as Tilera's Tile64, Intel's Nehalem, and AMD's Opteron are
migrating memory controllers (MCs) on-chip, while maintaining a large, flat memory address …

[PDF][PDF] NUMA-aware algorithms: the case of data shuffling.

Y Li, I Pandis, R Mueller, V Raman, GM Lohman - CIDR, 2013 - pandis.net
In recent years, a new breed of non-uniform memory access (NUMA) systems has emerged:
multi-socket servers of multicores. This paper makes the case that data management …

Memory system performance in a NUMA multicore multiprocessor

Z Majo, TR Gross - Proceedings of the 4th Annual International …, 2011 - dl.acm.org
Modern multicore processors with an on-chip memory controller form the base for NUMA
(non-uniform memory architecture) multiprocessors. Each processor accesses part of the …

A tool to analyze the performance of multithreaded programs on NUMA architectures

X Liu, J Mellor-Crummey - ACM Sigplan Notices, 2014 - dl.acm.org
Almost all of today's microprocessors contain memory controllers and directly attach to
memory. Modern multiprocessor systems support non-uniform memory access (NUMA): it is …

A study on communication issues for systems-on-chip

CA Zeferino, ME Kreutz, L Carro… - … . 15th Symposium on …, 2002 - ieeexplore.ieee.org
Present days cores composing a system-on-chip might be interconnected by means of both
dedicated channels or shared buses. Nevertheless, future systems will have strong …

Locality-centric data and threadblock management for massive GPUs

M Khairy, V Nikiforov, D Nellans… - 2020 53rd Annual IEEE …, 2020 - ieeexplore.ieee.org
Recent work has shown that building GPUs with hundreds of SMs in a single monolithic chip
will not be practical due to slowing growth in transistor density, low chip yields, and …

Switched real-time ethernet with earliest deadline first scheduling protocols and traffic handling

H Hoang, M Jonsson, U Hagstrom… - Proceedings 16th …, 2002 - ieeexplore.ieee.org
There is a strong interest of using the cheap and simple Ethernet technology for industrial
and embedded systems. This far, however, the lack of real-time services has prevented this …

A data-centric profiler for parallel programs

X Liu, J Mellor-Crummey - … of the International Conference on High …, 2013 - dl.acm.org
It is difficult to manually identify opportunities for enhancing data locality. To address this
problem, we extended the HPCToolkit performance tools to support data-centric profiling of …