Beyond the roofline: Cache-aware power and energy-efficiency modeling for multi-cores

A Ilic, F Pratas, L Sousa - IEEE Transactions on Computers, 2016 - ieeexplore.ieee.org
To foster the energy-efficiency in current and future multi-core processors, the benefits and
trade-offs of a large set of optimization solutions must be evaluated. For this purpose, it is …

[HTML][HTML] CIMAR, NIMAR, and LMMA: Novel algorithms for thread and memory migrations in user space on NUMA systems using hardware counters

R Laso, OG Lorenzo, JC Cabaleiro, TF Pena… - Future Generation …, 2022 - Elsevier
This paper introduces two novel algorithms for thread migrations, named CIMAR (Core-
aware Interchange and Migration Algorithm with performance Record–IMAR–) and NIMAR …

An extended roofline model with communication-awareness for distributed-memory hpc systems

D Cardwell, F Song - Proceedings of the International Conference on …, 2019 - dl.acm.org
Performance modeling of parallel applications on distributed memory systems is a
challenging task due to the effects of CPU speed, memory access time, and communication …

Using an extended Roofline Model to understand data and thread affinities on NUMA systems

OG Lorenzo, TF Pena, JCC Domínguez… - Annals of Multicore …, 2014 - dialnet.unirioja.es
Today's microprocessors include multicores that feature a diverse set of compute cores and
onboard memory subsystems connected by complex communication networks and …

Using performance attributes for managing heterogeneous memory in hpc applications

B Goglin, AR Proaño - 2022 IEEE International Parallel and …, 2022 - ieeexplore.ieee.org
The complexity of memory systems has increased considerably over the past decade.
Supercomputers may now include several levels of heterogeneous and non-uniform …

Multiobjective optimization technique based on monitoring information to increase the performance of thread migration on multicores

OG Lorenzo, TF Pena, JC Cabaleiro… - 2014 IEEE …, 2014 - ieeexplore.ieee.org
Multicore systems present on-board memory hierarchies and communication networks that
influence their performance when they execute shared memory parallel codes …

Performance debugging toolbox for binaries: sensitivity analysis and dependence profiling

F Gruber - 2019 - theses.hal.science
Debugging, as usually understood, revolves around finding and removing defects in
software that prevent it from functioning correctly. That is, when one talks about bugs and …

Performance analysis of applications in the context of architectural rooflines

B Norris, W Spear, A Malony - Proceedings of the 8th ACM/SPEC on …, 2017 - dl.acm.org
Intuitive visual representations of architecture capabilities and the performance of
applications are critical to enabling effective performance analysis, which in turn guides …

LBMA and IMAR2: Weighted lottery based migration strategies for NUMA multiprocessing servers

R Laso, OG Lorenzo, FF Rivera… - Concurrency and …, 2021 - Wiley Online Library
Multicore NUMA systems present on‐board memory hierarchies and communication
networks that influence performance when executing shared memory parallel codes …

MD-Roofline: A Training Performance Analysis Model for Distributed Deep Learning

T Miao, Q Wu, T Liu, P Cui, R Ren… - 2022 IEEE Symposium …, 2022 - ieeexplore.ieee.org
Due to the bulkiness and sophistication of the Distributed Deep Learning (DDL) systems, it
leaves an enormous challenge for AI researchers and operation engineers to analyze …