Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
Image processing pipelines combine the challenges of stencil computations and stream
programs. They are composed of large graphs of different stencil stages, as well as complex …
programs. They are composed of large graphs of different stencil stages, as well as complex …
Flashmeta: A framework for inductive program synthesis
Inductive synthesis, or programming-by-examples (PBE) is gaining prominence with
disruptive applications for automating repetitive tasks in end-user programming. However …
disruptive applications for automating repetitive tasks in end-user programming. However …
The design and implementation of FFTW3
M Frigo, SG Johnson - Proceedings of the IEEE, 2005 - ieeexplore.ieee.org
FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the
hardware in order to maximize performance. This paper shows that such an approach can …
hardware in order to maximize performance. This paper shows that such an approach can …
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Recent advances in computing have led to an explosion in the amount of data being
generated. Processing the ever-growing data in a timely manner has made throughput …
generated. Processing the ever-growing data in a timely manner has made throughput …
Auto-tuning a high-level language targeted to GPU codes
Determining the best set of optimizations to apply to a kernel to be executed on the graphics
processing unit (GPU) is a challenging problem. There are large sets of possible …
processing unit (GPU) is a challenging problem. There are large sets of possible …
BLIS: A framework for rapidly instantiating BLAS functionality
FG Van Zee, RA Van De Geijn - ACM Transactions on Mathematical …, 2015 - dl.acm.org
The BLAS-like Library Instantiation Software (BLIS) framework is a new infrastructure for
rapidly instantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental …
rapidly instantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental …
Precimonious: Tuning assistant for floating-point precision
C Rubio-González, C Nguyen, HD Nguyen… - Proceedings of the …, 2013 - dl.acm.org
Given the variety of numerical errors that can occur, floating-point programs are difficult to
write, test and debug. One common practice employed by developers without an advanced …
write, test and debug. One common practice employed by developers without an advanced …
Modern development methods and tools for embedded reconfigurable systems: A survey
Heterogeneous reconfigurable systems provide drastically higher performance and lower
power consumption than traditional CPU-centric systems. Moreover, they do it at much lower …
power consumption than traditional CPU-centric systems. Moreover, they do it at much lower …
Memory coherence in shared virtual memory systems
The memory coherence problem in designing and implementing a shared virtual memory on
loosely coupled multiprocessors is studied in depth. Two classes of algorithms, centralized …
loosely coupled multiprocessors is studied in depth. Two classes of algorithms, centralized …
Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping
Heterogeneous multiprocessors are increasingly important in the multi-core era due to their
potential for high performance and energy efficiency. In order for software to fully realize this …
potential for high performance and energy efficiency. In order for software to fully realize this …