HSA Connections - Home
Better, Faster Systems with Heterogeneous Design
AUG 05, 2016 16:18 PM
A+ A A-

Better, Faster Systems with Heterogeneous Design

By Paul Blinzer, John Glossner, Jarmo Takala






Heterogeneous systems typically consist of a CPU, GPU and other accelerator devices, all of which are integrated in a single platform with a shared high-bandwidth memory system. The special accelerators are used to obtain both power and performance benefits. While the shared memory system eliminates data copies between the CPU and accelerator-owned memory, the different memory access models for most of the accelerators (e.g. for cache coherency) may still result in “expensive” software synchronization overhead when passing data between the different system components.

In this post, we’ll see how the Heterogeneous System Architecture (HSA) specification addresses the key issues in using accelerator execution units for processing. In a later post, we’ll discuss how programmers can implement the standard.

Intermediaries can diminish gains

Typically, accelerators rely on a software driver applications programming interface (API) as an intermediary. This greatly increases the overhead to leverage accelerator functions and queue up enough work to the accelerator to amortize the necessary software overhead. While GPUs have demonstrated extraordinary gains in compute performance - in range of several Teraflops/second – the additional overhead caused by the software APIs diminishes some of these gains.

The need for a standard on heterogeneous computing has been set by the popularity of general-purpose GPUs but also applies to other programmable accelerators. One of the first computing platforms and programming models for GPGPUs is Compute Unified Device Architecture (CUDA) developed by NVIDIA. However, this is a proprietary interface. Several industry players have introduced open APIs for heterogeneous computing, such as DirectCompute extensions for GPGPU computing on Windows, and Renderscript for heterogeneous computing on Android. Similarly, the Khronos Group announced an open standard framework for heterogeneous computing, the OpenCL standard, which supports both task-level and data-level parallelism and allows platform control and program execution on compute devices.

Yet a pervasive challenge in heterogeneous computing is related to memory organization and the need to copy data structures between various local memories. OpenCL, for instance, doesn’t have a well-defined memory model to the level of HSA and it still depends on memory handles in 1.2 and on runtime allocated memory for a large part of the 2.0 implementations. There are few OpenCL 2.x implementations (AMD is one of them) that even support SVM via OpenCL, especially fine-grain access, which is equivalent to the HSA paradigm. And AMD’s implementation leverages the HSA features in the implementation. The HSA specification addresses this problem while it targets towards a royalty-free industry standard for heterogeneous computing.

The HSA specifications define virtual memory, memory coherency, architected dispatch mechanisms, and power-efficient signal platform requirements. The architecture uses accelerators called kernel agents to reduce or eliminate software overhead paths in performance-critical dispatch paths. All these definitions help to dramatically reduce the overhead and latency needed to dispatch work to the accelerator.

The design allows targeting the accelerator hardware directly via high-level compilers and data parallel- and managed runtimes without the typical translation steps necessary to interface with a high-level API in the dispatch. The architecture also allows compute kernels running on the accelerator to efficiently call back to the host for OS services like file I/O, networking and similar functions that typically would not be available and, therefore, allowing the accelerator to operate as a true peer of the host CPU.

We’ve outlined some features of the HSA system architecture including the use of agents, kernels and runtime, but we’ve yet to address the programmer’s model for using the architecture. In our next post, we’ll dive deeper into the programmer’s model and the HSA Runtime Specification. And we’ll discuss mapping concepts of HSA agents to modern DSP accelerators using HSAIL implementations of Finite Impulse Response (FIR) filters to remove memory loads


Paul Blinzer works on wide variety of Platform System Software architecture projects and specifically on the Heterogeneous System Architecture (HSA) System Software at Advanced Micro Devices, Inc. (AMD) as a Fellow in the System Software group. Living in the Seattle, WA area, during his career he has worked in various roles on system level driver development, system software development, graphics architecture, graphics & compute acceleration since the early '90s. Paul is the chairperson of the "System Architecture Workgroup" of the HSA Foundation. He has a degree in Electrical Engineering (Dipl.-Ing) from TU Braunschweig, Germany. 

Jarmo Takala received his M.Sc. (hons) degree in Electrical Engineering and Dr. Tech. degree in Information Technology from Tampere University of Technology, Tampere, Finland (TUT). From 1992 to 1995, he was a Research Scientist at Tampere-based VTT-Automation. Between 1995 and 1996, he was a Senior Research Engineer at Nokia Research Center in Tampere. From 1996 to 1999, he was a Researcher at TUT. Since 2000, he has been a professor of computer engineering at TUT and currently Dean of the Faculty of Computing and Electrical Engineering. From 2007-2011 he was Associate Editor and Area Editor for IEEE Transactions on Signal Processing and in 2012-2013 he was the Chair of IEEE Signal Processing Society's Design and Implementation of Signal Processing Systems Technical Committee. Currently he is Co-Editor-in-Chief of Journal of Signal Processing Systems. 

Dr. John Glossner
Dr. John Glossner is president of the HSA Foundation. He also serves as CEO of Optimum Semiconductor Technologies, Inc., dba General Processor Technologies, the US division of China-based Wuxi DSP in partnership with Beijing-based Hua Xia GPT. Dr. Glossner received his Ph.D. in Electrical Engineering from TU Delft in the Netherlands, M.S. degrees in electrical engineering and engineering management from NTU, and holds a B.S.E.E. degree from Penn State. He has published over 120 articles and has been issued 36 patents.

[%= name %]
[%= createDate %]
[%= comment %]
Share this:
Please login to enter a comment:

Computing Now Blogs
Business Intelligence
by Drew Hendricks
by Keith Peterson
Cloud Computing
A Cloud Blog: by Irena Bojanova
The Clear Cloud: by STC Cloud Computing
Computing Careers: by Lori Cameron
Display Technologies
Enterprise Solutions
Enterprise Thinking: by Josh Greenbaum
Healthcare Technologies
The Doctor Is In: Dr. Keith W. Vrbicky
Heterogeneous Systems
Hot Topics
NealNotes: by Neal Leavitt
Industry Trends
Internet Of Things
Sensing IoT: by Irena Bojanova
Software Technologies: by Christof Ebert