Name: Ask the Experts: Heterogeneous and GPU Compute with AMD’s Manju Hegde
Item: Ask the Experts: Heterogeneous and GPU Compute with AMD’s Manju Hegde
Author: Anand Lal Shimpi

Ask the Experts: Heterogeneous and GPU Compute with AMD’s Manju Hegde

by Anand Lal Shimpi on 5/14/2012 3:46 PM EST

Post Your Comment
Please log in or sign up to comment.

Comments Locked

101 Comments

Back to Article

B3an - Monday, May 14, 2012 - link
I have some questions for Manju...

1: Could an OS use GPU compute in the future to speed up everyday tasks, apart from the usual stuff like the UI? What possible tasks would this be? And is it possible we'll see this happen within the next few years?

2: Are you excited about Microsofts C++ Accelerated Massive Parallelism (AMP)? Do you think we'll see a lot more software using GPU compute now that Visual Studio 11 will include C++ AMP support?

3: Do you expect the next gen consoles to make far more use of GPU compute?
BenchPress - Monday, May 14, 2012 - link
1: Your best bet is AVX2, not GPGPU. Any 32-bit code loop with independent iterations can be speeded up by a factor of up to eight using AVX2. And since it's part of the CPU's instruction set there's no data or commands to send back and forth between the CPU and GPU. Also, you won't have to wait long for AVX2 to make an impact. Compilers are ready to support it today, and it takes very little if any developer effort.

2: It's just OpenCL is disguise. Yes it supports a few C++ constructs but it still has many of the same limitations. AVX2 doesn't impose any limitations. In fact you can use it with any programming language you like.

3: I'd rather hope next gen consoles have AVX2 or similar technology (i.e. a vector equivalent of every scalar instruction, including gather).
Omoronovo - Monday, May 14, 2012 - link
That wasn't really what he had asked, though. Is there anything stopping AVX2 AND GPGPU being used in parallel to speed up *more* tasks than either one combined? This is the focus (and direction) of AMD's current work into heterogeneous compute, and it remains a question that has been fully answered.

I would love to see the day where simple everyday business tasks like running Excel have the same kind of integrated gpu compute ability as, say, web browsers have gained over the last few years. I personally am of the opinion that we are still a long way from that, but AMD seems to be betting on the "wathershed" moment for it happening a lot sooner.
Omoronovo - Monday, May 14, 2012 - link
I apologize for my typing and spelling errors, I hit reply before reading it over, and forgot AnandTech has no edit capability in the inline comments.
BenchPress - Monday, May 14, 2012 - link
AVX2 is more versatile than GPGPU, and just as powerful. So why would you want them both? We could just have a homogeneous CPU with more cores instead. Of course that makes TSX another piece of critical technology, and AVX-1024 will be required to lower the power consumption. But it's obvious that GPGPU has no future when the CPU can incorporate the same technology as the GPU.

AMD is betting on something that will never happen. Developers are very reluctant to invest time and money on technology from one small vendor. The ROI is very low and will decline over time. The CPU and GPU have been growing closer together, and the next step is to merge them together. AVX2 is a major step toward that, making it a safe bet for developers to support.
SleepyFE - Tuesday, May 15, 2012 - link
I'm not sure you noticed but more cores is the problem. Not everything is a server and even in servers power consumption matters. For desktop processors you still only have 4 core from Intel, and they don't seem too keen on making 6 or 8 core parts. The GCN is good because SIMD is paired into larger units, that might allow more flexibility. If you don't need as many just split the units in 2 and you can have 2 apps running on one physical unit. SIMD is already in CPU-s but AMD put them in a GPU, when they could just put more SIMD-s on the CPU and try to make the system recognize them as a GPU.
Power gating my friend. If the SIMD is on the CPU core it has to run with the core (i think). So here it is power gating and flexibility. And they can probably move the AVX to the GPU as well.
BenchPress - Tuesday, May 15, 2012 - link
Yes, too many cores is a problem currently, but that's precisely why Haswell adds TSX technology!

Sandy Bridge features power gating for the upper lane of AVX. So there's no waste when not using it.

And no, AVX cannot move to the GPU. It's an integral part of x86 and moving all of it over to the GPU would simply turn the GPU into more CPU cores.

The only remaining problem with AVX2 is high power consumption. Not from the ALUs, but from the rest of the pipeline. But this can be fixed with AVX-1024 by executing them in four cycles on 256-bit units. This allows clock gating large parts of the pipeline for 3/4 of the time and lowers switching activity elsewhere.
A5 - Monday, May 14, 2012 - link
AVX2 is nice, but it isn't the solution to all of these problems.

For one, it is Intel only, and will only be available on Haswell and later CPUs. Considering that all MMX, SSEn, etc "required" was a compiler update and new hardware as well, you can look at those for your implementation timelines in normal applications (aka a couple years at best).

GPGPU is good for now because it works on existing hardware (there are far more compute-capable GPUs than Haswell processors at the moment...).
BenchPress - Monday, May 14, 2012 - link
AVX2 is not Intel only. AMD has already implemented AVX support and will add AVX2 support as soon as possible.

Furthermore, you can't compare AVX2 to MMX and SSE. The latter two are 'horizontal' SIMD instruction set extensions. They're only suitable for explicit vector math. AVX2 on the other hand is a 'vertical' SIMD instruction set extension. It is highly suitable for the SPMD programming model also used by GPUs. It allows you to write scalar code and just have multiple loop iterations execute in parallel. It's a whole new paradigm for CPUs.

So it will be adopted extremely fast. It is instantly applicable to any GPGPU workload, but more flexible in every way. Meanwhile NVIDIA has crippled GPGPU performance in the GTX 6xx series so developers are not inclined to rely on the GPU for generic computing. One small manufacturer offering APUs isn't going to change that. Intel has the upper hand here.

AMD has to embrace homogeneous computing to stand a chance. With hardware quickly becoming bandwidth limited, ease of programmability will be a primary concern. GPGPU is horrendous in this regard. It's currently impossible to write code which runs well on GPUs from each manufacturer. AVX2 won't suffer from this because it has no heterogeneous bottlenecks, low instruction and memory latencies, great data locality, a large call stack, etc.
Jaybus - Thursday, May 17, 2012 - link
There is little doubt that coding for AVX2 is less complex than coding for GPGPU. But what would happen if those bandwidth bottlenecks were drastically mitigated? Say Intel gets their on chip silicon photonics stuff working and enables a chip-to-chip optical bus with a subsequent orders of magnitude increase of bandwidth. Would it still be better to have all linked chips have identical cores? Or would it be better to have a mix, where all cores of a particular chip were homogenous, but each chip may have a different type of core? II can see advantages to both, but for programming and OS scheduling, a bunch of like cores are certainly simpler.
BenchPress - Thursday, May 17, 2012 - link
That's an interesting thought experiment, but I don't think AMD should be hoping for some miracle technology to save HSA. On-chip optical interconnects won't be viable for the consumer market for at least another 10 years, and heterogeneous computing will run into bandwidth walls long before that. And it remains to be seen whether the bandwidth offered by optical technology will make the whole issue go away or just postpone it a little longer.

Secondly, the issue isn't just bandwidth, but also latency. Light travels only slightly faster than an electrical signal in copper, and that's without accounting for transmitter and receiver latency. So while a homogeneous CPU core can switch between scalar code and vector code from one cycle to the next, for a heterogeneous architecture it still takes a lot of time to send some data over and signal the processing.
BenchPress - Monday, May 14, 2012 - link
I don't think HSA is going to work. With Haswell we'll have AVX2 which brings key GPU technology right into the CPU cores. And the CPU is way more suitable for generic computing anyway thanks to its large caches and out-of-order execution. With AVX2 there's also no overhead from round-trip delays or bandwidth or APIs. Future extensions like AVX-1024 would totally eradicate the chances of the GPU ever becoming superior at general purpose computing without sacrificing a lot of graphics performance.
MrSpadge - Monday, May 14, 2012 - link
Think big: there could be a couple of "AVX-1024" FPUs.. which could just as well be used as shaders by the GPU. That's the true fusion.
BenchPress - Monday, May 14, 2012 - link
HSA uses a specific binary format that is not compatible with VEX (the encoding format used by AVX2 and AVX which is an extension of x86). So it's not going to support AVX-1024.

But yes, AVX-1024 could be used for shader processing. It's just not going to be heterogeneous. It's a homogeneous part of the CPU's micro-architecture and instruction set.
codedivine - Monday, May 14, 2012 - link
According to publically available information, HSA is also not tied to a specific binary format and will be JIT-compiled to the actual ISA.

Given that AMD already supports AVX, FMA4 and will support FMA3 going forward, (i.e. most of the AVX functionality), I expect that they will support AVX through HSA just fine.

Please stop your misinformed posts.
BenchPress - Tuesday, May 15, 2012 - link
Duh, of course it can be JIT-compiled. But JIT-compilation doesn't solve the actual problem. We've had JIT-compiled throughput computing for many years and it got us nowhere...

The real problem is heterogeneous computing itself. You just can't get good performance by moving work between generic computing units with different instruction sets.

It's going to be quite ironic when HSA actually runs better on an Intel CPU with AVX2.
Penti - Monday, May 14, 2012 - link
Actually Trinity/Piledriver uses a normal FMA3 AVX, when HNI will come into effect at AMD too is anyones guess. Some of the FMA3 will be compatible with Haswell and Piledriver. For the HSA virtual ISA you just need to tune the LLVM backend to the Intel processors or simply compile for Intel to begin with. It's not something tide to hardware any way. That is not all what HSA is though. HSAIL isn't really a toolkit or API either. Not yet at least. It won't really replace all the other tools.
TeXWiller - Monday, May 14, 2012 - link
Heterogeneous computing is ultimately not about using graphics cores as vector processors but about the resource utilization of the whole chip assuming a parallel workload with a set of sequential sections. Without an open abstraction layer the developer would have less choice on those "wimpy" throughput cores and the additional accelerators in the system might be left underutilized.
I'm personally expecting Intel and AMD to have a selection of wide and narrow cores on a same x86 chip, executing sequential and parallel sections respectively. Otherwise scaling would hit a concrete wall in the near future, even assuming a software with 95-97% executable in a parallel way.
BenchPress - Tuesday, May 15, 2012 - link
You're assuming that code can be strictly categorized as sequential or parallel. This is never the case. You're always losing performance when forcing code to run on either a scalar CPU or a parallel GPU.

A CPU with AVX2 simply combines the best of both worlds. No need to move data around to process it by another core. Switch instantly between sequential and parallel code, without synchronization overhead.
TeXWiller - Tuesday, May 15, 2012 - link
I'm simply assuming that Amdahl's law is still in force in the future. It sounds like your are creating a false dilemma with the data moving argument. A heterogeneous model as it emerges in the future is precisely going to this direction apparently for Nvidia as well.
BenchPress - Wednesday, May 16, 2012 - link
How is Amdahl's Law an argument in favor of heterogeneous computing? It tells us that when you scale up the parallel processing, you get diminishing returns. So there's also a need to focus on sequential processing speed.

GPUs are incredibly slow at processing a single thread. They just achieve high throughput by using many threads. That's not a good thing in light of Amdahl's Law. Even more so since the explicit parallelism is finite. And so it's really no coincidence that the pipeline length of GPUs has been shortening ever since they became programmable, to improve the latencies and lower the number of threads.

Please observe that this means the GPU is slowly evolving in the direction of a CPU architecture, where instruction latencies are very low, and caches, prefetching and out-of-order execution ensure that every thread advances as fast as possible so you only need a few and don't suffer from limited parallelism.

This convergence hasn't been slowing down. So it's obvious that in the future we'll end up with devices which combine the advantages of the CPU and GPU into one. With AVX2 that future isn't very far away any more.
TeXWiller - Friday, May 18, 2012 - link
Say you want to have 1024 threads at your disposal and have limited chip resources. Integrating 256 or more SB level cores on a single chip while using low power is quite difficult.
Instead if you have 16 high performance threads of computing with 4 to 8 very wide cores with all the vector goodness that fits in the power budgets combined with 246 Larrabee style of narrow cores with very good energy efficiency, you can have the cake and eat it as well, so to speak.
Heterogeneous computing is all about using parallel cores for parallel problems and powerful sequential cores for sequential problems. Scaling simply stops otherwise. The concept of heterogeneous computing does not imply anything about the actual ISAs used, instead it does imply a model of computation. This seems to be the true issue in this discussion.
BenchPress - Friday, May 18, 2012 - link
You're offering a solution in search of a problem. Nobody "wants" 1024 threads. In an ideal world we'd have one thread per process.

I'm afraid though you're confused about the difference between a CPU thread and a GPU thread. You have to carefully distinguish between threads, warps, wavefronts, strands, fibers, bundles, tiles, grids, etc. For instance by using Intel's terminology, a quad-core Haswell CPU will have no problem running 8 threads, 64 fibers and 2048 strands. In fact you can freely choose the number of fibers per thread and the number of strands per fiber. The optimal amount depends on the kernel's register count and available cache space. But a higher strand count definitely doesn't automatically equal higher performance.

Likewise, a high number of cores is never the right answer. You have to balance core count, SIMD width, and issue width. And when GPU manufacturers give you a 'compute core' count, they multiply all these numbers. Using this logic, mainstream Haswell will have 64 compute cores (at three times the clock frequency of most GPUs).

And this is why CPUs are actually much closer in compute density compared to GPUs, than the marketing terminology might have you conclude.
TeXWiller - Saturday, May 19, 2012 - link
I was talking only about hardware parallelism and the problems of scaling software once the parallelism available on single a chip exceeds certain limit. I wasn't talking anything about programming frameworks or zones of locality.
You don't sound like you would believe Intel's MIC will ever be a useful solution for HPC. That would be ironic considering your other posts. The ideal world where we have single thread per process we also have that 10GHz Pentium 4, right?
BenchPress - Sunday, May 20, 2012 - link
Intel's MIC is aimed at supercomputers, which is very different from the consumer market. The problem size an run times are both many orders of magnitude larger. So they require a very different architecture. The MIC will do fine in the HPC market, but it's not an architecture suited for consumer software.

Consumer software is a complex mix of ILP, DLP and TLP, and can switch between them on a microsecond scale or less. So the hardware has to cater for each type of parallelism. GPUs have a strong focus on DLP and can do a bit of TLP. CPUs can deal with ILP and TLP and next year they'll add DLP to the list using AVX2.

Beyond that the core count will go up again, and we'll have TSX to efficiently synchronize between them.
iwod - Tuesday, May 15, 2012 - link
I think BenchPress actually got it right. Unless someone could explain to me otherwise.

Hardware without Software means nothing. And just by looking at GPU compute adoption you can tell after all these years Nvidia trying to push for it, it is only going to take off in HPC space where software re engineering cost outweigh traditional method.

Someday, sometimes GPGPU will eventually be so powerful that we simply can not ignore it. But as x86 continues to evolve with AVX adding even more FP power, taking advantage of this continual improvement with x86 is going to be much much easier for developers.

Ofcoz, that is assuming Intel will be pushing AVX forward for this. Otherwise there simply isn't enough incentive to rewrite apps in OpenCL.

I have a feeling that even Apple has sort of abandon ( or slow down ) with OpenCL.
GullLars - Tuesday, May 15, 2012 - link
I think he is partially correct. Having a wide vector unit within the out-of-order domain and connected to the flexible CPU caches will allow acceleration of sequential code for instructions and methods within a program. However making use of explicit parallelism for larger heavy tasks which are massively or embarrisingly parallell will allow for a more efficient speedup.

Hetrogenous computing also allows easier addition of modular special purpose accelerators which may be powergated. Intel's quick-sync, and AES-NI are examples of the performance and power-efficiency of such accelerators.

As geometries become smaller, transistors become cheaper and power-density increases. Dedicating die area to accelerators that may be powergated, and having a large GPGPU area which can run at a lower lock and voltage and/or be clock/power gated in sections will help manage thermal limitations. Thermals will likely become a problem of geometry (active transistor density and distribution), not just total chip powerconsumption.
Denithor - Monday, May 14, 2012 - link
Guess that's the big question - what other regions of software are going to benefit substantially from being able to use GPU acceleration?

I was asking like a week ago in the forums if anyone thought we'd see physics for games show up being run on the iGPU (either on Intel or AMD processors). Especially in cases where a user already has a powerful discrete GPU, is there any advantage to buying a CPU with an on-die GPU as well or are those going to be just extra baggage for most power users?
Denithor - Monday, May 14, 2012 - link
And one other question, these days the drive in CPU is to lower power/heat generation. Compared to discrete GPU where it's nice to use less power but that's not really as much of a driver as increased performance.

I imagine that ntegrated GPU gains a serious advantage from sharing a cache at some level with the CPU - making workflow much more efficient.

However, for these integrated GPUs to seriously challenge discrete cards, I think they are going to have to push the power consumption up significantly. Currently we don't mind using 70-100+W for the CPU and another 150-300W for the discrete GPU. Are there any plans to release a combined CPU+iGPU that will use like 200W or more? Or is the iGPU going to continue to be held back to a minimal performance level by power concerns?
Matias - Monday, May 14, 2012 - link
When will we see GPU usage in parallel tasks such as file compression/decompression (RAR, mp3, flac), database management (SQL), even booting Windows etc? Sorry if my questions are too simplistic or ignorant.
jamyryals - Monday, May 14, 2012 - link
While we already see some of these tasks being GPU accelerated, they are by and large experimental or extremely expensive to implement.

My modification to your question would be, when will these tasks be easily accessible to perform these tasks to a non-specialized developer? Only when this happens will this technology become ubiquitous.
BenchPress - Monday, May 14, 2012 - link
Heterogeneous computing will never be easy, which is one of the main reasons why homogeneous computing using AVX2 will prevail. It offers the same parallel computing advantages, without the disadvantages. Any programming language can use AVX2 to speed up vectorizable work, without the developer even having to know about it.

Also, AVX2 will be supported by every Intel processor from Haswell forward, and AMD will have no other choice but to support it as well soon after. So few developers will be inclined to support a proprietary architecture that is harder to develop for.
Denithor - Monday, May 14, 2012 - link
Well, the GPU is already going to be there so why not find some use for it? For gamers & workstations with discrete GPU the iGPU is just going to go waste otherwise...
BenchPress - Tuesday, May 15, 2012 - link
Only a fraction of systems will have an IGP and a discrete GPU. Also, they'll come in widely varying configurations. This is a nightmare for developers.

Things become much easier when the GPU concentrates on graphics alone (whether discrete or integrated), and all generic computing is handled by the CPU. NVIDIA has already realized that and backed away from GPGPU to focus more on graphics.

AVX2 will be available in every Intel CPU from Haswell forward, and AMD will soon follow suit. And with a modest quad-core/module you'll be getting 500 GFLOPS of very flexible and highly efficient throughput computing power.

I know it seems tempting that if you have three processors you give each of them a purpose, but it's just way too hard to ensure good performance on each configuration. Concentrating on AVX2 (and multi-threading with TSX) will simply give developers higher returns with less effort.
MJEvans - Monday, May 14, 2012 - link
I have heard that there are ways of leveraging parallelism in more common programming and scripting languages; and there are also more implicitly parallel languages such as Erlang, but that might be too high level in some aspects for approaching the tasks that GPUs are best at.

What sort of open platforms is AMD participating in, or even spearheading that would be useful for developers who are more familiar with more traditional/common languages?

Are there any older languages you'd recommend developers experiment with for fun and educational purposes to help refine thought and use patterns?

If the best performance might be had by combining existing language inspirations in to a new set of programming/scripting languages can you please link to more information about these new languages?

Finally, hardware support under diverse operating systems: I know the answer is somewhat of a paradox; the impression (and reality) in gaming is that even when using the binary AMD driver under Linux there's less performance than windows users see on similar hardware, I worry that this might also extended in to higher end workstation and super-computing applications with the same hardware; from a marketing perspective would it not make sense to allocate slightly more resources for supporting the latest updates to the popular windowing systems in a timely manor, to support the latest hardware (with the binary driver) on the date of release (and ensure the community driver gets sufficient documentation and helpful hints to get feature parity sooner rather than later) and to make sure that benchmarks using OpenCL or whatever tools you expose the massively parallel processing tool to programmers with benchmark well under all operating systems?
codedivine - Monday, May 14, 2012 - link
1. One of the big problems on Windows using GPU computing is Timeout Detection and Recovery. If a GPU is also driving the display, then that GPU is essentially limited to only small kernels, of say around 2 seconds of length. Will this get better in the future?
Basically, will the GPU be able to seamlessly context-switch between computing apps, UI rendering and 3D apps etc in a seamless fashion?

2. Will we see good performance fp64 support for more consumer GPUs? Will the GPU side of APUs ever get fp64?

3. AMD's OpenCL implementation currently does not expose all the GPU capabilities. For example, no function pointers even though GCN hardware supports it (if I am understanding correctly).

4. Will we see a new Firestream product? Also, why hasnt AMD pushed APUs in HPC more?
BenchPress - Monday, May 14, 2012 - link
1. Not an issue with AVX2.

2. You get great FP64 performance with AVX2.

3. Any programming feature can still be supported when compiling for AVX2.
codedivine - Monday, May 14, 2012 - link
I am well aware of AVX2. I didn't ask about that. GPUs, especially discrete GPUs, continue to hold massive advantage when it comes to floating point performance and AVX2 will not change that a whole lot. Also, as already pointed out, HSA is not about CPU vs GPU, but rather CPU+GPU so I am not sure why you keep comparing the two.

Would be great if you could just focus on the thread.
BenchPress - Tuesday, May 15, 2012 - link
GPUs will only have a 2x advantage in theoretical compute density over CPU cores with AVX2. I wouldn't call this a massive advantage, especially since the GPU suffers badly from having only a tiny amount of cache space per thread. Furthermore, discrete GPUs perform horribly due to the round-trip delay and limited CPU-GPU bandwidth.

This really is about CPU vs. GPU, because CPU+GPU has no long term future and the GPU can't exist on its own. Hence the only possible outcome is that GPU technology will be merged into the CPU. Maybe we shouldn't call this a CPU any more, but it's definitely not a heterogeneous APU.
shawkie - Tuesday, May 15, 2012 - link
What about memory bandwidth? The latest GPUs have 4GB of device memory running at 200GB/s.
BenchPress - Tuesday, May 15, 2012 - link
GPUs are very wasteful with bandwidth. They have very little cache space per thread, and so they're forced to store a lot of things in RAM and constantly read it and write things back.

CPUs are way more efficient because they process threads much faster and hence they need fewer, resulting in high amounts of cache space per thread. This in turn gives it very high cache hit rates, which has lower latency, consumes less power, and offers higher net bandwidth.

In other words, higher RAM bandwidth for GPUs doesn't actually make them any better at extracting effective performance from it. Also, CPUs will still have DDR4 and beyond once required, while GPUs are already pushing the limits and will have to resort to bigger caches in the near future, effectively sacrificing computing density and becoming more like a CPU.

Last but not least, the APU is limited by the socket bandwidth, so it's GPU has no bandwidth advantage over the CPU.
DarkUltra - Tuesday, May 15, 2012 - link
1. WDDM 1.2 require preemtive multitasking, so the gpu should never be clogged up anymore. Threads will be swapped in and out very quickly.
BenchPress - Tuesday, May 15, 2012 - link
What makes you think that switching contexts can be done quickly? There's way more register state to be stored/restored, buffers to be flushed, caches to be warmed, etc. than on a CPU.
suty455 - Monday, May 14, 2012 - link
i just need to understand why AMD made such a poor job of the latest CPUs even with Win 8 they lag so far behind intel its crazy, is the unified approach ever going to allow AMD to leap the gap to intels processors, and what kind of influence do you have with the major software houses eg MS to get the unified processor used to its fullest extent to actually make a difference in real world usage and not just benchmarks?

i ask as a confirmed AMD fan who frankly can no longer ignore the massive performance increase i can get from swapping to intel.
Jedibeeftrix - Monday, May 14, 2012 - link
1. When will AMD be able to demonstrate a real competence at GPU compute, vis-a-vis Nvidia and its CUDA platform, by having its GPU's able to properly function as a render source for Blender/CYCLES?

2. What steps are necessary to get it there?

----------------------------

Blender (and its new CYCLES GPU renderer) is a poster-child for the GPU compute world.
It already runs on OpenCL, however, only properly on the CPU or via Nvidia CUDA.
Blender themselves are already trying to make OpenCL the default platform because it would be a cross-platform and cross-architecture solution, however, on AMD it does not function adequately .

What are you doing to help the development of AMD-GPU on OpenCL with the Blender foundation?
With what driver release would you hope Catalyst will reach an acceptable level of functionality?
With what driver release would you hope Catalyst will reach broad parity with Nvidia/CUDA?
Is the upcoming AMD OpenCL APP SDK v1.2 a part of this strategy?

Above all; when will my 7970 rock at CYCLES?

Kind regards
palladium - Monday, May 14, 2012 - link
At the moment the CPU and GPU are relatively independant of each other in terms of operations, and both enjoy an (almost) equal area in terms of die space. Do you expect in the near future for AMD to head in a similar direction as the Cell processor (in PS3), where the CPU handles the OS and passing on most of the intensive calculations over to the GPU?
SilthDraeth - Monday, May 14, 2012 - link
Just saying. These questions are for the AMD guy, and this benchpress guy comes in here spamming AVX2 to answer all the questions posed to AMD.

Makes you go hmmm...
palladium - Monday, May 14, 2012 - link
yes, very suspicious indeed.
BenchPress - Tuesday, May 15, 2012 - link
I just want what's best for all of us: homogeneous computing.

Computing density is increasing quadratically, but bandwidth only increases linearly. Hence computing has to be done as locally as possible. It's inevitable that sooner or later the CPU and GPU will fully merge (they've been converging for many years). So HSA has no future, while AVX2 is exactly the merging of GPU technology into the CPU.

Gather used to be a GPU exclusive feature, giving it a massive benefit, but now it's part of AVX2.
_vor_ - Wednesday, May 16, 2012 - link
Give it a rest guy. People would like to hear from AMD.
Fergy - Wednesday, May 16, 2012 - link
So why not but a cpu in the gpu? If you are worried about round trips and caches.
BenchPress - Wednesday, May 16, 2012 - link
Because to make a GPU run sequential workloads efficiently it would need lots of CPU technology like out-of-order execution and a versatile cache hierarchy, which sacrifices more graphics performance than people are willing to part with. The CPU itself however is a lot closer to becoming the ideal general purpose high throughput computing device. All it needs is wide vectors with FMA and gather: AVX2. It doesn't have to make any sacrifices for other workloads.

AVX2 is also way easier to adopt by software developers (including compiler developers like me). And even if AMD puts hundreds of millions of dollars into HSA's software ecosystem (which I doubt) to make it a seamless experience for application developers (i.e. just switching a compiler flag), it's still going to suffer from fundamental heterogeneous communication overhead which makes things run slower than the theoretical peak. Figuring out why that happens takes highly experienced engineers, again costing companies lots of money. And some of that overhead just can't be avoided.

Last but not least, AVX2 will be ubiquitous in a few years from now, while dedicated HSA will only be available in a minority of systems. The HSA roadmap even shows that the hardware won't be complete before 2014, and then they still have to roll out all of the complex software to support it. AVX2 compiler support on the other hand is in beta today, for all major platforms and programming languages/frameworks.
hwhacker - Monday, May 14, 2012 - link
Love this open dialogue, thanks Manju/Anand.

What balance of Radeon cores do you see as a pertinent mix to execute fp128 and 256-bit instructions? Is one 64sp unit realistic, or does the unit need to be comparably larger (or a multiple) to justify it's allocation within the multipurpose nature not only within the APU but across discrete GPU product lines that may also use the same DNA?

What are the obstacles in the transition from the current FPU unit(s) within bulldozer CPUs to such a design? Clockspeed/unit pairings per transistor budget that may mesh better on future process nodes, for example?
mrdude - Monday, May 14, 2012 - link
1 - The recent Kepler design has shown that there might be a chasm developing between how AMD and nVidia treat desktop GPUs. While GCN showed that it can deliver both fantastic compute performance (particularly on supported openCL tasks), it also weighs in heavier than Kepler and lags behind in terms of gaming performance. The added vRAM, bus width and die space for the 7970 allow for greater compute performance but at a higher cost; is this the road ahead and will this divide only further broaden as AMD pushes ahead? I guess what I'm asking is: Can AMD provide both great gaming performance as well as compute without having to sacrifice by increasing the overall price and complexity of the GPU?

2 - It seems to me that HSA is going to require a complete turnaround for AMD as far as how they approach developers. Personally speaking, I've always thought of AMD as the engineers in the background who did very little to reach out and work with developers, but now in order to leverage the GPU as a compute tool in tasks other than gaming it's going to require a lot of cooperation with developers who are willing to put in the extra work. How is AMD going about this? and what apps will we see being transitioned into GPGPU in the near future?

3 - Offloading FP-related tasks to the GPU seems like a natural transition for a type of hardware that already excels in such tasks, but was HSA partly the reason for the single FPU in a Bulldozer module compared to the 2 ALUs?

4 - Is AMD planning to transition into an 'All APU' lineup for the future, from embedded to mobile to desktop and server?
ToTTenTranz - Tuesday, May 15, 2012 - link
This I'm also really interested in knowing.
Especially the 3rd question.

It seems Bulldozer/Piledriver sacrificed quite a bit of parallel FP performance.
Does this mean tha HSA's purpose is to have only a couple of powerful FP units for some (rare ?) FP128 workloads while leaving the rest of the FP calculations (FP64 and below) for the GPU? Will that eventually be completely transparent for a developer?

And please, will someone just kick the spamming avx dude?
A5 - Monday, May 14, 2012 - link
What is AMD doing to make OpenCL more pleasant to work with?

The obvious deficiency at the moment is the toolchain, and (IMO) the language itself is more difficult to work with for people who are not experienced with OpenGL. As someone with a C background, I was able to get a basic CUDA program running in under 1/3rd of the time it took me to get the same program implemented and functional in OpenCL.
ltcommanderdata - Monday, May 14, 2012 - link
1. WinZip and AMD have been promoting their joint venture in implementing OpenCL hardware accelerated compression and decompression. Owning an AMD GPU I appreciate it. However, it's been reported that WinZip's OpenCL acceleration only works on AMD CPUs. What is the reasoning behind this? Isn't it hypocritical, given AMD's previous stance against proprietary APIs, namely CUDA, that AMD would then support development of a vendor specific OpenCL program?

2. This may be related to the above situation. Even with standardized, cross-platform, cross-vendor APIs like OpenCL, to get the best performance, developers would need to do vendor specific, even device generation within a vendor specific optimizations. Is there anything that can be done whether at the API level, at the driver level or at the hardware level to achieve the write-once, run-well anywhere ideal?

3. Comparing the current implementations of on-die GPUs, namely AMD Llano and Intel Sandy Bridge/Ivy Bridge, it appears that Intel's GPU is more tightly integrated with CPU and GPU sharing the last level cache for example. Admittedly, I don't believe CPU/GPU data sharing is exposed to developers yet and only available to Intel's driver team for multimedia operations. Still, what are the advantages and disadvantages of allowing CPUs and GPUs to share/mix data? I believe memory coherency is a concern. Is data sharing the direction that things are eventually going to be headed?

4. Related to the above, how much is CPU<>GPU communications a limitation for current GPGPU tasks? If this is a significant bottleneck, then tightly integrated on-die CPU/GPUs definitely show their worth. However, the amount of die space that can be devoted to an IGP is obviously more limited than what can be done with a discrete GPU. What can then be done to make sure the larger computational capacity of discrete GPUs isn't wasted doing data transfers? Is PCIe 3.0 sufficient? I don't remember if memory coherency was adopted for the final PCIe 3.0 spec, but would a new higher speed bus, dedicated to coherent memory transfers between the CPU and discrete GPU be needed?

5. In terms of gaming, when GPGPU began entering consumer consciousness with the R500 series, GPGPU physics seemed to be the next big thing. Now that highly programmable GPUs are common place and the APIs have caught up, mainstream GPGPU physics is no where to be found. It seems the common current use cases for GPGPU in games is to decompress textures and to do ambient occlusion. What happened to GPGPU physics? Did developers determine that since multi-core CPUs are generally underutilized in games, there is plenty of room to expand physics on the CPU without having to bother with the GPU? Is GPGPU physics coming eventually? I could see concerns about contention between running physics and graphics on the same GPU, but given most CPUs are coming integrated with a GPGPU IGP anyways, the ideal configuration would be a multi-core CPU for game logic, an IGP as a physics accelerator, and a discrete GPU for graphics.
ltcommanderdata - Tuesday, May 15, 2012 - link
As a follow up, it looks like the just released Trinity brings improved CPU/GPU data sharing as per question 3 above. Maybe you could compare and contrast Trinity and Ivy Bridge's approach to data sharing and give an idea of future directions in this area?
GullLars - Monday, May 14, 2012 - link
My question:
Will the GPGPU acceleration mainly improve embarrassingly parallel and compute bandwidth constrained applications, or will it also be able to accelerate smaller pieces of work that are parallel to a significant degree.
And what is the latency associated with branching off and running a piece of code on the parallel part of the APU? (f.ex. as a method called by a program to work on a large set of independent data in parallel)
j1o2h3n4 - Monday, May 14, 2012 - link
Under the impression that OPENCL for 3D Rendering is finally as fast as CUDA, Really? If yes what rendering systems can we use?

Nvidia has exclusivity on big names of GPU renderers like IRAY, VRAY, OCTANE, ARION, these companies has taken years developing & optimizing CUDA and under their requirements only Nvidia is mentioned. By going AMD we are giving all that up, what foreseeable effort does AMD take to boost this market?
Ashkal - Monday, May 14, 2012 - link
I am just layman but I suggest instead of having more and more cores and transistor counts
just reduce transistor count and optimise for computing use with respect to power it is as good as not having power reserve meter in rollsroyce and huge engine or Honda with selective V8 engine, but Suzukis 660cc 60BHP 26km/lt engine or like that.
sfooo - Tuesday, May 15, 2012 - link
What are your thoughts on OpenCL and its adoption rate, or lack thereof? How about DirectCompute?
caecrank - Tuesday, May 15, 2012 - link
When is ansys going to start using amd hardware again? When can we expect to see an apu that can beat a Tesla on memory size and match it in terms of performance?
Can we also please have a 4 or 8 memory channel bulldozer? Fp performance on it is quite good, the only thing stopping us from adopting it is the max memory capacity. (sandy bridge e)

J
Loki726 - Tuesday, May 15, 2012 - link
AMD Fellow Mike Mantor has a nice statement that I believe captures the core difference between GPU and CPU design.

"CPUs are fast because they include hardware that automatically discovers and exploits parallelism (ILP) in sequential programs, and this works well as long as the degree of parallelism is modest. When you start replicating cores to exploit highly parallel programs, this hardware becomes redundant and inefficient; it burns power and area rediscovering parallelism that the programmer explicitly exposed. GPUs are fast because they spend the least possible area and energy on executing instructions, and run thousands of instructions in parallel."

Notice that nothing in here prevents a high degree of interoperability between GPU and CPU cores.

1) When will we see software stacks catch up with heterogeneous hardware? When can we target GPU cores with standard languages (C/C++/Objective-C/Java), compilers(LLVM, GCC, MSVS), and operating systems (Linux/Windows)? The fact that ATI picked a different ISA for their GPUs than x86 is not an excuse; take a page out of ARM's book and start porting compiler backends.

2) Why do we need new languages for programming GPUs that inherit the limitations of graphics shading languages? Why not toss OpenCL and DirectX compute, compile C/C++ programs, and launch kernels with a library call? You are crippling high level languages like C++-AMP, Python, and Matlab (not to mention applications) with a laundry list of pointless limitations.

3) Where's separable compilation? Why do you have multiple address spaces? Where is memory mapped IO? Why not support arbitrary control flow? Why are scratchpads not virtualized? Why can't SW change memory mappings? Why are thread schedulers not fair? Why can't SW interrupt running threads? The industry solved these problems in the 80s. Read about how they did it, you might be surprised that the exact same solutions apply.

Please fix these software problems so we can move onto the real hard problems of writing scalable parallel applications.
BenchPress - Tuesday, May 15, 2012 - link
AMD Fellow Mike Mantor is wrong, and he'd better know it. GPUs don't run thousands of instructions in parallel. They have wide vector units which execute THE SAME instruction across each element. So we're looking at only 24 instructions executing in parallel in the case of Trinity.

CPUs have wide vector units too now. A modest quad-core Haswell CPU will be capable of 128 floating-point operations per cycle, at three times the clock frequency of most GPUs!

AVX2 is bringing a massive amount of GPU technology into the CPU. So the only thing still setting them apart is the ILP technology and versatile caches of the CPU. Those are good features to have around for generic computing.

Last but not least, high power consumption of out-of-order execution will be solved by AVX-1024. Executing 1024-bit instructions on 256-bit vector units in four cycles allows the CPU's front-end to be clock gated, and there would be much less switching frequency in the schedulers. Hence you'd get GPU-like execution behavior within the CPU, without sacrificing any of the CPUs other advantages!

GPGPU is a dead end and the sooner AMD realizes this the sooner we can create new experiences for consumers.
Loki726 - Tuesday, May 15, 2012 - link
So your main argument is that AMD should just start bolting very wide vector units onto existing CPU cores?

If you want to summarize it at a very high level like that, then sure, that is the type of design the everyone is evolving towards, and I'd say that you are mostly right.

Doing this would go a long way towards solving the problems of supporting standard SW stacks. If GPGPU means heterogeneous ISAs and programming models, then I agree with you, it should be dead. If it means efficient SIMD and multi-core harwdare, then I think that we need software that can use it yesterday, and I agree that tightly integrating it with existing CPU designs is important. AVX is a good start.

Like so many other things, though, the details are not so straightforward.
Intel already built a multi-core CPU with wide vector units. They called it Knight's Corner/Ferry/Landing. However, even they used light-weight atom-like cores with wide vector units, not complex OOO cores. Why do you think they did that? Making CPU vector units reduces some overhead, but even with that do you think they could have fit 50 Haswell cores into the same area? I'll stand by Mike's point about factoring out redundant HW. A balanced number of OOO CPU cores is closer to 2-4, not 50.

Also, I disagree with your point about a GPU only executing one instruction for each vector instruction. GPUs procedurally factor out control overhead when threads are executing the same instruction at the same time. Although it is not true for all applications, for a class of programs (the ones that GPUs are good at), broadcasting a single instruction to multiple data paths with a single vector operation really is equivalent to running different instances of that same instruction in multiple threads on different cores. You can get the same effect with vector units plus some additional HW and compiler support. Wide AVX isn't enough, but it is almost enough.

Finally, you mention that Haswell will run at three times the frequency of most GPU designs. That is intentional. The high frequency isn't an advantage. Research and designs have shown over and over again that the VLSI techniques required to hit 3Ghz+ degrade efficiency compared to more modestly clocked designs. Maybe someone will figure out how to do it efficiently in the future, but AFAIK, my statement is true for all standard library based flows, and the semi-custom layout that Intel/others sometimes use.
BenchPress - Tuesday, May 15, 2012 - link
I'm not suggesting to just "bolt on" wide vector units. One of the problems is you still need scalar units for things like address pointers and loop counters. So you want them to be as close as possible to the rest of the execution core, not tacked on to the side. Fortunately all AMD has to do in the short term is double the width of its Flex FP unit, and implement gather. This shouldn't be too much of a problem with the next process shrink.

Indeed Knight's Corner uses very wide vectors, but first and foremost let's not forget that the consumer product got cancelled. Intel realized that no matter how much effort you put into making it suitable for generic computing, there are still severe bottlenecks inherent to heterogeneous computing. LRBni will reappear in the consumer market as AVX2.

And yes I do believe Haswell could in theory compete with MICs. You can't fit the same number of cores on it, but clock speed makes up for it in large part and you don't actually want too many cores. And most importantly, GPGPU has proven to never reach anywhere near the peak performance for real-world workloads. Haswell can do more, with less, thanks to the cache hierarchy and out-of-order execution.

Would do care to explain why you think AVX2 isn't enough for SPMD processing on SIMD?
Tanclearas - Tuesday, May 15, 2012 - link
Although I do agree that there are many opportunities for HSA, I am concerned that AMD's own efforts in using heterogeneous computing have been half-baked. The AMD Video Converter has a smattering of conversion profiles, lacks any user-customizable options (besides a generic "quality" slider), and hasn't seen any update to the profiles in a ridiculously long time (unless there were changes/additions within the last few months).

It is no secret that Intel has put considerable effort into compiler optimizations that required very little effort on the part of developers to take advantage of. AMD's approach to heterogeneous computing appears simply to wait for developers to do all the heavy lifting.

The question therefore is, when is AMD going to show real initiative with development, and truly enable developers to easily take advantage of HSA? If this is already happening, please provide concrete examples of such. (Note that a 3-day conference that also invites investors is hardly a long-term, on-going commitment to improvement in this area.)
jeff_rigby - Tuesday, May 15, 2012 - link
I'm sure you can't answer direct questions so I'll try a work around:

1) Would next generation game consoles benefit from HSA efficiencies?
2) Will AMD 2014 SOC designs be available earlier to a large 3rd party with game console volume.
3) Can third party CPUs like the IBM Power line be substituted for X86 processors in AMD SOCs. Is this a better choice than a Steamroller or Jaguar core for instance? Assume that the CPU will prefetch for the GPGPU.
4) Is my assumption correct that a Cell processor, because it already has an older attempt at HSA internal to the cell makes it unsuitable for inclusion in a AMD HSA SOC but individual PPC or SPU elements could be included. This is a simplistic question and does not take into account that a GPGPU is more efficient at many tasks for which SPUs would be used.
5) If using a SOC which has a CPU and GPU internal but more GPU needed, is there going to be a line of external GPUs that use system memory and a common address scheme rather than a PCIe buss and must have it's own RAM like PC cards.
6) In the future, will 3D stacked memory replace GDDR memory in AMD GPU cards for PCs. I understand that it will be faster, more energy efficient and eventually cheaper.

Foundries questions that applies to AMD SOC "process optimized" building blocks made to consortium standards that will reduce cost and time to market for SOCs.
1) Are we going to see very large SOCs in the near future.
2) Design tools to custom create a large substrate that has bumps and traces the AMD building blocks 2.5D attach to. Mentioned in PDFs was Global Foundries needing 2.5 years lead time to design custom chips. How much lead time will be needed in the future if using AMD building blocks to build SOCs?
3) Will 3D stacked memory be inside the SOC, Outside the SOC or a combination of the two due to DRAM being temp sensitive.
4) Are a line of FPGAs part of AMD building blocks?

Thanks for any questions you can answer.

.
SleepyFE - Tuesday, May 15, 2012 - link
Where is AMD going with their APU-s. Are you trying to make it more like a CPU or a GPU that can stand on it's own. Since you used SIMD for GCN it seems to me that you are going for a more CPU like design with a large amount of SIMD-s that will be recognized as a graphics card for backward compatibility purposes (possibly implemented in the driver). But since you promote GPGPU and your APU-s have weaker CPU capabilities, do you eventually intend to make it a GPU that can handle x86 instructions (for backwards compatibility).
PrezWeezy - Tuesday, May 15, 2012 - link
It seems like with the advent of using the GPU for some tasks and the CPU for others, the biggest technical hurdle is the idea of programing something to make use of the best suited processor for the job. What are the possibilities of adding a chip, or core, on the hardware side to branch off different tasks to the processor which can complete it fastest? That way no software must be changed in order to make use of the GPGPU. Would that even be feasible? It seems like it might make a rather drastic change in the way x86 works, but I see many more possibilities if a hardware level branch happened instead of software level.
BenchPress - Tuesday, May 15, 2012 - link
No, you can't just "branch off" a CPU workload and make it run on the GPU, let alone faster.

That said, AVX2 enables compilers to do auto-vectorization very effectively. And such compilers are nearly ready today, a year before Haswell arrives. So it will take very little effort from application developers to take advantage of AVX2. And there will also be many middleware frameworks and libraries which make use of it, so you merely have to install an update to see each application that makes use of it gain performance.

So you can get the benefits of GPGPU right at the core of the CPU, with minimal effort.
kyuu - Tuesday, May 15, 2012 - link
Goddamn, what are you, some Intel/nVidia shill/shareholder? Or just a troll?

We get it. You think AVX2 rules and GPGPU drools. Cool, thanks. This is supposed to be for people who want to ask Mr. Hegde questions about GPGPU, not for people to come in and troll everyone else's questions with what comes down to, "AVX2 > GPGPU, lol idiot".

If you really want to contribute, how about posting an actual question. Y'know, like maybe one about what role, if any, GPGPU will have in the future assuming the widespread adoption of AVX2. That would be a legitimate question and much better than you trolling everyone with your AVX2 propaganda.
mrdude - Tuesday, May 15, 2012 - link
indeed. Can a mod just delete his posts please?
SleepyFE - Wednesday, May 16, 2012 - link
BenchPress must be an Intel employee or something because i can read all his posts between the lines like so:"No no, don't use GPGPU. Our GPU-s suck too much to be used for that. We will add such functions to the CPU where we have a monopoly so everyone will be forced to use it."
BenchPress - Wednesday, May 16, 2012 - link
I'm not an Intel employee. Not even close. So please don't try to make this personal when you're out of technical arguments why a homogeneous CPU with throughput computing technology can't be superior to a heterogeneous solution.

Have you seen the OpenCL Accelerated Handbrake review? That's Trinity against Intel CPUs without AVX2. Trinity still loses against the CPU codec. So Intel won't need my help selling AVX2. The technology stands for itself. And I would be hailing AMD if they implemented it first.

AVX2 will have *four* times the floating-point throughput of SSE4, and also adds support for gather which is critical for throughput computing (using OpenCL or any other language/framework of your choice). This isn't some kneejerk reaction to GPGPU. This is a carefully planned paradigm shift and these instructions will outlive any GPGPU attempt.

It's a scientific fact that computing density increases quadratically while bandwidth increases linearly. And while this can be mitigated to some extent using caches, it demands centralizing data and computation. Hence heterogeneous general purpose computing is doomed to fail sooner or later.
maximumGPU - Wednesday, May 16, 2012 - link
Well said!
BenchPress - Wednesday, May 16, 2012 - link
No, I'm not employed by any of these companies, nor am I a shareholder, nor am I troll. I'm a high performance compiler engineer and I just want to help people avoid wasting their time with technology that I strongly believe has no future in the consumer market.

I'm sorry but I won't ask Manju any direct questions. He's here only to help sell HSA, so he won't admit that AVX2+ is very strong competing technology which will eventually prevail. AMD has invested a lot of money into HSA and will be late to the homogeneous computing scene. They want to capitalize on this investment while it lasts.

If you do think HSA is the future, then please tell me which bit of GPU technology could never be merged into the CPU. The GPU already lost the unique advantage of multi-core, SMT, FMA, wide vectors, gather, etc. And AVX-1024 would bring the power consumption on par. Anything still missing to make the CPU superior at general purpose throughput computing?

Don't blame me for AMD making the wrong technology choices.
SleepyFE - Wednesday, May 16, 2012 - link
You have stated that you strongly belive that HSA is a fail and that AVX is superior. Nothing wrong with speaking your mind.
Doing so over and over and over (over times number of your posts) makes you a troll
BenchPress - Wednesday, May 16, 2012 - link
Each of my posts highlights different aspects of homogeneous versus heterogeneous throughput computing, backed by verifiable facts. So I'm doing a lot more than just sharing some uninformed opinion and repeating it. I'm trying to help each individual with their specific questions. I can't help it that the conclusion is always the same though. We have AMD to blame for that.

Manju will answer questions about individual HSA aspects, which is fine and dandy except that the *whole* concept of HSA is problematic compared to the future of AVX.

Someone has yet to come up with a strong technical argument why general purpose computing on the GPU and general purpose computing on the CPU will always be superior to merging both into one. Physics is telling us otherwise, and I didn't make the rules.

If nobody else has the guts to point that out, I will. It doesn't make me a troll.
SleepyFE - Wednesday, May 16, 2012 - link
THE TROLL strikes again
gcor - Wednesday, May 16, 2012 - link
I ask because I used to work on a Telecom's platform that used PPC chips, with vector processors that *I think* are quite analogous to GPGPU programming. We off loaded as much as possible to the vector processors (e.g. huge quantities of realtime audio processing). Unfortunately it was extremely difficult to write reliable code for the vector processors. The software engineering costs wound up being so high, that after 4-5 years of struggling, the company decided to ditch the vector processing entirely and put in more general compute hardware power instead. This was on a project with slightly less than 5,000 software engineers, so there were a lot of bodies available. The problem wasn't so much the number of people, as the number of very high calibre people required. In fact, having migrated back to generalised code, the build system took out the compiler support for the vector processing to ensure that it could never be used again. Those vector processors now sit idle in telecoms nodes all over the world.

Also, wasn't the lack of developer take up of vector processing one of the reasons why Apple gave up on PPC and moved to Intel? Apple initially touted that they had massively more compute available than Windows Intel based machines. However, in the long run no, or almost no, applications used the vector processing compute power available, making the PPC platform no advantage.

Anyway, I hope the problem isn't intrinsically too hard for mainstream adoption. It'll be interesting to see how x264 development gets through it's present quality issues with OpenCL.
BenchPress - Wednesday, May 16, 2012 - link
Any chance this is IBM's Cell processor you're talking about? Been there, done that. It's indeed largely analogous to GPGPU programming.

To be fair though HSA will have advantages over Cell, such as a unified coherent memory space. But that's not going to suffice to eliminate the increase in engineering cost. You still have to account for latency issues, bandwidth bottlenecks, register limits, call stack size, synchronization overhead, etc.

AVX2 doesn't have these drawbacks and the compiler can do auto-vectorization very effectively thanks to finally having a complete 'vertical' SIMD instruction set. So you don't need "high caliber people" to ensure that you'll get good speedups out of SPMD processing.
_vor_ - Wednesday, May 16, 2012 - link
Enough with the AVX2 Nerdrage. Seriously.
BenchPress - Wednesday, May 16, 2012 - link
What is your problem with AVX2?

If hypothetically some technology was superior to GPGPU, wouldn't you want to know about it so you can stop wasting your time with GPGPU? What if that technology is AVX2?

I'm completely open to the opinion that it's not, but I haven't seen technical arguments yet of the contrary. So please be open to possibility that GPGPU won't ever deliver on its promise and will be surpassed by homogeneous high throughput computing technology.
_vor_ - Wednesday, May 16, 2012 - link
lol. Ok seriously. Are they paying you per post?
BenchPress - Wednesday, May 16, 2012 - link
No, nobody's paying me to post here.

Please read gcor's post again. He raised very serious real world concerns about heterogeneous computing. So I'm just trying to help him and everyone else by indicating that with AVX2 we'll get the performance boost of SPMD processing without the disadvantages of a heterogeneous architecture.

Is it so hard to believe that someone might be willing to help other people without getting paid for it? I don't see why you have a problem with that.
SleepyFE - Friday, May 18, 2012 - link
How would AVX2 handle graphics processing?
BenchPress - Friday, May 18, 2012 - link
I am only suggesting using AVX2 for general purpose high throughput computing. Graphics can still be done on a highly dedicated IGP or discrete GPU.

This is the course NVIDIA is taking. With GK104, they invested less die space and power consumption on features that would only benefit GPGPU. They realize the consumer market has no need for heterogeneous computing since the overhead is high, it's hard to develop for, it sacrifices graphics performance, and last but not least the CPU will be getting high throughput technology with AVX2.

So let the CPU concentrate on anything general purpose, and let the GPU concentrate on graphics. The dividing line depends on whether you intend on reading anything back. Not reading results back, like with graphics, allows the GPU to have higher latencies and increase the computing density. General purpose computing demands low latencies and this is the CPU's strong point. AVX2 offers four times the floating-point throughput of SSE4 so that's no longer a reason to attempt to use the GPU.
SleepyFE - Saturday, May 19, 2012 - link
So you want a GPU and an extralarge CPU (to acomodate AVX)?
That is whet they do NOT want. In case you didn't notice they are trying to make better use of the resources available. With Bulldozer modules they cut the FPU-s in half. The GPU will handle showing the desktop and doing some serios computing (what idiot would play a game while waiting). That is the point. Smaller dies doing more (possibly at all times). Efficiency.
BenchPress - Saturday, May 19, 2012 - link
No, the CPU doesn't have to be extra large. AVX2 barely increases the die size. ALUs are pretty tiny these days.

And it doesn't matter what "they" want. AMD doesn't write all the applications, other companies do. It's a substantial investment to adopt HSA, which has to pay off in increased revenue. AVX2 is much easier to adopt and will be more widely supported. So from a ROI and risk perspective AVX2 wins hands down.

Also, Trinity's GPU still loses against Intel's CPUs at things like video transcoding (which it was supposed to be good at). And that's before AVX2 is even supported! So how is AMD going to be able to make the GPU catch up with CPUs supporting AVX2? Actually they have to substantially *exceed* it for HSA to make sense. How much die space will that cost? How power efficient will it be at that point?

And it gets worse. AVX2 won't be the last challenge HSA will face. There are rumors about AVX-1024 which would execute 1024-bit instructions (already mentioned by Intel in 2010) over four cycles, reducing the power consumption. So this would erase yet another reason for adopting HSA.

So you have to consider the possibility that AMD might be going in the wrong direction. Them "wanting" to cripple the CPU and beef up the iGPU isn't sufficient for developers to take the effort in supporting such an architecture, and doesn't cancel the competition's advances in CPU technology.

They need a reality check and implement AVX2 sooner rather than later, or some miracle GPGPU technology we haven't heard of yet. So I'm very curious what Manju has to say on how they will overcome *all* the obstacles.
SleepyFE - Saturday, May 19, 2012 - link
They are not going in the wrong direction. They are merging the CPU and the GPU, and since the GPU can't stan on it's own (and since their new GPU-s are more or less SIMD-s) the CPU will probably take over the GPU operations. That means that the CPU will have the GPU cores available (and the program will have to be written to make use of them), but AVX will require more ALU-s (unless these same ALU-s are in the GPU part of the chip, making this chat a non issue).

BTW when i said extralarge i meant without otherwise not added parts.

And "THEY" also stands for us consumers, because we want things thinner and faster (only possible if you make use of all available resources)
markstock - Wednesday, May 16, 2012 - link
Mr. Hegde, I have two questions which I hope you will answer.

1) To your knowledge, what are the major impediments preventing developers from thinking about this new hierarchy of computation and begin programming for heterogenous architectures?

2) AMD clearly aims to fill a void for silicon with tightly-coupled CPU-like and GPU-like computational elements, but are they only targeting the consumer market, or will future hardware be designed to also appeal to HPC users?

Thank you.

-Mark
tspacie - Thursday, May 17, 2012 - link
My question is, where does he see the market for these APUs? NVIDIA tried to get consumers interested in GPU compute and it largely flopped. As AT has shown, the main usage of GPU compute for consumers was video transcode and QuickSync does at least as good a job. There appears to be a heterogeneous compute market at the high end (Tesla / Quadro) and the very high end (Cray, Amazon cloud, etc.), but almost none in the consumer space which seems to be where the APUs are being targeted.
MySchizoBuddy - Friday, May 18, 2012 - link
What are AMDs plans for supporting compiler directive based speed ups like OpenACC supported by PGI, Cray and Nvidia
MySchizoBuddy - Friday, May 18, 2012 - link
When will APU allow both CPU and GPU to access the same memory address, without requiring any data movement for CPU to GPU compute.
BenchPress - Friday, May 18, 2012 - link
AMD previously released a roadmap for HSA indicating that the CPU and GPU would share a unified and coherent memory space in 2013: http://www.anandtech.com/show/5493/amd-outlines-hs...

This doesn't mean that it eliminates any data movement though. When the GPU reads memory that was previously written by the CPU, it still has to travel all the way from the CPU's cache into the GPU's cache. The only thing you get is that this process will be managed by the hardware, instead of having to do it explicitly in your code. The actual overhead is still there.

The only way to truly eliminate additional data movement is by using a homogeneous high throughput architecture instead, which will become available in the 2013 timeframe as well.
SleepyFE - Saturday, May 19, 2012 - link
Not sure but i think the GPU has access to L2 cache (or they plan on doing that).
c0d1f1ed - Friday, May 18, 2012 - link
Hi Manju,

I would be very grateful if you could answer these questions:

In case the GPU becomes swamped by the graphics workload and some CPU cores are un(der)utilized, will it be possible to preempt a general purpose task (say, physics) and migrate it from the GPU to the CPU? If so, to what extent would the developer be responsible for load balancing? Will the operating system be able to schedule GPU kernels the same way as CPU threads, or would the HSA software framework relief both the application developers and operating system developers of this complex task? How will you insure that the overhead of migrating the work will be lower than what you gain?

Are GPU context switches only considered to be a QoS feature (i.e. insuring that one application can't hog the GPU and make the system unresponsive), or will it also be a viable way to achieve TLP by creating more kernels than the GPU supports concurrently, and regularly switching between them? In other words, how will the context switch overhead compare to that of a CPU in terms of wall time? If suspending and resuming kernels is not recommended, what other synchronization primitives will developers have access to for structuring tasks and dependencies (something analogous to CPU fibers perhaps)? Will the GPU support hardware transactional memory one day?

It's clear that by making the GPU much more suitable for general purpose computing, some compromises will be made for graphics performance. At the same time, CPUs are gaining high throughput vector processing technology while also focusing on performance/Watt. How does AMD envision balancing (1) scalar execution on the CPU, (2) parallel execution on the CPU, (3) general-purpose execution on the GPU, and (4) graphics execution on the GPU? How do you expect it to evolve in the long term given semiconductor trends and the evolution to platform independent higher level programming?

AMD has indicated to open up HSA to the competition and the software community, obviously in the interest of making it become the dominant GPGPU technology. How will fragmentation of features be avoided? OpenCL and C++ AMP still have some way to go so we'll have many versions, which prevents creating a single coherent ecosystem in which GPGPU code can easily be shared or sold. How will HSA ensure both backward and forward compatibility, even across vendors? To what extent will hardware details be shared with developers? Or will there always be 'black box' components under tight control of AMD, preventing both taking advantage of all hardware characteristics and fixing issues by third parties?

Where do you expect integrated and discrete GPUs are heading? Integrated GPUs clearly benefit from lower latency, but in many cases lack in computing power (while also doing graphics) to outperform the CPU, while discrete GPUs can be severely held back by the PCIe bandwidth and latency. Will AMD provide powerful APUs aimed at hardcore gamers, or will all future CPUs include an IGP, sacrificing cores?

Thank you,
Nick
chiddy - Friday, May 18, 2012 - link
One area in which Intel has excelled is ensuring that it is easy for developers to use their technologies, and that the potential of their hardware is well utilized on all platforms.

On the x86 front for example the Intel Windows and Linux x86 C/C++ and Fortran compilers and profilers are considered to be some of the best available, and Intel are very active in all major x86 open source platforms ensuring that their hardware works (e.g Intel contributed heavily to the port of KVM to Illumos so VT-x/VT-d worked from the beginning).

This is a similar story on the GPU compute front with Nvidia who provide both tools and code to developers; and Intel will surely follow as indicated with the launch of their new 'Intel® SDK for OpenCL* Applications 2012' and associated marketing.

What steps will AMD take to ensure firstly that developers have the tools and materials they need to develop on AMD platforms, and more importantly to reassure both developers and purchasers alike that their hardware can run x86 and GPU compute workloads on all platforms with acceptable performance and is thus a solid investment?
biswa60 - Saturday, October 19, 2019 - link
The hardware strategy is clear: don’t just build discrete CPUs and GPUs, but instead transition to APUs.
https://certificate-template.com/
https://businessletter.org/

Ask the Experts: Heterogeneous and GPU Compute with AMD’s Manju Hegde

Post Your Comment

101 Comments

Back to Article

B3an - Monday, May 14, 2012 - link

BenchPress - Monday, May 14, 2012 - link

Omoronovo - Monday, May 14, 2012 - link

Omoronovo - Monday, May 14, 2012 - link

BenchPress - Monday, May 14, 2012 - link

SleepyFE - Tuesday, May 15, 2012 - link

BenchPress - Tuesday, May 15, 2012 - link

A5 - Monday, May 14, 2012 - link

BenchPress - Monday, May 14, 2012 - link

Jaybus - Thursday, May 17, 2012 - link

BenchPress - Thursday, May 17, 2012 - link

BenchPress - Monday, May 14, 2012 - link

MrSpadge - Monday, May 14, 2012 - link

BenchPress - Monday, May 14, 2012 - link

codedivine - Monday, May 14, 2012 - link

BenchPress - Tuesday, May 15, 2012 - link

Penti - Monday, May 14, 2012 - link

TeXWiller - Monday, May 14, 2012 - link

BenchPress - Tuesday, May 15, 2012 - link

TeXWiller - Tuesday, May 15, 2012 - link

BenchPress - Wednesday, May 16, 2012 - link

TeXWiller - Friday, May 18, 2012 - link

BenchPress - Friday, May 18, 2012 - link

TeXWiller - Saturday, May 19, 2012 - link

BenchPress - Sunday, May 20, 2012 - link

iwod - Tuesday, May 15, 2012 - link

GullLars - Tuesday, May 15, 2012 - link

Denithor - Monday, May 14, 2012 - link

Denithor - Monday, May 14, 2012 - link

Matias - Monday, May 14, 2012 - link

jamyryals - Monday, May 14, 2012 - link

BenchPress - Monday, May 14, 2012 - link

Denithor - Monday, May 14, 2012 - link

BenchPress - Tuesday, May 15, 2012 - link

MJEvans - Monday, May 14, 2012 - link

codedivine - Monday, May 14, 2012 - link

BenchPress - Monday, May 14, 2012 - link

codedivine - Monday, May 14, 2012 - link

BenchPress - Tuesday, May 15, 2012 - link

shawkie - Tuesday, May 15, 2012 - link

BenchPress - Tuesday, May 15, 2012 - link

DarkUltra - Tuesday, May 15, 2012 - link

BenchPress - Tuesday, May 15, 2012 - link

suty455 - Monday, May 14, 2012 - link

Jedibeeftrix - Monday, May 14, 2012 - link

palladium - Monday, May 14, 2012 - link

SilthDraeth - Monday, May 14, 2012 - link

palladium - Monday, May 14, 2012 - link

BenchPress - Tuesday, May 15, 2012 - link

_vor_ - Wednesday, May 16, 2012 - link

Fergy - Wednesday, May 16, 2012 - link

BenchPress - Wednesday, May 16, 2012 - link

hwhacker - Monday, May 14, 2012 - link

mrdude - Monday, May 14, 2012 - link

ToTTenTranz - Tuesday, May 15, 2012 - link

A5 - Monday, May 14, 2012 - link

ltcommanderdata - Monday, May 14, 2012 - link

ltcommanderdata - Tuesday, May 15, 2012 - link

GullLars - Monday, May 14, 2012 - link

j1o2h3n4 - Monday, May 14, 2012 - link

Ashkal - Monday, May 14, 2012 - link

sfooo - Tuesday, May 15, 2012 - link

caecrank - Tuesday, May 15, 2012 - link

Loki726 - Tuesday, May 15, 2012 - link

BenchPress - Tuesday, May 15, 2012 - link

Loki726 - Tuesday, May 15, 2012 - link

BenchPress - Tuesday, May 15, 2012 - link

Tanclearas - Tuesday, May 15, 2012 - link

jeff_rigby - Tuesday, May 15, 2012 - link

SleepyFE - Tuesday, May 15, 2012 - link

PrezWeezy - Tuesday, May 15, 2012 - link

BenchPress - Tuesday, May 15, 2012 - link

kyuu - Tuesday, May 15, 2012 - link

mrdude - Tuesday, May 15, 2012 - link

SleepyFE - Wednesday, May 16, 2012 - link

BenchPress - Wednesday, May 16, 2012 - link