Will the A53 finally be `the big` breakthrough in the data server market you want? Google , apple and Facebook have all been testing ARM servers for a while now and good `noises` are heard about them. Also will the new tie in with AMD reap its rewards sooner than later ?
It's going to need more than 2MB cache per core to compete in the enterprise class because it's safe to say the branch prediction and cache hit rates will trail Intel.
Cache does not matter for data servers, as much as you have it is never enough to keep a significant amount of the data it needs to serve to make a substantial difference, data servers need fast storage and plenty of ram for caching frequently accessed stuff.
Cache serves computation purposes, when you need to put data on the registers and perform operations with it, because cpu cache is much faster than ram, not to mention mechanical storage, but for a data server, the latency of ram and large scale storage is not an issue. And furthermore, unless you have gigabytes of cpu cache, it won't make any difference.
Also, what branch prediction, the data server is gonna predict what data is needed? This is BS, branch prediction is important for computation, not for serving data, and for a typical data server application, a cache miss doesn't really matter, because in a data server application you will get predominantly misses, no data server uses a data set small enough to fit and stay in the cpu cache.
Your comment makes zero sense. My guess is you "heard" intel makes good prefetchers and you try to pretend to be smart... well, DON'T!
I'd recommend you research how and why enterprise class processors have historically had large caches and why they ARE important before you vouch for a mobile platform being retrofitted for use in datacenter. Any programmer or hardware engineer would disagree with your ridiculous trolling.
As I said, unless your data set is small enough to fit and stay in the cpu cache, you won't see significant improvements, and for a data server this scenario is completely out of the question.
Unlike you, I am a programmer, and a pretty low level at that (low level at the hardware that is, not at skill). Also, I know a fanboy troll when I see one, you saw max L2 cache is 2MB and made the brilliant deduction an entire CPU architecture is noncompetitive to intel, which is what is really ridiculous.
When hardware engineers design CPUs, they run very accurate performance simulations to determined the optimal cache capacity, and if that chip is capped at 2MB of L2 cache, that means increasing it anymore is no longer efficient in terms of "die area/performance" ratio.
You guys talk like there can only be ONE CPU to rule them all! That's not how big guys like Facebook or Google work. They need different varieties of CPUs to handle different jobs. Low power ARMs for simple tasks like data retrieval or serving static data while conserving power usage. Enterprise CPU for data analysis and predictions. There are simply too many different tasks for just using one type of CPU if they want to run the company efficiently. A few watts of power differences could mean millions of savings.
I agree that if you are just serving data on key (data servers) then cache size doesn't matter too much (although I can imagine that being able to keep indexing structures and such in processor cache might be useful depending on the application).
But regardless or that, in my experience servers tend to do quite a bit of computing as well.
So I'm going to have to disagree with you that processor cache in general doesn't matter for enterprise servers (and agree with happycamperjack I suppose).
That was some utterly *bull*. Disable the L3 cache on a Xeon class CPU and run whatever you call "server-workload", your performance will be absolutely shit compared to when the L3 is enabled, even more so if you run on Sandy Bridge or later where I/O-devices DMA directly into (and out of) L3-cache.
I concur with everyone else on this one, ddriver. You are really wrong about this. Just because a data segment might be 2GB, you still need a L3 cache for all of those machine code level commands that actually do something with that 2GB chunk of data.
In that case, there's also Cortex A57, which will be faster than anything Atom can do.
But my guess is he was referring to energy efficiency, and if the extreme energy efficiency of Cortex A53 can be more useful in servers than perhaps Cortex A57.
my buddy's aunt earned 14958 dollar past week. she been working on the laptop and got a 510900 dollar home. All she did was get blessed and put into action the information leaked on this site... http://cpl.pw/OKeIJo
Hi, it would be interesting to know two thing: - the cache memories (L1/L2) are write-back or write-through? Inclusive or exclusive? - multiprocessor capabilities are limited to 4 cores or they can scale to 8+ cores without additional glue logic?
The cache cogency bus (CCN) supports a maximum of 8 cpu-id per socket. That is why. L2 cache is actually a RAM accelerator. Filling cache with data (in and out) allows for interleaved and delayed writes to slow memory at roughly the cache speed. This means an order of magnitute faster since most L2 cache are 95% hits anyways. Branch-prediction logic will reduce the stalling of the pipeline and cache misses, thus enhancing the performance. Yes, server apps needs lots of RAM mean the cache size and efficiency is vitally important there ...
All cacheable attributes are supported, but Cortex-A53 is optimised around write-back, write-allocate. The L2 cache is inclusive on the instruction side and exclusive on the data side.
A Cortex-A53 cluster only supports up to 4-cores. If more than 4-cores are required in a platform then multiple clusters can be implemented and coherently connected using an interconnect such as CCI-400. The reason for not scaling to 8-cores per cluster is that the L2 micro-architecture would need to either compromise energy-efficiency in the 1-4 core range to achieve performance in the 4-8 core range, or compromise performance in the 4-8 core range to maximise energy-efficiency in the 1-4 core range. This isn’t a hard and fast rule for all clusters, but is the case for a cluster at the Cortex-A53 power/performance point. For the majority of mobile use cases it is best to focus on energy efficiency and enable more than 4-cores through multi-cluster solutions.
We have seen MediaTek introducing an 8xA7 SOC, instead of going to the big.LITTLE configuration of some sorts. Do you expect the same thing to happen with the A53 and A57 generation for low budget SOCs or will this generation's combo be a little easier and cheaper to implement?
If it includes A57, it's high-end by default. That chip you're talking about isn't big.Little, nor does it contain Cortex A15 in it. It's an 8-core Cortex A7 chip, so yes, I assume Mediatek will make another 8-core one with Cortex A53, but I wouldn't exactly call it high-end, more like mid-to high-end.
Interesting: you apparently completely misunderstood his question, yet "I assume Mediatek will make another 8-core one with Cortex A53" is what I would answer as well. 8 smaller cores are cheaper than 2*4 in big.LITTLE and does sound impressive to the uniformed.
Big cores are larger than small cores (surprise!), so the SoC will be more expensive to produce if it has big cores rather than only little cores. But then again it will be faster too.
Excluding L2, A15 is about 4 times as large as A7 (http://chip-architect.com/news/2013_core_sizes_768... So 2xA15 + 2xA7 is about the size of 10xA7, ie. larger than 8xA7. A15 will also need a larger L2 than A7 due to its higher performance.
We expect to see a range of platform configurations using Cortex-A53. A 4+4 Cortex-A53 platform configuration is fully supported and a logical progression from a 4+4 Cortex-A7 platform. A Cortex-A57 in the volume smartphone markets is less likely, but that’s a decision in the hands of the ARM partners. It will be interesting to see the range of Cortex-A53 platforms and configurations announced by partners over the coming months.
We have already seen how well Qualcomm's Cortex A7 can perform thanks to Moto G. How much will it improve with the new Cortex A53? What will be the core and performance wise difference? How will you compare it against Cortex A9, A12 and A15 in terms of performance, battery consumption and all.
With the Exynos Octa core processor Battery Test we haven't seen much battery improvements compared to Qualcomm's Snapdragon 600 and 800 Processor. How will it perform this time?
What is ARM planning do with its Mali GPU? What will be next after Cortex A53 and A57?
Cortex-A53 has the same pipeline length as Cortex-A7 so I would expect to see similar frequencies when implemented on the same process geometry. Within the same pipeline length the design team focussed on increasing dual-issue, in-order performance as far as we possibly could. This involved symmetric dual-issue of most of the instruction set, more forwarding paths in the datapaths, reduced issue latency, larger & more associative TLB, vastly increased conditional and indirect branch prediction resources and expanded instruction and data prefetching. The result of all these changes is an increase in SPECInt-2000 performance from 0.35-SPEC/Mhz on Cortex-A7 to 0.50-SPEC/Mhz on Cortex-A53. This should provide a noticeable performance uplift on the next generation of smartphones using Cortex-A53.
We shouldn't infer anything from there being a nice-sized gap between Cortex A53 and A57 which might be the 64-bit version of Cortex A12 which in a hypothetical universe might be named Cortex A55, should we ? :)
As someone who's worked at ARM fairly recently, plenty of activity was happening around the A53/A57 as well as a new M-class core (it's supposed to be M5 or M7, still undecided) but I never heard anything about a hypothetical mid-range A55. Right now, it's just a gap in the naming scheme, so it might be used in future.
Why are the current A7 quad core phones performing similar to the A9 quad (exynos 4412 , tegra 3), although A9 is more advanced and OoO? What is the main difference between A5 and A7, becuase the A7 is just a bit faster than the A5?
Overall platform performance is dependent on many factors including processor, interconnect, memory controller, GPU, video and more. While the Cortex-A9 is a higher performance processor both in IPC and frequency, ARM partners are continuously improving their platforms and porting them to new process geometries. This allows a new generation Cortex-A7 based platform to improve on an older generation Cortex-A9 based platform.
Compared to Cortex-A5, Cortex-A7 increased load-store bandwidth, allowed more common data-processing operations to dual-issue and made some small improvements in the branch-predictors.
Why isn't there a more logical naming convention to the ARM cores. I can't tell which is faster an A7 or an A9 core? It seems like your getting better with the A15 being faster than an A9 or A7.
It sort of seems like A7 was just named that because they ran out of numbers. My understanding is it was designed to try to get as much of A9's performance as possible in a smaller die, and as such it should be better than A8 (and can have multi-core versions).
What is the main difference in A5, A7, A9, A12, A15, A53 and A57 for single threaded performance? Is it still worth having 4+4 big.LITTLE compared to 8x A7 ?
1. Do you think that cortex A12 should have been announced earlier(as the gap between A9 and A15 was huge and something in between those two was required) ?
2. Similar to A12, Will there be anything in between A53 and A57 in near future?
3. Instead of using big.LITTLE config and putting 8 cores(4 A15 and 4 A7), why can't we have efficient power gating and other innovative techniques inside the high performance cores so that they can be run as efficient as A7 cores. Die area of individual A15 will increase but we can save some total area as only 4 A15 will be required. How difficult it is from architecture point on view?
The answer is the same as to "Why can't Intel put a Haswell into my phone?" When you need T transistors to implement some functionality in some basic way you might get performance P at power consumption Q. Implement some clever trick which increase performance to 1.5*P. This will probably require more than 1.5*P transistors, probably more like 2*P. And hence power consumption will increase to approximately 2*Q as well. And you can't power-gate this additional power draw because you actually need those transistors to work in order to perform the desired function. That's why building ever more complex single cores comes at the price of power efficiency.
When designing a processor and deciding on which performance attributes to emphasize, do you target current workloads for short term market concerns, or do you target possible future workloads for the market a year or two from now? Or is performance tuning more workload agnostic, and do you say "I want this to be fast for everything"?
For example, since ARM processors are very popular in the Android market, do you tune for content comsumption and gaming? Or since Android may be trending towards more of a primary computing device in the future, is it important to tune for desktop applications?
And finally, what are the considerations of performance tuning for thermally constrained devices?
A good question. A general purpose processor has to be good on all workloads. After all, we expect to see Cortex-A53 in everything from mobile phones to cars, set-top-box to micro-servers and routers. However we do track the direction software is evolving – for example the increased use of Javascript which puts pressure on structures such as indirect predictors. Therefore we design structures within the processor to cope with future code as well as existing expected workloads.
Do you have any plans to support various forms of turbo functionality within your next generation ARM cores? An a potential example, in a 28nm quad-core A53 setup at 1.2GHz, you could support dual-core at >1.4GHz and single core at >1.6GHz within the same power consumption (core design allowing, of course), yet single threaded performance could improve significantly.
ARM cores have been historically low power, however that doesn't mean there aren't more power savings to be made. Examples include deeper sleep states, power gated cores, and so on - features that Intel and AMD have had to include in order to reduce their TDPs whereas ARM cores haven't need them (yet). What are the future power saving methods that ARM is considering for its future cores (that you can give away)?
A Turbo mode is typically a form of voltage overdrive for brief periods of time to maximise performance, which ARM partners have been implementing on mobile platforms for many years. Whether this is applied to 1,2 or more cores is a decision of the Operating System and the platform power management software. If there is only one dominant thread you can bet that mobile platforms will be using Turbo mode. Due to the power-efficiency of Cortex-A53 on a 28nm platform, all 4 cores can comfortably be executing at 1.4GHz in less than 750mW which is easily sustainable in a current smartphone platform even while the GPU is in operation.
In terms of further power saving techniques, power gating unused cores is a technique that has been used since the first Cortex-A9 platforms appeared on the market several years ago. The technique is so fundamental that I think many in the mobile industry use it automatically and forget that it’s a highly beneficial power saving technique. But you are correct that there is more milage to come from deeper sleep states which is why both Cortex-A53 and Cortex-A57 support state retention techniques in caches and the NEON unit to further improve leakage power.
We see Qualcomm winning most designs these days with their implementation of ARM's ISAs, does ARM wish to bring back their own architecture (not just the ISA) to the front line of mobile products (consumer ones) ?
What is your view on Intel newest mobile architecture BayTrail and how do you see Intel (and others) competing with ARM on the consumer grade mobile products in the near future ?
Are there any plans for desktop/server designs for ARM cpus? (i mean with pci-e lanes, ddr4 controllers, strong FPU (AVX2 like), without baseband logic and any other bloat hardware). If yes, is there a time frame? Thank you!
Take a look at TI's KeyStone II architecture, there are already quad core A15 based chips with powerful DSP, PCIE support, multiple 10 and 1 GBit interfaces with hardware accelerated packets and security and mind-boggling on-chip interconnect capable of over 2 TBit per second.
But it is a strictly server chip, considering it doesn't even have a GPU, and it is not like consumer devices need multiport 10 GBit switches embedded.
Peter, thanks for offering your time to Anandtech!
I was curious if you could talk a bit about how easy/difficult A53-derived SoCs might be to integrate into solutions that are already using A7/A9 type chips? i.e. Devices like Beagleboards, Raspberry Pis, ODROIDs, etc. Is there anything that makes the A53 particularly difficult or easy to suit to these types of devices?
Also, for Micro and "regular" servers, do you see A57/A53 big.LITTLE being the norm, or do you anticipate a variety of A53-only and A57-only designs? Any predictions on market split between the A5x series here?
Cortex-A53 has been designed to be able to easily replace Cortex-A7. For example, Cortex-A7 supports the same bus-interface standards (and widths) as Cortex-A7 which allows a partner who has already built a Cortex-A7 platform to rapidly convert to Cortex-A53.
With servers I think we will see a mix of solutions. The most popular approach will be to use Cortex-A57 due to the performance that micro-architecture is capable of providing, but I still expect some Cortex-A53 servers and big.LITTLE too!
I would love to know ARM position towards open source GPU drivers now that Intel is putting a big amount of money and effort into developing theirs.
It seems to me that not taking the open source road for GPU drivers (as ARM is doing) is a big mistake. Furthermore when the primary OS host their hardware is running in is Android.
ARM could create much better binaries for free with the work of other talented developers and get better integration with Android in the process. That will be a great selling point for any manufacturer and possible client! The Lima project has currently a better performance (5% more) than the close driver distributed by ARM and it is being coded in the developers free time! It would be great to see both teams working together.
What are the reasons the driver is close-source? - Is it that they would reveal a lot of Mali's internal core? - Is ARM afraid of IP sues? Having close source will deter patent trolls? Intel doesn't seem to have those problems.
PS: The question is not about whether the drivers should be open or not just because it is morally right or wrong. Obviously it would be nice for the clients and a good selling point, but I was wondering how ARM management see one of their biggest competitor embracing open source when developing GPU drivers.
Intel makes a ton of money, plenty to purchase pretty much every IP it needs, plus it holds a lot of IP itself, so it can also trade IP with other vendors. In contrast ARM makes a modest profit, even though I guess it can still trade IP, since both nvidia and amd license ARM themselves.
But it is a good point, open source GPU drivers are a crucial step towards empowering Linux and ending the MS monopoly. It is hilarious that Linux powers like 99% of the supercomputers and like 1% of the personal computers. If Linux is good enough for supercomputers, it has got to be good enough for personal computers, but the lack of good and often any drivers is what cripples Linux for the regular user.
I have to wonder how long it will take ARM's legal department to review the driver's code and assess risks. Even if they have part of the blob which is legally binding, they can still open up a big chunk and start chipping in on the advantages of open source developing. There are already developers knowledgeable on their architecture.
ARM and other chip producers know how much of a pain is to support badly integrated blob of code. When an ARM customer have a problem with the driver, they have to communicate with ARM, wait till they receive an answer, wait till the problem is solved (if it is at all), and then integrate the new blob with their software. So many steps where you can fail! It takes so much time! Mobile products have a really short time-to-market. It would be so easy for everyone if they let their customer help out with the development. Plus, it is free!!
Samsung (ARM biggest customer) have been testing Intel's products for a while and I am pretty sure that by now they have some developers who know Intel's driver architecture. Don't you think one reason when deciding which platform to choose would be support and time-to-market?
Good questions, but I'm not familiar with the ARM GPU graphics drivers. Perhaps persuade Anand for an Ask the Expert with one of the ARM graphics team? :)
ARM CPU vendors (Qualcomm, Nvidia, etc) seem to be choosing slower quad core over faster dual core, and I'm suspecting its all a marketing game (e.g. more cores is better, see Motorola's X8 announcement of an "8 core" phone). Do those non-technical decisions impact the decisions of the engineers in developing the ARM architecture?
NVidia used 4+1 A15 cores (fastest available at the time) for Tegra 4. And Qualcomm doesn't use generic ARM cores. They have their own (krait) architecture and the most popular SoCs based on their fastest architectures (krait 300/400) are almost exclusively quad-core.
You are quite correct that there are a variety of frequencies and core-counts being offered by ARM partners. However, for ARM design micro-architectures these do not have an effect on micro-architectures as we must be able to support a variety of target frequencies and core-counts across many different process geometries.
How does designing a CPU "by hand" differ from using an automated layout tool? What sort of trade-offs does/would using automated tools cause for ARM's cores?
Second question: With many chips from many manufacturers now implementing technologies like fine-grained power gating, extremely fine control of power and clock states, and efficient out-of-order execution pipelines, where does ARM go from here to keep its leadership in low-power compute IP?
Hand layout versus automated layout is an interesting trade-off. From one perspective, full hand-layout for all circuits in a processor is rarely used now. Aside from cache RAMs which are always custom, hand-layout is reserved for datapath and queues which are regular structures that allow a human to spot the regularity and ‘beat’ an automated approach. However, control logic is not amenable to hand-layout as it’s very difficult to beat automated tools which means that the control logic can end up setting the frequency of the processor without significant effort.
In general the benefit from hand-layout has been reducing in recent years. Partly this is due to the complexity of the design rules for advanced process generations reducing the scope for more specific circuit tuning techniques to be used. But another factor is the development of advanced standard cell libraries that have a large variety of cells and drive strengths which lessens the need for special circuit techniques. When we’re developing our processors we’re fortunate to have access to a large team in ARM designing standard cell libraries and RAMs who can advise us about upcoming nodes (for example 16nm and 10nm). In turn the processor teams can suggest & trial new advanced cells for the libraries which we call POPs (Processor Optimization Packages) that improve frequency, power and area.
A final trade-off to consider is process portability. After an ARM processor is licensed we see it on many different process geometries which is only possible because the designs are fully synthesizable. For example, there are Cortex-A7 implementations on all the major foundries from 65nm to 16nm. In combination with the advanced standard cell libraries for these processes there is little need to go to a hand-layout approach and we instead enable our partners to get to market more rapidly on the process geometry and foundry of their choosing.
When can we expect an end to software based dvfs scaling? It seems to me to be the biggest hurdle in the armsphere towards higher single threaded performance.
the current takes on your big.little architecture have been somewhat suboptimal (the exynos cache flush as an example), so what can we expect from arm themselves to skirt/address these issues? It seems to me to be a solid approach given the absolutely miniscule power and die budget that your little cores occupy, but there's still the issues of software and hardware implementation before it becomes widely accepted.
Though this question might be better posited for the gpu division, are we going to be seeing unified memory across the gpu and CPU cores in the near future? Arm joining hsa seems to point to a more coherent hardware architecture and programming emphasis
Pardon the grammatical errors as IM typing this on my phone. big thanks to Anand and peter.
While there are platforms that use hardware event monitors to influence DVFS policy, this is usually underneath a Software DVFS framework. Software DVFS is powerful in that it has a global view of all activity across a platform in time whereas Hardware DVFS relies on building up a picture from lots of individual events which have little to no relationship with one another. As an analogy, Software DVFS is like directing traffic from a helicopter with a very clear view of what is going on all roads in a city (but greater latency when forcing a change), whereas Hardware DVFS is like trying to pool information from hundreds of traffic cops all feeding traffic information in from their street corner. A traffic cop might be able to change traffic flow on their street corner, but it may not be the best policy for the traffic in the city!
Like all things in life, there are trade-offs with neither approach being absolutely perfect in all situations and hardware DVFS solutions rely on the Software DVFS helicopter too.
This may not be something you can answer, but is there a timeline for a 64 bit follow on to Krait?
Also, do you have any thoughts regarding clock speed vs. instruction width scaling and which route Qualcomm plans to take (with Apple going the instruction width route with the A7 and Qualcomm currently going the clock speed route with recent SoC's)/
ARM != Qualcomm. Qualcomm designs their own stuff, this guy is from ARM. Even if he knew the answers to those questions, they're neither on topic, nor is he at liberty to discuss them. He probably doesn't even want to talk about that, considering Qualcomm isn't exactly giving ARM any compliments by throwing out all of ARM's work and starting from scratch.
Really try to inform yourself a little bit better before asking all these questions. Krait 600 and 800 that are in most phones and tablets are 100% new designs from Qualcomm. Krait 410 is not a new design and is licenced from ARM.
I mistakenly thought we had someone from Qualcomm answering the questions. I didn't say anything about the "Krait" (or Snapdragon) 410 having a Qualcomm designed CPU.
With Apple and yourselves taking different approaches to ARM64 do you have any thoughts on what the different trade offs you both made were and what the knock on effects are in terms of were the two implementations might shine?
What emotion comes to mind on the fact that ARM wishes to forget the big.LITTLE with a 64 bit equivalent of A12 limited to a Quad-Core configuration for consumer electronics?
ARM continues to believe in big.LITTLE which is why we improved on interoperability in the Cortex-A53 and Cortex-A57 generation of processors. In future processor announcements I’m sure you’ll see our continued focus on big.LITTLE as a key technology that enables best possible energy efficiency.
1. We don't seem to have quite seen the promised power savings for big.little yet (thinking of the Exynos 5420 in particular since it has hmp working, not sure if any devices have correct Linux kernel yet though). Are you still as bullish on this aspect of the big.little story?
2. Are there particular synergies to using Mali with the CCI vs. other brand GPU's?
3. What is your general response to the criticism of big.little that has come out of Intel and Qualcomm? Intel, in particular, tends to argue dynamic frequency scaling is a better approach.
In answer to (3), DVFS is complimentary to big.LITTLE not instead of.
A partner building a big.LITTLE platform can use DVFS across the same voltage and frequency range as another vendor on the same process with a single processor. The difference is that once the voltage floor of the process is reached with the 'big' processor the thread can be rapidly switched to the 'LITTLE' processor further increasing energy efficiency.
Mobile workloads have an extremely large working range from gaming and web browsing through to screen-off updates. The challenge with a single processor is that it must compromise between absolute performance at the high-end and energy efficiency at the low-end. However a big.LITTLE solution allows the big processor to be implemented for maximum performance since it will only be operating when needed. Conversely the LITTLE processor can be implemented for best energy efficiency.
So far it doesn't look like any chip maker is in a hurry to go to 20nm next year, even with the jump to ARMv8. Can he share his opinion on why this is happening? Why aren't we seeing all ARMv8 chips arrive at 20nm, as it was supposed to happen (just like the previous generation, Krait/Cortex A15, jumped to 28nm)?
As a follow-up question, since I assume he'll hint at a combination of failures from both fabs and chip OEM's to move to 20nm fast enough, will this situation be rectified at least in 2015, with an early push to 14/16nm FinFET? Can we expect chip makers to move to that in EARLY 2015? (Nvidia has kind of hinted at that, but who can trust Nvidia with keeping their own schedule?!)
"So far it doesn't look like any chip maker is in a hurry to go to 20nm next year, even with the jump to ARMv8."
This is the kind of thing that NO-ONE is going to talk about until they have things working. Assuming otherwise is just silly. Take, for example, Apple. They are quite likely porting the A7 to TSMC's 20nm process as we speak, with the goal of both learning about the process and introducing a speed-bumped iPad lineup (maybe even also a speed bumped iPhone) in Q2 2014. (They did the same think with the die-shrunk A5, though in that case they did not publicize it as a speed bump, it's just that newer iPads had better battery life.) But they're not going to tell anyone about this. If the project slips, they'll look dumb. They may prevent a whole of people buying today, then those people get sick of waiting and buy Android (Osborne effect), etc. Meanwhile there is no upside to telling the world about this move.
Oh, and one more question. FreeBSD developers have just said that they will stay away from using Intel and VIA's hardware encryption features, because they could be backdoored by the NSA.
ARM is from UK - the home of GCHQ, which is just as bad, if not worse, than NSA - so is there a way to reassure us that ARMv8, which comes with hardware encryption, is free of such backdoors? Is he willing to go on the record with that?
I'm sure he realizes, that if people stop trusting these features, and they (including Intel, AMD, VIA, etc) can't prove to us that their hardware isn't in fact backdoored, will just mean NO ONE will end up using those features, and will stick with a software solution instead, which means their hardware encryption will just waste space on the die, so I hope they take this issue very seriously, for their sake, too.
Sorry for piggy backing on Krysto's post but I'd love it if Peter talks not only about hardware RNGs (that's what I assume Krysto meant by "hardware encryption") but also about general security features in ARM - specifically TrustZone (which is what Apple uses in the iPhone's Touch ID solution).
The Cortex-A15 has really struggled on mobile. Neither Tegra 4 nor Exynos 5 (nor OMAP 5 cough cough) have sold well at all compared to Snapdragon 600/800. Possibly related to their lack of success with A15, Nvidia and Samsung (and AMD) have already announced that they are going to be designing their own CPU (rather than an ARM design). Is this worrying to ARM? Doesn't it show that big.LITTLE was a mistake required to cover up A15 power hunger? Krait 400 proves that big.LITTLE is not needed to be both powerful and very power efficient. How will A57 succeed where A15 failed?
Hit the nail on the head there. I bet they say A15 wasn't really meant for mobile but then there's a glaring hole after A9 until the only just announced A12, so I reckon A15 was supposed to be for mobile but they messed it up and dreamt up big.LITTLE to cover themselves. Not that big.LITTLE is a crazy idea, just so far it's a botch job. And A53 and A57 seem to be just incrementals of A7, A15 with ARMv8 chucked in so I wouldn't expect any improved success with these.
Since ARM released the big.LITTLE guidelines, is there a plan for ARM to also release guidelines for processor and co-processor implementations such as Motorola's 'X8' system and Apple's 'M7' co-processor?
It's far worse. This is not a secret and not worth asking about. The two target completely different spaces --- expensive and high performance vs dirt cheap and adequate performance.
I have just been through the extensive instruction ARMv8 set ( and there must be several hundred instructions in total), so my question is whether ARM believes that compilers, such as gcc, can be set up to take advantage of most of the instruction set, or whether one will still depend on assembly coding for a lot of the advanced stuff.
The AArch64 instruction set in the ARMv8 architecture set is simpler than the ARMv7 instruction set. The AArch32 instruction set in the ARMv8 architecture is an evolution of ARMv7 with a few extra instructions. From this perspective, just the same as compilers such as GCC can produce optimised code for the billions of ARMv7 devices on the market I don’t see any new challenge for ARMv8 compilers.
Hello. Now that you have a 64bit ISA are you planning something bigger (size wise)? So far ARM CPU-s are built into SOC-s but i would like to know if you are going to make an A1000 core that will be four large cores with Mali 600 and will compete for a space in the desktops. It makes sense since all major systems (Linux, WindowsRT, iOS) are already running on ARM CPU-s. This is less of a question and more of a request.
The time is not yet right. The top of the line ARM ISA CPU, Cyclone, has IPC comparable with Intel --- which is great --- BUT at a third of Intel's frequency. Apple (and the rest of ARM) have to get to at least close to Intel's frequency while not losing that IPC. Not impossible, but no trivial; and until that happens the CPUs are just not that interesting for the desktop.
The first step (which I expect Apple to take with the A8) would be an architecture like Sandy Bridge and later: - smallish high bandwidth per core L2's - unified large L3 shared by all cores and graphics [Cyclone has something that plays this role, but it's effectively an "off-chip" cache as far as the cores are concerned, being about 150 cycles away from the cores - ring (or something similar) tying together the cores, L3 slices, graphics and memory controller
Done right, I expect this gets Apple to same IPC as before, but 2x the frequency, in 20nm FinFET.
Of course that's still not good enough. Then for the A9 they have to add in a new µArch to either ramp up the IPC significantly, or improve the circuits and physical enough to turbo up to near 4GHz for reasonably long periods of time... As I said, not impossible, but there is still plenty of work to do.
Noone said the first CPU has to be perfect. Considering low end PC's and laptops it's a good idea to start selling to OEM's. That way you can get the ball rolling on software development. Also the GPU does not have to be good since you would use a discrete one (might finally force AMD and Nvidia to write good linux drivers).
Given the flexibility ARM has with the instruction set (compared to x86) I would like to know where ARM sees itself going mid- to long-term. The specific question being: how can we get strong single threaded performance (like in a fat Intel core) and a massive amount of energy efficient number crunchers for parallel tasks (like GPU cores)? The current state of treating them as co-processors (CUDA, OpenCL etc.) and trying to bring them closer to the cores (HSA) ultimately seem like like crutches to me, because it still takes significant effort on the software side to actually use those units.
What I imagine as the "ultimate Fusion" of these ressources is a group of fat integer cores (like in AMDs modules, Haswells with 2/4 way HT, with big.LITTLE.. whatever you want) sharing a large pool of GPU-shader-like number crunchers, presented to like like regular floating point units now. Dispatching instructions to these cores should be as simple as using the FPU from the software side. Sure, latency would go up (hence some faster scalar local units might still be needed) but throughput could go up by orders of magnitude. Even a single thread might get access to all of them, or in case of many threads there'd be excellent load-balancing. The GPU and maybe other functions would use them as well. The number of integer / FP cores / execution units could relatively easily be scaled, depending on the application (server, HPC, all-round).
Intel and AMD have the hardware building blocks, but apart from the next version of SSE/AVX I don't think there is any chance to implement such functionality in x86 efficiently. And it surely wouldn't be backwards compatible, hence take years or tens of years to trickle down the software stacks. The ARM software is much younger and more agile, as Apples quick and almost completely seamless transition to 64 bit iOS has shown. I'd even say: if anyone could pull something like this off it's ARM. What do you think?
I wonder if this sort of fusion is ultimately a bad idea.
Even at the basic HW level, tying the GPU in with the CPU is tough because the two are so different, and it doesn't help to destroy the primary value of the GPU in this quest. Specifically, using the same memory space clearly has value (in performance and programmer ease). Which means using the same virtual address space and TLBs. Again not in itself too problematic. But then what if we decide that we use that TLB to support VM on the GPU side? Now life gets really tough because GPUs are not set up for precise exceptions... (Using the TLB to track privilege violations is less of a problem because no-one [except debuggers!] cares if the exception generated bubbles up to the OS hundreds of instructions away from its root cause.)
WRT to the more immediate issue, the implication seems to be that a unified instruction set could be used to drive both the CPU and GPU. While this sounds cool, I fear that it's the same sort of issue --- a TREMENDOUS amount of pain to solve a not especially severe problem. The issue is that the processing model of the GPU is just different from a CPU --- that's a fact. Making it the same as a CPU is to throw away the value of the GPU. But since these models are so different, the only feasible instructions would seem to be some sort of "create a parameter bock then execute it" instructions --- at which point, how is this any more efficient or useful than the current scheme of using the existing CPU instructions to do this?
I think we can gauge the value of this idea, to some extent, by the late Larrabee. Intel seem (as far as I can tell) to have started with a plan vaguely like what's described --- let's make the GPU bits more obviously part of the CPU, using more or less standard CPU concepts --- and it flat out did not work. It's mutated into the Knights SomethingOrOther series which, regardless of their value or not as HPC accelerators cards, no longer look like any part of the future of GPUs or desktop CPUs.
I've talked about this before. CS engineers are peculiarly susceptible to the siren song of virtualization and masquerading because the digital world is so malleable. But not all virtualization is a good idea. The 90s spent god knows how much money on the idea of process and network transparent objects in various forms, from OLE to CORBA, but it all went basically nowhere; what won in that space was the totally non-transparent HTTP/HTML combo, I would say because they actually mapped onto the problem properly, rather than trying to make the problem look like a pre-existing solution.
Some valid concerns, for sure. And I didn't say it would be easy :) But I think I can adress at least some of them.
First, my idea is not to fuse CPU and GPU into each other. It's about sharing that pool of shaders, which eats a major amount of transistors and power budget in both chips and ultimately limits their performance (provided you can feed and cool the beasts). In current AMD APUs 2 cores in a module share the 2 FPUs because these units are simply huge. Intel is already on the way to 512 bit AVX, requiring even more transistors & area. Yet their throughput pales in comparison to GPUs. And to use them all we have to go fully multi-threaded, with all its software and synchronization issues. If what I have in mind works perfectly a single CPU core could easily get access to the entire pool of shaders/FPUs, if needed. It just fires off the instructions to these massively parallel, high latency FPUs instead of the local scalar one and gets massive throughput. That's the ultimate load-balancing and very efficient use of those transistors, if it works well.
The hard-wired logic in the GPU cores (TMUs, ROPs, rasterizer etc.) would still remain. At the point where they'd usually dispatch instructions to their shaders they would now also go into that "sea of FPUs".
Sure, internal and external bandwidth, registers and such would all need to scale to hide the increased latency from putting the execution units further away from the CPU/GPU cores. But if these costs become too large one could segment the whole thing again, like combining 1 to 4 GCN compute units with one CPU module. The amount of raw FPU horsepower available to the CPU could still increase tremendously, while the "fast path" local scalar FPU could be reduced from 2x128 bit (or more) to one double precision unit again.
You see, I'd not necessarily want or need a unified instruction set for CPU and GPU, just the same micro-ops (or however you want to call them) to access the shaders /FPUs. Larrabee is almost a "traditional many-core CPU" in comparison ;) (if there already is such a thing)
1. MIPS - Opinions On it against ARMv8 ? 2. I Quote "There is nothing worse than scrambled bytes on a network. All Intel implementations and the vast majority of ARM implementations are little endian. The vast majority of Power Architecture implementations are big endian. Mark says MIPS is split about half and half – network infrastructure implementations are usually big endian, consumer electronics implementations are usually little endian. The inference is: if you have a large pile of big endian networking infrastructure code, you’ll be looking at either MIPS or Power. "
How True is that? And if true, do ARM has any bigger plans to tackle this problem. Obviously there are huge opportunities when SDN are now exploding.
3. Thoughts on current integration of IP ( ARM ), Implementer ( Apple/Qualcomm ) and Fab ( TSMC ) ? Especially on the speed of execution. Where previously it would takes years for any IP marker from announce to something that is on the market. We are now seeing Apple coming in much sooner and Qualcomm is also well ahead of ARM projected schedule for 64Bit SoC in terms of Shipment date.
4. Thoughts on Apple's implementation of ARMv8?
5. Thoughts on Economy of Scale in Fab and Nodes. Post 16/14nm and 450mm wafers. Development Cost etc. How would that impact ARM?
6. By having a Pure ARMv8 implementation and Not supporting the older ARMv7. How much, in terms of % transistor does it save?
7. What technical hurdles do you see for ARM in the near future?
Addressing question-2, all ARM architecture and processor implementations support big and little endian data. There is an operating system visible bit that can be changed dynamically during execution.
On question-6, certainly an AArch64 only implementation would save a few transistors compared to an ARMv8 implementation supporting both AArch32 and AArch64. However probably not as much as you think and is very dependent on the micro architecture since the proportion of decode (or AArch32 specific gates) will be less in a wide OOO design than an in-order design. For now, code compatibility with the huge amount of applications written for Cortex-A5, Cortex-A7, Cortex-A9, etc is more important.
Next gen consoles have been noted for their use of SoCs, especially in the context of hUMA. Of course, SoC have long been the standard in the mobile space. What is the current state of hUMA-like functionality between the CPU and the GPU in mobile? And what can and/or will be done in the future to improve this, both within ARM's family of products (ARM CPU + ARM GPU) and working with third-parties (ARM CPU + any other GPU)?
Intel has adopted a cache model where each core has small pools of private, fast L1 and L2 cache and sharing/integration between cores and even the GPU happens in a larger, slower L3 cache. ARM's designs favour a private, fast L1 with sharing happening on the level of the L2 cache. What are the advantages/disadvantages between these design choices in terms of performance, power, die area, and scalability/flexibility?
Intel and AMD are busy expanding the width of their SIMD instruction set to 256-bits and beyond. Are 256-bit vectors relevant to mobile and NEON or are the use cases not there in mobile and/or the power/die area not worth it?
On the topic of ISA extensions to accelerate common functionality what other opportunities are out there? ARMv8 is adding acceleration for cryptography. Could acceleration for image processing, face recognition or voice recognition be useful or are those best left for specific chips outside the CPU?
* Which are the latencies in CPU cycles for CPU caches? Is it possible in future to create a design that uses a shared L3 cache? * How many general purpose CPU registers are in Cortex-A53 compared with predecesors? * Can be expected that Cortex-A53 to be part of netbooks in the years to come? What about micro-servers?
While not yet in mobile, ARM already produces solutions with L3 caches such as our CCN-504 and CCN-508 products which Cortex-A53 (and Cortex-A57) can be connected too.
Since Cortex-A53 is an in-order, non-renamed processor the number of integer general purpose registers in AArch64 is 31 the same as specified by the architecture.
How closely does a company like ARM follow academic ideas, and how long does it take to move those ideas into silicon. For example: - right now the king of academic branch prediction appears to be TAGE. Is ARM looking at changing its branch predictor to TAGE, and if so would we expect that to debut in 2015? 2017?
- there have been some very interesting ideas for improving memory performance through having LLC and Memory Controller know about each other. For example Virtual Write Queue attempts to substantially reduce the cost of writing out data, while another scheme has predictors for when various ranks will be idle long enough that writes to them should be attempted, and a third scheme has prefetch requests prioritized to match ranks that are least busy. Once again, how long before we expect this sort of tech in ARM CPUs?
- in a handwaving fashion, for a high end CPU, I think it's fair to say that the single biggest cause of slowdowns is memory latency, which everyone knows; but the second biggest cause of slowdowns is the less well known problem of fetch bandwidth, specifically frequent taken branches, coupled with a four-wide fetch from a SINGLE cache line, and edge effects that result in many of those fetches being less than four wide. The heavy duty solution for this is a trace cache, a somewhat weaker solution is a loop buffer. Does ARM plan to introduce either of these? (Surely they are not going to allow the fact that Intel completely bollixed their version of a trace cache destroy what is conceptually a good idea, especially if you just use it as a small loop driven augmentation of a regular I-cache, rather than trying to have it replace the I-cache?)
Getting away from the technical questions, I'm interested in these two.
ARM has been used in many different devices, what do you consider the most innovative use of what you designed, possibly something that was outside of how you envisioned it originally being used?
As a creator, what devices made you look at what you created and had the most pride?
I'd suggest all of us who work for ARM are proud that the vast majority of mobile devices use ARM technology!
Some of the biggest innovations with ARM devices is coming in the Internet of Things (IOT) space which isn't as technically complex from a processor perspective as high-end mobile or server, but is a space that will certainly effect our everyday lives.
Hey Mr Greenhalgh, Actually working in a great SoC company, I have recently been working on A12 based soc which was supposed to be an improvement over the A9 by a certain amount of dmips/mhz, but much more power-effcient than the a7/a15 couple in a big little. It was at a point that having a sole 4xcore a12 was better in terms of performance in the low perf (than the 4xa7), as much as better in high speed because generating less power than the 4xa15, which makes today's Soc throttling around 1.3Ghz, thus allowing sustainable perf at a higher frequency. Best of all, it allows not using the CCI which has been subject to controversy (hmp, smp, ...) This CPU has not been really highly markettized (forgive this ugly word), because today the fashion is over the A53/A57 big.LITTLE couple and the possibly useless 64bit platforms. this was my background (personnal thought). Now my question: are we really going to see a CPU performance improvement for the small platform (smartphone) with the A53/57 or are these CPU specified for heavy use, which would indicate that thermal dissipation will prevent a hard use on smartphone. Should the SoC vendor concentrate on 32Bit a7/a15/a12 version that could be again improved in the futur in order to really see more performance. Are you packaging a 8*a12 that would possibly make sense in high end soc? Are you going to improve the power domain sharing inside your deliveries? It's still a nonsense to have coresight IP inside a CPU domain, as it prevent debugging once CPUs are sleeping...
What is ARM's most power efficient processing core? I don't mean using the least power, I mean work per watt. How does that compare to Intel and IBM? Also, I know that ARM is trying to grow in the server market, given the rise of the GPGPU market, do you foresee ARM leveraging their MALI GPUs for this in the future? Finally, does ARM have any interest or ambition in scaling up to the desktop market?
I have another question. Why is ARM pursuing the big.LITTLE paradigm? Wouldn't it be more economical to use the extra silicon to make larger, more powerful cores that run at a lower clockspeed?
That is a very good point. I can soo imagine ARM telling their customers: you are anyways using 4X area, let's swap that for a BIG core for laptop/ultra-book class products
In the traditional applications class, Cortex-A5, Cortex-A7 and Cortex-A53 have very similar energy efficiency. Once a micro-architecture moves to Out-of-Order and increases the ILP/MLP speculation window and frequency there is a trade-off of power against performance which reduces energy efficiency. There’s no real way around this as higher performance requires more speculative transistors. This is why we believe in big.LITTLE as we have simple (relatively) in-order processors that minimise wasted energy through speculation and higher-performance out-of-order cores which push single-thread performance.
Across the entire portfolio of ARM processors a good case could be made for Cortex-M0+ being the more energy efficient processor depending on the workload and the power in the system around the Cortex-M0+ processor.
When running 32bit apps on 64bit OS, is there's any performance hit compared to 64bit apps on 64bit OS ?
And from IPC/Watt perspective, how A53/A57 is doing compared to A7/A15... I mean how much more performance we will get in the same power usage compared to A7/A15... talking about the whole platform ( memory included )
The performance per watt (energy efficiency) of Cortex-A53 is very similar to Cortex-A7. Certainly within the variation you would expect with different implementations. Largely this is down to learning from Cortex-A7 which was applied to Cortex-A53 both in performance and power.
ARM has an active architecture research team and, as I'm sure you would expect, look at all new architectural developments.
It would be possible to design a CPU with on-chip FPGA (after all, most things in design are possible), but the key to a processor architecture is code compatibility so that any application can run on any device. If a specific instruction can only run on one device it is unlikely to be taken advantage of by software since the code is no longer portable. If you look at the history of the ARM architecture it's constantly evolved with new instructions added to support changes in software models. These instructions are only introduced after consultation with the ARM silicon and software partners.
You may also be interested in recent announcements concerning Cortex-A53 implemented on an FPGA. This allows standard software to run on the processor, but provides flexibility around the other blocks in the system.
I'm pretty sure no one asked you and that the question was meant to be answered by the ARM engineer, should he choose to answer it. Instead of trolling perhaps you should come up with your own question for our guest.
If you don't have anything nice to say, don't say it at all.
I'm pretty sure no one asked you and that the question was meant to be answered by the ARM engineer, should he choose to answer it. Instead of trolling perhaps you should come up with your own question for our guest.
If you don't have anything nice to say, don't say it at all.
Is this move to 64 bit driven by a need from the hardware and/or software or or pressure from competitors? If the former can you indicate some of the improvements users will see and feel with 64 bit?
Why aren't you helping to make Terminator 2 (specifically, not the first one, that one's robot was just scary while Terminator 2 had a friendly robot and a scary robot as well) a reality in our world? Do you have something against robots? Seems vaguely speciestist to be honest...
Can you talk a bit about your personal philosophy regarding pipeline lengths. As the A53 and A57 diverge significantly on the subject. Too short its difficult to implement goodness like a scheduler but as you increase the length you also contribute to design bloat: you need large branch target arrays with both global and local history to avoid stalls, more complicated redirects in the decoder and execution units to avoid bubbles, and generally just more difficult loops to converge in your design. Are you please with the pipeline in the A53, where do you see happening with the pipeline both in the big cores and the little ones going forward (anticipate a vague answer on this one, but not going to stop me from asking)?
I'd expect my view of pipeline lengths to be similar to most other micro-architects. The design team have to balance the shortest possible pipeline to minimise branch mis-prediction penalty and wasted pipelining of control/data against the gates-per-cycle needed to hit the frequency target. Balance being the operative word as the aim is to have a similar amount of timing pressure on each pipeline stage since there's no point in having stages which are near empty (unless necessary due to driving long wires across the floorplan) and others which are full to bursting.
Typically a pipeline is built around certain structures taking a specific amount of time. For example you don't want an ALU to be pipelined across two cycles due to the IPC impact. Another example would be the instruction scheduler where you want the pick->update path to have a single-cycle turnaround. And L1 data cache access latency is important, particularly in pointer chasing code, so getting a good trade-off against frequency & the micro-architecture is required (a 4-cycle latency may be tolerable on a wide OOO micro-architecture which can scavenge IPC from across a large window, but an in-order pipeline wants 1-cycle/2-cycle).
We're pretty happy with the 8-stage (integer) Cortex-A53 pipeline and it has served us well across the Cortex-A53, Cortex-A7 and Cortex-A5 family. So far it's scaled nicely from 65nm to 16nm and frequencies approaching 2GHz so there's no reason to think this won't hold true in the future.
Peter, Thank you for taking the time to answer our questions! As mobile devices become more and more useful/powerful, they have encroached on territory historically dominated by Intel AMD...and they feel the pressure. Does ARM feel pressure from from those two companies? As ARM progresses, will you actively target the desktop space as room for growth?
Here's my question: Implementations of previous ARM cores by licensees, most notably the A15, feature much higher clocks than what ARM recommends. How has that influenced the design of the A53? Do you expect ARM's clock frequency design targets to be closer to the clocks in actual implementations?
ARM processor pipelines allow the processor to be built to achieve certain frequencies, but we don't recommend or advise what they should be. After all, there are still ARM1136 processors being implemented today on 40nm, yet we designed the processor on 180nm!
We and our partners like the freedom to chose whether to push the frequency as far as it will go or to back off a bit and save a bit of area/power. This freedom allows differentiation, optimisation around the rest of the platform and time-to-market (higher frequency = more effort = more time).
Naturally our pipelines have a range of sweet-spot frequencies on a given process node and there is a lot of discussion with lead partners about a new micro-architecture, but we aren't changing the pipelines based on the frequencies we're seeing in current mobile implementations.
Most good EE/CE degrees will have a reasonable amount of micro-architecture/architecture courses, but it doesn't hurt to understand what makes all the popular micro-architectures tick. For that matter, a lot of the designs in the 90's were impressive too - check out the Dec Alpha EV8 which never got to market, but was a really interesting processor.
Hi all (and hopefully this gets a response from our guest Mr. Greenhalgh),
I'm not exactly too well informed in the CPU department, so I won't pretend that I am. I'm just curious as to how A53 will fare against the likes of Krait 450 and Cyclone in terms of DMIPS (as obsolete as some people may think it is, i'd just like to get a sense of it performance-wise) and pipeline-depth.
We're all assuming that Apple has gone ahead and used a ARMv8a instruction set and, as per their own usual routine, swelled up the cores to many times that of their competitors and marketed it as a custom architecture. Since A53 is also based off ARMv8, I'm wondering how this will translate into speed. I think someone's mentioned before that A53 is the logical successor to Cortex-A7, but my mind is telling me that there's more to the number in the name than just a random number that is a few integers below 57.
If this is essentially a quad-core part and succeeds the A7, then are we looking at placement in the Snapdragon 400 segment of the market? It would certainly satisfy the conditions of "mid-to-high end" but I'm a little disappointed in Cortex-A at the moment considering that the A7 was introduced as a sort of energy-efficient, slightly lower performing A9. I mean, the A12 is seen as the A9's successor but it's still ARMv7a and it won't be hitting the market anytime soon, so would it be possible that we could see A53, with its ARMv8 set, on par with the Cortex-A12 in terms of rough performance estimates?
Can't wait until A57; it's bound to be a great performer!
Speaking broadly about Dhrystone, the pipeline length is not relevant to the benchmark as perfect branch prediction is possible which means issue width to multiple execution units and fetch bandwidth largely dictates the performance. This is the reason Dhrystone isn't great as a benchmark as it puts no pressure on the data or instruction side memory systems (beyond the L1 cache interfaces), TLBs and little pressure on the branch predictors.
Cortex-A12 is a decent performance uplift from Cortex-A53 in performance so we're not worried about overlap and while the Smartphone market is moving in the direction of 64-bit, there are still a lot of sockets for Cortex-A12. In addition there are many other markets where Cortex-A9 has been successful (Set Top Box, Digital TV, etc) where 64-bit isn't a near-term requirement and Cortex-A12 will be a great follow-on.
Question: What is the competitive advantage of ARM powered devices over other manufacturers' products and what your company will do in the future to preserve and enhance it?
Can you explain what you mean by a 'weak' memory model and how this differs from other architectures and how it translates into memory models in common languages like Java?
A weakly ordered memory model essentially allows reads (loads) and writes (stores) to overtake each other and observed by other CPUs/GPUs/etc in the system at different times or different order.
A weakly ordered memory model allows for the highest performance system to be built, but requires the program writer to enforce order where necessary through barriers (sometimes termed fences). There are many types of barrier in the ARM architecture from instruction only (ISB) to full-system barriers (DSB) and memory barriers (DMB) with various variants that, for example, only enforce ordering on writes rather than reads.
The Alpha architecture is the most weakly ordered of all the processor architectures I'm aware of, though ARM runs it close. x86 is an example of a strongly ordered memory model.
Recent programming standards such as C++11 assume weakly ordered and may need ordering directives even on strongly ordered processors to prevent the compiler from optimising the order.
Memory bandwidth to RAM is more important than huge on chip caches. The on chip cache is more like a buffer for prefetching and write-back. 10% more RAM bandwidth is better than 50% more cache. Even caching of instructions is getting harder (keeping enough in cache) because the size of RAM used has increased far more than cache with the number of active programs and their complexity has increased. Mid 90's Pentium offchip 256KB L2 cache with 64MB RAM, P3 onchip 256KB L2 with 1024MB RAM, Sandy Bay Xeon with about 2.5MB L3 per core (upto 20MB) and 16GB RAM per core (128GB) or more.
Hi Peter! Can 32bit performance degrade in future ARMv8 processor designs? ARMv7 requires some features omitted in ARMv8 - I mean arbitrary shifts, direct access to R15, conditional execution. I guess this extra hardware is not free, especially the latter.
Fortunately, while the ARM instruction set has evolved over the years, ARMv8 AArch32 (which is effectively ARMv7) isn't that far away from ARMv8 AArch64. A couple of big differences in ARMv8 AArch64 are constant length instructions (32-bit) rather than variable (16-bit or 32-bit) and essentially no conditional execution, but most of the main instructions in AArch32 have an AArch64 equivalent. As a micro-architect, one of the aspects I like the most about the AArch64 instruction set is the regularity of the instruction decoding as it makes decoding them faster.
As such the hardware cost of continuing to support AArch32 is not that significant and it is more important to be able to support the thousands of software applications that have been compiled for ARMv7 which are fully compatible and run just fine on the generation of 64-bit ARM processors that are now arriving.
How is ARM A57 matching up with performance related to Intel Haswell? Though very good at power, the ARM cores are traditionally weak in performance compared to Intel. The Haswell arch seems to be beating ARM A15 very easily in Chromebook. Is this due to memory b/w issue? Can ARM arch with big.Little including CCI support higher BW? Also why doesn't ARM go to Qualcomm arch like asynchronous Freq scaling? Why does freq tied to cluster instead of cores?
- Does ARM works on GCC Development? - Are there special instructions for Cryptostuff defined in the 64-Bit ISA? - If yes, are there patches for the upstream linux kernel available? - Are there Instructions for SHA-3 available? - Would ARM change their mind about free Mali drivers? - Would ARM support device-trees?
Yes, ARM works on GCC development and, yes, there are special Crypto instructions defined in the v8 Architecture (for AES and SHA).
As for patches, Mali drivers and device trees, these are handled by other teams in ARM. If you're interested in these wider questions about ARM technology, forums such as http://community.arm.com can help you.
I hope you guys intend to add ChaCha20 to your next-generation chips or architecture. I'm pretty sure it's the next big cipher to be adopted by software makers. Google's security chief, Adam Langley, has already shown his support for it, but there are others who are looking to adopt it as an alternative to AES. So I hope you too can adopt it as soon as possible.
As for SHA-3, the jury is still out on that one, since nobody trusts NIST anymore, and there has been even some recent controversy about them wanting to lower the security of the final SHA-3 standard, so you might want to hold off on that one. You should support SHA-512 in the meantime, though, which is oddly missing from ARMv8.
Also, you guys should move to 4-wide and at least 256-bit NEON for Cortex A57's successor (I know, not your job, but still). And as others have said, try to match the low-end, mid-end and high-end release of the cores next time. One more thing - try to support OpenCL 2.0 as soon as you can.
My question is about processor architecture design in general - there cant be very many positions in the world for "lead processor/architecture designer" - so how does one become one? Obviously promotion from within but how to you get the opportunity to show your company you have what it takes to make the tough calls? There cant be very many textbooks on the subject since you guys are constantly evolving the direction these things go.
How many people does it take to design a bleeding edge ARM processor? How are they split up? Can you give a brief overview of the duties assigned to the various teams that work on these projects?
I'd imagine that ARM is not so different from any other processor company in that it is the strength of the engineering team that is key to producing great products.
Perhaps where ARM differs from more traditional companies is the level of discussion with the ARM partners. Even before an ARM product has been licensed by an ARM partner they get input in to the product and there will be discussions with partners at all levels from junior engineers a few years out of college, through to multi-project veterans, lead architects, product marketing, segment marketing, sales, compiler teams, internal & external software developers, etc etc.
As a result, there are rarely 'tough calls' to be made as there's enough input from all perspectives to make a good decision.
In answer to your question about processor teams, these are typically made up of unit design engineers responsible for specific portions of the processor (e.g. Load-Store Unit) working alongside unit verification engineers. In addition to this there will be top-level verification teams responsible for making sure the processor works as a whole (using a variety of different verification techniques), implementation engineers building the design and providing feedback about speed-paths/power, performance teams evaluating the IPC on important benchmarks/micro-benchmarks.
And this is just the main design team! The wider team in ARM will include physical library teams creating advanced standard cells and RAMs (our POP technology), IT managing our compute cluster, marketing/sales working with partners, software teams understanding instruction performance, system teams understanding wider system performance/integration and test-chip teams creating a test device.
All in all it takes a lot of people and a lot of expertise!
I. Core count inflation. Everyone but Apple lately has equated high-end with quad-core, which is unfortunate. I have a four-core phone, but would rather have a dual-core one that used those two cores' worth of die area for a higher-IPC dual-core design, or low-power cores for a big.LITTLE setup, or more L2, or most anything other than a couple of cores that are essentially always idle. Is there anything ARM could do (e.g., in its own branding and marketing or anything else) to try to push licensees away from this arms race that sort of reminds me of the megapixel/GHz wars and towards more balanced designs?
II. Secure containers. There has been a lot of effort put in to light-weight software sandboxes lately: Linux containers are newly popular (see Docker, LXC, etc.); Google's released Native Client; process-level sandboxing is used heavily now. Some of those (notably NaCl) seem be clever hacks implemented in spite of the processor architecture, not with its help. Virtualization progressed from being that sort of hack to getting ISA support in new chips. Do you see ARM having a role in helping software implementers build secure sandboxes, somewhat like its role in supporting virtualization?
III. Intel. How does it feel to work for the first company in a long while to make Intel nervously scramble to imitate your strategy? Not expecting an answer to that in a thousand years but had to ask.
Core counts are certainly a popular subject at the moment!
From our perspective we've consistently supported a Quad-Core capability on every one of our multi-core processors all the way back to ARM11 MPCore which was released in the mid-2000's. And while there's a lot of focus from the tech industry and websites like Anandtech on high-end mobile, our multi-core processors go everywhere from Set-Top-Box to TVs, in-car entertainment, home networking, etc, etc some of which can easily and consistently use 4-cores (and more, which is why we've built coherent interconnects to allow multiple cluster to be connected together).
The processor's are designed to allow an ARM partner to chose between 1,2,3 or 4-cores and the typical approach is to implement a single core then instance it 4-times to make a Quad-Core with the coherency+L2 cache layer connecting the cores together and power switching to turn un-used Cores off. The nice thing about this approach is that it is technically feasible to design a coherency+L2 cache solution that scales in frequency, energy-efficiency and IPC terms from 1-4 cores rather than compromising in any one area.
The result of this is that a Dual-Core implementation will be very similar in overall performance terms as a Quad-Core implementation. So while it may be that for thermal reasons running all 4-Cores at maximum frequency for a sustained period of time is not possible, if two Cores are powered off on a Quad-Core implementation it isn't any different from only having a Dual-Core implementation to start with. Indeed, for brief periods of time 4-Cores can be turned-on as a Turbo mode for responsiveness in applications that only want a burst of performance (e.g. web browsing). Overall there are few downsides to multiple Core implementations outside of silicon area and therefore yield.
From a product perspective we've been consistent for almost a decade on the core counts provided by our processors and allow the ARM partners to choose how they want to configure their platforms with our technology.
First, wow--had no idea you were going to go through all these comments and answer them; good work.
FWIW, your answer dismisses a problem slightly different from the one I asked about. Yes, 4xA15/2MB L2 performs no worse than 2xA15/2MB L2; you just spend more because of higher die area. But if the SoC design had stuck to two cores, I imagine they could use extra die area for more useful things--enlarging the L2, adding A7s in a big.LITTLE config, using a higher-IPC but larger core design if one were available, upgrading other SoC components like the GPU, etc. My understanding, mostly from AnandTech, is that Apple's A7 (where'd they get that name from?) SoC basically does this--it's largish, but that's from Apple doing basically everything *but* going quad-core.
Still, indirectly, you have answered my bottom-line question--ARM doesn't really see the proliferation of quad-core SoCs as a problem, and maybe doesn't see it as their job to push licensees towards one config or another in general.
Can we expect the fully flexible A53+A57 octa core successor of the current big.LITTLE implementation? Can you estimate the improvements it could bring regarding performance /efficiency, on the 20nm?
Will Aarch64 add coprocessor instructions? The CP15 interface exists in the 32 bit ARMv7 but are missing in ARMv8. Will this be added later so that 3rd parties may add their custom coprocessors to an ARM design? If not, is this to prevent Aarch64 from becoming diversified with proprietary extensions?
The 64 bit ARM ISA is unique in that it allows legacy 32 bit support to be omitted from a design. How much additional power consumption is needed to support the 32 bit legacy modes? Also if 32 bit legacy support was removed and the core optimized for 64 bit only mode, what would the performance gains be, all other things being equal?
big.LITTLE is an interesting concept that launched with the Cortex A7 and A15. Several years later the A12 arrives and can be used in big.LITTLE as well. Can big.LITTLE scale to span three architectures (A7, A12, A15) for fine grained performance/watt stepping as load increases? Similarly, there is a gap between the Cortex A53 and Cortex A57. Presumably there is room for a hypothetical Cortex A55 but that’d arrive several years later. Is there plans to synchronize the release schedule for the low, middle and high end designs?
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
158 Comments
Back to Article
HalloweenJack - Tuesday, December 10, 2013 - link
Will the A53 finally be `the big` breakthrough in the data server market you want? Google , apple and Facebook have all been testing ARM servers for a while now and good `noises` are heard about them. Also will the new tie in with AMD reap its rewards sooner than later ?Samus - Tuesday, December 10, 2013 - link
It's going to need more than 2MB cache per core to compete in the enterprise class because it's safe to say the branch prediction and cache hit rates will trail Intel.ddriver - Tuesday, December 10, 2013 - link
Cache does not matter for data servers, as much as you have it is never enough to keep a significant amount of the data it needs to serve to make a substantial difference, data servers need fast storage and plenty of ram for caching frequently accessed stuff.Cache serves computation purposes, when you need to put data on the registers and perform operations with it, because cpu cache is much faster than ram, not to mention mechanical storage, but for a data server, the latency of ram and large scale storage is not an issue. And furthermore, unless you have gigabytes of cpu cache, it won't make any difference.
Also, what branch prediction, the data server is gonna predict what data is needed? This is BS, branch prediction is important for computation, not for serving data, and for a typical data server application, a cache miss doesn't really matter, because in a data server application you will get predominantly misses, no data server uses a data set small enough to fit and stay in the cpu cache.
Your comment makes zero sense. My guess is you "heard" intel makes good prefetchers and you try to pretend to be smart... well, DON'T!
Samus - Tuesday, December 10, 2013 - link
Very constructive insult driver, nicely done.I'd recommend you research how and why enterprise class processors have historically had large caches and why they ARE important before you vouch for a mobile platform being retrofitted for use in datacenter. Any programmer or hardware engineer would disagree with your ridiculous trolling.
ddriver - Wednesday, December 11, 2013 - link
As I said, unless your data set is small enough to fit and stay in the cpu cache, you won't see significant improvements, and for a data server this scenario is completely out of the question.Unlike you, I am a programmer, and a pretty low level at that (low level at the hardware that is, not at skill). Also, I know a fanboy troll when I see one, you saw max L2 cache is 2MB and made the brilliant deduction an entire CPU architecture is noncompetitive to intel, which is what is really ridiculous.
When hardware engineers design CPUs, they run very accurate performance simulations to determined the optimal cache capacity, and if that chip is capped at 2MB of L2 cache, that means increasing it anymore is no longer efficient in terms of "die area/performance" ratio.
happycamperjack - Wednesday, December 11, 2013 - link
You guys talk like there can only be ONE CPU to rule them all! That's not how big guys like Facebook or Google work. They need different varieties of CPUs to handle different jobs. Low power ARMs for simple tasks like data retrieval or serving static data while conserving power usage. Enterprise CPU for data analysis and predictions. There are simply too many different tasks for just using one type of CPU if they want to run the company efficiently. A few watts of power differences could mean millions of savings.eriohl - Friday, December 13, 2013 - link
I agree that if you are just serving data on key (data servers) then cache size doesn't matter too much (although I can imagine that being able to keep indexing structures and such in processor cache might be useful depending on the application).But regardless or that, in my experience servers tend to do quite a bit of computing as well.
So I'm going to have to disagree with you that processor cache in general doesn't matter for enterprise servers (and agree with happycamperjack I suppose).
virtual void - Wednesday, December 11, 2013 - link
That was some utterly *bull*. Disable the L3 cache on a Xeon class CPU and run whatever you call "server-workload", your performance will be absolutely shit compared to when the L3 is enabled, even more so if you run on Sandy Bridge or later where I/O-devices DMA directly into (and out of) L3-cache.Samus - Wednesday, December 11, 2013 - link
I know driver may be a software engineer of some sort but he obviously has no clue about memory bandwidth and how large caches work around it.Pobu - Friday, December 13, 2013 - link
enterprise class = data servers ? Plz look carefully at what Samus said.Moreover, data server = data storage server or data analysis server ?
Cache is abusolutely a critical part for data analysis server.
theduckofdeath - Saturday, December 14, 2013 - link
I concur with everyone else on this one, ddriver. You are really wrong about this. Just because a data segment might be 2GB, you still need a L3 cache for all of those machine code level commands that actually do something with that 2GB chunk of data.Krysto - Thursday, December 12, 2013 - link
In that case, there's also Cortex A57, which will be faster than anything Atom can do.But my guess is he was referring to energy efficiency, and if the extreme energy efficiency of Cortex A53 can be more useful in servers than perhaps Cortex A57.
JoannWDean - Saturday, December 14, 2013 - link
my buddy's aunt earned 14958 dollar past week. she been working on the laptop and got a 510900 dollar home. All she did was get blessed and put into action the information leaked on this site... http://cpl.pw/OKeIJoshodanshok - Tuesday, December 10, 2013 - link
Hi, it would be interesting to know two thing:- the cache memories (L1/L2) are write-back or write-through? Inclusive or exclusive?
- multiprocessor capabilities are limited to 4 cores or they can scale to 8+ cores without additional glue logic?
Thanks.
SleepyFE - Tuesday, December 10, 2013 - link
Hi.I know the second one. The new big.LITTLE spec allows the use of all 8+ cores.
fteoath64 - Thursday, December 12, 2013 - link
The cache cogency bus (CCN) supports a maximum of 8 cpu-id per socket. That is why.L2 cache is actually a RAM accelerator. Filling cache with data (in and out) allows for interleaved and delayed writes to slow memory at roughly the cache speed. This means an order of magnitute faster since most L2 cache are 95% hits anyways. Branch-prediction logic will reduce the stalling of the pipeline and cache misses, thus enhancing the performance. Yes, server apps needs lots of RAM mean the cache size and efficiency is vitally important there ...
Peter Greenhalgh - Wednesday, December 11, 2013 - link
Hi Shodanshok,All cacheable attributes are supported, but Cortex-A53 is optimised around write-back, write-allocate. The L2 cache is inclusive on the instruction side and exclusive on the data side.
A Cortex-A53 cluster only supports up to 4-cores. If more than 4-cores are required in a platform then multiple clusters can be implemented and coherently connected using an interconnect such as CCI-400. The reason for not scaling to 8-cores per cluster is that the L2 micro-architecture would need to either compromise energy-efficiency in the 1-4 core range to achieve performance in the 4-8 core range, or compromise performance in the 4-8 core range to maximise energy-efficiency in the 1-4 core range. This isn’t a hard and fast rule for all clusters, but is the case for a cluster at the Cortex-A53 power/performance point. For the majority of mobile use cases it is best to focus on energy efficiency and enable more than 4-cores through multi-cluster solutions.
shodanshok - Wednesday, December 11, 2013 - link
Thank you very much Peter :)lukarak - Tuesday, December 10, 2013 - link
We have seen MediaTek introducing an 8xA7 SOC, instead of going to the big.LITTLE configuration of some sorts. Do you expect the same thing to happen with the A53 and A57 generation for low budget SOCs or will this generation's combo be a little easier and cheaper to implement?Krysto - Tuesday, December 10, 2013 - link
If it includes A57, it's high-end by default. That chip you're talking about isn't big.Little, nor does it contain Cortex A15 in it. It's an 8-core Cortex A7 chip, so yes, I assume Mediatek will make another 8-core one with Cortex A53, but I wouldn't exactly call it high-end, more like mid-to high-end.MrSpadge - Tuesday, December 10, 2013 - link
Interesting: you apparently completely misunderstood his question, yet "I assume Mediatek will make another 8-core one with Cortex A53" is what I would answer as well. 8 smaller cores are cheaper than 2*4 in big.LITTLE and does sound impressive to the uniformed.lukarak - Tuesday, December 10, 2013 - link
Exactly, I was wondering that while today it is cheaper/easier to make 8x A7 than say 2x A15 2xA7 big.LITTLE, will that be the case with A53 and A57?Wilco1 - Tuesday, December 10, 2013 - link
Big cores are larger than small cores (surprise!), so the SoC will be more expensive to produce if it has big cores rather than only little cores. But then again it will be faster too.lukarak - Wednesday, December 11, 2013 - link
Yes, but I'm talking about an 8 core A7 vs 4 core A15/A7 combo, as in 2 A15 cores and 2 A7 cores in big.LITTLE. So it's not the same number of cores.Wilco1 - Thursday, December 12, 2013 - link
Excluding L2, A15 is about 4 times as large as A7 (http://chip-architect.com/news/2013_core_sizes_768... So 2xA15 + 2xA7 is about the size of 10xA7, ie. larger than 8xA7. A15 will also need a larger L2 than A7 due to its higher performance.Peter Greenhalgh - Wednesday, December 11, 2013 - link
Hi Lukarak,We expect to see a range of platform configurations using Cortex-A53. A 4+4 Cortex-A53 platform configuration is fully supported and a logical progression from a 4+4 Cortex-A7 platform. A Cortex-A57 in the volume smartphone markets is less likely, but that’s a decision in the hands of the ARM partners. It will be interesting to see the range of Cortex-A53 platforms and configurations announced by partners over the coming months.
ehsan.nitol - Tuesday, December 10, 2013 - link
Hi there, I have some questions.We have already seen how well Qualcomm's Cortex A7 can perform thanks to Moto G. How much will it improve with the new Cortex A53? What will be the core and performance wise difference? How will you compare it against Cortex A9, A12 and A15 in terms of performance, battery consumption and all.
With the Exynos Octa core processor Battery Test we haven't seen much battery improvements compared to Qualcomm's Snapdragon 600 and 800 Processor. How will it perform this time?
What is ARM planning do with its Mali GPU? What will be next after Cortex A53 and A57?
deputc26 - Tuesday, December 10, 2013 - link
This what will the IPC improve,nets be from A7 to A53deputc26 - Tuesday, December 10, 2013 - link
Typing on an iPad, I blame Tim Cook for the errors above.Peter Greenhalgh - Wednesday, December 11, 2013 - link
Hi Ehsan,Cortex-A53 has the same pipeline length as Cortex-A7 so I would expect to see similar frequencies when implemented on the same process geometry. Within the same pipeline length the design team focussed on increasing dual-issue, in-order performance as far as we possibly could. This involved symmetric dual-issue of most of the instruction set, more forwarding paths in the datapaths, reduced issue latency, larger & more associative TLB, vastly increased conditional and indirect branch prediction resources and expanded instruction and data prefetching. The result of all these changes is an increase in SPECInt-2000 performance from 0.35-SPEC/Mhz on Cortex-A7 to 0.50-SPEC/Mhz on Cortex-A53. This should provide a noticeable performance uplift on the next generation of smartphones using Cortex-A53.
jeffkibuule - Tuesday, December 10, 2013 - link
We shouldn't infer anything from there being a nice-sized gap between Cortex A53 and A57 which might be the 64-bit version of Cortex A12 which in a hypothetical universe might be named Cortex A55, should we ? :)r3loaded - Tuesday, December 10, 2013 - link
As someone who's worked at ARM fairly recently, plenty of activity was happening around the A53/A57 as well as a new M-class core (it's supposed to be M5 or M7, still undecided) but I never heard anything about a hypothetical mid-range A55. Right now, it's just a gap in the naming scheme, so it might be used in future.Techguy99X - Tuesday, December 10, 2013 - link
Why are the current A7 quad core phones performing similar to the A9 quad (exynos 4412 , tegra 3), although A9 is more advanced and OoO? What is the main difference between A5 and A7, becuase the A7 is just a bit faster than the A5?Peter Greenhalgh - Wednesday, December 11, 2013 - link
Hi Techguy99X,Overall platform performance is dependent on many factors including processor, interconnect, memory controller, GPU, video and more. While the Cortex-A9 is a higher performance processor both in IPC and frequency, ARM partners are continuously improving their platforms and porting them to new process geometries. This allows a new generation Cortex-A7 based platform to improve on an older generation Cortex-A9 based platform.
Compared to Cortex-A5, Cortex-A7 increased load-store bandwidth, allowed more common data-processing operations to dual-issue and made some small improvements in the branch-predictors.
bdub951 - Tuesday, December 10, 2013 - link
Lady luck bless this post!Techguy99X - Tuesday, December 10, 2013 - link
What is the power hit from A5, A7, A9, A12, A15, A53 and A57? And also the die area?deputc26 - Tuesday, December 10, 2013 - link
Great questionsshing3232 - Tuesday, December 10, 2013 - link
+1discotea - Tuesday, December 10, 2013 - link
Why isn't there a more logical naming convention to the ARM cores. I can't tell which is faster an A7 or an A9 core? It seems like your getting better with the A15 being faster than an A9 or A7.Wolfpup - Tuesday, December 10, 2013 - link
It sort of seems like A7 was just named that because they ran out of numbers. My understanding is it was designed to try to get as much of A9's performance as possible in a smaller die, and as such it should be better than A8 (and can have multi-core versions).shing3232 - Tuesday, December 10, 2013 - link
A7 usually have almost same performance as A8 although with less execution resource, but A7 have a much more efficient BPUTechguy99X - Tuesday, December 10, 2013 - link
What is the main difference in A5, A7, A9, A12, A15, A53 and A57 for single threaded performance? Is it still worth having 4+4 big.LITTLE compared to 8x A7 ?archgeek - Tuesday, December 10, 2013 - link
1. Do you think that cortex A12 should have been announced earlier(as the gap between A9 and A15 was huge and something in between those two was required) ?2. Similar to A12, Will there be anything in between A53 and A57 in near future?
3. Instead of using big.LITTLE config and putting 8 cores(4 A15 and 4 A7), why can't we have efficient power gating and other innovative techniques inside the high performance cores so that they can be run as efficient as A7 cores. Die area of individual A15 will increase but we can save some total area as only 4 A15 will be required. How difficult it is from architecture point on view?
MrSpadge - Tuesday, December 10, 2013 - link
The answer is the same as to "Why can't Intel put a Haswell into my phone?" When you need T transistors to implement some functionality in some basic way you might get performance P at power consumption Q. Implement some clever trick which increase performance to 1.5*P. This will probably require more than 1.5*P transistors, probably more like 2*P. And hence power consumption will increase to approximately 2*Q as well. And you can't power-gate this additional power draw because you actually need those transistors to work in order to perform the desired function.That's why building ever more complex single cores comes at the price of power efficiency.
barleyguy - Tuesday, December 10, 2013 - link
When designing a processor and deciding on which performance attributes to emphasize, do you target current workloads for short term market concerns, or do you target possible future workloads for the market a year or two from now? Or is performance tuning more workload agnostic, and do you say "I want this to be fast for everything"?For example, since ARM processors are very popular in the Android market, do you tune for content comsumption and gaming? Or since Android may be trending towards more of a primary computing device in the future, is it important to tune for desktop applications?
And finally, what are the considerations of performance tuning for thermally constrained devices?
Peter Greenhalgh - Wednesday, December 11, 2013 - link
Hi Barleyguy,A good question. A general purpose processor has to be good on all workloads. After all, we expect to see Cortex-A53 in everything from mobile phones to cars, set-top-box to micro-servers and routers. However we do track the direction software is evolving – for example the increased use of Javascript which puts pressure on structures such as indirect predictors. Therefore we design structures within the processor to cope with future code as well as existing expected workloads.
psychobriggsy - Tuesday, December 10, 2013 - link
Do you have any plans to support various forms of turbo functionality within your next generation ARM cores? An a potential example, in a 28nm quad-core A53 setup at 1.2GHz, you could support dual-core at >1.4GHz and single core at >1.6GHz within the same power consumption (core design allowing, of course), yet single threaded performance could improve significantly.ARM cores have been historically low power, however that doesn't mean there aren't more power savings to be made. Examples include deeper sleep states, power gated cores, and so on - features that Intel and AMD have had to include in order to reduce their TDPs whereas ARM cores haven't need them (yet). What are the future power saving methods that ARM is considering for its future cores (that you can give away)?
Peter Greenhalgh - Wednesday, December 11, 2013 - link
Hi Psychobriggsy,A Turbo mode is typically a form of voltage overdrive for brief periods of time to maximise performance, which ARM partners have been implementing on mobile platforms for many years. Whether this is applied to 1,2 or more cores is a decision of the Operating System and the platform power management software. If there is only one dominant thread you can bet that mobile platforms will be using Turbo mode. Due to the power-efficiency of Cortex-A53 on a 28nm platform, all 4 cores can comfortably be executing at 1.4GHz in less than 750mW which is easily sustainable in a current smartphone platform even while the GPU is in operation.
In terms of further power saving techniques, power gating unused cores is a technique that has been used since the first Cortex-A9 platforms appeared on the market several years ago. The technique is so fundamental that I think many in the mobile industry use it automatically and forget that it’s a highly beneficial power saving technique. But you are correct that there is more milage to come from deeper sleep states which is why both Cortex-A53 and Cortex-A57 support state retention techniques in caches and the NEON unit to further improve leakage power.
gregounech - Tuesday, December 10, 2013 - link
We see Qualcomm winning most designs these days with their implementation of ARM's ISAs, does ARM wish to bring back their own architecture (not just the ISA) to the front line of mobile products (consumer ones) ?What is your view on Intel newest mobile architecture BayTrail and how do you see Intel (and others) competing with ARM on the consumer grade mobile products in the near future ?
adrian_sev - Tuesday, December 10, 2013 - link
Are there any plans for desktop/server designs for ARM cpus? (i mean with pci-e lanes, ddr4 controllers, strong FPU (AVX2 like), without baseband logic and any other bloat hardware).If yes, is there a time frame?
Thank you!
ddriver - Tuesday, December 10, 2013 - link
Take a look at TI's KeyStone II architecture, there are already quad core A15 based chips with powerful DSP, PCIE support, multiple 10 and 1 GBit interfaces with hardware accelerated packets and security and mind-boggling on-chip interconnect capable of over 2 TBit per second.But it is a strictly server chip, considering it doesn't even have a GPU, and it is not like consumer devices need multiport 10 GBit switches embedded.
Xebec - Tuesday, December 10, 2013 - link
Peter, thanks for offering your time to Anandtech!I was curious if you could talk a bit about how easy/difficult A53-derived SoCs might be to integrate into solutions that are already using A7/A9 type chips? i.e. Devices like Beagleboards, Raspberry Pis, ODROIDs, etc. Is there anything that makes the A53 particularly difficult or easy to suit to these types of devices?
Also, for Micro and "regular" servers, do you see A57/A53 big.LITTLE being the norm, or do you anticipate a variety of A53-only and A57-only designs? Any predictions on market split between the A5x series here?
Respectfully,
John
Peter Greenhalgh - Wednesday, December 11, 2013 - link
Hi Xebec,Cortex-A53 has been designed to be able to easily replace Cortex-A7. For example, Cortex-A7 supports the same bus-interface standards (and widths) as Cortex-A7 which allows a partner who has already built a Cortex-A7 platform to rapidly convert to Cortex-A53.
With servers I think we will see a mix of solutions. The most popular approach will be to use Cortex-A57 due to the performance that micro-architecture is capable of providing, but I still expect some Cortex-A53 servers and big.LITTLE too!
san_dehesa - Tuesday, December 10, 2013 - link
I would love to know ARM position towards open source GPU drivers now that Intel is putting a big amount of money and effort into developing theirs.It seems to me that not taking the open source road for GPU drivers (as ARM is doing) is a big mistake. Furthermore when the primary OS host their hardware is running in is Android.
ARM could create much better binaries for free with the work of other talented developers and get better integration with Android in the process. That will be a great selling point for any manufacturer and possible client!
The Lima project has currently a better performance (5% more) than the close driver distributed by ARM and it is being coded in the developers free time! It would be great to see both teams working together.
What are the reasons the driver is close-source?
- Is it that they would reveal a lot of Mali's internal core?
- Is ARM afraid of IP sues? Having close source will deter patent trolls?
Intel doesn't seem to have those problems.
PS: The question is not about whether the drivers should be open or not just because it is morally right or wrong. Obviously it would be nice for the clients and a good selling point, but I was wondering how ARM management see one of their biggest competitor embracing open source when developing GPU drivers.
coder543 - Tuesday, December 10, 2013 - link
Yes! This post is excellent.bji - Tuesday, December 10, 2013 - link
Seconded.ddriver - Tuesday, December 10, 2013 - link
Intel makes a ton of money, plenty to purchase pretty much every IP it needs, plus it holds a lot of IP itself, so it can also trade IP with other vendors. In contrast ARM makes a modest profit, even though I guess it can still trade IP, since both nvidia and amd license ARM themselves.But it is a good point, open source GPU drivers are a crucial step towards empowering Linux and ending the MS monopoly. It is hilarious that Linux powers like 99% of the supercomputers and like 1% of the personal computers. If Linux is good enough for supercomputers, it has got to be good enough for personal computers, but the lack of good and often any drivers is what cripples Linux for the regular user.
san_dehesa - Wednesday, December 11, 2013 - link
I have to wonder how long it will take ARM's legal department to review the driver's code and assess risks. Even if they have part of the blob which is legally binding, they can still open up a big chunk and start chipping in on the advantages of open source developing. There are already developers knowledgeable on their architecture.ARM and other chip producers know how much of a pain is to support badly integrated blob of code. When an ARM customer have a problem with the driver, they have to communicate with ARM, wait till they receive an answer, wait till the problem is solved (if it is at all), and then integrate the new blob with their software. So many steps where you can fail! It takes so much time! Mobile products have a really short time-to-market. It would be so easy for everyone if they let their customer help out with the development. Plus, it is free!!
Samsung (ARM biggest customer) have been testing Intel's products for a while and I am pretty sure that by now they have some developers who know Intel's driver architecture. Don't you think one reason when deciding which platform to choose would be support and time-to-market?
Peter Greenhalgh - Wednesday, December 11, 2013 - link
Hi San_Dehesa,Good questions, but I'm not familiar with the ARM GPU graphics drivers. Perhaps persuade Anand for an Ask the Expert with one of the ARM graphics team? :)
san_dehesa - Thursday, December 12, 2013 - link
Haha, yes, you are right! Sorry for the miss-targeted question.I appreciate the time you are spending here answering our question. Thank you.
Doormat - Tuesday, December 10, 2013 - link
ARM CPU vendors (Qualcomm, Nvidia, etc) seem to be choosing slower quad core over faster dual core, and I'm suspecting its all a marketing game (e.g. more cores is better, see Motorola's X8 announcement of an "8 core" phone). Do those non-technical decisions impact the decisions of the engineers in developing the ARM architecture?dishayu - Wednesday, December 11, 2013 - link
Care to present some examples, please?NVidia used 4+1 A15 cores (fastest available at the time) for Tegra 4. And Qualcomm doesn't use generic ARM cores. They have their own (krait) architecture and the most popular SoCs based on their fastest architectures (krait 300/400) are almost exclusively quad-core.
Peter Greenhalgh - Wednesday, December 11, 2013 - link
Hi Doormat,You are quite correct that there are a variety of frequencies and core-counts being offered by ARM partners. However, for ARM design micro-architectures these do not have an effect on micro-architectures as we must be able to support a variety of target frequencies and core-counts across many different process geometries.
Factory Factory - Tuesday, December 10, 2013 - link
How does designing a CPU "by hand" differ from using an automated layout tool? What sort of trade-offs does/would using automated tools cause for ARM's cores?Second question: With many chips from many manufacturers now implementing technologies like fine-grained power gating, extremely fine control of power and clock states, and efficient out-of-order execution pipelines, where does ARM go from here to keep its leadership in low-power compute IP?
Peter Greenhalgh - Wednesday, December 11, 2013 - link
Hi Factory,Hand layout versus automated layout is an interesting trade-off. From one perspective, full hand-layout for all circuits in a processor is rarely used now. Aside from cache RAMs which are always custom, hand-layout is reserved for datapath and queues which are regular structures that allow a human to spot the regularity and ‘beat’ an automated approach. However, control logic is not amenable to hand-layout as it’s very difficult to beat automated tools which means that the control logic can end up setting the frequency of the processor without significant effort.
In general the benefit from hand-layout has been reducing in recent years. Partly this is due to the complexity of the design rules for advanced process generations reducing the scope for more specific circuit tuning techniques to be used. But another factor is the development of advanced standard cell libraries that have a large variety of cells and drive strengths which lessens the need for special circuit techniques. When we’re developing our processors we’re fortunate to have access to a large team in ARM designing standard cell libraries and RAMs who can advise us about upcoming nodes (for example 16nm and 10nm). In turn the processor teams can suggest & trial new advanced cells for the libraries which we call POPs (Processor Optimization Packages) that improve frequency, power and area.
A final trade-off to consider is process portability. After an ARM processor is licensed we see it on many different process geometries which is only possible because the designs are fully synthesizable. For example, there are Cortex-A7 implementations on all the major foundries from 65nm to 16nm. In combination with the advanced standard cell libraries for these processes there is little need to go to a hand-layout approach and we instead enable our partners to get to market more rapidly on the process geometry and foundry of their choosing.
mrdude - Tuesday, December 10, 2013 - link
A few questions:When can we expect an end to software based dvfs scaling? It seems to me to be the biggest hurdle in the armsphere towards higher single threaded performance.
the current takes on your big.little architecture have been somewhat suboptimal (the exynos cache flush as an example), so what can we expect from arm themselves to skirt/address these issues? It seems to me to be a solid approach given the absolutely miniscule power and die budget that your little cores occupy, but there's still the issues of software and hardware implementation before it becomes widely accepted.
Though this question might be better posited for the gpu division, are we going to be seeing unified memory across the gpu and CPU cores in the near future? Arm joining hsa seems to point to a more coherent hardware architecture and programming emphasis
Pardon the grammatical errors as IM typing this on my phone. big thanks to Anand and peter.
Peter Greenhalgh - Wednesday, December 11, 2013 - link
Hi Mrdude,While there are platforms that use hardware event monitors to influence DVFS policy, this is usually underneath a Software DVFS framework. Software DVFS is powerful in that it has a global view of all activity across a platform in time whereas Hardware DVFS relies on building up a picture from lots of individual events which have little to no relationship with one another. As an analogy, Software DVFS is like directing traffic from a helicopter with a very clear view of what is going on all roads in a city (but greater latency when forcing a change), whereas Hardware DVFS is like trying to pool information from hundreds of traffic cops all feeding traffic information in from their street corner. A traffic cop might be able to change traffic flow on their street corner, but it may not be the best policy for the traffic in the city!
Like all things in life, there are trade-offs with neither approach being absolutely perfect in all situations and hardware DVFS solutions rely on the Software DVFS helicopter too.
nafhan - Tuesday, December 10, 2013 - link
This may not be something you can answer, but is there a timeline for a 64 bit follow on to Krait?Also, do you have any thoughts regarding clock speed vs. instruction width scaling and which route Qualcomm plans to take (with Apple going the instruction width route with the A7 and Qualcomm currently going the clock speed route with recent SoC's)/
coder543 - Tuesday, December 10, 2013 - link
ARM != Qualcomm. Qualcomm designs their own stuff, this guy is from ARM. Even if he knew the answers to those questions, they're neither on topic, nor is he at liberty to discuss them. He probably doesn't even want to talk about that, considering Qualcomm isn't exactly giving ARM any compliments by throwing out all of ARM's work and starting from scratch.nafhan - Tuesday, December 10, 2013 - link
I just finished reading the Snapdragon 410 article, and I thought I read Qualcomm in here somewhere... you are absolutely correct.Still, I don't think licensing ARM's ISA (a la Krait) is an insult to ARM. That's a big part of their business model.
Fergy - Tuesday, December 10, 2013 - link
Really try to inform yourself a little bit better before asking all these questions. Krait 600 and 800 that are in most phones and tablets are 100% new designs from Qualcomm. Krait 410 is not a new design and is licenced from ARM.phoenix_rizzen - Wednesday, December 11, 2013 - link
There's no such thing as "Krait 600" or "Krait 800". You're mixing up CPU and SoC names.The newest Krait CPU from Qualcomm is the Krait 450 CPU, part of the upcoming Snapdragon 805 SoC.
The current Krait CPUs available in phones are the Krait 200 (Snapdragon S4 Pro), Krait 300 (Snapdragon S600), and Krait 400 (Snapdragon S800).
Yes, their naming scheme is horrible.
nafhan - Wednesday, December 11, 2013 - link
I mistakenly thought we had someone from Qualcomm answering the questions. I didn't say anything about the "Krait" (or Snapdragon) 410 having a Qualcomm designed CPU.hlovatt - Tuesday, December 10, 2013 - link
With Apple and yourselves taking different approaches to ARM64 do you have any thoughts on what the different trade offs you both made were and what the knock on effects are in terms of were the two implementations might shine?Thanks for taking questions on AnandTech.
Try-Catch-Me - Tuesday, December 10, 2013 - link
What do you have to do to get into chip design? Is it really difficult to get into companies like ARM?Peter Greenhalgh - Wednesday, December 11, 2013 - link
An Engineering degree in electronics and/or software for a start. Passion for micro-architecture & architecture certainly helps! :)mercury555 - Tuesday, December 10, 2013 - link
Peter:What emotion comes to mind on the fact that ARM wishes to forget the big.LITTLE with a 64 bit equivalent of A12 limited to a Quad-Core configuration for consumer electronics?
Thanks.
Fergy - Tuesday, December 10, 2013 - link
Where did you read that ARM is stopping with big.LITTLE?Peter Greenhalgh - Wednesday, December 11, 2013 - link
Hi Mercury,ARM continues to believe in big.LITTLE which is why we improved on interoperability in the Cortex-A53 and Cortex-A57 generation of processors. In future processor announcements I’m sure you’ll see our continued focus on big.LITTLE as a key technology that enables best possible energy efficiency.
mercury555 - Thursday, December 12, 2013 - link
Thank you for taking out time to answer.mrtanner70 - Tuesday, December 10, 2013 - link
1. We don't seem to have quite seen the promised power savings for big.little yet (thinking of the Exynos 5420 in particular since it has hmp working, not sure if any devices have correct Linux kernel yet though). Are you still as bullish on this aspect of the big.little story?2. Are there particular synergies to using Mali with the CCI vs. other brand GPU's?
3. What is your general response to the criticism of big.little that has come out of Intel and Qualcomm? Intel, in particular, tends to argue dynamic frequency scaling is a better approach.
Cheers
Peter Greenhalgh - Wednesday, December 11, 2013 - link
Hi MrTanner,In answer to (3), DVFS is complimentary to big.LITTLE not instead of.
A partner building a big.LITTLE platform can use DVFS across the same voltage and frequency range as another vendor on the same process with a single processor. The difference is that once the voltage floor of the process is reached with the 'big' processor the thread can be rapidly switched to the 'LITTLE' processor further increasing energy efficiency.
Mobile workloads have an extremely large working range from gaming and web browsing through to screen-off updates. The challenge with a single processor is that it must compromise between absolute performance at the high-end and energy efficiency at the low-end. However a big.LITTLE solution allows the big processor to be implemented for maximum performance since it will only be operating when needed. Conversely the LITTLE processor can be implemented for best energy efficiency.
Krysto - Tuesday, December 10, 2013 - link
So far it doesn't look like any chip maker is in a hurry to go to 20nm next year, even with the jump to ARMv8. Can he share his opinion on why this is happening? Why aren't we seeing all ARMv8 chips arrive at 20nm, as it was supposed to happen (just like the previous generation, Krait/Cortex A15, jumped to 28nm)?As a follow-up question, since I assume he'll hint at a combination of failures from both fabs and chip OEM's to move to 20nm fast enough, will this situation be rectified at least in 2015, with an early push to 14/16nm FinFET? Can we expect chip makers to move to that in EARLY 2015? (Nvidia has kind of hinted at that, but who can trust Nvidia with keeping their own schedule?!)
name99 - Wednesday, December 11, 2013 - link
"So far it doesn't look like any chip maker is in a hurry to go to 20nm next year, even with the jump to ARMv8."This is the kind of thing that NO-ONE is going to talk about until they have things working. Assuming otherwise is just silly.
Take, for example, Apple. They are quite likely porting the A7 to TSMC's 20nm process as we speak, with the goal of both learning about the process and introducing a speed-bumped iPad lineup (maybe even also a speed bumped iPhone) in Q2 2014. (They did the same think with the die-shrunk A5, though in that case they did not publicize it as a speed bump, it's just that newer iPads had better battery life.)
But they're not going to tell anyone about this. If the project slips, they'll look dumb. They may prevent a whole of people buying today, then those people get sick of waiting and buy Android (Osborne effect), etc. Meanwhile there is no upside to telling the world about this move.
Krysto - Tuesday, December 10, 2013 - link
Oh, and one more question. FreeBSD developers have just said that they will stay away from using Intel and VIA's hardware encryption features, because they could be backdoored by the NSA.http://arstechnica.com/security/2013/12/we-cannot-...
ARM is from UK - the home of GCHQ, which is just as bad, if not worse, than NSA - so is there a way to reassure us that ARMv8, which comes with hardware encryption, is free of such backdoors? Is he willing to go on the record with that?
I'm sure he realizes, that if people stop trusting these features, and they (including Intel, AMD, VIA, etc) can't prove to us that their hardware isn't in fact backdoored, will just mean NO ONE will end up using those features, and will stick with a software solution instead, which means their hardware encryption will just waste space on the die, so I hope they take this issue very seriously, for their sake, too.
smoohta - Tuesday, December 10, 2013 - link
Sorry for piggy backing on Krysto's post but I'd love it if Peter talks not only about hardware RNGs (that's what I assume Krysto meant by "hardware encryption") but also about general security features in ARM - specifically TrustZone (which is what Apple uses in the iPhone's Touch ID solution).Thanks!
syxbit - Tuesday, December 10, 2013 - link
The Cortex-A15 has really struggled on mobile. Neither Tegra 4 nor Exynos 5 (nor OMAP 5 cough cough) have sold well at all compared to Snapdragon 600/800.Possibly related to their lack of success with A15, Nvidia and Samsung (and AMD) have already announced that they are going to be designing their own CPU (rather than an ARM design).
Is this worrying to ARM? Doesn't it show that big.LITTLE was a mistake required to cover up A15 power hunger?
Krait 400 proves that big.LITTLE is not needed to be both powerful and very power efficient.
How will A57 succeed where A15 failed?
marblearch - Monday, December 30, 2013 - link
Hit the nail on the head there. I bet they say A15 wasn't really meant for mobile but then there's a glaring hole after A9 until the only just announced A12, so I reckon A15 was supposed to be for mobile but they messed it up and dreamt up big.LITTLE to cover themselves. Not that big.LITTLE is a crazy idea, just so far it's a botch job. And A53 and A57 seem to be just incrementals of A7, A15 with ARMv8 chucked in so I wouldn't expect any improved success with these.silenceisgolden - Tuesday, December 10, 2013 - link
Since ARM released the big.LITTLE guidelines, is there a plan for ARM to also release guidelines for processor and co-processor implementations such as Motorola's 'X8' system and Apple's 'M7' co-processor?jameskatt - Tuesday, December 10, 2013 - link
How does the A53 compare in performance to Apple's custom A7 64-bit Dual-core Chip?name99 - Wednesday, December 11, 2013 - link
It's far worse. This is not a secret and not worth asking about. The two target completely different spaces --- expensive and high performance vs dirt cheap and adequate performance.sverre_j - Tuesday, December 10, 2013 - link
I have just been through the extensive instruction ARMv8 set ( and there must be several hundred instructions in total), so my question is whether ARM believes that compilers, such as gcc, can be set up to take advantage of most of the instruction set, or whether one will still depend on assembly coding for a lot of the advanced stuff.Peter Greenhalgh - Wednesday, December 11, 2013 - link
Hi Sverre,The AArch64 instruction set in the ARMv8 architecture set is simpler than the ARMv7 instruction set. The AArch32 instruction set in the ARMv8 architecture is an evolution of ARMv7 with a few extra instructions. From this perspective, just the same as compilers such as GCC can produce optimised code for the billions of ARMv7 devices on the market I don’t see any new challenge for ARMv8 compilers.
SleepyFE - Tuesday, December 10, 2013 - link
Hello.Now that you have a 64bit ISA are you planning something bigger (size wise)? So far ARM CPU-s are built into SOC-s but i would like to know if you are going to make an A1000 core that will be four large cores with Mali 600 and will compete for a space in the desktops. It makes sense since all major systems (Linux, WindowsRT, iOS) are already running on ARM CPU-s.
This is less of a question and more of a request.
name99 - Wednesday, December 11, 2013 - link
The time is not yet right.The top of the line ARM ISA CPU, Cyclone, has IPC comparable with Intel --- which is great --- BUT at a third of Intel's frequency. Apple (and the rest of ARM) have to get to at least close to Intel's frequency while not losing that IPC.
Not impossible, but no trivial; and until that happens the CPUs are just not that interesting for the desktop.
The first step (which I expect Apple to take with the A8) would be an architecture like Sandy Bridge and later:
- smallish high bandwidth per core L2's
- unified large L3 shared by all cores and graphics [Cyclone has something that plays this role, but it's effectively an "off-chip" cache as far as the cores are concerned, being about 150 cycles away from the cores
- ring (or something similar) tying together the cores, L3 slices, graphics and memory controller
Done right, I expect this gets Apple to same IPC as before, but 2x the frequency, in 20nm FinFET.
Of course that's still not good enough. Then for the A9 they have to add in a new µArch to either ramp up the IPC significantly, or improve the circuits and physical enough to turbo up to near 4GHz for reasonably long periods of time...
As I said, not impossible, but there is still plenty of work to do.
SleepyFE - Wednesday, December 11, 2013 - link
Noone said the first CPU has to be perfect. Considering low end PC's and laptops it's a good idea to start selling to OEM's. That way you can get the ball rolling on software development. Also the GPU does not have to be good since you would use a discrete one (might finally force AMD and Nvidia to write good linux drivers).mercury555 - Wednesday, December 11, 2013 - link
Though that seems like a logical progression for ARM, single thread performance is no where closer to that of Intel.MrSpadge - Tuesday, December 10, 2013 - link
Topic: ultimate FusionGiven the flexibility ARM has with the instruction set (compared to x86) I would like to know where ARM sees itself going mid- to long-term. The specific question being: how can we get strong single threaded performance (like in a fat Intel core) and a massive amount of energy efficient number crunchers for parallel tasks (like GPU cores)? The current state of treating them as co-processors (CUDA, OpenCL etc.) and trying to bring them closer to the cores (HSA) ultimately seem like like crutches to me, because it still takes significant effort on the software side to actually use those units.
What I imagine as the "ultimate Fusion" of these ressources is a group of fat integer cores (like in AMDs modules, Haswells with 2/4 way HT, with big.LITTLE.. whatever you want) sharing a large pool of GPU-shader-like number crunchers, presented to like like regular floating point units now. Dispatching instructions to these cores should be as simple as using the FPU from the software side. Sure, latency would go up (hence some faster scalar local units might still be needed) but throughput could go up by orders of magnitude. Even a single thread might get access to all of them, or in case of many threads there'd be excellent load-balancing. The GPU and maybe other functions would use them as well. The number of integer / FP cores / execution units could relatively easily be scaled, depending on the application (server, HPC, all-round).
Intel and AMD have the hardware building blocks, but apart from the next version of SSE/AVX I don't think there is any chance to implement such functionality in x86 efficiently. And it surely wouldn't be backwards compatible, hence take years or tens of years to trickle down the software stacks. The ARM software is much younger and more agile, as Apples quick and almost completely seamless transition to 64 bit iOS has shown. I'd even say: if anyone could pull something like this off it's ARM. What do you think?
name99 - Wednesday, December 11, 2013 - link
I wonder if this sort of fusion is ultimately a bad idea.Even at the basic HW level, tying the GPU in with the CPU is tough because the two are so different, and it doesn't help to destroy the primary value of the GPU in this quest.
Specifically, using the same memory space clearly has value (in performance and programmer ease). Which means using the same virtual address space and TLBs.
Again not in itself too problematic.
But then what if we decide that we use that TLB to support VM on the GPU side? Now life gets really tough because GPUs are not set up for precise exceptions...
(Using the TLB to track privilege violations is less of a problem because no-one [except debuggers!] cares if the exception generated bubbles up to the OS hundreds of instructions away from its root cause.)
WRT to the more immediate issue, the implication seems to be that a unified instruction set could be used to drive both the CPU and GPU. While this sounds cool, I fear that it's the same sort of issue --- a TREMENDOUS amount of pain to solve a not especially severe problem.
The issue is that the processing model of the GPU is just different from a CPU --- that's a fact. Making it the same as a CPU is to throw away the value of the GPU. But since these models are so different, the only feasible instructions would seem to be some sort of "create a parameter bock then execute it" instructions --- at which point, how is this any more efficient or useful than the current scheme of using the existing CPU instructions to do this?
I think we can gauge the value of this idea, to some extent, by the late Larrabee. Intel seem (as far as I can tell) to have started with a plan vaguely like what's described --- let's make the GPU bits more obviously part of the CPU, using more or less standard CPU concepts --- and it flat out did not work. It's mutated into the Knights SomethingOrOther series which, regardless of their value or not as HPC accelerators cards, no longer look like any part of the future of GPUs or desktop CPUs.
I've talked about this before. CS engineers are peculiarly susceptible to the siren song of virtualization and masquerading because the digital world is so malleable. But not all virtualization is a good idea. The 90s spent god knows how much money on the idea of process and network transparent objects in various forms, from OLE to CORBA, but it all went basically nowhere; what won in that space was the totally non-transparent HTTP/HTML combo, I would say because they actually mapped onto the problem properly, rather than trying to make the problem look like a pre-existing solution.
MrSpadge - Wednesday, December 11, 2013 - link
Some valid concerns, for sure. And I didn't say it would be easy :) But I think I can adress at least some of them.First, my idea is not to fuse CPU and GPU into each other. It's about sharing that pool of shaders, which eats a major amount of transistors and power budget in both chips and ultimately limits their performance (provided you can feed and cool the beasts). In current AMD APUs 2 cores in a module share the 2 FPUs because these units are simply huge. Intel is already on the way to 512 bit AVX, requiring even more transistors & area. Yet their throughput pales in comparison to GPUs. And to use them all we have to go fully multi-threaded, with all its software and synchronization issues. If what I have in mind works perfectly a single CPU core could easily get access to the entire pool of shaders/FPUs, if needed. It just fires off the instructions to these massively parallel, high latency FPUs instead of the local scalar one and gets massive throughput. That's the ultimate load-balancing and very efficient use of those transistors, if it works well.
The hard-wired logic in the GPU cores (TMUs, ROPs, rasterizer etc.) would still remain. At the point where they'd usually dispatch instructions to their shaders they would now also go into that "sea of FPUs".
Sure, internal and external bandwidth, registers and such would all need to scale to hide the increased latency from putting the execution units further away from the CPU/GPU cores. But if these costs become too large one could segment the whole thing again, like combining 1 to 4 GCN compute units with one CPU module. The amount of raw FPU horsepower available to the CPU could still increase tremendously, while the "fast path" local scalar FPU could be reduced from 2x128 bit (or more) to one double precision unit again.
You see, I'd not necessarily want or need a unified instruction set for CPU and GPU, just the same micro-ops (or however you want to call them) to access the shaders /FPUs. Larrabee is almost a "traditional many-core CPU" in comparison ;) (if there already is such a thing)
Fergy - Tuesday, December 10, 2013 - link
Have OEMS spoken about plans for 'real' laptops with ARM cpus? Like the Intel and AMD laptops.iwod - Tuesday, December 10, 2013 - link
1. MIPS - Opinions On it against ARMv8 ?2. I Quote
"There is nothing worse than scrambled bytes on a network. All Intel implementations and the vast majority of ARM implementations are little endian. The vast majority of Power Architecture implementations are big endian. Mark says MIPS is split about half and half – network infrastructure implementations are usually big endian, consumer electronics implementations are usually little endian. The inference is: if you have a large pile of big endian networking infrastructure code, you’ll be looking at either MIPS or Power. "
How True is that? And if true, do ARM has any bigger plans to tackle this problem. Obviously there are huge opportunities when SDN are now exploding.
3. Thoughts on current integration of IP ( ARM ), Implementer ( Apple/Qualcomm ) and Fab ( TSMC ) ? Especially on the speed of execution. Where previously it would takes years for any IP marker from announce to something that is on the market. We are now seeing Apple coming in much sooner and Qualcomm is also well ahead of ARM projected schedule for 64Bit SoC in terms of Shipment date.
4. Thoughts on Apple's implementation of ARMv8?
5. Thoughts on Economy of Scale in Fab and Nodes. Post 16/14nm and 450mm wafers. Development Cost etc. How would that impact ARM?
6. By having a Pure ARMv8 implementation and Not supporting the older ARMv7. How much, in terms of % transistor does it save?
7. What technical hurdles do you see for ARM in the near future?
Peter Greenhalgh - Wednesday, December 11, 2013 - link
Hi iwod,Addressing question-2, all ARM architecture and processor implementations support big and little endian data. There is an operating system visible bit that can be changed dynamically during execution.
On question-6, certainly an AArch64 only implementation would save a few transistors compared to an ARMv8 implementation supporting both AArch32 and AArch64. However probably not as much as you think and is very dependent on the micro architecture since the proportion of decode (or AArch32 specific gates) will be less in a wide OOO design than an in-order design. For now, code compatibility with the huge amount of applications written for Cortex-A5, Cortex-A7, Cortex-A9, etc is more important.
ltcommanderdata - Tuesday, December 10, 2013 - link
Next gen consoles have been noted for their use of SoCs, especially in the context of hUMA. Of course, SoC have long been the standard in the mobile space. What is the current state of hUMA-like functionality between the CPU and the GPU in mobile? And what can and/or will be done in the future to improve this, both within ARM's family of products (ARM CPU + ARM GPU) and working with third-parties (ARM CPU + any other GPU)?Intel has adopted a cache model where each core has small pools of private, fast L1 and L2 cache and sharing/integration between cores and even the GPU happens in a larger, slower L3 cache. ARM's designs favour a private, fast L1 with sharing happening on the level of the L2 cache. What are the advantages/disadvantages between these design choices in terms of performance, power, die area, and scalability/flexibility?
Intel and AMD are busy expanding the width of their SIMD instruction set to 256-bits and beyond. Are 256-bit vectors relevant to mobile and NEON or are the use cases not there in mobile and/or the power/die area not worth it?
On the topic of ISA extensions to accelerate common functionality what other opportunities are out there? ARMv8 is adding acceleration for cryptography. Could acceleration for image processing, face recognition or voice recognition be useful or are those best left for specific chips outside the CPU?
ciplogic - Wednesday, December 11, 2013 - link
* Which are the latencies in CPU cycles for CPU caches? Is it possible in future to create a design that uses a shared L3 cache?* How many general purpose CPU registers are in Cortex-A53 compared with predecesors?
* Can be expected that Cortex-A53 to be part of netbooks in the years to come? What about micro-servers?
Peter Greenhalgh - Sunday, December 15, 2013 - link
Hi Ciplogic,While not yet in mobile, ARM already produces solutions with L3 caches such as our CCN-504 and CCN-508 products which Cortex-A53 (and Cortex-A57) can be connected too.
Since Cortex-A53 is an in-order, non-renamed processor the number of integer general purpose registers in AArch64 is 31 the same as specified by the architecture.
name99 - Wednesday, December 11, 2013 - link
How closely does a company like ARM follow academic ideas, and how long does it take to move those ideas into silicon.For example:
- right now the king of academic branch prediction appears to be TAGE. Is ARM looking at changing its branch predictor to TAGE, and if so would we expect that to debut in 2015? 2017?
- there have been some very interesting ideas for improving memory performance through having LLC and Memory Controller know about each other. For example Virtual Write Queue attempts to substantially reduce the cost of writing out data, while another scheme has predictors for when various ranks will be idle long enough that writes to them should be attempted, and a third scheme has prefetch requests prioritized to match ranks that are least busy. Once again, how long before we expect this sort of tech in ARM CPUs?
- in a handwaving fashion, for a high end CPU, I think it's fair to say that the single biggest cause of slowdowns is memory latency, which everyone knows; but the second biggest cause of slowdowns is the less well known problem of fetch bandwidth, specifically frequent taken branches, coupled with a four-wide fetch from a SINGLE cache line, and edge effects that result in many of those fetches being less than four wide. The heavy duty solution for this is a trace cache, a somewhat weaker solution is a loop buffer. Does ARM plan to introduce either of these? (Surely they are not going to allow the fact that Intel completely bollixed their version of a trace cache destroy what is conceptually a good idea, especially if you just use it as a small loop driven augmentation of a regular I-cache, rather than trying to have it replace the I-cache?)
secretmanofagent - Wednesday, December 11, 2013 - link
Getting away from the technical questions, I'm interested in these two.ARM has been used in many different devices, what do you consider the most innovative use of what you designed, possibly something that was outside of how you envisioned it originally being used?
As a creator, what devices made you look at what you created and had the most pride?
Peter Greenhalgh - Wednesday, December 11, 2013 - link
Hi,I'd suggest all of us who work for ARM are proud that the vast majority of mobile devices use ARM technology!
Some of the biggest innovations with ARM devices is coming in the Internet of Things (IOT) space which isn't as technically complex from a processor perspective as high-end mobile or server, but is a space that will certainly effect our everyday lives.
NeBlackCat - Wednesday, December 11, 2013 - link
> Do a good job here and I might be able to even convince him to give away some ARM powered goodiesWhat's his favourite type of sausage?
Since, as any halfwit should be able to work out from the above spelling, I've got bugger all chance of being eligible for giveaways.
Peter Greenhalgh - Wednesday, December 11, 2013 - link
Hi,Pork with chilli is good.
lightshapers - Wednesday, December 11, 2013 - link
Hey Mr Greenhalgh,Actually working in a great SoC company, I have recently been working on A12 based soc which was supposed to be an improvement over the A9 by a certain amount of dmips/mhz, but much more power-effcient than the a7/a15 couple in a big little. It was at a point that having a sole 4xcore a12 was better in terms of performance in the low perf (than the 4xa7), as much as better in high speed because generating less power than the 4xa15, which makes today's Soc throttling around 1.3Ghz, thus allowing sustainable perf at a higher frequency. Best of all, it allows not using the CCI which has been subject to controversy (hmp, smp, ...)
This CPU has not been really highly markettized (forgive this ugly word), because today the fashion is over the A53/A57 big.LITTLE couple and the possibly useless 64bit platforms.
this was my background (personnal thought).
Now my question: are we really going to see a CPU performance improvement for the small platform (smartphone) with the A53/57 or are these CPU specified for heavy use, which would indicate that thermal dissipation will prevent a hard use on smartphone. Should the SoC vendor concentrate on 32Bit a7/a15/a12 version that could be again improved in the futur in order to really see more performance.
Are you packaging a 8*a12 that would possibly make sense in high end soc?
Are you going to improve the power domain sharing inside your deliveries? It's still a nonsense to have coresight IP inside a CPU domain, as it prevent debugging once CPUs are sleeping...
thanks,
lightshapers - Friday, December 13, 2013 - link
I wish I had gotten your interest on my questions...wrkingclass_hero - Wednesday, December 11, 2013 - link
What is ARM's most power efficient processing core? I don't mean using the least power, I mean work per watt. How does that compare to Intel and IBM? Also, I know that ARM is trying to grow in the server market, given the rise of the GPGPU market, do you foresee ARM leveraging their MALI GPUs for this in the future? Finally, does ARM have any interest or ambition in scaling up to the desktop market?wrkingclass_hero - Wednesday, December 11, 2013 - link
I have another question. Why is ARM pursuing the big.LITTLE paradigm? Wouldn't it be more economical to use the extra silicon to make larger, more powerful cores that run at a lower clockspeed?mercury555 - Wednesday, December 11, 2013 - link
That is a very good point. I can soo imagine ARM telling their customers: you are anyways using 4X area, let's swap that for a BIG core for laptop/ultra-book class productsPeter Greenhalgh - Wednesday, December 11, 2013 - link
Hi wrkingclass_hero,In the traditional applications class, Cortex-A5, Cortex-A7 and Cortex-A53 have very similar energy efficiency. Once a micro-architecture moves to Out-of-Order and increases the ILP/MLP speculation window and frequency there is a trade-off of power against performance which reduces energy efficiency. There’s no real way around this as higher performance requires more speculative transistors. This is why we believe in big.LITTLE as we have simple (relatively) in-order processors that minimise wasted energy through speculation and higher-performance out-of-order cores which push single-thread performance.
Across the entire portfolio of ARM processors a good case could be made for Cortex-M0+ being the more energy efficient processor depending on the workload and the power in the system around the Cortex-M0+ processor.
Xajel - Wednesday, December 11, 2013 - link
When running 32bit apps on 64bit OS, is there's any performance hit compared to 64bit apps on 64bit OS ?And from IPC/Watt perspective, how A53/A57 is doing compared to A7/A15... I mean how much more performance we will get in the same power usage compared to A7/A15... talking about the whole platform ( memory included )
Peter Greenhalgh - Wednesday, December 11, 2013 - link
The performance per watt (energy efficiency) of Cortex-A53 is very similar to Cortex-A7. Certainly within the variation you would expect with different implementations. Largely this is down to learning from Cortex-A7 which was applied to Cortex-A53 both in performance and power./pigafetta - Wednesday, December 11, 2013 - link
Is ARM thinking of adding hardware transactional memory instructions, similar to Intel's TSX-NI?And would it be possible to design a CPU with an on-chip FPGA, where a program could define it's own instructions?
Peter Greenhalgh - Wednesday, December 11, 2013 - link
Hi pigfetta,ARM has an active architecture research team and, as I'm sure you would expect, look at all new architectural developments.
It would be possible to design a CPU with on-chip FPGA (after all, most things in design are possible), but the key to a processor architecture is code compatibility so that any application can run on any device. If a specific instruction can only run on one device it is unlikely to be taken advantage of by software since the code is no longer portable. If you look at the history of the ARM architecture it's constantly evolved with new instructions added to support changes in software models. These instructions are only introduced after consultation with the ARM silicon and software partners.
You may also be interested in recent announcements concerning Cortex-A53 implemented on an FPGA. This allows standard software to run on the processor, but provides flexibility around the other blocks in the system.
mpjesse - Wednesday, December 11, 2013 - link
I'm pretty sure no one asked you and that the question was meant to be answered by the ARM engineer, should he choose to answer it. Instead of trolling perhaps you should come up with your own question for our guest.If you don't have anything nice to say, don't say it at all.
mpjesse - Wednesday, December 11, 2013 - link
I'm pretty sure no one asked you and that the question was meant to be answered by the ARM engineer, should he choose to answer it. Instead of trolling perhaps you should come up with your own question for our guest.If you don't have anything nice to say, don't say it at all.
kenyee - Wednesday, December 11, 2013 - link
How low a speed can the ARM chips be underclocked?i.e., what limits the lowest speed?
Peter Greenhalgh - Wednesday, December 11, 2013 - link
Hi Kenyee,If you wished to clock an ARM processor at a few KHz you could. Going slower is always possible!
BoyBawang - Wednesday, December 11, 2013 - link
Hi Peter,ARM, as one of the key founders of HSA foundation organized by AMD, What is the now the current state progress on the ARM implementation?
Hulk - Wednesday, December 11, 2013 - link
Is this move to 64 bit driven by a need from the hardware and/or software or or pressure from competitors? If the former can you indicate some of the improvements users will see and feel with 64 bit?AgreedSA - Wednesday, December 11, 2013 - link
Why aren't you helping to make Terminator 2 (specifically, not the first one, that one's robot was just scary while Terminator 2 had a friendly robot and a scary robot as well) a reality in our world? Do you have something against robots? Seems vaguely speciestist to be honest...Alpha21264 - Wednesday, December 11, 2013 - link
Can you talk a bit about your personal philosophy regarding pipeline lengths. As the A53 and A57 diverge significantly on the subject. Too short its difficult to implement goodness like a scheduler but as you increase the length you also contribute to design bloat: you need large branch target arrays with both global and local history to avoid stalls, more complicated redirects in the decoder and execution units to avoid bubbles, and generally just more difficult loops to converge in your design. Are you please with the pipeline in the A53, where do you see happening with the pipeline both in the big cores and the little ones going forward (anticipate a vague answer on this one, but not going to stop me from asking)?Peter Greenhalgh - Thursday, December 12, 2013 - link
Hi Alpha,I'd expect my view of pipeline lengths to be similar to most other micro-architects. The design team have to balance the shortest possible pipeline to minimise branch mis-prediction penalty and wasted pipelining of control/data against the gates-per-cycle needed to hit the frequency target. Balance being the operative word as the aim is to have a similar amount of timing pressure on each pipeline stage since there's no point in having stages which are near empty (unless necessary due to driving long wires across the floorplan) and others which are full to bursting.
Typically a pipeline is built around certain structures taking a specific amount of time. For example you don't want an ALU to be pipelined across two cycles due to the IPC impact. Another example would be the instruction scheduler where you want the pick->update path to have a single-cycle turnaround. And L1 data cache access latency is important, particularly in pointer chasing code, so getting a good trade-off against frequency & the micro-architecture is required (a 4-cycle latency may be tolerable on a wide OOO micro-architecture which can scavenge IPC from across a large window, but an in-order pipeline wants 1-cycle/2-cycle).
We're pretty happy with the 8-stage (integer) Cortex-A53 pipeline and it has served us well across the Cortex-A53, Cortex-A7 and Cortex-A5 family. So far it's scaled nicely from 65nm to 16nm and frequencies approaching 2GHz so there's no reason to think this won't hold true in the future.
Tchamber - Wednesday, December 11, 2013 - link
Peter,Thank you for taking the time to answer our questions!
As mobile devices become more and more useful/powerful, they have encroached on territory historically dominated by Intel AMD...and they feel the pressure. Does ARM feel pressure from from those two companies? As ARM progresses, will you actively target the desktop space as room for growth?
OreoCookie - Wednesday, December 11, 2013 - link
Here's my question: Implementations of previous ARM cores by licensees, most notably the A15, feature much higher clocks than what ARM recommends. How has that influenced the design of the A53? Do you expect ARM's clock frequency design targets to be closer to the clocks in actual implementations?Peter Greenhalgh - Thursday, December 12, 2013 - link
Hi OreoCookie,ARM processor pipelines allow the processor to be built to achieve certain frequencies, but we don't recommend or advise what they should be. After all, there are still ARM1136 processors being implemented today on 40nm, yet we designed the processor on 180nm!
We and our partners like the freedom to chose whether to push the frequency as far as it will go or to back off a bit and save a bit of area/power. This freedom allows differentiation, optimisation around the rest of the platform and time-to-market (higher frequency = more effort = more time).
Naturally our pipelines have a range of sweet-spot frequencies on a given process node and there is a lot of discussion with lead partners about a new micro-architecture, but we aren't changing the pipelines based on the frequencies we're seeing in current mobile implementations.
msm595 - Wednesday, December 11, 2013 - link
As someone starting their Computer Engineering degree and really interested in computer architecture, how can I give myself a head start?Peter Greenhalgh - Thursday, December 12, 2013 - link
Hi Msm,Most good EE/CE degrees will have a reasonable amount of micro-architecture/architecture courses, but it doesn't hurt to understand what makes all the popular micro-architectures tick. For that matter, a lot of the designs in the 90's were impressive too - check out the Dec Alpha EV8 which never got to market, but was a really interesting processor.
tabascosauz - Thursday, December 12, 2013 - link
Hi all (and hopefully this gets a response from our guest Mr. Greenhalgh),I'm not exactly too well informed in the CPU department, so I won't pretend that I am. I'm just curious as to how A53 will fare against the likes of Krait 450 and Cyclone in terms of DMIPS (as obsolete as some people may think it is, i'd just like to get a sense of it performance-wise) and pipeline-depth.
We're all assuming that Apple has gone ahead and used a ARMv8a instruction set and, as per their own usual routine, swelled up the cores to many times that of their competitors and marketed it as a custom architecture. Since A53 is also based off ARMv8, I'm wondering how this will translate into speed. I think someone's mentioned before that A53 is the logical successor to Cortex-A7, but my mind is telling me that there's more to the number in the name than just a random number that is a few integers below 57.
If this is essentially a quad-core part and succeeds the A7, then are we looking at placement in the Snapdragon 400 segment of the market? It would certainly satisfy the conditions of "mid-to-high end" but I'm a little disappointed in Cortex-A at the moment considering that the A7 was introduced as a sort of energy-efficient, slightly lower performing A9. I mean, the A12 is seen as the A9's successor but it's still ARMv7a and it won't be hitting the market anytime soon, so would it be possible that we could see A53, with its ARMv8 set, on par with the Cortex-A12 in terms of rough performance estimates?
Can't wait until A57; it's bound to be a great performer!
Peter Greenhalgh - Thursday, December 12, 2013 - link
Hi Tabascosauz,Speaking broadly about Dhrystone, the pipeline length is not relevant to the benchmark as perfect branch prediction is possible which means issue width to multiple execution units and fetch bandwidth largely dictates the performance. This is the reason Dhrystone isn't great as a benchmark as it puts no pressure on the data or instruction side memory systems (beyond the L1 cache interfaces), TLBs and little pressure on the branch predictors.
Cortex-A12 is a decent performance uplift from Cortex-A53 in performance so we're not worried about overlap and while the Smartphone market is moving in the direction of 64-bit, there are still a lot of sockets for Cortex-A12. In addition there are many other markets where Cortex-A9 has been successful (Set Top Box, Digital TV, etc) where 64-bit isn't a near-term requirement and Cortex-A12 will be a great follow-on.
ThanosPAS - Thursday, December 12, 2013 - link
Question: What is the competitive advantage of ARM powered devices over other manufacturers' products and what your company will do in the future to preserve and enhance it?hlovatt - Thursday, December 12, 2013 - link
Can you explain what you mean by a 'weak' memory model and how this differs from other architectures and how it translates into memory models in common languages like Java?Peter Greenhalgh - Sunday, December 15, 2013 - link
Hi hlovatt,A weakly ordered memory model essentially allows reads (loads) and writes (stores) to overtake each other and observed by other CPUs/GPUs/etc in the system at different times or different order.
A weakly ordered memory model allows for the highest performance system to be built, but requires the program writer to enforce order where necessary through barriers (sometimes termed fences). There are many types of barrier in the ARM architecture from instruction only (ISB) to full-system barriers (DSB) and memory barriers (DMB) with various variants that, for example, only enforce ordering on writes rather than reads.
The Alpha architecture is the most weakly ordered of all the processor architectures I'm aware of, though ARM runs it close. x86 is an example of a strongly ordered memory model.
Recent programming standards such as C++11 assume weakly ordered and may need ordering directives even on strongly ordered processors to prevent the compiler from optimising the order.
tygrus - Thursday, December 12, 2013 - link
Memory bandwidth to RAM is more important than huge on chip caches. The on chip cache is more like a buffer for prefetching and write-back. 10% more RAM bandwidth is better than 50% more cache. Even caching of instructions is getting harder (keeping enough in cache) because the size of RAM used has increased far more than cache with the number of active programs and their complexity has increased. Mid 90's Pentium offchip 256KB L2 cache with 64MB RAM, P3 onchip 256KB L2 with 1024MB RAM, Sandy Bay Xeon with about 2.5MB L3 per core (upto 20MB) and 16GB RAM per core (128GB) or more.vvid - Thursday, December 12, 2013 - link
Hi Peter!Can 32bit performance degrade in future ARMv8 processor designs? ARMv7 requires some features omitted in ARMv8 - I mean arbitrary shifts, direct access to R15, conditional execution. I guess this extra hardware is not free, especially the latter.
Peter Greenhalgh - Sunday, December 15, 2013 - link
Hi vvid,Fortunately, while the ARM instruction set has evolved over the years, ARMv8 AArch32 (which is effectively ARMv7) isn't that far away from ARMv8 AArch64. A couple of big differences in ARMv8 AArch64 are constant length instructions (32-bit) rather than variable (16-bit or 32-bit) and essentially no conditional execution, but most of the main instructions in AArch32 have an AArch64 equivalent. As a micro-architect, one of the aspects I like the most about the AArch64 instruction set is the regularity of the instruction decoding as it makes decoding them faster.
As such the hardware cost of continuing to support AArch32 is not that significant and it is more important to be able to support the thousands of software applications that have been compiled for ARMv7 which are fully compatible and run just fine on the generation of 64-bit ARM processors that are now arriving.
ARMtech - Friday, December 13, 2013 - link
Thanks for the time.I have few questions
How is ARM A57 matching up with performance related to Intel Haswell? Though very good at power, the ARM cores are traditionally weak in performance compared to Intel. The Haswell arch seems to be beating ARM A15 very easily in Chromebook. Is this due to memory b/w issue? Can ARM arch with big.Little including CCI support higher BW?
Also why doesn't ARM go to Qualcomm arch like asynchronous Freq scaling? Why does freq tied to cluster instead of cores?
elabdump - Friday, December 13, 2013 - link
Very nice,Here are some Questions:
- Does ARM works on GCC Development?
- Are there special instructions for Cryptostuff defined in the 64-Bit ISA?
- If yes, are there patches for the upstream linux kernel available?
- Are there Instructions for SHA-3 available?
- Would ARM change their mind about free Mali drivers?
- Would ARM support device-trees?
Peter Greenhalgh - Sunday, December 15, 2013 - link
Hi Elabdump,Yes, ARM works on GCC development and, yes, there are special Crypto instructions defined in the v8 Architecture (for AES and SHA).
As for patches, Mali drivers and device trees, these are handled by other teams in ARM. If you're interested in these wider questions about ARM technology, forums such as http://community.arm.com can help you.
Krysto - Monday, December 16, 2013 - link
I hope you guys intend to add ChaCha20 to your next-generation chips or architecture. I'm pretty sure it's the next big cipher to be adopted by software makers. Google's security chief, Adam Langley, has already shown his support for it, but there are others who are looking to adopt it as an alternative to AES. So I hope you too can adopt it as soon as possible.http://googleonlinesecurity.blogspot.com/2013/11/a...
As for SHA-3, the jury is still out on that one, since nobody trusts NIST anymore, and there has been even some recent controversy about them wanting to lower the security of the final SHA-3 standard, so you might want to hold off on that one. You should support SHA-512 in the meantime, though, which is oddly missing from ARMv8.
Also, you guys should move to 4-wide and at least 256-bit NEON for Cortex A57's successor (I know, not your job, but still). And as others have said, try to match the low-end, mid-end and high-end release of the cores next time. One more thing - try to support OpenCL 2.0 as soon as you can.
JDub8 - Friday, December 13, 2013 - link
My question is about processor architecture design in general - there cant be very many positions in the world for "lead processor/architecture designer" - so how does one become one? Obviously promotion from within but how to you get the opportunity to show your company you have what it takes to make the tough calls? There cant be very many textbooks on the subject since you guys are constantly evolving the direction these things go.How many people does it take to design a bleeding edge ARM processor? How are they split up? Can you give a brief overview of the duties assigned to the various teams that work on these projects?
Thanks.
Peter Greenhalgh - Sunday, December 15, 2013 - link
Hi JDub8,I'd imagine that ARM is not so different from any other processor company in that it is the strength of the engineering team that is key to producing great products.
Perhaps where ARM differs from more traditional companies is the level of discussion with the ARM partners. Even before an ARM product has been licensed by an ARM partner they get input in to the product and there will be discussions with partners at all levels from junior engineers a few years out of college, through to multi-project veterans, lead architects, product marketing, segment marketing, sales, compiler teams, internal & external software developers, etc etc.
As a result, there are rarely 'tough calls' to be made as there's enough input from all perspectives to make a good decision.
In answer to your question about processor teams, these are typically made up of unit design engineers responsible for specific portions of the processor (e.g. Load-Store Unit) working alongside unit verification engineers. In addition to this there will be top-level verification teams responsible for making sure the processor works as a whole (using a variety of different verification techniques), implementation engineers building the design and providing feedback about speed-paths/power, performance teams evaluating the IPC on important benchmarks/micro-benchmarks.
And this is just the main design team! The wider team in ARM will include physical library teams creating advanced standard cells and RAMs (our POP technology), IT managing our compute cluster, marketing/sales working with partners, software teams understanding instruction performance, system teams understanding wider system performance/integration and test-chip teams creating a test device.
All in all it takes a lot of people and a lot of expertise!
twotwotwo - Saturday, December 14, 2013 - link
I. Core count inflation. Everyone but Apple lately has equated high-end with quad-core, which is unfortunate. I have a four-core phone, but would rather have a dual-core one that used those two cores' worth of die area for a higher-IPC dual-core design, or low-power cores for a big.LITTLE setup, or more L2, or most anything other than a couple of cores that are essentially always idle. Is there anything ARM could do (e.g., in its own branding and marketing or anything else) to try to push licensees away from this arms race that sort of reminds me of the megapixel/GHz wars and towards more balanced designs?II. Secure containers. There has been a lot of effort put in to light-weight software sandboxes lately: Linux containers are newly popular (see Docker, LXC, etc.); Google's released Native Client; process-level sandboxing is used heavily now. Some of those (notably NaCl) seem be clever hacks implemented in spite of the processor architecture, not with its help. Virtualization progressed from being that sort of hack to getting ISA support in new chips. Do you see ARM having a role in helping software implementers build secure sandboxes, somewhat like its role in supporting virtualization?
III. Intel. How does it feel to work for the first company in a long while to make Intel nervously scramble to imitate your strategy? Not expecting an answer to that in a thousand years but had to ask.
Peter Greenhalgh - Sunday, December 15, 2013 - link
Hi twotwotwo,Core counts are certainly a popular subject at the moment!
From our perspective we've consistently supported a Quad-Core capability on every one of our multi-core processors all the way back to ARM11 MPCore which was released in the mid-2000's. And while there's a lot of focus from the tech industry and websites like Anandtech on high-end mobile, our multi-core processors go everywhere from Set-Top-Box to TVs, in-car entertainment, home networking, etc, etc some of which can easily and consistently use 4-cores (and more, which is why we've built coherent interconnects to allow multiple cluster to be connected together).
The processor's are designed to allow an ARM partner to chose between 1,2,3 or 4-cores and the typical approach is to implement a single core then instance it 4-times to make a Quad-Core with the coherency+L2 cache layer connecting the cores together and power switching to turn un-used Cores off. The nice thing about this approach is that it is technically feasible to design a coherency+L2 cache solution that scales in frequency, energy-efficiency and IPC terms from 1-4 cores rather than compromising in any one area.
The result of this is that a Dual-Core implementation will be very similar in overall performance terms as a Quad-Core implementation. So while it may be that for thermal reasons running all 4-Cores at maximum frequency for a sustained period of time is not possible, if two Cores are powered off on a Quad-Core implementation it isn't any different from only having a Dual-Core implementation to start with. Indeed, for brief periods of time 4-Cores can be turned-on as a Turbo mode for responsiveness in applications that only want a burst of performance (e.g. web browsing). Overall there are few downsides to multiple Core implementations outside of silicon area and therefore yield.
From a product perspective we've been consistent for almost a decade on the core counts provided by our processors and allow the ARM partners to choose how they want to configure their platforms with our technology.
Krysto - Monday, December 16, 2013 - link
I think many chips companies will be moving to 8-core solutions soon. What would the purpose of that be (other than marketing)?Krysto - Monday, December 16, 2013 - link
To clarify, I'm referring to 8-same-core solutions, not big.Little.twotwotwo - Monday, December 16, 2013 - link
First, wow--had no idea you were going to go through all these comments and answer them; good work.FWIW, your answer dismisses a problem slightly different from the one I asked about. Yes, 4xA15/2MB L2 performs no worse than 2xA15/2MB L2; you just spend more because of higher die area. But if the SoC design had stuck to two cores, I imagine they could use extra die area for more useful things--enlarging the L2, adding A7s in a big.LITTLE config, using a higher-IPC but larger core design if one were available, upgrading other SoC components like the GPU, etc. My understanding, mostly from AnandTech, is that Apple's A7 (where'd they get that name from?) SoC basically does this--it's largish, but that's from Apple doing basically everything *but* going quad-core.
Still, indirectly, you have answered my bottom-line question--ARM doesn't really see the proliferation of quad-core SoCs as a problem, and maybe doesn't see it as their job to push licensees towards one config or another in general.
darkich - Sunday, December 15, 2013 - link
Can we expect the fully flexible A53+A57 octa core successor of the current big.LITTLE implementation?Can you estimate the improvements it could bring regarding performance /efficiency, on the 20nm?
Kevin G - Monday, December 16, 2013 - link
Will Aarch64 add coprocessor instructions? The CP15 interface exists in the 32 bit ARMv7 but are missing in ARMv8. Will this be added later so that 3rd parties may add their custom coprocessors to an ARM design? If not, is this to prevent Aarch64 from becoming diversified with proprietary extensions?The 64 bit ARM ISA is unique in that it allows legacy 32 bit support to be omitted from a design. How much additional power consumption is needed to support the 32 bit legacy modes? Also if 32 bit legacy support was removed and the core optimized for 64 bit only mode, what would the performance gains be, all other things being equal?
big.LITTLE is an interesting concept that launched with the Cortex A7 and A15. Several years later the A12 arrives and can be used in big.LITTLE as well. Can big.LITTLE scale to span three architectures (A7, A12, A15) for fine grained performance/watt stepping as load increases?
Similarly, there is a gap between the Cortex A53 and Cortex A57. Presumably there is room for a hypothetical Cortex A55 but that’d arrive several years later. Is there plans to synchronize the release schedule for the low, middle and high end designs?
Zisch - Monday, December 16, 2013 - link
How would you liken ARM to MIPS? Where do you see MIPS going?barbare64 - Tuesday, December 17, 2013 - link
Is ARM planning to create high TDP CPU for servers ?Is there some work towards APU ?