AMD's Quad-Core Barcelona: Defending New Territory
by Johan De Gelas on September 10, 2007 12:15 AM EST- Posted in
- IT Computing
64-bit Linux Java Performance: SPECjbb2005
SPECjbb needs good integer performance and an excellent memory subsystem, especially if you test with several instances as we do. So what integer improvements could help Barcelona here?
Fetching 32 bytes instead of 16 bytes (Intel Core, AMD previous Opterons) makes decoding a bit faster as the average decoding bandwidth increases, but will only help performance when the CPU is able to calculate many instructions per cycle, which is not the case in a lot of applications, including SPECjbb (IPC of 0.2 - 0.5). It might help with some branch intensive code however (unaligned branch targets).
The biggest improvement for integer code and especially code that accesses the memory a lot is the fact that finally AMD has an architecture that can reorder loads ahead of a load and in some cases a store. This feature has been lacking in the AMD family, while it has been present in the Intel CPUs since the Pentium Pro. It makes the newest AMD quad CPUs more "out of order" than previous CPUs; Intel's Core architecture is still a lot more flexible in this, but the AMD Barcelona should like the SPECjbb benchmark quite a bit: it has more memory bandwidth than the Core CPUs have available, and the gap in OOO integer processing with Core has been reduced quite a bit.
SPECjbb2005 from SPEC (Standard Performance Evaluation Corporation) evaluates the performance of server side Java by emulating a three-tier client/server system with emphasis on the middle tier. Instead of testing with a separate possibly disk intensive database system, SPECjbb uses tables of objects, implemented by Java Collections. A longer description can be found here.
Again, it is not our objective to show the best possible scores. Very few people will take the time to fully tune the JVM and take the risk that some of the ultra aggressive optimizations backfire. So we tested with some decent but rather generic tuning that we could use on all systems. The JVM is Sun's version 1.5.0_08, which allows us to compare scores with previous results as we have had only a few days to test the newly arrived systems.
We tested SPECjbb2005 with four application instances. Using NUMActl, a clever utility written by Andi Kleen, we were able to bind each Java application to a separate node. We didn't bind instances to CPUs on the Intel platforms (though it is possible with taskset) as it gives lower performance. The parameters in bold show the actual JVM optimizations.
On the Opteron we used:
The newest Opteron does well, and performs like a 2.4GHz Clovertown. Note that it cannot outperform the old four socket (but more expensive) 880 Opteron as this platform has even more bandwidth available and runs at an almost 20% higher clock speed. Still, we can conclude that the improved memory subsystem does pay off in SPECjbb. That's a good sign for the majority of server applications, but what about the HPC world?
SPECjbb needs good integer performance and an excellent memory subsystem, especially if you test with several instances as we do. So what integer improvements could help Barcelona here?
Fetching 32 bytes instead of 16 bytes (Intel Core, AMD previous Opterons) makes decoding a bit faster as the average decoding bandwidth increases, but will only help performance when the CPU is able to calculate many instructions per cycle, which is not the case in a lot of applications, including SPECjbb (IPC of 0.2 - 0.5). It might help with some branch intensive code however (unaligned branch targets).
The biggest improvement for integer code and especially code that accesses the memory a lot is the fact that finally AMD has an architecture that can reorder loads ahead of a load and in some cases a store. This feature has been lacking in the AMD family, while it has been present in the Intel CPUs since the Pentium Pro. It makes the newest AMD quad CPUs more "out of order" than previous CPUs; Intel's Core architecture is still a lot more flexible in this, but the AMD Barcelona should like the SPECjbb benchmark quite a bit: it has more memory bandwidth than the Core CPUs have available, and the gap in OOO integer processing with Core has been reduced quite a bit.
SPECjbb2005 from SPEC (Standard Performance Evaluation Corporation) evaluates the performance of server side Java by emulating a three-tier client/server system with emphasis on the middle tier. Instead of testing with a separate possibly disk intensive database system, SPECjbb uses tables of objects, implemented by Java Collections. A longer description can be found here.
Again, it is not our objective to show the best possible scores. Very few people will take the time to fully tune the JVM and take the risk that some of the ultra aggressive optimizations backfire. So we tested with some decent but rather generic tuning that we could use on all systems. The JVM is Sun's version 1.5.0_08, which allows us to compare scores with previous results as we have had only a few days to test the newly arrived systems.
We tested SPECjbb2005 with four application instances. Using NUMActl, a clever utility written by Andi Kleen, we were able to bind each Java application to a separate node. We didn't bind instances to CPUs on the Intel platforms (though it is possible with taskset) as it gives lower performance. The parameters in bold show the actual JVM optimizations.
On the Opteron we used:
numactl --cpunodebind=$node --membind=$node -- java -cp jbb.jar:check.jar -Xms2g -Xmx2g -Xmn1g -Xss128K -XX:+AggressiveOpts -XX:+UseParallelOldGC -XX:+UseParallelGC spec.jbb.JBBmain -propfile SPECjbb.props -id $x
On the Xeons we used:
java -classpath jbb.jar:check.jar -Xms2g -Xmx2g -Xmn1g -Xss128K -XX:+AggressiveOpts -XX:+UseParallelOldGC -XX:+UseParallelGC spec.jbb.JBBmain -propfile SPECjbb.props -id $x
Below you can find the final score reported by SPECjbb2005, which is an average of the last four runs.The newest Opteron does well, and performs like a 2.4GHz Clovertown. Note that it cannot outperform the old four socket (but more expensive) 880 Opteron as this platform has even more bandwidth available and runs at an almost 20% higher clock speed. Still, we can conclude that the improved memory subsystem does pay off in SPECjbb. That's a good sign for the majority of server applications, but what about the HPC world?
46 Comments
View All Comments
kalyanakrishna - Tuesday, September 11, 2007 - link
I don't deny people use MKL ... I dont agree that anyone targeting performance on AMD Opteron will use MKL. No one running HPL/Linpack for Top 500 submission would use MKL on Opteron. No one who wishes to test his Opteron for performance would use MKL to do so. No one wishing to have the fastest possible results from his Opteron will do so.Even ISV's now provide code that is optimized for Xeon and Opteron separately.
JohanAnandtech - Tuesday, September 11, 2007 - link
Ok, point taken. Give us some time, and we'll follow up with new compilations of Linpack.kalyanakrishna - Wednesday, September 12, 2007 - link
Thank you. Appreciate the effort.leexgx - Monday, September 10, 2007 - link
and how offen do you read anandtechs Previews and reviewsunlike when intels core 2 came out all the hipe was real, to bad for AMD this time
this cpu is going to be good, problem is will it be able to compleat with Intels new cpu when it comes out
i still useing an amd system if your wundering and so all the rest of my pcs apart from my server as i just thow in an old P4 mobo to just file sharein house (all second hand parts apart from the hdds)
phaxmohdem - Monday, September 10, 2007 - link
I wonder if it would be feasible for AMD to take the Intel approach, and slap two of there new native quad cores together and release an octal core CPU in the near future. Or would they remain the multi-core purists they have become... Similarly I wonder if 2 65nm Barecelona cores could even fit under that heat spreader... or come in under an acceptable thermal envelope.Accord99 - Monday, September 10, 2007 - link
It won't fit on Socket F:http://www.madboxpc.com/news/am2/AMD_barcelona.jpg">http://www.madboxpc.com/news/am2/AMD_barcelona.jpg
fic2 - Monday, September 10, 2007 - link
Page 8, 3DS Max 9 last paragraph:"Dual 3GHz Opteron 2222 is capable of generating about 29 frames per hour", but then
"potential 3GHz Barcelona will be able to spit out ~35 frames per second". I think that is supposed to be ~35 frames per hour. Otherwise that is an extremely impressive speedup!
JohanAnandtech - Monday, September 10, 2007 - link
No, it is "per second". We used a Octalcore 2THz Barcelona there.... Thanks, fixed that one :-)
phaxmohdem - Monday, September 10, 2007 - link
Got SuperPi times for that beast? ;)Roy2001 - Monday, September 10, 2007 - link
Kentsfield has 2*143mm^2 dies. Barcelona is 280+ mm^2. Penry would be even smaller, 2*100 mm^2. So unless AMD can increase the frequency to 3.0+Ghz soon and price their new quad-core processors higher than Intel's, AMD would be still in red unless it oursouces Athlon 64 to TSMC.