Dynamic Power Management: A Quantitative Approach

Name: Dynamic Power Management: A Quantitative Approach
Item: Dynamic Power Management: A Quantitative Approach
Author: Johan De Gelas

by Johan De Gelas on January 18, 2010 2:00 AM EST

Posted in
IT Computing

35 Comments | Add A Comment

35 Comments

Saving Power at Low Load

Measuring idle power is important in some applications as operating system schedulers may choose to "race to idle", i.e. perform the task as quickly as possible so the CPU can return to an idle state. This strategy is only worthwhile if the idle state consumes very little power, but lots of server applications are running at relatively low but almost never "zero" load. One example is a web server that is visited all around the globe. Thus it is equally interesting to see how the processors deal with this kind of situation. We started Fritz Mark up with two threads to see how the operating system and hardware cope with this. First we look at the delivered performance.

Fritzmark integer processsing: 2 thread performance

In performance mode, the Xeon L3426 is capable of pushing clock speed up to 2.66GHz, but not always. Performance is equal to a similar Xeon at 2.5GHz. This in contrast with the Xeon X3470 which can almost always keep its clock speed at 3.33GHz, and as such delivers performance that is equal to a Xeon that would run always at that speed. The reason for this difference is that the PCU of the L3426 has less headroom: it cannot dissipate more than 45W while the X3470 is allowed to dissipate up to 95W. Still, the performance boost is quite impressive: Turbo Boost offers 34% better performance on the L3426 compared to the "normal" 1.86GHz clock.

Now let's confront the performance levels with the power consumption.

The six-core Opteron is clearly a better choice than its faster clocked quad-core sibling. In power saving mode it is capable of reducing the power by 8W more while offering the same level of performance. It is a small surprise: do not forget that the "Istanbul" Opteron has twice as many idle cores that are leaking power than the "Shanghai" CPU.

The Nehalem based core offers very high performance per thread, about 40% higher than the Opteron's architecture is capable of achieving, but it does come with a price, as we see power shoot up very quickly. Part of the reason is of course is that the Nehalem is more efficient at idle. We assume - based on early component level power measurements - that the idle power of the Xeons is about 9W (power plan Balanced), the Opterons about 14W (power plan Balanced). Note that the exact numbers are not really important. Since the RAM is hardly touched, we assume that power is only raised by 1W per DIMM on average. Based on our previous assumptions we can estimate CPU + VRM power, measured at the outlet.

System Power Estimates
System	Power Calculation	CPU + VRM Power	Notes
Xeon X3470 performance	119W - 4W (4 x 1W per DIMM) - 60W idle + 13W CPU	= 68W	(idle power of system was 73W = 13W CPU, 60W for the rest of the system)
Xeon L3426 performance	99W - 4W - 60W + 11W	= 46W
Xeon L3426	90W - 4W - 60W + 9W	= 35W
Opteron 2435 performance	102W - 4W - 70W idle + 18W	= 42W	(total idle power was 88W, 18W CPU)
Opteron 2435 balanced	100W - 4W - 70W idle + 14W	= 40W
Opteron 2389 performance	114W - 4W - 70W idle + 22W	= 62W

First of all, you might be surprised that the Turbo Boosted L3426 needs 46W. Don't forget this is measured at the power outlet, so 46W at 90% efficiency means that the CPU + VRMs got 41W delivered. Yes, these numbers are not entirely accurate, but that is not the point. Our component level power measurements still need some work, but we have reason to assume that the numbers above are close enough to draw some conclusions.

AMD's platform consumes a bit too much at idle, but...
The six-core Opteron CPUs are much more efficient than the quad-core in these circumstances
Intel's 95W Xeons offer stellar performance but the high IPC requires quite a bit of power
The low power versions offer an excellent performance / Watt ratio

So if we take the platform out of the picture, the low power Xeon with Turbo Boost consumes about the same as the "normal" six-core Opteron, but performance is 16% better. Is this a success or a failure? Did Intel's Power Controller Unit save a considerable amount of power? Or in other words, would the power of the Xeons be much higher if they didn't have a PCU? Let's dive deeper.

Our Benchmark Choice Analysis: What Happened?

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

35 Comments

View All Comments

n0nsense - Monday, January 18, 2010 - link
Here is what system sees ...
only one is 2.5, other three are 2.0 :)

nons ~ # cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz
stepping : 7
cpu MHz : 2497.000
cache size : 3072 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority
bogomips : 5009.38
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz
stepping : 7
cpu MHz : 1998.000
cache size : 3072 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 4
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority
bogomips : 7012.69
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz
stepping : 7
cpu MHz : 1998.000
cache size : 3072 KB
physical id : 0
siblings : 4
core id : 2
cpu cores : 4
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority
bogomips : 5009.08
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz
stepping : 7
cpu MHz : 1998.000
cache size : 3072 KB
physical id : 0
siblings : 4
core id : 3
cpu cores : 4
apicid : 3
initial apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority
bogomips : 5009.09
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
VJ - Tuesday, January 19, 2010 - link
These are mobile CPUs, however:

With Linux on a Latitude (Intel T7200 or T7500), CPU Frequency Scaling Monitor allows one to scale the frequency of one core to its max while leaving the other core at its minimum.

With an AMD TL62, this is not possible. The induced scaling of one core causes the frequency of the other core to follow.

With an AMD ZM84 this is possible. Just like with the Latitude, one can have one core at its max with the other core at its minimum.

Maybe what's shown is not what's taking place.

Additionally;

http://www.intel.com/technology/itj/2006/volume10i...">http://www.intel.com/technology/itj/200...al_Manag...

"For example, in a Dual-Processor system, when the OS decides to reduce the frequency of a single core, the other core can still run at full speed. In the Intel Core Duo system, however, lowering the frequency to one core slows down the other core as well."
VJ - Tuesday, January 19, 2010 - link
Additionally; AMD's ZM84 allows each core to operate at different frequencies. The lowest frequency is 575Mhz while the highest is 2300Mhz.

I can set one core to 1150Mhz with the other set at 2300Mhz. This is different from the Intel (Mobile) CPUs I've come across where a difference in frequency between cores is only possible when one core is (seemingly) operating at its lowest frequency (in a dual core system).

What is also interesting from aforementioned cpuinfo output is that only core is running at its max frequency while all (3) other cores are (seemingly) at their minimum frequency. Considering my previous conjecture on C2 and C0 states, it would be surprising if one can show cpuinfo output where 2 cores are running at max frequency while the other 2 cores are running at any frequency other than max frequency. That shouldn't be possible at all.
valnar - Thursday, May 6, 2010 - link
Does anyone know if this kind of power management for Lynnfield processors is available in Windows 2003?
hshen1 - Sunday, June 23, 2013 - link
This is really a good article for power management researchers like me!!

Dynamic Power Management: A Quantitative Approach

Post Your Comment

35 Comments

View All Comments

n0nsense - Monday, January 18, 2010 - link

VJ - Tuesday, January 19, 2010 - link

VJ - Tuesday, January 19, 2010 - link

valnar - Thursday, May 6, 2010 - link

hshen1 - Sunday, June 23, 2013 - link

Log in

Don't have an account? Sign up now