*Note this writeup was done privately at end of 2012 and so pretty outdated, but still I think it was a decent summary of the state of ARM at that time and so I publish it for fun. I am considering to do an updated Arm Analysis given the changes that have happened in past year *
Will Eatherton – email@example.com
The buzz around Arm based CPU’s for the Server Ecosystem has been growing notably in mid 2012. Gartner predicts that ARM servers will own 15 percent of the server CPU market within four years.
There are numerous planned ARM based CPU server chips for server market : ST Micro , Applied Micro has early 64-bit prototype and seems to be leading support for ARM v8, Marvell was early with a 32-bit server ARM based server focused processor, AMD has announced 64-bit ARM CPUs by 2014 , LSI , and startup Calxeda. Additionally there are indications within the industry that IBM/Samsung may be considering opportunities in server space using ARM (maybe more as SOC designs with partners).
- Beyond silicon plays, early stage startups like Netspeed are looking at enabling the growing number of players interested in high end multi-core ARM SOCs by providing a coherent interconnect
- ARM is also working on a coherent interconnectt for massively multi-core designs with support for 16 cores going to 32 in the future
From a systems standpoint Dell has announced an ARM based server starting with a 32-bit version, and HP has indicated that their proof of concept project (Moonshot) will support ATOM and ARM based CPUs.
- Beyond the system level end Mega Data center operators like Facebook have also have shown interestt in helping to support ARM insertion into data center
Note that for system vendors as well as DIY Hyperscale DC integrators like Facebook, support of ARM in DC is somewhat self serving for the basic reason of providing negotiating leverage with Intel on pricing
Looking at specific application of a Map and Reduce focused compute node, the question is what an ARM cpu based server could look like in 2014/2015 and if it would add any notable differentiation benefits Vs Intel. The analysis below focuses on exploring implications of 64-bit ARM cores used in data center for M&R applications.
Conclusion : The analysis below concludes that ARM based servers for the target application do not seem worth the risks at this point. If future projected ARM solutions could show 3–5x increase in final performance after all factors (including software) in same power profile then it would be worth revisiting.
Challenges for ARM as Server CPU ISA
Before getting into a low level performance/hardware analysis of ARM as a server CPU, it is important to look at what the implications of replacing x86 in data center with ARM ISA will have on the ecosystem for Software/System architecture.
OS Support Optimizations
There is clearly gaps today for strong generalized Linux support of the upcoming ARM silicon for DC. There is a recent initiative termed Linaro to focus on flushing out full linux support for ARM related to topics starting with boot sequences and going from there. Beyond base driver level support a major area of concern from performance oriented software developers is the extensive tooling around benchmarking and tuning of Linux on x86 that will have to mature with ARM.
Compilation Tool Chains
It has been common wisdom for a number of years that Intel’s optimizing compiler (ICC) for C/C++ is one of the better compilers in the industry and represents again the level of multi-year optimization around x86. Recent studies in 2012, there continues to be seen a notable benefit of ICC compilers of other alternatives like GCC and Microsoft.
From the perspective of Hadoop (a Java application) the impact of not having the Intel C/C++ compiler from performance/memory standpoint may not be major, but is representative of the types of optimization issues that ARM will have entering wide spread data center use.
With ARMv7/v8 there is some support for Hardware and I/O virtualization that will be supported and available in silicon by 2014. Paravirtualization is the description of the technique used to present software interfaces up to virtual machines and is used to compensate for these gaps (accomplished by intercepting certain instructions from guest OS and interpreting them differently for the actual hardware) .
* Note the ongoing maintenance of para-virtualization can be difficult with consistent performance and reliability
Vmware has indicated pretty bluntly they are in no rush to support ARM, this is of course a major issue for adoption of ARM in the data center as the open source solutions will be only real option for virtualization for ARM servers in the foreseeable future.
For KVM support of ARM which seem to be making progress but still early phase and have many years of work ahead of them on topics related to performance tuning, benchmarking, and para-virtualization enhancements to better compensate for missing virtualization extensions and hardware in ARM v7/v8.
* For the x86 architecture (Intel and AMD) starting providing support for virtualization many years ago and the time frame for ARM to reach parity will be a long time
From the perspective of a Hadoop cluster, the weak support of hypervisor support on ARM is not necessarily a show stopper as it is a common case to run the Hadoop stack a non virtualized OS running bare metal, but it is still representative of the types of optimization issues that ARM will have entering wide spread data center use.
Java Virtual Machines
As with prior topics, the immature state of JVM optimization for ARM compared to x86 will again be an impediment for rapid ARM adoption. While in general the performance analysis data is sketchy in this area, some example experiments in 2012 have shown that with OpenJDK performance between C and Java on x86 platforms are in ballpark of 1:1, but when the same benchmarks are run on ARM the ratios can be as high as 3.6x to 8.9x worse for ARM. This indicates that the level of tuning around ARM JVM support is still very immature.
Oracle has recently announced that they will support their JRE on ARM. This is very important is the Oracle JRE is commonly viewed as the clear industry grade/performance leader for JRE support compared to alternative commercial and open source options. There are some functional limitations in Oracles planned support of ARM, but they do not appear to have major impact on server applications. However, there is no benchmarking data yet available for Oracle’s JRE and it is expected there will be a multi-year evolution required. Additionally the first port is focused on 32-bit and ARM v7, so support for the new set of 64-bit cores will not be until well into 2013.
The JVM support for ARM is key to Hadoop which is Java based.
CPU level Analysis of ARM vs x86
Based on discussion with a processor Architect, interesting data points :
- Expects to see a 32-core, 64-bit core devices in prototype by 2H 2013
for integer/text manipulations (common in M&R applications which is example application considered here) expects that first order it can be approximated that ARM v8 cores should be on par with x86 E5 cores at the machine code level (ignoring any virtualization or tool chain differences that were explored above) at the same clock
- From power standpoint expects 32 core 64-bit ARM v8 core CPU to be similar power as 10 core x86 CPU in similar time frame (100W) for same system level functionality. This implies an upper bounds of 3x benefit per core for ARM
- Industry discussion of the back of the envelope statistics for a 5–10x delta between power of x86 server cores and ARM cores, generally are comparing Arm V6 32-bit cores which do not represent the power per core when the ARM ISA moves to 64-bit and starts adding more overheads like full floating point support, virtualization support (e.g. nested page tables), and coherent interconnect overheads.
He does not see that coherent interconnect of 32-cores will be bottleneck based on his analysis of ARM’s recent multi-interconnect (CN–504)
Looking beyond a single CPU, There is not a concrete plan available yet about support for multi-CPU mesh configurations like Intel’s QPI connection for larger shared memory complex
- The implication of not having this multi-CPU configuration is that each silicon instance is a standalone CPU without ability to leverage shared memory and require finer grain segregation of tasks across CPUs with separate distributed application instances across each CPU
There has been a statement from Applied micro CEO that in future up to 1024 cores across 64 CPU’s is planned, though there is not much additional detail on this yet.
Analysis of Arm CPU for a M&R Compute Node
At a system level beyond the topic of the potential compute benefits (per Watt) of ARM vs x86, and the software complexities is question of how relevant this trade-off for a given application area.
First lets consider a very rough estimate for normalizing performance to BIPS (Billions of Instructions per Second) for integer operations of an ARM and x86 based CPU within 100W budget for silicon expected to be available by end of 2013. Note that the relative JVM performance estimates may be optimistic in favor of ARM.
|Number of Cores||32||10|
|ISA efficiency for Integer operations||.5||1|
|Linux OS Perf Relative to x86||.95||1|
|JVM Performance Implications||.7||1|
|Final relative BIPS||32||24|
In summary while a crude estimate, this final ratio of relative performance with in 100W is close enough to 1:1 that it is not interesting.
Going to system level, if we assume that over time with software and further silicon optimization ARM based CPUs improved to a solid 2x final performance (relative BIPS) per watt benefit of ARM over x86 after all overheads, how much would this impact system optimization for the Map and Reduce application in a rack server ?
- Taking into account memory, disk, IO and other overheads, the final system impact in system density for a 2:1 would have estimated <20% benefit at system level. This does not seem to have enough impact to warrant the significant risk and effort.
As an example, consider that for recent hadoop cluster analysis, the ratio of x86 cores to spindles may be as high as 1:5, this implies that a blade with say 32 cores would matchup with 160 spinning disks. This is significant amount of space and power compared to the CPU complex, making the raw CPU performance less relevant.
If it were possible to achieve a ratio of say 3–5x of ARM over x86 in final relative BIP performance per watt, and aiming for 100’s of ARM cores on a blade (or in a 2RU rack server), then merging this optimization with a major overhaul of the system design to match the massively multi-core architecture in areas of disk/memory/IO could result in a significant different optimization point then x86 servers today.