My post on Raspberry Pi overclocking and benchmarking raised a few questions about the differences in performance between the armel architecture used by Debian and the armhf architecture used by Raspbian.
Debian officially provides two ARM architectures, armel and armhf. armel is for lower end hardware, supports the ARMv4 instruction set and hardware floating-point though a compatibility mode which slows performance but allows compatiblity with code written for processors without floating point units. armhf is for higher end hardware and supports ARMv7 and faster, direct hardware floating point support without backwards compatibility. These are roughly analogous to the i386 and i686 architectures.
Note that in Ubuntu, both armel and armhf are compiled for ARMv7 and above, so neither will work on the Raspberry Pi.
The Raspberry Pi is an ARM11 processor supporting the ARMv6 instruction set and VFPv2 hardware floating point. Performance is being sacrificed to retain compatibility with code compiled without support for the VFP. It is also potentially missing out on faster, more optimised instructions introduced with the ARMv5 and ARMv6 instruction sets. The Raspberry Pi is a victim of the compromises made between performance and compatibility when standardising the Debian architectures.
ARM Instruction Sets and Architectures
At this point it’s worth explaining these version numbers. There are two numbers used to describe the ‘version’ of an ARM processor.
The first is the instruction set version, which on the Pi is ARMv6. Most newer devices use ARMv7, and ARMv8 will be the next version.
The second is the ARM processor design version, which on the Pi is 11. ARM11 introduced the ARMv6 instruction set. The next version was the Cortex A, which introduced the ARMv7 instructions. Your smartphone will probably have a processor with an ARM11 or an ARM Cortex A.
Each processor design comes in several variants depending upon which optional features it includes. The Raspberry Pi uses the ARM1176JZF-S, with the F meaning that it includes the Vector Floating Point (VPF) instruction set.
Softfp and Hardfp
For ARM there are two different ABIs (Application Binary Interfaces), soft/softfp and hard. ‘soft’ doesn’t use the FPU at all and uses gcc maths replacement functions to emulate floating point arithmetic. ‘softfp’ uses the FPU but arguments to functions are passed through the integer registers and then passed to the floating point unit. ‘hard’ is using the FPU directly with data passed directly to the floating point unit registers. While soft/softfp are forwards compatible, ie. a ‘soft’ app can run on a softfp system, but not vice-versa, a ‘hardfloat’ application can run on neither of those systems. This means that in order to use hardfloat the system has to be completely recompiled for the hardfloat ABI, down to the last library and program.
The Raspbian distribution is in essence the Debian source recompiled for the ARMv6 instruction set with ‘hardfloat’ direct access to the VPFv2 instructions. There are also some other minor changes to make things work better on the Raspberry Pi, such as optimised implementations of memcpy() and memset() but no real changes to the distribution. It’s intended to optimise the genuine Debian installation as much as possible for the Raspberry Pi.
To try to quantify the optimisations in Raspbian I extended my benchmarking tests and ran them on the two different architectures, Debian armel and Raspbian armhf.
The tests were run on a single Raspberry Pi using two different partitions on the same USB hard disk. The same kernel was used for both sets of tests. Tests were performed multiple times until a “best” score was reached. For the GTKPerf tests both Raspbian and Debian were set to the same GTK theme. I chose not to use the raw data, but present it in an easier to understand format, armhf performance is presented relative to armel performance.
Keep in mind the golden rule of benchmarking: All benchmarks are flawed benchmarks.
The chart above shows the performance difference in various applications between Debian armel and Raspbian armhf. Performance improvement varied from 4% to 40% depending upon the application.
The performance improvements seen by non-floating point applications like Gzip and Bzip2 and were related to the ARMv6 instructions being used instead of the ARMv4 instructions. We seem to gain 4-10% in these applications.
GTKPerf showed a 19% improvement in X Windows GUI operations, which should make the Raspberry Pi more usable as a desktop. Quake 3 showed a more modest improvement, but that is because it is already limited by the GPU at 1080p resolutions.
The applications with larger performance improvements are those making heavy use of floating point maths, particularly media en/decoding, which see a huge performance increase. I also saw a 600% increase in Mpeg Layer 3 and Layer 2 encoding performance, but I didn’t include that on this chart as it made the other data difficult to read.
Users of the latest OpenSSL packages will see a ~100% performance increase over these numbers, which is related to an ASM optimisation patch applied by the Raspbian team and isn’t relevant to this test. It does demonstrate how important optimisation is though!
The chart above shows a comparison of the MP3 encoding performance of the armel and armhf architectures. The numbers show significant speed improvements with the code compiled for hardfp/ARMv6 and demonstrates the performance gains possible from optimised code.
When encoding the test files the armhf encoder managed to encode at more than double real-time, and armel at less than half real time.
The chart above shows the amount of CPU time required to decode various audio formats in real time on the Raspberry Pi. This is particularly important for XBMC users as all audio streams are decoded in software on the ARM CPU, and high-bitrate audio streams can be troublesome. This data shows a significant performance increase when decoding these formats on Raspbian.
The above chart shows some unusual results. The performance of bc is significantly lower on armhf than it is on armel. This is particularly strange as it should be impossible for this to happen! It’s almost certainly the result of the gcc compiler choosing a less efficient instruction when it compiled the armhf version of the binary. This is the only test that showed this behaviour.
This does demonstrate why we’re seeing improvements on the other tests though, it’s all down to the particular instructions that the code uses to perform each calculation. Newer instruction sets often have more efficient ways of doing things, allowing you to do the same work in less time either with fewer instructions or with a more efficient instruction. Here we see what happens when compilers are poorly optimised.
I should stress that this doesn’t mean there is a performance trade off with Raspbian, it means that we’ve uncovered a bug with this version of the gcc compiler that produces inefficient code in this one instance, Raspbian is undoubtedly faster for all purposes.
I believe the QtonPi wiki best sums up my experiences:
Given the preponderance of hardfp performance over its register ignorant peers, this will be useful in eking every last drop of performance out of the hardware.