memetic dot org

Raspbian Benchmarking – armel vs armhf

by on Jul.13, 2012, under Linux, Raspberry Pi

My post on Raspberry Pi overclocking and benchmarking raised a few questions about the differences in performance between the armel architecture used by Debian and the armhf architecture used by Raspbian.

Debian officially provides two ARM architectures, armel and armhf. armel is for lower end hardware, supports the ARMv4 instruction set and hardware floating-point though a compatibility mode which slows performance but allows compatiblity with code written for processors without floating point units. armhf is for higher end hardware and supports ARMv7 and faster, direct hardware floating point support without backwards compatibility. These are roughly analogous to the i386 and i686 architectures.

Note that in Ubuntu, both armel and armhf are compiled for ARMv7 and above, so neither will work on the Raspberry Pi.

The Raspberry Pi is an ARM11 processor supporting the ARMv6 instruction set and VFPv2 hardware floating point. Performance is being sacrificed to retain compatibility with code compiled without support for the VFP. It is also potentially missing out on faster, more optimised instructions introduced with the ARMv5 and ARMv6 instruction sets. The Raspberry Pi is a victim of the compromises made between performance and compatibility when standardising the Debian architectures.

ARM Instruction Sets and Architectures

At this point it’s worth explaining these version numbers. There are two numbers used to describe the ‘version’ of an ARM processor.

The first is the instruction set version, which on the Pi is ARMv6. Most newer devices use ARMv7, and ARMv8 will be the next version.

The second is the ARM processor design version, which on the Pi is 11. ARM11 introduced the ARMv6 instruction set. The next version was the Cortex A, which introduced the ARMv7 instructions. Your smartphone will probably have a processor with an ARM11 or an ARM Cortex A.

Each processor design comes in several variants depending upon which optional features it includes. The Raspberry Pi uses the ARM1176JZF-S, with the F meaning that it includes the Vector Floating Point (VPF) instruction set.

Softfp and Hardfp

For ARM there are two different ABIs (Application Binary Interfaces), soft/softfp and hard. ‘soft’ doesn’t use the FPU at all and uses gcc maths replacement functions to emulate floating point arithmetic. ‘softfp’ uses the FPU but arguments to functions are passed through the integer registers and then passed to the floating point unit. ‘hard’ is using the FPU directly with data passed directly to the floating point unit registers. While soft/softfp are forwards compatible, ie. a ‘soft’ app can run on a softfp system, but not vice-versa,  a ‘hardfloat’ application can run on neither of those systems. This means that in order to use hardfloat the system has to be completely recompiled for the hardfloat ABI, down to the last library and program.

The Raspbian distribution is in essence the Debian source recompiled for the ARMv6 instruction set with ‘hardfloat’  direct access to the VPFv2 instructions. There are also some other minor changes to make things work better on the Raspberry Pi, such as optimised implementations of memcpy() and memset() but no real changes to the distribution. It’s intended to optimise the genuine Debian installation as much as possible for the Raspberry Pi.

Performance Results

To try to quantify the optimisations in Raspbian I extended my benchmarking tests and ran them on the two different architectures, Debian armel and Raspbian armhf.

The tests were run on a single Raspberry Pi using two different partitions on the same USB hard disk. The same kernel was used for both sets of tests. Tests were performed multiple times until a “best” score was reached. For the GTKPerf tests both Raspbian and Debian were set to the same GTK theme. I chose not to use the raw data, but present it in an easier to understand format, armhf performance is presented relative to armel performance.

Keep in mind the golden rule of benchmarking: All benchmarks are flawed benchmarks.

Performance comparison between Debian armel and Raspbian armhf

Performance comparison between Debian armel and Raspbian armhf

The chart above shows the performance difference in various applications between Debian armel and Raspbian armhf. Performance improvement varied from 4% to 40% depending upon the application.

The performance improvements seen by non-floating point applications like Gzip and Bzip2 and were related to the ARMv6 instructions being used instead of the ARMv4 instructions. We seem to gain 4-10% in these applications.

GTKPerf showed a 19% improvement in X Windows GUI operations, which should make the Raspberry Pi more usable as a desktop. Quake 3 showed a more modest improvement, but that is because it is already limited by the GPU at 1080p resolutions.

The applications with larger performance improvements are those making heavy use of floating point maths, particularly media en/decoding, which see a huge performance increase. I also saw a 600% increase in Mpeg Layer 3 and Layer 2 encoding performance, but I didn’t include that on this chart as it made the other data difficult to read.

Users of the latest OpenSSL packages will see a ~100% performance increase over these numbers, which is related to an ASM optimisation patch applied by the Raspbian team and isn’t relevant to this test. It does demonstrate how important optimisation is though!

MP3 Encoding

MP3 encoding performance comparison of armel and armhf

MP3 encoding performance comparison of armel and armhf

The chart above shows a comparison of the MP3 encoding performance of the armel and armhf architectures. The numbers show significant speed improvements with the code compiled for hardfp/ARMv6 and demonstrates the performance gains possible from optimised code.

When encoding the test files the armhf encoder managed to encode at more than double real-time, and armel at less than half real time.

Media Decoding

Resources required to decode various audio formats in real time on the Raspberry Pi

Resources required to decode various audio formats in real time on the Raspberry Pi

The chart above shows the amount of CPU time required to decode various audio formats in real time on the Raspberry Pi. This is particularly important for XBMC users as all audio streams are decoded in software on the ARM CPU, and high-bitrate audio streams can be troublesome. This data shows a significant performance increase when decoding these formats on Raspbian.

Compiler Anomaly

Performance comparison of bc on armel and armhf architectures

Performance comparison of bc on armel and armhf architectures

The above chart shows some unusual results. The performance of bc is significantly lower on armhf than it is on armel. This is particularly strange as it should be impossible for this to happen! It’s almost certainly the result of the gcc compiler choosing a less efficient instruction when it compiled the armhf version of the binary. This is the only test that showed this behaviour.

This does demonstrate why we’re seeing improvements on the other tests though, it’s all down to the particular instructions that the code uses to perform each calculation. Newer instruction sets often have more efficient ways of doing things, allowing you to do the same work in less time either with fewer instructions or with a more efficient instruction. Here we see what happens when compilers are poorly optimised.

I should stress that this doesn’t mean there is a performance trade off with Raspbian, it means that we’ve uncovered a bug with this version of the gcc compiler that produces inefficient code in this one instance, Raspbian is undoubtedly faster for all purposes.

Conclusion

I believe the QtonPi wiki best sums up my experiences:

Given the preponderance of hardfp performance over its register ignorant peers, this will be useful in eking every last drop of performance out of the hardware.

 

 


16 Comments for this entry

  • futairs

    Thanks for clearing that up. Having tried to deal with hard float before (MaverickCrunch), I know the difference it can make.

  • Sebastian

    Hi, I really appreciate what you did. Good work. One question though: Is it with “Raspbian armhf” okay to use rpi-update,
    or does it break anything? Thanks in advance. Sebastian.

  • willieboy

    A very interesting, comprehensive and useful analysis. Your analysis is much appreciated by this pi user. Many thanks!

  • The_Ross_

    Thank your for taking the time to explain armel, armhf, and the benchmark results. This Pi noobie is very appreciative.

  • Wookey

    Did you check whether both the armel and armhf binaries were compiled with the same version of the compiler? If you just used armel binaries from Debian they could have been built some time ago with an older compiler version, depending on when the last upload was done. Whereas all of raspian was built recently with the same (recent) compiler version.

    The 4-10% differences in most of the test are the sort of change you get from one gcc version so the next so it’s important to control for this. The change from armel to armhf ABI should make a (sometimes large) difference for FP code, but should make almost no difference at all for integer code, and I wouldn’t expect there to be any in at least the bzip and gzip tests. There should be some room for optimisation in the differences between armv4t and armv6 instructions although the differences are not huge. This may account for most of the difference, or it may be down to compiler version variation if that wasn’t controlled for.

    Benchmarking is a difficult art to get reproducible results so some more details of what you actually did would be most helpful in order to make it possible get a good handle on how much difference 4vt->v6, armel->armhf, and maybe gcc x.y to gcc x.y+1 respectively make.

    One final comment. I wouldn’t really describe the armel ABI as a ‘compatibility layer’. That gives the wrong idea – it’s just a different ABI spec. for how function values are passed in registers. It _is_ less efficient for FP values due to extra copies. But this is a nitpick, and your tests are interesting and well-presented. Thank you for doing them.

    • adama

      Very good points. I tried many different ways to explain softfp/hardfp.

      Benchmarks for softfp/hardfp i’ve seen for other platforms don’t produce the overall benefit we see here, so i suspect a lot of it is down to either compiler changes or memcpy/memset. Some platforms even see frequent performance regressions(!).

    • Michas

      There is big difference in integer performance between ARMv4 and ARMv6. There are ARMv5E DSP instructions and ARMv6 SIMD. It has nothing to do with floating match. The compilers don’t use them, but there is a hand assembler code for ARMv6 in ffmpeg. So this very big difference for encoding of popular codecs looks possible.

      • adama

        This sounds like a good explanation for the encoder performance. The difference between softfp and hardfp shouldn’t be very large at all.

  • Alex Buell

    Which version of the compiler was used?

  • Tim Snyder

    Nice writeup on armel vs armhf. What did you use to make your graphs? They look really nice.

35 Trackbacks / Pingbacks for this entry

Leave a Reply

Looking for something?

Use the form below to search the site:

Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!

Blogroll

A few highly recommended websites...

Archives

All entries, chronologically...