Support for ARMv8 Cryptography Extensions on Vero4k

Unlike the Raspberry Pi, the Vero 4k’s CPU supports ARMv8 cryptography extensions for AES, SHA1 and SHA2-256.

osmc@osmc:~$ cat /proc/cpuinfo | grep Features
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 wp half thumb fastmult vfp edsp neon vfpv3 tlsi vfpv4 idiva idivt

However, it appears that openssl is not making use of these crypto extensions:

osmc@osmc:~$ openssl speed -evp aes-128-cbc
Doing aes-128-cbc for 3s on 16 size blocks: 8066354 aes-128-cbc’s in 2.99s
Doing aes-128-cbc for 3s on 64 size blocks: 2589229 aes-128-cbc’s in 2.99s
Doing aes-128-cbc for 3s on 256 size blocks: 703339 aes-128-cbc’s in 2.99s
Doing aes-128-cbc for 3s on 1024 size blocks: 179647 aes-128-cbc’s in 2.99s
Doing aes-128-cbc for 3s on 8192 size blocks: 22599 aes-128-cbc’s in 2.99s
OpenSSL 1.0.1t 3 May 2016
built on: Fri Jan 27 00:26:25 2017
options:bn(64,32) rc4(ptr,char) des(idx,cisc,16,long) aes(partial) blowfish(ptr)
compiler: gcc -I. -I… -I…/include -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -DTERMIO -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -Wl,-z,relro -Wa,–noexecstack -Wall -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DAES_ASM -DGHASH_ASM
The ‘numbers’ are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 43164.44k 55421.62k 60218.99k 61524.59k 61916.73k
osmc@osmc:~$ openssl speed aes-128-cbc
Doing aes-128 cbc for 3s on 16 size blocks: 9004567 aes-128 cbc’s in 2.97s
Doing aes-128 cbc for 3s on 64 size blocks: 2683427 aes-128 cbc’s in 2.99s
Doing aes-128 cbc for 3s on 256 size blocks: 709304 aes-128 cbc’s in 2.99s
Doing aes-128 cbc for 3s on 1024 size blocks: 180315 aes-128 cbc’s in 3.00s
Doing aes-128 cbc for 3s on 8192 size blocks: 22633 aes-128 cbc’s in 2.99s
OpenSSL 1.0.1t 3 May 2016
built on: Fri Jan 27 00:26:25 2017
options:bn(64,32) rc4(ptr,char) des(idx,cisc,16,long) aes(partial) blowfish(ptr)
compiler: gcc -I. -I… -I…/include -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -DTERMIO -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -Wl,-z,relro -Wa,–noexecstack -Wall -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DAES_ASM -DGHASH_ASM
The ‘numbers’ are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128 cbc 48509.45k 57437.90k 60729.71k 61547.52k 62009.88k

It is my understanding (confirmed on other larger systems) that using the -evp flag will cause openssl to use the AES hardware acceleration, where it is available. Though the result differ with and without -evp, they do not suggest a significant (or even any) improvement when -evp is used.

My first thought is that the openssl version from raspbian has probably not been compiled with hardware acceleration enabled since it is aimed at the Rapsberry Pi, which lacks such a feature. Plus the version of openssl on raspbian is 1.0.1t, which is a bit old and might not even have the equivalent of AES-NI for ARMv8.

I’m not averse to compiling my own openssl but perhaps someone more versed in these matters could tell me whether openssl 1.0.1t supports the crypto extensions and would it also need to be supported in the kernel and/or the Vero 4k’s equivalent of a “BIOS”?

Also, assuming there are no significant roadblocks on the way, what would be the chances of OSMC providing an “enhanced” AES-enabled build of openssl?

Hello

This is a good question.

Implementation of cryptography extensions in ARMv8 is actually optional. Our SoC provider has their own modules for crypto. These usually require additional modules, and some changes to the Device Tree. But they didn’t provide any improvements in my testing.

There isn’t always an easy way for userspace to use kernel based cryptography, but OpenSSL is an application that can take advantage of cryptography extensions when available.

I tried using the SoC’s cryptography modules, and you’ll see in /proc/crypto what’s available, but didn’t find performance benefits with them at this time.

There is recent discussion on the Linux kernel mailing list about improving cryptography support a new kernel, so we may be able to benefit from this in the near future.

If you’re running a web server with SSL, you might want to terminate before the Vero 4K.

Sam

Hi Sam,

First off, thank you for your detailed reply. I sometimes wonder whether you ever manage to get any sleep, given the gold-standard support you give us.

You mention that the SoC’s crypto modules are there “but didn’t provide any improvements” during your testing. That in itself is a bit odd, since we know that Intel’s AES-NI extensions provide significant performance improvements and, though I have no idea how the ARM extensions are supposed to perform, I’m sure we’d notice a difference if they were working. Plus the CPU is advertising that it (allegedly) has AES, SHA1 and SHA2 already baked in. It wouldn’t be the first time that a company has pushed out a product before it is ready, with the intention of removing the rough edges at a later date, now would it? Perhaps their crypto modules are little more than stub placeholders, for example, just waiting to be finished. Ultimately, “time to market” is what counts for these people.

Anyway, being of a curious disposition, I checked the openssl binaries on my Pi3 and Vero4k and they are identical. So I decided to try to build openssl on the Vero4k, since the config step would / should see that the CPU has the hardware-acceleration options built in and possibly try to make use of them.

The good news is that the config step seems to have identified the existence of the hardware crypto extensions. The bad news is that make step fails because it claims that the processor doesn’t support the extensions:

arm64cpuid.S: Assembler messages:
arm64cpuid.S:4: Error: unknown architecture `armv8-a+crypto'

arm64cpuid.S:10: Error: ARM register expected -- `orr v15.16b,v15.16b,v15.16b'
arm64cpuid.S:11: Error: bad instruction `ret'
arm64cpuid.S:17: Error: ARM register expected -- `mrs x0,CNTVCT_EL0'
arm64cpuid.S:18: Error: bad instruction `ret'
arm64cpuid.S:24: Error: selected processor does not support ARM mode `aese v0.16b,v0.16b'
arm64cpuid.S:25: Error: bad instruction `ret'
arm64cpuid.S:31: Error: selected processor does not support ARM mode `sha1h s0,s0'
arm64cpuid.S:32: Error: bad instruction `ret'
arm64cpuid.S:38: Error: selected processor does not support ARM mode `sha256su0 v0.4s,v0.4s'
arm64cpuid.S:39: Error: bad instruction `ret'
arm64cpuid.S:44: Error: bad instruction `pmull v0.1q,v0.1d,v0.1d'
arm64cpuid.S:45: Error: bad instruction `ret'
<builtin>: recipe for target 'arm64cpuid.o' failed
make[1]: *** [arm64cpuid.o] Error 1
make[1]: Leaving directory '/media/48891270-a51c-4c01-be54-db9f5f8c09b4/ossl/openssl-1.0.2k/crypto'
Makefile:287: recipe for target 'build_crypto' failed
make: *** [build_crypto] Error 1

This was using openssl v1.0.2k, BTW, since it’s easy to download and I didn’t intend to install it. It was just an experiment. I’m a bit surprised to see messages like `unknown architecture armv8-a+crypto’. Perhaps the compiler (gcc 4.9.2-10) is too old or it might need some directive tweaking to get it to work. Is this worth pursuing further or should I give up for now?

Bottom line: it’s looking a bit borked, right now. It’ll be useful if / when it finally arrives, but I can wait! :slight_smile:

Before building, try running setarch linux32. I suspect the compiler is confused by the Aarch64 kernel and armv7 userland.

The SoC implements their own crypto modules, but they don’t implement the ARM cryptography extensions properly from what I can tell. I suspect this may be for licensing reasons, so they have instead provided their own modules for crypto, but documentation on them is rather sparse. When I first tested them, I found they would deadlock on ioctl, so it’s indeed possible that they are stubs for now. I suspect until a large customer has a use for these modules, they may be on the backburner.

There is a lot of interesting discussion about this on the kernel mailing list.

Sam

It didn’t seem to be too promising because I was seeing things like Configuring for linux-armv4 and the compiler directive -march=armv7-a but it compiled ok. However, the real test is in the running:

$ ./apps/openssl speed -evp aes-128-cbc
WARNING: can't open config file: /usr/local/ssl/openssl.cnf
Doing aes-128-cbc for 3s on 16 size blocks: 26868125 aes-128-cbc's in 2.99s
Doing aes-128-cbc for 3s on 64 size blocks: 19475940 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 8870689 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 2855763 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 389699 aes-128-cbc's in 3.00s
OpenSSL 1.0.2k  26 Jan 2017
built on: reproducible build, date unspecified
options:bn(64,32) rc4(ptr,char) des(idx,cisc,16,long) aes(partial) idea(int) blowfish(ptr) 
compiler: gcc -I. -I.. -I../include  -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -march=armv7-a -Wa,--noexecstack -O3 -Wall -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DAES_ASM -DBSAES_ASM -DGHASH_ASM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc     143775.92k   415486.72k   756965.46k   974767.10k  1064138.07k
$ ./apps/openssl speed aes-128-cbc
WARNING: can't open config file: /usr/local/ssl/openssl.cnf
Doing aes-128 cbc for 3s on 16 size blocks: 9696742 aes-128 cbc's in 3.00s
Doing aes-128 cbc for 3s on 64 size blocks: 2655967 aes-128 cbc's in 2.99s
Doing aes-128 cbc for 3s on 256 size blocks: 694189 aes-128 cbc's in 3.00s
Doing aes-128 cbc for 3s on 1024 size blocks: 175086 aes-128 cbc's in 3.00s
Doing aes-128 cbc for 3s on 8192 size blocks: 21947 aes-128 cbc's in 3.00s
OpenSSL 1.0.2k  26 Jan 2017
built on: reproducible build, date unspecified
options:bn(64,32) rc4(ptr,char) des(idx,cisc,16,long) aes(partial) idea(int) blowfish(ptr) 
compiler: gcc -I. -I.. -I../include  -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -march=armv7-a -Wa,--noexecstack -O3 -Wall -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DAES_ASM -DBSAES_ASM -DGHASH_ASM
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128 cbc      51715.96k    56850.13k    59237.46k    59762.69k    59929.94k

With hardware extensions:

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128-cbc     143775.92k   415486.72k   756965.46k   974767.10k  1064138.07k

Without hardware extensions:

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-128 cbc      51715.96k    56850.13k    59237.46k    59762.69k    59929.94k

Success! Give that man a cigar!

For completeness, here are the same tests, using the same newly-compiled code, on the Pi3.
With -evp flag:
aes-128-cbc 39351.31k 46509.40k 49894.91k 50424.78k 50552.04k
Without -evp flag:
aes-128 cbc 43417.83k 48102.90k 49957.93k 50531.98k 50629.88k

No appreciable difference between the two results, which is to be expected, and roughly in line with the Vero4k non-accelerated figures.

Edit: sha1 and sha256 should also be supported in hardware but I’m getting strange figures:

sha1 with -evp flag:

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
sha1              7235.88k    28153.60k    99893.85k   273392.64k   558115.50k

sha1 without -evp flag:

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
sha1              7307.21k    28169.39k    99848.70k   273652.05k   560201.29k

sha256 with -evp flag:

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
sha256            7189.42k    27786.75k    97804.03k   263942.49k   528731.95k

sha256 without -evp flag:

type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
sha256           37835.67k   124624.70k   311541.73k   492438.18k   596473.49k

No difference for sha1 and sha256 is actuall slower with the -evp flag. The sha256 with -evp is also performing the same as sha1, which is odd. It might be that the -evp flag disables hardware acceleration for sha256. Needs investigating.

The setarch line is needed because Vero 4K uses an Aarch64 kernel, but armhf userland. This can confuse some build tools which check uname -m to determine the system architecture.

I haven’t looked in to this in great detail on Vero 4K, but it’s possible there can be improvements made in the future.

Sam

The performance improvments should be worth the effort, IMO. For example, the aes-128-cbc performance with a 1k block size is sixteen times faster using hardware extensions.

Would this logic also apply to openvpn encryption? I’m seeing bandwidth cut roughy in half when I’m connected to IPVanish.

I doubt it: if CPU use is low when you are using the connection then it’s unlikely to be the limiting factor.

It probably depends on the time of the day and the endpoint you’re using

Yes. Openvpn uses openssl for its encryption.

Details??? As Sam says, it could be the location of the server that’s a problem (closer is better/faster) or it could be that IPVanish isn’t very fast, especially at certain times of the day. You’ll need to run top or htop to see how much you’re loading the CPU.

But NB: since openvpn is only a single-threaded process, it can use just one processor core out of the four on the Vero 4K. So there’s the potential to max out one core faster than you might have expected if you’re pushing the VPN hard, while the other three cores just sit there doing very little.

Actually I think it’s mostly IPVanish’s server connection, because when I connect to their closest server (Auburn VA for me) I get bad transfer rates to my media servers in France (I use real-debrid which is based in France) I get much better transfer rates when I connect to IPVanish servers actually in France.

Logically I would assume that means that openvpn isn’t the bottleneck at all? But since between the combination of multi-threading and the SHA hardware, the compression part would at the very least reduce lag, correct?

Not necessarily; I think the Crypto Extensions would help for running a server, but not necessarily a client.

Sam

1 Like

Gotcha. So now I’m going down the route of figuring how how to automate the change between US and FR servers when I go from Kodi to X11, then the reverse back again.

I’d rephrase that to:

Crypto Extensions are more helpful for a server than for a client.

A client will also benefit when the download data rate is sufficiently high, for example torrents and perhaps high bit-rate streaming media. It’ll depend on your internet connection being fast enough.

1 Like