[HOW-TO] SSHFS tutorial

Benchmarks

People might be curious to see what kind of performance can be expected from such a setup.

The environment is this:

  • client: Vero 4k
  • switch: Turris Omnia
  • server: old Dual-core Intel E6500 server with Seagate ST4000DM000 4TB HDD

Everything is connected with gigabit, end to end.

First read of a 10MB file:

osmc@osmc:~$ dd if=/mnt/server/10M of=/dev/null
19531+1 enregistrements lus
19531+1 enregistrements écrits
10000000 bytes (10 MB, 9,5 MiB) copied, 1,03572 s, 9,7 MB/s

Subsequent reads are, obviously, easily cached:

osmc@osmc:~$ dd if=/mnt/server/10M of=/dev/null
19531+1 enregistrements lus
19531+1 enregistrements écrits
10000000 bytes (10 MB, 9,5 MiB) copied, 0,0420413 s, 238 MB/s

bonnie++ benchmarks:

Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
osmc             4G    25  22 16714  11  9056   5  1032  88 10748   2  1183  34
Latency               391ms     635ms    1348ms     254ms     259ms     212ms
Version  1.97       ------Sequential Create------ --------Random Create--------
osmc                -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
               1024   253   4   881   4   274   2   254   4   923   4   246   2
Latency              3260ms     981ms   17924ms    3472ms   55357us   30419ms
1.97,1.97,osmc,1,1548699449,4G,,25,22,16714,11,9056,5,1032,88,10748,2,1183,34,1024,,,,,253,4,881,4,274,2,254,4,923,4,246,2,391ms,635ms,1348ms,254ms,259ms,212ms,3260ms,981ms,17924ms,3472ms,55357us,30419ms

During the tests, top showed SSH and SSHFS each taking about 20% of a CPU with load around 1.5/1.6, so it seems like CPU is not much of a bottleneck, at first glance. The network was far from being saturated of course - it spiked at around 14MiB/s in the earlier parts of the test and then leveled down to a much slower ~200KiB/s:


Here’s the same benchmark performed directly on the server:

Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
server          16G   448  98 84327  12 45061   7  1112  90 86878   8  72.7   2
Latency             19506us    4374ms    4166ms   30304us     122ms    4964ms
Version  1.97       ------Sequential Create------ --------Random Create--------
server              -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
               1024 18074  33 290437  99   750   1 21873  39 552006  99   420   0
Latency              7862ms    3316us   32984ms    1483ms     797us   33467ms
1.97,1.97,server,1,1548786028,16G,,448,98,84327,12,45061,7,1112,90,86878,8,72.7,2,1024,,,,,18074,33,290437,99,750,1,21873,39,552006,99,420,0,19506us,4374ms,4166ms,30304us,122ms,4964ms,7862ms,3316us,32984ms,1483ms,797us,33467ms

There is definitely a performance hit there: the test took over 6 hours when running over sshfs, and only a fraction of that locally. Sequential writes are about over 5 times faster and reads are 8 times faster. Creating/reading/deleting lots of small files is especially painful, but in my case I treat the storage as read-only.

A test ran with fio leads other results which are surprising. On the OSMC SSHFS endpoint, I get around 3324KB/s:

osmc@osmc:~$ fio --name=stressant --group_reporting --directory=/mnt/server/osmc/ --size=100M --readwrite=randrw --direct=1 --numjobs=4
stressant: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
...
fio-2.16
Starting 4 processes
stressant: Laying out IO file(s) (1 file(s) / 100MB)
stressant: Laying out IO file(s) (1 file(s) / 100MB)
stressant: Laying out IO file(s) (1 file(s) / 100MB)
stressant: Laying out IO file(s) (1 file(s) / 100MB)                                                                                                               JobsJobs: 2 (f=2): [_(1),m(2),_(1)] [100.0% done] [2138KB/2298KB/0KB /s] [534/574/0 iops] [eta 00m:00s]
stressant: (groupid=0, jobs=4): err= 0: pid=11394: Mon Jan 28 21:54:34 2019
  read : io=205352KB, bw=3324.9KB/s, iops=831, runt= 61763msec
    clat (msec): min=1, max=3924, avg= 4.62, stdev=62.55
     lat (msec): min=1, max=3924, avg= 4.62, stdev=62.55
    clat percentiles (usec):
     |  1.00th=[ 1512],  5.00th=[ 1816], 10.00th=[ 2008], 20.00th=[ 2256],
     | 30.00th=[ 2448], 40.00th=[ 2608], 50.00th=[ 2800], 60.00th=[ 3024],
     | 70.00th=[ 3280], 80.00th=[ 3600], 90.00th=[ 4128], 95.00th=[ 4640],
     | 99.00th=[ 6880], 99.50th=[21888], 99.90th=[209920], 99.95th=[218112],
     | 99.99th=[3719168]
  write: io=204248KB, bw=3306.1KB/s, iops=826, runt= 61763msec
    clat (usec): min=38, max=2818, avg=105.09, stdev=53.16
     lat (usec): min=39, max=2831, avg=106.92, stdev=53.77
    clat percentiles (usec):
     |  1.00th=[   52],  5.00th=[   59], 10.00th=[   63], 20.00th=[   70],
     | 30.00th=[   77], 40.00th=[   88], 50.00th=[   98], 60.00th=[  104],
     | 70.00th=[  112], 80.00th=[  125], 90.00th=[  147], 95.00th=[  187],
     | 99.00th=[  302], 99.50th=[  346], 99.90th=[  620], 99.95th=[  748],
     | 99.99th=[ 1064]
    lat (usec) : 50=0.17%, 100=25.91%, 250=22.62%, 500=1.08%, 750=0.06%
    lat (usec) : 1000=0.02%
    lat (msec) : 2=4.92%, 4=39.36%, 10=5.48%, 20=0.11%, 50=0.15%
    lat (msec) : 100=0.01%, 250=0.09%, 2000=0.01%, >=2000=0.02%
  cpu          : usr=0.97%, sys=1.35%, ctx=104291, majf=0, minf=138
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=51338/w=51062/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: io=205352KB, aggrb=3324KB/s, minb=3324KB/s, maxb=3324KB/s, mint=61763msec, maxt=61763msec
  WRITE: io=204248KB, aggrb=3306KB/s, minb=3306KB/s, maxb=3306KB/s, mint=61763msec, maxt=61763msec

But the same test ran directly on the server is over 10 times slower, at around 218KB/s:

root@server:~# sudo -u osmc fio --name=stressant --group_reporting --directory=/srv/osmc --size=100M --readwrite=randrw --direct=1 --numjobs=4
stressant: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
...
fio-2.16
Starting 4 processes
stressant: Laying out IO file(s) (1 file(s) / 100MB)
stressant: Laying out IO file(s) (1 file(s) / 100MB)
stressant: Laying out IO file(s) (1 file(s) / 100MB)
stressant: Laying out IO file(s) (1 file(s) / 100MB)
Jobs: 1 (f=1): [_(3),m(1)] [99.9% done] [96KB/84KB/0KB /s] [24/21/0 iops] [eta 00m:01s]
stressant: (groupid=0, jobs=4): err= 0: pid=11310: Mon Jan 28 22:14:19 2019
  read : io=205352KB, bw=224047B/s, iops=54, runt=938554msec
    clat (usec): min=190, max=653298, avg=54744.57, stdev=50035.60
     lat (usec): min=191, max=653299, avg=54745.93, stdev=50035.59
    clat percentiles (msec):
     |  1.00th=[    4],  5.00th=[   11], 10.00th=[   15], 20.00th=[   21],
     | 30.00th=[   27], 40.00th=[   34], 50.00th=[   43], 60.00th=[   57],
     | 70.00th=[   69], 80.00th=[   82], 90.00th=[  101], 95.00th=[  123],
     | 99.00th=[  204], 99.50th=[  424], 99.90th=[  523], 99.95th=[  545],
     | 99.99th=[  603]
  write: io=204248KB, bw=222842B/s, iops=54, runt=938554msec
    clat (usec): min=203, max=649080, avg=17813.42, stdev=42354.94
     lat (usec): min=204, max=649082, avg=17814.94, stdev=42354.96
    clat percentiles (usec):
     |  1.00th=[  245],  5.00th=[  302], 10.00th=[  366], 20.00th=[  580],
     | 30.00th=[  868], 40.00th=[ 2704], 50.00th=[ 7264], 60.00th=[12096],
     | 70.00th=[18304], 80.00th=[26752], 90.00th=[40704], 95.00th=[59648],
     | 99.00th=[104960], 99.50th=[423936], 99.90th=[536576], 99.95th=[552960],
     | 99.99th=[593920]  
    lat (usec) : 250=0.60%, 500=7.91%, 750=5.25%, 1000=2.43%
    lat (msec) : 2=2.37%, 4=3.57%, 10=7.75%, 20=15.55%, 50=28.83%
    lat (msec) : 100=20.08%, 250=4.93%, 500=0.57%, 750=0.18%
  cpu          : usr=0.05%, sys=0.13%, ctx=103589, majf=0, minf=42
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=51338/w=51062/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1
                         
Run status group 0 (all jobs):
   READ: io=205352KB, aggrb=218KB/s, minb=218KB/s, maxb=218KB/s, mint=938554msec, maxt=938554msec
  WRITE: io=204248KB, aggrb=217KB/s, minb=217KB/s, maxb=217KB/s, mint=938554msec, maxt=938554msec                         
                        
Disk stats (read/write): 
    dm-6: ios=108343/51634, merge=0/0, ticks=3904652/989000, in_queue=4901284, util=100.00%, aggrios=102071/51643, aggrmerge=6184/179, aggrticks=3605260/975972, aggrin_queue=4581132, aggrutil=100.00%
  sdb: ios=102071/51643, merge=6184/179, ticks=3605260/975972, in_queue=4581132, util=100.00%

That’s pretty slow! I suspect something else might have slowed down the test, or that there’s some caching going on in the SFTP layer that the test ran directly on the server cannot leverage. Indeed, a more reasonable test (sequential read/write) yields more similar results. Client hits 4125KB/s:

osmc@osmc:~$ fio --name=stressant --group_reporting --directory=/mnt/server/osmc/ --size=100M --readwrite=readwrite --direct=1 --numjobs=4
stressant: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
...
fio-2.16
Starting 4 processes
Jobs: 2 (f=2): [_(1),M(2),_(1)] [100.0% done] [576KB/704KB/0KB /s] [144/176/0 iops] [eta 00m:00s]
stressant: (groupid=0, jobs=4): err= 0: pid=11649: Mon Jan 28 22:35:04 2019
  read : io=205352KB, bw=4148.6KB/s, iops=1037, runt= 49500msec
    clat (usec): min=897, max=406359, avg=3663.71, stdev=10687.28
     lat (usec): min=898, max=406360, avg=3665.18, stdev=10687.28
    clat percentiles (usec):
     |  1.00th=[ 1512],  5.00th=[ 1832], 10.00th=[ 2024], 20.00th=[ 2256],
     | 30.00th=[ 2448], 40.00th=[ 2640], 50.00th=[ 2864], 60.00th=[ 3088],
     | 70.00th=[ 3344], 80.00th=[ 3696], 90.00th=[ 4256], 95.00th=[ 4832],
     | 99.00th=[ 8768], 99.50th=[34560], 99.90th=[209920], 99.95th=[214016],
     | 99.99th=[257024]
  write: io=204248KB, bw=4126.2KB/s, iops=1031, runt= 49501msec
    clat (usec): min=38, max=2591, avg=102.47, stdev=51.50
     lat (usec): min=39, max=2599, avg=104.28, stdev=51.93
    clat percentiles (usec):
     |  1.00th=[   51],  5.00th=[   58], 10.00th=[   62], 20.00th=[   68],
     | 30.00th=[   75], 40.00th=[   85], 50.00th=[   96], 60.00th=[  103],
     | 70.00th=[  110], 80.00th=[  122], 90.00th=[  143], 95.00th=[  175],
     | 99.00th=[  298], 99.50th=[  350], 99.90th=[  572], 99.95th=[  724],
     | 99.99th=[ 1144]
    lat (usec) : 50=0.26%, 100=27.26%, 250=21.25%, 500=1.03%, 750=0.04%
    lat (usec) : 1000=0.02%
    lat (msec) : 2=4.63%, 4=38.68%, 10=6.37%, 20=0.14%, 50=0.21%
    lat (msec) : 250=0.11%, 500=0.01%
  cpu          : usr=1.01%, sys=1.66%, ctx=104084, majf=0, minf=142
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=51338/w=51062/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: io=205352KB, aggrb=4148KB/s, minb=4148KB/s, maxb=4148KB/s, mint=49500msec, maxt=49500msec
  WRITE: io=204248KB, aggrb=4126KB/s, minb=4126KB/s, maxb=4126KB/s, mint=49501msec, maxt=49501msec

Server hits 831KB/s, about 5 times slower:

root@marcos:~# sudo -u osmc fio --name=stressant --group_reporting --directory=/srv/osmc --size=100M --readwrite=readwrite --numjobs=4 --direct=1
stressant: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
...
fio-2.16
Starting 4 processes
Jobs: 2 (f=2): [_(2),M(2)] [99.6% done] [832KB/872KB/0KB /s] [208/218/0 iops] [eta 00m:01s]
stressant: (groupid=0, jobs=4): err= 0: pid=19495: Mon Jan 28 22:33:01 2019
  read : io=205352KB, bw=856134B/s, iops=209, runt=245616msec
    clat (usec): min=129, max=146001, avg=18635.62, stdev=18942.67
     lat (usec): min=129, max=146002, avg=18636.92, stdev=18942.62
    clat percentiles (usec):
     |  1.00th=[  175],  5.00th=[  205], 10.00th=[  211], 20.00th=[  223],
     | 30.00th=[  241], 40.00th=[  258], 50.00th=[11072], 60.00th=[35072],
     | 70.00th=[35584], 80.00th=[36096], 90.00th=[37120], 95.00th=[38144],
     | 99.00th=[57088], 99.50th=[68096], 99.90th=[98816], 99.95th=[107008],
     | 99.99th=[134144]
  write: io=204248KB, bw=851532B/s, iops=207, runt=245616msec
    clat (usec): min=198, max=125692, avg=426.80, stdev=2142.22
     lat (usec): min=199, max=125693, avg=428.22, stdev=2142.22
    clat percentiles (usec):
     |  1.00th=[  211],  5.00th=[  225], 10.00th=[  239], 20.00th=[  255],
     | 30.00th=[  262], 40.00th=[  270], 50.00th=[  278], 60.00th=[  290],
     | 70.00th=[  306], 80.00th=[  326], 90.00th=[  350], 95.00th=[  502],
     | 99.00th=[  868], 99.50th=[ 3344], 99.90th=[32640], 99.95th=[40192],
     | 99.99th=[80384]
    lat (usec) : 250=25.50%, 500=46.05%, 750=1.87%, 1000=0.94%
    lat (msec) : 2=0.10%, 4=0.08%, 10=0.17%, 20=0.17%, 50=24.39%
    lat (msec) : 100=0.69%, 250=0.05%
  cpu          : usr=0.09%, sys=0.43%, ctx=102530, majf=0, minf=48
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=51338/w=51062/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: io=205352KB, aggrb=836KB/s, minb=836KB/s, maxb=836KB/s, mint=245616msec, maxt=245616msec
  WRITE: io=204248KB, aggrb=831KB/s, minb=831KB/s, maxb=831KB/s, mint=245616msec, maxt=245616msec

Disk stats (read/write):
    dm-6: ios=52096/51157, merge=0/0, ticks=980180/23600, in_queue=1004148, util=100.00%, aggrios=51449/51226, aggrmerge=690/32, aggrticks=955344/20240, aggrin_queue=975528, aggrutil=100.00%
  sdb: ios=51449/51226, merge=690/32, ticks=955344/20240, in_queue=975528, util=100.00%

I believe the key is the cache: if we disable direct mode and let the server do its own caching, it definitely outperforms the SSHFS endpoint. Client, 11178KB/s:

osmc@osmc:~$ fio --name=stressant --group_reporting --directory=/mnt/server/osmc/ --size=100M --readwrite=readwrite --numjobs=4
stressant: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
...
fio-2.16
Starting 4 processes
Jobs: 3 (f=3): [M(3),_(1)] [90.0% done] [10598KB/10238KB/0KB /s] [2649/2559/0 iops] [eta 00m:02s]
stressant: (groupid=0, jobs=4): err= 0: pid=11665: Mon Jan 28 22:40:13 2019
  read : io=205352KB, bw=11178KB/s, iops=2794, runt= 18371msec
    clat (usec): min=3, max=207422, avg=692.04, stdev=8703.16
     lat (usec): min=3, max=207423, avg=693.04, stdev=8703.17
    clat percentiles (usec):
     |  1.00th=[    4],  5.00th=[    4], 10.00th=[    5], 20.00th=[    5],
     | 30.00th=[    5], 40.00th=[    6], 50.00th=[    6], 60.00th=[    7],
     | 70.00th=[    9], 80.00th=[   11], 90.00th=[   14], 95.00th=[   20],
     | 99.00th=[10048], 99.50th=[49408], 99.90th=[152576], 99.95th=[173056],
     | 99.99th=[193536]
  write: io=204248KB, bw=11118KB/s, iops=2779, runt= 18371msec
    clat (usec): min=37, max=167522, avg=646.50, stdev=2714.98
     lat (usec): min=38, max=167523, avg=647.81, stdev=2715.04
    clat percentiles (usec):
     |  1.00th=[   46],  5.00th=[   52], 10.00th=[   56], 20.00th=[   64],
     | 30.00th=[   73], 40.00th=[   87], 50.00th=[  110], 60.00th=[  141],
     | 70.00th=[  179], 80.00th=[ 1848], 90.00th=[ 2224], 95.00th=[ 2416],
     | 99.00th=[ 2960], 99.50th=[ 3472], 99.90th=[10816], 99.95th=[82432],
     | 99.99th=[126464]
    lat (usec) : 4=0.07%, 10=36.92%, 20=10.59%, 50=2.45%, 100=21.96%
    lat (usec) : 250=14.82%, 500=1.10%, 750=0.15%, 1000=0.19%
    lat (msec) : 2=2.29%, 4=8.64%, 10=0.27%, 20=0.25%, 50=0.04%
    lat (msec) : 100=0.10%, 250=0.17%
  cpu          : usr=1.29%, sys=2.92%, ctx=59179, majf=0, minf=146
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=51338/w=51062/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: io=205352KB, aggrb=11178KB/s, minb=11178KB/s, maxb=11178KB/s, mint=18371msec, maxt=18371msec
  WRITE: io=204248KB, aggrb=11117KB/s, minb=11117KB/s, maxb=11117KB/s, mint=18371msec, maxt=18371msec

Server, ~89700KB/s, that is, the original 8-fold performance difference:

root@server:~# sudo -u osmc fio --name=stressant --group_reporting --directory=/srv/osmc --size=100M --readwrite=readwrite --numjobs=4 
stressant: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=psync, iodepth=1
...
fio-2.16
Starting 4 processes
Jobs: 3 (f=3)
stressant: (groupid=0, jobs=4): err= 0: pid=23136: Mon Jan 28 22:40:17 2019
  read : io=205352KB, bw=90185KB/s, iops=22546, runt=  2277msec
    clat (usec): min=2, max=834095, avg=107.14, stdev=7682.75
     lat (usec): min=2, max=834096, avg=107.90, stdev=7682.78
    clat percentiles (usec):
     |  1.00th=[    2],  5.00th=[    2], 10.00th=[    2], 20.00th=[    2],
     | 30.00th=[    3], 40.00th=[    3], 50.00th=[    3], 60.00th=[    3],
     | 70.00th=[    3], 80.00th=[    4], 90.00th=[    4], 95.00th=[    4],
     | 99.00th=[    7], 99.50th=[  115], 99.90th=[ 2544], 99.95th=[ 5024],
     | 99.99th=[626688]
  write: io=204248KB, bw=89700KB/s, iops=22425, runt=  2277msec
    clat (usec): min=3, max=912823, avg=35.75, stdev=4235.81
     lat (usec): min=3, max=912824, avg=36.62, stdev=4235.84
    clat percentiles (usec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    4],
     | 30.00th=[    4], 40.00th=[    4], 50.00th=[    4], 60.00th=[    4],
     | 70.00th=[    5], 80.00th=[    5], 90.00th=[    5], 95.00th=[    6],
     | 99.00th=[    8], 99.50th=[   15], 99.90th=[ 1752], 99.95th=[ 2576],
     | 99.99th=[15936]
    lat (usec) : 4=43.64%, 10=55.52%, 20=0.30%, 50=0.12%, 100=0.06%
    lat (usec) : 250=0.11%, 500=0.02%, 750=0.02%, 1000=0.02%
    lat (msec) : 2=0.06%, 4=0.08%, 10=0.01%, 20=0.01%, 50=0.01%
    lat (msec) : 250=0.01%, 750=0.01%, 1000=0.01%
  cpu          : usr=2.99%, sys=7.04%, ctx=573, majf=0, minf=41
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=51338/w=51062/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: io=205352KB, aggrb=90185KB/s, minb=90185KB/s, maxb=90185KB/s, mint=2277msec, maxt=2277msec
  WRITE: io=204248KB, aggrb=89700KB/s, minb=89700KB/s, maxb=89700KB/s, mint=2277msec, maxt=2277msec

Disk stats (read/write):
    dm-6: ios=233/152, merge=0/0, ticks=10104/110200, in_queue=120288, util=93.35%, aggrios=257/189, aggrmerge=11/5, aggrticks=4500/64196, aggrin_queue=68684, aggrutil=88.16%
  sdb: ios=257/189, merge=11/5, ticks=4500/64196, in_queue=68684, util=88.16%

Almost an order of magnitude faster…

Note that a lot of that overhead is specific to the Vero. During tests on another computer I could almost saturate (but not quite) the gigabit network, something I cannot reproduce on the Vero 4k+. I suspect this might be because Intel processors have special instructions set for cryptographic operations which might be missing from the Vero CPU, which means it needs to run those as normal instructions in the CPU.

Benchmarking the cipher used by default (chacha20) using openssl speed gives the following results, on the Vero:

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
vero            47715.23k    96421.80k   217529.96k   240623.22k   248491.58k   249343.66k
server         215857.37k   416072.46k   751905.03k   811660.11k   906759.14k   829639.77k
workstation    245314.57k   441711.83k   918491.99k  1894956.71k  1967540.91k  1973321.73k

So that’s probably the bottleneck. Forcing another, faster cipher would probably help with the performance. For example here are the same results but with AES256:

host             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
vero4k+         94186.17k   265138.88k   474621.61k   605420.65k   654475.26k   659305.31k
server         137186.89k   169273.62k   181124.07k   183319.34k   185431.77k   184395.53k
workstation    409389.73k   579395.61k   626040.06k   629537.79k   627444.18k   625431.89k

Switching to AES128-CTR gives a two-fold performance increase, for example:

$ sudo sftp osmc@server:/test/1000M /tmp
Connected to server.
Fetching /test/1000M to /tmp/1000M
/test/1000M                                      100%  954MB  19.4MB/s   00:49 
$ sudo sftp  -c aes128-ctr osmc@server:/test/1000M /tmp
Connected to osmc.
Fetching /test/1000M to /tmp/1000M
/test/1000M                                      100%  954MB  42.1MB/s   00:22 

Which algorithm to pick is left as an exercise for the reader for now. :slight_smile: (Hint: a script like this one will generate nice reports.) But before you go crazy optimizing all of this, keep in mind that a 1080p DVD is 40Mbit/s (5MB/s) and 4k is 128Mbit/s (16MB/s). So even with the default algorithms, it might just be possible to stream 4k videos over SSH.

Anecdotally, it seems the Vero comes pretty close to saturating the gigabit network using a pure network benchmark:

osmc@osmc:~$ iperf3 --client server
Connecting to host server, port 5201
[  4] local 192.168.0.222 port 47802 connected to 192.168.0.223 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   110 MBytes   923 Mbits/sec    0   1.85 MBytes       
[  4]   1.00-2.00   sec   112 MBytes   941 Mbits/sec    0   2.83 MBytes       
[  4]   2.00-3.00   sec   112 MBytes   940 Mbits/sec    0   2.22 MBytes       
[  4]   3.00-4.00   sec   112 MBytes   939 Mbits/sec    0   2.40 MBytes       
[  4]   4.00-5.00   sec   112 MBytes   940 Mbits/sec    0   2.55 MBytes       
[  4]   5.00-6.00   sec   112 MBytes   939 Mbits/sec    0   2.67 MBytes       
[  4]   6.00-7.00   sec   112 MBytes   941 Mbits/sec    0   2.77 MBytes       
[  4]   7.00-8.00   sec   112 MBytes   939 Mbits/sec    0   2.03 MBytes       
[  4]   8.00-9.00   sec   112 MBytes   940 Mbits/sec    0   2.14 MBytes       
[  4]   9.00-10.00  sec   112 MBytes   940 Mbits/sec    0   2.22 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  1.09 GBytes   938 Mbits/sec    0             sender
[  4]   0.00-10.00  sec  1.09 GBytes   935 Mbits/sec                  receiver

iperf Done.

The theoretical maximum bandwidth for a gigabit TCP connection is around 950Mbit/s, or about a 5% overhead. The above tests shows an extra 1-1.5% overhead. So clearly the problem is the encryption layer in SSH which overloads the CPU. But I believe performance is still acceptable for many use cases.

1 Like