Vero4k+ random freezes with skb_panic()

The PC-to-server figures look fine, so it seems that the connection from the SRW2024P switch to the server is ok, and the PC is auto-negotiating correctly when connected to the same switch.

If it’s a duplex mismatch I’d therefore expect the V4K to be the one running at half duplex, since it’s the one getting poor transmit figures. Running ethtool on the V4K+ should answer this one.

Just did that. It now runs Linux vero4k 3.14.29-139-osmc #1 SMP Tue Feb 19 04:09:47 UTC 2019 aarch64 GNU/Linux.

vero4k+ runs full duplex:

root@vero4k:~# ethtool eth0
Settings for eth0:
	Supported ports: [ TP MII ]
	Supported link modes:   10baseT/Half 10baseT/Full 
	                        100baseT/Half 100baseT/Full 
	                        1000baseT/Half 1000baseT/Full 
	Supported pause frame use: Symmetric Receive-only
	Supports auto-negotiation: Yes
	Advertised link modes:  10baseT/Half 10baseT/Full 
	                        100baseT/Half 100baseT/Full 
	                        1000baseT/Half 1000baseT/Full 
	Advertised pause frame use: Symmetric Receive-only
	Advertised auto-negotiation: Yes
	Link partner advertised link modes:  10baseT/Half 10baseT/Full 
	                                     100baseT/Half 100baseT/Full 
	                                     1000baseT/Half 1000baseT/Full 
	Link partner advertised pause frame use: Symmetric Receive-only
	Link partner advertised auto-negotiation: Yes
	Speed: 1000Mb/s
	Duplex: Full
	Port: MII
	PHYAD: 0
	Transceiver: external
	Auto-negotiation: on
	Supports Wake-on: ug
	Wake-on: d
	Current message level: 0x0000003d (61)
			       drv link timer ifdown ifup
	Link detected: yes

The MTU on the NFS server is 9000:

box ~ # ip link show dev eth0
2: eth0: <BROADCAST,MULTICAST,ALLMULTI,UP,LOWER_UP> mtu 9000 qdisc htb state UP mode DEFAULT group default qlen 1000
    link/ether 74:d4:35:e7:ac:e6 brd ff:ff:ff:ff:ff:ff

The MTU on vero4k+ is 1500:

root@vero4k:~# ip link show dev eth0
2: eth0: <BROADCAST,MULTICAST,DYNAMIC,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether c4:4e:ac:29:33:a4 brd ff:ff:ff:ff:ff:ff

hm… I am not sure what this will show. If there is no traffic at all, I wouldn’t expect the Ethernet driver to crash. How would you like me to test it?

The default is 1MB.

root@vero4k:~# cat /etc/fstab 
# rootfs is not mounted in fstab as we do it via initramfs. Uncomment for remount (slower boot)
#/dev/vero-nand/root  /    ext4      defaults,noatime    0   0
#10.11.12.1:/data /data   nfs   defaults,auto,rsize=1048576,wsize=1048576,noatime,nodiratime,intr,cto,tcp,vers=3 0 0
10.11.12.1:/data /data    nfs   noauto,x-systemd.automount,noatime,nodiratime,vers=3 0 0

# mount | grep /data
systemd-1 on /data type autofs (rw,relatime,fd=29,pgrp=1,timeout=0,minproto=5,maxproto=5,direct)
10.11.12.1:/data on /data type nfs (rw,noatime,nodiratime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.11.12.1,mountvers=3,mountport=49800,mountproto=udp,local_lock=none,addr=10.11.12.1)

I installed the development updates. Let’s see if it crashes again…

That’s latest and greatest

You could try Kodi based NFS access for a while. This will use libnfs.

We cannot do a frame size above 3052 (IIRC, from top of my head).
1500 on Vero will be fine.

If you keep getting issues, I can send a debug kernel. I only recently found a cause of eth0 dying due to low traffic. We only had one user affected but he was running a DHCPless and Avahiless environment, and the low RX packet count caused us to reset the PHY as we thought we were not getting acks (tcp oriented patch series for sure…).

Personally, I think we’ll just end up finding a very strange bug exposed by your network configuration. I’d prefer a hardware fault though – the solution is easier :wink:

Sam

I asked you if there was anything out of the ordinary about the network. Surely running 9K jumbo frames across the network qualifies as being “out of the ordinary”.

AFAICT, the only Pi that supports 9K jumbo frames is the 3B+.

I would have thought that the next step has to be removing the jumbo frames and running the server with an MTU of 1500. Then (a) re-run the iperf3 figures from the V4K+ and (b) see if the network panics still occur.

It has frozen again with the new kernel.

I did these 2 changes:

  1. I have set the MTU on the NFS server to 1500.
  2. I have connected the vero4k+ via WiFi (5GHz) and removed the ethernet cable from it.
root@vero4k:~# iperf3 -c box         
Connecting to host box, port 5201
[  4] local 10.11.12.242 port 52799 connected to 10.11.12.1 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  27.5 MBytes   230 Mbits/sec    2    290 KBytes       
[  4]   1.00-2.00   sec  28.3 MBytes   237 Mbits/sec    2    233 KBytes       
[  4]   2.00-3.00   sec  25.4 MBytes   213 Mbits/sec    0    255 KBytes       
[  4]   3.00-4.00   sec  27.1 MBytes   228 Mbits/sec    3    194 KBytes       
[  4]   4.00-5.00   sec  25.5 MBytes   214 Mbits/sec    0    212 KBytes       
[  4]   5.00-6.00   sec  27.2 MBytes   229 Mbits/sec    0    221 KBytes       
[  4]   6.00-7.00   sec  25.5 MBytes   214 Mbits/sec    0    222 KBytes       
[  4]   7.00-8.00   sec  26.4 MBytes   221 Mbits/sec    3    173 KBytes       
[  4]   8.00-9.00   sec  23.5 MBytes   198 Mbits/sec    1    143 KBytes       
[  4]   9.00-10.00  sec  24.0 MBytes   202 Mbits/sec    0    156 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec   261 MBytes   219 Mbits/sec   11             sender
[  4]   0.00-10.00  sec   259 MBytes   218 Mbits/sec                  receiver

iperf Done.

Let’s see if it crashes again…

Ideally, you should only be changing one parameter at a time.

2 Likes

Yes I know.

It did not crash with just WiFi so far. It played 2 movies.
Let me check now with Ethernet.

1 Like

It seems that jumbo frames was the problem.

I disabled them on the NFS server and I still got freezes, but now with these errors:

Mar  4 09:51:04 vero4k kernel: [52328.412424@0] ndesc_get_rx_status: Oversized frame spanned multiple buffers
Mar  4 09:51:05 vero4k kernel: [52329.413215@0] ndesc_get_rx_status: Oversized frame spanned multiple buffers
Mar  4 09:51:07 vero4k kernel: [52331.412880@0] ndesc_get_rx_status: Oversized frame spanned multiple buffers
Mar  4 09:51:11 vero4k kernel: [52335.413162@0] ndesc_get_rx_status: Oversized frame spanned multiple buffers
Mar  4 09:51:19 vero4k kernel: [52343.413311@0] ndesc_get_rx_status: Oversized frame spanned multiple buffers
Mar  4 09:51:35 vero4k kernel: [52359.413573@0] ndesc_get_rx_status: Oversized frame spanned multiple buffers
Mar  4 09:52:07 vero4k kernel: [52391.414186@0] ndesc_get_rx_status: Oversized frame spanned multiple buffers
Mar  4 09:53:11 vero4k kernel: [52455.413769@0] ndesc_get_rx_status: Oversized frame spanned multiple buffers
Mar  4 09:55:19 vero4k kernel: [52584.267413@0] ndesc_get_rx_status: Oversized frame spanned multiple buffers
Mar  4 09:55:20 vero4k kernel: [52585.268803@0] ndesc_get_rx_status: Oversized frame spanned multiple buffers

So, I now disabled it on all computers on the network.

Let’s see if it now crashes again…

No crashes yet.

It is funny that you can crash a vero4k+, by just plugging in a computer with 9k jumbo frames, even if there is no real connection between that computer and vero4k+…

AFAIK the generally accepted rule is that if you’re going to use jumbo frames, then every node in the network needs to support jumbo frames. Clearly, this wasn’t the case here.

I’m a bit rusty on this stuff but it’s unclear to me why you were still seeing those “oversized frame” messages with the NFS server’s MTU set to 1500. They are probably related to things such as ARP, broadcast and multicast traffic from other devices on the LAN, which were at the time still on a 9K MTU. That said, I thought that the switch/router should have dealt with any 9K frames – either rejecting or fragmenting them, as appropriate – before sending them to the Vero4K+.

To answer your specific point:

there is always “chatter” between devices on a LAN. At the time you saw those messages, those other devices were still using a 9k MTU. With every node on the network now using an MTU of 1500, I would expect such messages to disappear.

Well zeroconf and avahi only to name two protocols that constantly exchange packets between all devices. If you want to check it install tcpdump and you can see them

Agreed.
I will work out how to stop crashes; although network stability is not going to be realistic.

MTU of 3000 should work OK.

Sam