Vero 4K is crashing randomly during idle

CestLeBob · 4 January 2018 21:49

Yeah I get you, swapping the cable shouldn’t be a problem.

But as said, the V4K does its “Wake on Lan” on the NAS and then immediately sends pings to it, which results first in dropped packages until the NAS is booting up, then in “slow responses” from 500ms to 1sec and then it stops pinging as soon as the server is up. After this pinging the NAS from OSMC results in sub 1ms ping responses. I can 100% reproduce this behaviour by setting the NAS to sleep and then doing a WOL from OSMC while monitoring the logs.

Here are the crashlogs of the most recent crash with Debug Mode enabled:
https://paste.osmc.tv/feriwuvaha

Crash happened (according to my SSH client) on 22:36, which is exactly when the logs stop. I’m completely out of ideas what could be the cause I will try swapping cables next.

Edit:
When I restart with the NAS already active, the log looks like this (https://paste.osmc.tv/gevuvubomu), as far as I can see there are no more high pings then:

Jan 04 22:53:07 Vero4K mediacenter[404]: PING 192.168.196.50 (192.168.196.50): 56 data bytes
Jan 04 22:53:07 Vero4K mediacenter[404]: 64 bytes from 192.168.196.50: seq=0 ttl=64 time=2.909 ms
Jan 04 22:53:07 Vero4K mediacenter[404]: — 192.168.196.50 ping statistics —
Jan 04 22:53:07 Vero4K mediacenter[404]: 1 packets transmitted, 1 packets received, 0% packet loss
Jan 04 22:53:07 Vero4K mediacenter[404]: round-trip min/avg/max = 2.909/2.909/2.909 ms

dillthedog · 4 January 2018 22:36

eth0      Link encap:Ethernet  HWaddr c4:4e:ac:0a:ee:fd  
          inet addr:192.168.196.60  Bcast:192.168.196.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST DYNAMIC  MTU:1500  Metric:1
          RX packets:3271 errors:0 dropped:273 overruns:0 frame:0
          TX packets:1239 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:862827 (842.6 KiB)  TX bytes:601238 (587.1 KiB)
          Interrupt:40

AFAICT, the system was only up for a short time but you have 273 dropped RX packets.

The Kodi log is showing NFS-related issues:

22:53:09.044 T:3695277040   ERROR: NFS: Failed to stat(enderep/xbmc_videodb_2017-03-12/movies/Warcraft_2016-poster.jpg) stat call failed with "NFS: Lookup of /xbmc_videodb_2017-03-12/movies/Warcraft_2016-poster.jpg failed with NFS3ERR_NOENT(-2)"
22:53:09.046 T:3695277040   ERROR: NFS: Failed to stat(enderep/xbmc_videodb_2017-03-12/movies/Watchmen_2009-poster.jpg) stat call failed with "NFS: Lookup of /xbmc_videodb_2017-03-12/movies/Watchmen_2009-poster.jpg failed with NFS3ERR_NOENT(-2)"
22:53:09.048 T:3695277040   ERROR: NFS: Failed to stat(enderep/xbmc_videodb_2017-03-12/movies/The_Wolf_of_Wall_Street_2013-poster.jpg) stat call failed with "NFS: Lookup of /xbmc_videodb_2017-03-12/movies/The_Wolf_of_Wall_Street_2013-poster.jpg failed with NFS3ERR_NOENT(-2)"
22:53:09.050 T:3695277040   ERROR: NFS: Failed to stat(enderep/xbmc_videodb_2017-03-12/movies/Jason_Bourne_2016-poster.jpg) stat call failed with "NFS: Lookup of /xbmc_videodb_2017-03-12/movies/Jason_Bourne_2016-poster.jpg failed with NFS3ERR_NOENT(-2)"
22:53:09.052 T:3695277040   ERROR: NFS: Failed to stat(enderep/xbmc_videodb_2017-03-12/movies/The_Wolverine_2013-poster.jpg) stat call failed with "NFS: Lookup of /xbmc_videodb_2017-03-12/movies/The_Wolverine_2013-poster.jpg failed with NFS3ERR_NOENT(-2)"

You don’t seem to have defined a share path for nfs://192.168.196.50/enderep/xbmc_videodb_2017-03-12/movies

Edit: I’d suggest you edit /usr/share/kodi/addons/script.module.osmcsetting.logging/resources/lib/grablogs.py and add a -b to the sudo journalctl command since the log is getting a bit difficult to read.

sam_nazarko · 4 January 2018 22:48

That’s also what I suspected.

Personally, I would disconnect Ethernet and run it on 5Ghz WiFi for 24 hours and see what happens. Most routers won’t forward Wired broadcasts to WiFi and vice versa for security reasons.

Sam

CestLeBob · 4 January 2018 23:44

Okay, I’m not a network expert, so no idea why this could happen but at least there shouldn’t be any loops in my network cabling.

For now I added -b to the journalctl cmd in the .py file as requested by @dillthedog and switched the network over to 5Ghz wifi as told by @sam_nazarko. I will report back tomorrow if there is a crash again since this is easier for me compared to switching out the cable behind the TV rack etc.

Regarding the missing share on nfs://192.168.196.50/enderep/xbmc_videodb_2017-03-12/movies: Where does this come from? My library db (is that what the videodb is?) is stored on the v4k, not on the NAS. Moviefiles themselves are stored in nfs://…/Movies/ - I’m confused here?! Why does it reference a videodb that is supposedly on the NAS (there is no /xbmc_videodb_2017-03-12/ on there)

Thanks for the support and have a good night

CestLeBob · 5 January 2018 10:39

Fully disconnected ethernet and had it running via WiFi over night → this morning it was crashed again.

Logs are here: https://paste.osmc.tv/mumiyefesa

Had to change the settings once, since NFS access to the NAS is only allowed from the .60 IP which I had not used when first using WiFi, hence the ERROR: NFS: Failed to mount nfs share: /enderep (mount/mnt call failed with "RPC error: Mount failed with error MNT3ERR_ACCES(13) Permission denied(13)") in Kodi Old Log.

However I can see quite a few errors before that, e.g.:
Jan 04 23:00:13 Vero4K kernel: Reserved memory: incorrect alignment of CMA region

Jan 04 23:00:13 Vero4K kernel: hub 2-0:1.0: USB hub found
Jan 04 23:00:13 Vero4K kernel: hub 2-0:1.0: config failed, hub doesn't have any ports! (err -19) → Definitely no USB Hub connected, only the V4K Remote

Jan 04 23:00:13 Vero4K systemd[1]: Job rpcbind.service/start deleted to break ordering cycle starting with sysinit.target/start

Jan 04 23:00:13 Vero4K systemd[1]: Job kbd.service/start deleted to break ordering cycle starting with sysinit.target/start

Jan 04 23:00:16 Vero4K kernel: CONFIG-ERROR) dhd_conf_set_fw_string_struct_cmd: country setting failed -2

Jan 04 23:00:16 Vero4K kernel: CONFIG-ERROR) dhd_conf_set_fw_string_cmd: txbf setting failed -23

I also checked the configuration of my switches (they also have a cable-test function) and all my cables are ok according to the cable-tester of my switches and my regular old “hardware” cable tester. Since this happens with WiFi as well, I’d say we can rule out a broken cable. I can go ahead and activate Broadcast Filtering on the switch as well, which allows me to set a “Storm-Control Rate” in mbps-steps. I already have Loop-Detection enabled in the switches.

dillthedog · 5 January 2018 14:41

The messages you highlight appear on my V4K, so you can regard them as “normal”.

I guess the next step must be to remove the V4K entirely from the network. You’ll need to remove the data sources (take a copy of sources.xml first), and disconnect ethernet and WiFi. If it still crashes out, then that would suggest a hardware issue.

CestLeBob · 5 January 2018 14:49

That’s what I’m currently doing, I have it running without any network connections enabled. Have not cleared out the sources.xml though. Will do that ASAP.

I’m currently searching for any broadcast storms within my network through wireshark, so far it is looking good though (at least with the V4K not connected).

sam_nazarko · 5 January 2018 15:05

Those messages are not to be considered errors. Perhaps ‘overly verbose’ messages is the best way to describe them.

There was a guy who kept getting freezes every couple of hours. He gave me access to his device via TeamViewer. I was pretty convinced it had to be a hardware fault until I found out it was caused by a crappy DLNA server on his network. A basic iptables rule to ignore all traffic from this device (he wasn’t using it with Vero 4K) and the freezes stopped.

I never dug in to it further as we were both happy with the solution, and I could not reproduce it at my end.

You seem to get DNS resolution errors in your log as well which is quite strange. Can you check the uptime on your router and make sure it’s not rebooting itself sporadically?

Sam

CestLeBob · 5 January 2018 15:26

I have a PiHole (RasPi based hardware ad-blocker) which is propagated as a DNS server through DHCP on my router. But this should only affect your domains if they are on a blacklist somewhere. Is there a way to manually set DNS within OSMC? I can switch this back to the Google DNS as well for testing.

A teamviewer session shouldn’t be a problem as well if you’re willing to take a look.

sam_nazarko · 5 January 2018 16:01

Yes, this can be done under My OSMC → Network.

If PiHole is blocking any *.osmc.tv domains, that could cause problems with updates. But if it was, you shouldn’t be able to post here.

Sam

CestLeBob · 5 January 2018 16:03

Another thing that vexxes me while looking at those logs is the referral to:
23:00:23.429 T:4094686192 NOTICE: WakeOnAccess [192.168.196.50] trigged by accessing : nfs://192.168.196.50/enderep/xbmc_videodb_2017-03-12/...

There is no such path on the NAS and I checked all the sources I could find. They all have the correct paths to the actual folders. Do you have any idea on how to fix that?

dillthedog · 5 January 2018 16:10

Take a look here: http://kodi.wiki/view/Wake_on_lan

JimKnopf · 5 January 2018 16:17

See http://kodi.wiki/view/Import-export_library. Might be you or someone else exported the lib to a single file? Or there was a database migration in the past by an OSMC update? Perhaps one of the mods or Sam do remember?

CestLeBob · 5 January 2018 16:23

I have done this in order to backup the library (probably somewhere around 03-17 acc. to the timestamp ) and since have removed this particular file. But how do I get rid of the remnants?

JimKnopf · 5 January 2018 16:32

Rebuilding the databases like here Clean Library question - #12 by JimKnopf could be a solution.
But before going this extreme hard way, wait for other suggestions how to get rid off orphan data in the database.
I’m also not convinced that your crashes have their root cause by this library issue.

CestLeBob · 6 January 2018 16:56

I had it running several hours without network connection without a crash.

After this (yesterday around 23:00) I have reconnected ethernet and manually set the DNS to 8.8.8.8 and Quad9.

So far it’s still going without a crash - the longest time since several days! Fingers crossed. So the PiHole might have been a problem after all?!

No idea why the V4K crashed to a freeze due to a DNS resolve problem (mainly on your RSS feed) though. I’m using the PiHole since early December, but the crashes only started very recently. Very confusing issue.

I will keep an eye out for more crashes, but for now it is looking good with the G-DNS set as primary.

@sam_nazarko: Do you have an idea how to “clean” my database from the old entries under nfs://…/xbmc_videodb_2017-03-12/… without losing everything? I only exported the DB in order to have a backup, it was never intended to have the DB running this way on the NAS.

dillthedog · 6 January 2018 17:01

Have the ifconfig RX dropped packets also disappeared?

CestLeBob · 6 January 2018 17:10

No, they seem to have not. I streamed several files today and get the following:

> osmc@Vero4K:~$ ifconfig

eth0 Link encap:Ethernet HWaddr c4:4e:ac:0a:ee:fd
inet addr:192.168.196.60 Bcast:192.168.196.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST DYNAMIC MTU:1500 Metric:1
RX packets:4225464 errors:0 dropped:94937 overruns:0 frame:0
TX packets:938873 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:4755649770 (4.4 GiB) TX bytes:95081640 (90.6 MiB)
Interrupt:40

The weird thing is: Right now NOTHING should be going on: TV is off, NAS is off, V4K is “idling”, yet when I ifconfig again, the dropped packages increase to 94940, 94944, etc…

When running watch --interval 1 ifconfig I get exactly 1 dropped package per sec. Do you have any idea how I can find out what is running? netstat is only giving me CONNECTED “domain socket” entries and (at the very top) 3 active “Internet connections” → 1x SSH from my PC and 2x DNS (google and quad9).

dillthedog · 6 January 2018 17:13

Well, there are 3600 seconds in an hour. Do you have 26 hours of uptime?

CestLeBob · 6 January 2018 17:14

osmc@Vero4K:~$ uptime
18:13:45 up 1 day, 4:20, 1 user, load average: 2.42, 2.38, 2.40