Raspberry Pi 3 Wifi Disconnects

Pythonic · 30 August 2016 13:31

I’m having an issue where my Raspberry Pi 3’s are disconnecting from the network and not reconnecting.

When I view the network status, it reads “Busy” on the Internet and MAC address. and “Not connected” for link. In the myOSMC → network settings, everything looks normal. When in this problem state, if I exit to the command line and do an ifconfig wlan0 – I see that the interface still has its proper IP address/gateway settings, but I cannot ping external hosts nor will it respond to incoming pings.

The issue only seems to propagate when the network connection is left idle for a period of time (usually less than an hour, sometimes as little as a few minutes). If I SSH into the Pi, the network connection will usually remain up no matter how long I’m using the SSH connection. The usual remedy for the problem is a reboot.

I was thinking this is related to a power management issue, previously discussed here – but supposedly the fix to this issue has been applied as a default in OSMC, if I’m reading that correctly. Can anyone confirm the default power management state of wlan0?

I have updated osmc and am on the latest version.

sam_nazarko · 30 August 2016 14:48

Power management is disabled for the internal adapter, but the way it is done may not mean this is shown through iwconfig.

I have used WiFi extensively on my Pi 3 and don’t really have an issue. The ‘Busy’ display is not an accurate display of network status.

Doesn’t sound like power management if your device refuses to ping out when you control it with a keyboard.

Kernel messages log might give clues.

Pythonic · 30 August 2016 16:31

Hi Sam, thanks for the quick reply.

I uploaded the output of dmesg news to this pastebin – Most of this is foreign to me, but maybe something will be more apparent to someone with more skilled eyes. Also, the logs seems to confirm that power management is disabled.

Some things that stuck out to me were:

[    4.500539] brcmfmac: brcmf_c_preinit_dcmds: Firmware version = wl0: Dec 15 2015 18:10:45 version 7.45.41.23 (r606571) FWID 01-cc4eda9c
[    4.517411] brcmfmac: brcmf_cfg80211_reg_notifier: not a ISO3166 code
[    4.520839] FAT-fs (mmcblk0p1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
[    4.610940] uart-pl011 3f201000.uart: no DMA platform data
[    5.042711] lirc_rpi: auto-detected active high receiver on GPIO pin 18

Seems like there was a hiccup with the firmware for wlan0 – Also not sure what all that is about GPIO pin 18 – nothing is connected to the GPIO, but I’m guessing that doesn’t mean what I might think it means.

The other thing was the tidbit between these two lines.

[    3.537693] systemd[1]: Cannot add dependency job for unit display-manager.service, ignoring: Unit display-manager.service failed to load: No such file or directory.
...
[    3.537893] systemd[1]: Job rpcbind.service/start deleted to break ordering cycle starting with basic.target/start```

I suppose it's worth mentioning I have the 'wait for network' option enabled.

sam_nazarko · 30 August 2016 16:37

This all looks fine. No display-manager because we don’t have a window server.

Is the log posted after exhibiting the problem?

Pythonic · 30 August 2016 16:54

Thanks Sam. This is actually a log I pulled via SSH from a unit that (obviously) wasn’t exhibiting the problem at the time. However, I did check beforehand and it looks the same as the logs from the non-working machines. It was just easier to copy/paste from the ssh terminal.

I snapped a couple pics of the tail end of the logs of two units that were not working, but they appear pretty much identical to the unit not currently exhibiting the issue.

Nothing interesting in the kodi log from what I can tell. I’m enabling debugging on these and rebooting to see if anything interesting comes up in the Kodi logs with debugging enabled.

Pythonic · 31 August 2016 16:04

I noticed the something occurring in the Kodi log, which seemed to coincide with the network connectivity issue. I think I may have nailed the issue down to the osmcsetting.updates module (or related module)

I believe that something is going wrong, causing the updatesetting service to initialize when it should not. The service appears to get stuck in a holding pattern indefinitely, and may be blocking other network traffic as a result.

Here’s a debug log taken from a unit that was working normally with network connection up.

Here’s the debug log of a unit that was working, but eventually stopped working. – I had to cut parts out because it was too big to fit in a single paste, but the messages in that chunk were just repeating ad infinitum.

The problem I’m having seems to coincide with the time when the following osmc.setting.updates messages begin repeating infinitely. These messages do not appear when the network is working normally.

00:19:14 38730.066406 T:1796068336   DEBUG: script.module.osmcsetting.updates :  - update_service.update_now received: pos0: <update_service.Main object at 0x5f3cff10>
00:19:14 38730.148438 T:1796068336   DEBUG: script.module.osmcsetting.updates :  - update_service.check_update_conditions received: pos0: <update_service.Main object at 0x5f3cff10>
00:19:14 38730.152344 T:1796068336   DEBUG: script.module.osmcsetting.updates : result of Player.GetActivePlayers - {u'jsonrpc': u'2.0', u'id': 1, u'result': [{u'playerid': 2, u'type': u'picture'}]}
00:19:14 38730.152344 T:1796068336   DEBUG: script.module.osmcsetting.updates :  - Update CONDITION : player playing
00:19:14 38730.152344 T:1796068336   DEBUG: script.module.osmcsetting.updates :  - update_service.check_update_conditions [0.00170302391052]  returns: (False, 'Update CONDITION : player playing')
00:19:14 38730.152344 T:1796068336   DEBUG: script.module.osmcsetting.updates :  - update_service.update_now -nowait
00:19:14 38730.152344 T:1796068336   DEBUG: script.module.osmcsetting.updates : trigger_time - 2016-09-01 00:19:13.625363
00:19:14 38730.152344 T:1796068336   DEBUG: script.module.osmcsetting.updates :  - update_service.holding_pattern_update received: pos0: <update_service.Main object at 0x5f3cff10>
00:19:14 38730.152344 T:1796068336   DEBUG: script.module.osmcsetting.updates :  - update_service.check_update_conditions received: pos0: <update_service.Main object at 0x5f3cff10>
00:19:14 38730.152344 T:1796068336   DEBUG: script.module.osmcsetting.updates : result of Player.GetActivePlayers - {u'jsonrpc': u'2.0', u'id': 1, u'result': [{u'playerid': 2, u'type': u'picture'}]}
00:19:14 38730.152344 T:1796068336   DEBUG: script.module.osmcsetting.updates :  - Update CONDITION : player playing

I don’t do much socket programming at all, but taking a look at the comms.py file… I notice that the socket is essentially in an infinite try/except block of

log('Comms started')
while <some_condition>:
    try:
        conn, addr = self.sock.accept()
    except socket.timeout:
        continue
    except:
        log('An error occured while waiting for a connection')
        break
    log('Connection active.')
    #more stuff

(The fact that it’s stuck here forever can be verified by looking at the logs, where "Comms started" is logged, but "Connection active." is never logged until I came along and interrupted things. 'An error occurred while waiting for a connection.' is never logged, so the except condition being hit must be socket.timeout)

Like I said, I’m no expert with sockets, but that doesn’t stop me from trying to string together a vague idea of what might be wrong… Some line items that almost make this click for me:

A socket with a timeout blocks during the timeout period.
If the accept is retried immediately after the timeout, that leaves no time in between this blocking call.
Given 1 and 2, wouldn’t this essentially be the same thing as setting timeout to None ?
Do you suppose this behavior would be fixed if the socket was either nonblocking and/or waited a period of time (30 secs?) between trying to connect/accept?

All that said, this could be a byproduct of the problem and not necessarily the cause and I’m just barking up the wrong tree.

Anyhow, I’d like to test my theory. Is there a way to disable the settings update? If not, I can try monkeypatching the module to do effectively nothing to test the behavior, or adjust the socket behavior to try and fix what I believe the problem to be. That is, unless that will definitely break things in an unexpected way… who knows… Also tossing around the idea of increasing the _daemon wait time to more than 500ms

To give a little background on my setup:

I’ve got several raspberry pi’s running OSMC. I have an autoexec.py script that contacts a central server, providing its location, and makes a query for what should be in its playlist. From the result, the script downloads and adds/removes the appropriate files in the playlist folder. (/home/osmc/current) – Then the script starts a slideshow to that directory. In other cases, it may start a video playlist or something else, but the idea is the same.

I deployed the different units by setting up 1 unit, then cloning the SD card to the other units. After loading the image onto the units, I remove the network settings (the one tied to the base unit’s MAC address) from connman then setup a new network connection through the GUI.

It seems like a crapshoot as to when the problem will occur. Sometimes it’s multiple hours after starting the slideshow, sometimes it’s within minutes. But at some point, it always happens.

Thoughts?

Pythonic · 1 September 2016 19:00

I take that all back. I changed update settings to only occur after business hours, but the Pi’s still lost network connectivity, despite moving the time to check for updates.

sam_nazarko · 1 September 2016 19:14

I don’t think the issue is with the update code.

Don’t think it’s power related either. I suggest installing something like gnu screen and keeping ping running constantly to see what happens

Pythonic · 19 September 2016 12:06

Yeah, so after regrouping, I think I’ve finally got something useful in the grabbed logs at around line 1315 or so.

It looks like something is happening causing it to delete the existing interface then create a new one that defaults to an APIPA address.

It’s very much possible this is an internal networking problem. I just haven’t the faintest idea what.

The IP address for each unit is reserved in DHCP, so the IP should be the same every time, and it is when working normally.

Any thoughts?

Pythonic · 6 October 2016 17:26

Don’t mean to bump an old thread, but turns out the issue is a problem with wpasupplicant 2.3 and Cisco routers (probably related to FreeRADIUS). When the pi’s roam to particular APs, there’s a compatibility problem with the AP. Haven’t worked out a resolution yet, going to try setting a preferred AP, possibly changing wpasupplicant version.

sam_nazarko · 6 October 2016 17:34

Do you have a link to a confirmed bug report? I can ping the Debian maintainer for this package and get the error resolved upstream.

Pythonic · 6 October 2016 18:52

I’ve seen a few opn bugs related to wpasupplicant for ubuntu and WPA enterprise. Not sure if that’s helpful at all.

The problem seems to always coincide with a kernel warning being thrown related to cfg80211, which is preceded by an association with particular access points (Cisco routers running older firmware than other APs)

Our network services noted that on these APs there are numerous authentication failures from our Pi’s running osmc.

The changelog for wpasupplicant mentions for version 2.4, there was a fix for some cfg80211 issue as well as EAP reauthentication. I’m wondering if this addresses the problem I’m having; since initial connection is never a problem, but reauthing after we get deauthed seems to be a problem.