TRS-0001
The Ethernet network interface freezes after some time
Symptoms
After some time, from weeks to months, the network interface becomes unusable.
Symptoms
ALL of these symptoms must appear for this problem to be confirmed:
- LoRa devices nearby this gateway only are no more connected
- The packet forwarder is no more connected to the Network Server
- Remote connection with SSH is not possible
- The system is still running (blue LED blinking in "heartbeat" mode)
- Connection is possible through the USB service port using SSH
- The gateway has no network access (ping something in the local network fails)
- Restarting the network interface brings the network back online
Restarting the network interface brings the network back online:
$ sudo ifconfig eth0 down
$ sudo ifconfig eth0 up
This will only fix the current issue, it will not solve the problem definitely!
Description
The LORIX One's network interface driver of the Linux kernel's mainline (all versions) has a bug and enters in a deadlock situation following an overflow of the RX buffer.
The internal MAC peripheral of the SAMA5D4 is perfectly fine and, in this situation, generates an error which is totally possible to handle. However, the actual version of the driver doesn't manage it.
The problem appears most likely in high broadcast environment and the issue frequency is really variable depending on the network configuration like DHCP lease renew time.
Solution
After weeks of investigation, Wifx finally discovered the source of the issue and submitted a patch to the Linux kernel maintainers.
This patch has not yet been merged into the mainline sources but we integrate it already from LORIX OS 1.0.0.
The best option is then simply to update your LORIX One to LORIX OS 1.0.0 (or higher) through USB following this documentation.
Workaround
If you can't update to the LORIX OS, Monit can manage the network interface and restart it in case of deadlock symptom.
Install monit
Install Monit as explained here.
Add a new monitoring script
Create with vi or nano the Monit script /etc/monit.d/eth0-ping.monit and add the following text inside:
check host eth0-ping with address <ping address to check>
if failed ping count 5 size 128 with timeout 60 seconds then exec "/bin/bash -c '/sbin/ifconfig eth0 down; /sbin/ifconfig eth0 up;'"
repeat every 5 cycles
And replace the <ping address to check> text by the address of the host to check the connectivity with like 192.168.1.1 (to check connectivity with your main router) or 8.8.8.8 (to check connectivity with the DNS server of Google) for example.
This script will try every 5 cycles (cycle time of Monit is defined by default to 30 seconds from the general configuration file /etc/monitrc) and if for 5 unsuccessful pings with a timeout of 60 seconds, will execute the command /sbin/ifconfig eth0 down; /sbin/ifconfig eth0 up; in the bash interpreter.
This will restart the network interface and remove the deadlock caused by the driver bug.
Once this file saved, you can reload Monit and see the actual status:
eth0-ping status
$ sudo monit reload
Reinitializing monit daemon
$ sudo monit status
Monit 5.25.2 uptime: 25m
Remote Host 'eth0-ping'
status OK
monitoring status Monitored
monitoring mode active
on reboot start
ping response time 7.179 ms
data collected Mon, 15 Jun 2020 08:32:17
System 'sama5d4-lorix-one-512'
status OK
monitoring status Monitored
monitoring mode active
on reboot start
load average [0.00] [0.00] [0.00]
cpu 1.5%us 0.7%sy 0.0%wa
memory usage 12.9 MB [10.7%]
swap usage 0 B [0.0%]
uptime 29m
boot time Mon, 15 Jun 2020 08:02:58
data collected Mon, 15 Jun 2020 08:32:17
Test
If you are connected through SSH, you will lose connectivity. If your script is incorrect, you may not get back access to the gateway.
A general good approach is to test the script on a local product that you can easily access using the USB serial console.
You can test it works correctly by disabling the eth0 interface:
Script verification
$ sudo ip link set eth0 down
[ after some time ]
$ sudo monit status
Monit 5.25.2 uptime: 30m
Remote Host 'eth0-ping'
status ICMP failed
monitoring status Monitored
monitoring mode active
on reboot start
ping response time connection failed
data collected Mon, 15 Jun 2020 08:37:48
System 'sama5d4-lorix-one-512'
[..]
$
[ Monit executes the script command, the eth0 interface is restarted ]
IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
macb f8020000.ethernet eth0: link up (100/Full)
IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Your gateway will come online again after a couple of minutes.