TRS-0001
The Ethernet network interface freezes after some time
Symptoms
After some time, from weeks to months, the network interface becomes unusable.
Symptoms
ALL of these symptoms must appear for this problem to be confirmed:
- LoRa devices nearby this gateway only are no more connected
- The packet forwarder is no more connected to the Network Server
- Remote connection with SSH is not possible
- The system is still running (blue LED blinking in "heartbeat" mode)
- Connection is possible through the USB service port using SSH
- The gateway has no network access (ping something in the local network fails)
- Restarting the network interface brings the network back online
Restarting the network interface brings the network back online:
This will only fix the current issue, it will not solve the problem definitely!
Description
The LORIX One's network interface driver of the Linux kernel's mainline (all versions) has a bug and enters in a deadlock situation following an overflow of the RX buffer.
The internal MAC peripheral of the SAMA5D4 is perfectly fine and, in this situation, generates an error which is totally possible to handle. However, the actual version of the driver doesn't manage it.
The problem appears most likely in high broadcast environment and the issue frequency is really variable depending on the network configuration like DHCP lease renew time.
Solution
After weeks of investigation, Wifx finally discovered the source of the issue and submitted a patch to the Linux kernel maintainers.
This patch has not yet been merged into the mainline sources but we integrate it already from LORIX OS 1.0.0.
The best option is then simply to update your LORIX One to LORIX OS 1.0.0 (or higher) through USB following this documentation.
Workaround
If you can't update to the LORIX OS, Monit can manage the network interface and restart it in case of deadlock symptom.
Install monit
Install Monit as explained here.
Add a new monitoring script
Create with vi or nano the Monit script /etc/monit.d/eth0-ping.monit and add the following text inside:
And replace the <ping address to check> text by the address of the host to check the connectivity with like 192.168.1.1 (to check connectivity with your main router) or 8.8.8.8 (to check connectivity with the DNS server of Google) for example.
This script will try every 5 cycles (cycle time of Monit is defined by default to 30 seconds from the general configuration file /etc/monitrc) and if for 5 unsuccessful pings with a timeout of 60 seconds, will execute the command /sbin/ifconfig eth0 down; /sbin/ifconfig eth0 up; in the bash interpreter.
This will restart the network interface and remove the deadlock caused by the driver bug.
Once this file saved, you can reload Monit and see the actual status:
eth0-ping status
Test
If you are connected through SSH, you will lose connectivity. If your script is incorrect, you may not get back access to the gateway.
A general good approach is to test the script on a local product that you can easily access using the USB serial console.
You can test it works correctly by disabling the eth0 interface:
Script verification
Your gateway will come online again after a couple of minutes.