The Ethernet network interface freezes after some timeLink to The Ethernet network interface freezes after some time

SymptomsLink to Symptoms

After some time, from weeks to months, the network interface becomes unusable. 

Symptoms

ALL of these symptoms must appear for this problem to be confirmed:

  • LoRa devices nearby this gateway only are no more connected
  • The packet forwarder is no more connected to the Network Server
  • Remote connection with SSH is not possible
  • The system is still running (blue LED blinking in "heartbeat" mode)
  • Connection is possible through the USB service port using SSH
  • The gateway has no network access (ping something in the local network fails)
  • Restarting the network interface brings the network back online

Restarting the network interface brings the network back online:

$ sudo ifconfig eth0 down
$ sudo ifconfig eth0 up
BASH

This will only fix the current issue, it will not solve the problem definitely!

DescriptionLink to Description

The LORIX One's network interface driver of the Linux kernel's mainline (all versions) has a bug and enters in a deadlock situation following an overflow of the RX buffer.
The internal MAC peripheral of the SAMA5D4 is perfectly fine and, in this situation, generates an error which is totally possible to handle. However, the actual version of the driver doesn't manage it.

The problem appears most likely in high broadcast environment and the issue frequency is really variable depending on the network configuration like DHCP lease renew time.

SolutionLink to Solution

After weeks of investigation, Wifx finally discovered the source of the issue and submitted a patch to the Linux kernel maintainers.
This patch has not yet been merged into the mainline sources but we integrate it already from LORIX OS 1.0.0.

The best option is then simply to update your LORIX One to LORIX OS 1.0.0 (or higher) through USB following this documentation.

WorkaroundLink to Workaround

If you can't update to the LORIX OS, Monit can manage the network interface and restart it in case of deadlock symptom.

Install monitLink to Install monit

Install Monit as explained here

Add a new monitoring scriptLink to Add a new monitoring script

Create with vi or nano the Monit script /etc/monit.d/eth0-ping.monit and add the following text inside:

check host eth0-ping with address <ping address to check>
    if failed ping count 5 size 128 with timeout 60 seconds then exec "/bin/bash -c '/sbin/ifconfig eth0 down; /sbin/ifconfig eth0 up;'"
        repeat every 5 cycles
CODE

And replace the <ping address to check> text by the address of the host to check the connectivity with like 192.168.1.1 (to check connectivity with your main router) or 8.8.8.8 (to check connectivity with the DNS server of Google) for example.

This script will try every 5 cycles (cycle time of Monit is defined by default to 30 seconds from the general configuration file /etc/monitrc) and if for 5 unsuccessful pings with a timeout of 60 seconds, will execute the command /sbin/ifconfig eth0 down; /sbin/ifconfig eth0 up; in the bash interpreter.
This will restart the network interface and remove the deadlock caused by the driver bug.

Once this file saved, you can reload Monit and see the actual status:

eth0-ping status

$ sudo monit reload
Reinitializing monit daemon
$ sudo monit status
Monit 5.25.2 uptime: 25m

Remote Host 'eth0-ping'
  status                       OK
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  ping response time           7.179 ms
  data collected               Mon, 15 Jun 2020 08:32:17

System 'sama5d4-lorix-one-512'
  status                       OK
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  load average                 [0.00] [0.00] [0.00]
  cpu                          1.5%us 0.7%sy 0.0%wa
  memory usage                 12.9 MB [10.7%]
  swap usage                   0 B [0.0%]
  uptime                       29m
  boot time                    Mon, 15 Jun 2020 08:02:58
  data collected               Mon, 15 Jun 2020 08:32:17
BASH

TestLink to Test

If you are connected through SSH, you will lose connectivity. If your script is incorrect, you may not get back access to the gateway.
A general good approach is to test the script on a local product that you can easily access using the USB serial console.

You can test it works correctly by disabling the eth0 interface:

Script verification

$ sudo ip link set eth0 down
[ after some time ]
$ sudo monit status
Monit 5.25.2 uptime: 30m

Remote Host 'eth0-ping'
  status                       ICMP failed
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  ping response time           connection failed
  data collected               Mon, 15 Jun 2020 08:37:48

System 'sama5d4-lorix-one-512'
[..]
$
[ Monit executes the script command, the eth0 interface is restarted ]
IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
macb f8020000.ethernet eth0: link up (100/Full)
IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
BASH

Your gateway will come online again after a couple of minutes.


Was this page helpful for you?

Yes
No
Fix it