Table of Contents

Network monitoring

Kerlink provides two scripts to monitor the network or restart network interfaces: networkmonitoring.py and fixnetwork.py.

fixnetwork.py

This script has been designed to be called by client applications that monitor different network links.
Client applications should send monitored links status to the script each time the status is refreshed. The script will then take actions to fix defective links.

Since this script is used to monitor multiple links at the same time, in most cases, it is up to the client application to choose which link should be used (instead of relying on the default route).

This script uses a configuration file (/etc/network/fixnetwork.conf) describing what actions and when this actions should be taken.

This script uses LOG_LOCAL2 Syslog facility to output logs. By default, the traces are written in /var/log/networkmonitoring.log

Script Usage

root@klk-lpbs-0504B4:/user/test # fixnetwork.py -h                                                                                                                        
Usage: /usr/bin/fixnetwork.py [OPTION] device1HwPos,connectionState,durationSinceLastOk .... deviceXHwPos,connectionState,durationSinceLastOk                             
With:                                                                                                                                                                     
 deviceXHwPos can be:                                                                                                                                                     
  gsmmodemXSlot-modemXPosition for a GSM modem inside a WAN module with:                                                                                                  
   modemXSlot: module slot number (1 to n)                                                                                                                                
   modemXPosition: 1 for mono WAN modules, 1 or 2 for Dual WAN modules                                                                                                    
  extusb: device must be the only modem plugged on external USB                                                                                                           
  cable0: device is POE/LAN ethernet on iBTS                                                                                                                              
  cable1: device is Local ethernet on iBTS, or ethernet on iFemtoCell                                                                                                     
  wifi: device is wifi device (on iFemtoCell only)                                                                                                                        
 connectionState: OK (last applicative ping using this device was OK) or KO (applicative ping KO or not done because no corresponding interface exists)                   
 durationSinceLastOk: error duration in seconds (since last successful applicative ping or first applicative ping)                                                        
Options are:                                                                                                                                                              
 -h: display help                                                                                                                                                         
 -f confFile: give configuration file. Default is /etc/network/fixnetwork.conf                                                                                            
                                                                                                                                                                          
Given information on network links to monitor, this script will take actions to try to bring monitored links up.                                                          
Actions can be: connection restart, device hardware reset, board reboot, ...                                                                                              
Example:                                                                                                                                                                  
/usr/bin/fixnetwork.py gsm1-1,OK,0 gsm1-2,KO,50 cable0,KO,3400                                                                                                             
 means:                                                                                                                                                                   
 - connection on modem 1 on slot 1 is OK                                                                                                                                  
 - connection on modem 2 on slot 1 is KO since 50s                                                                                                                        
 - connection on POE/LAN device is KO since 3400s

configuration file

Configuration file example with default values.

[general]
# Log level:
#  - 0: No messages
#  - 1: Messages every time an action is taken
#  - 2: Messages every time a monitored connection status changes
#  - 3: Messages every time script is called
#  - 4 or more: Script debugging, many messages
# default is 1
log_level=1
 
[onelinkactions]
# These actions are done when a link is down during a given amount of time.
# If parameter is 0, corresponding action will never be taken.
# Number of seconds before reconnecting service (using ConnMan command). Default is 30.
error_duration_before_service_reconnect=30
# Number of seconds before device hw reset (if possible for this device)
error_duration_before_device_hw_reset=90
# Number of seconds before reconnecting service (using ConnMan command). Default is 150.
error_duration_before_service_reconnect_2=150
# Number of seconds before actions are retried.
# If not 0, this parameter should be more than all other action parameters
error_duration_before_action_retry=300
 
[alllinksactions]
# These actions are done when all links are down during a given amount of time.
# Number of seconds before restarting Ofono and ConnMan servers. Default is 50.
error_duration_before_servers_restart=50
# Number of seconds before restarting Ofono and ConnMan servers and devices hardware. Default is 200
error_duration_before_network_hardware_reboot=200
# Number of seconds before restarting board. Default is 0.
# If not 0, this parameter should be more than all other action parameters
error_duration_before_board_reboot=0
# Number of seconds before actions are retried.
# If not 0, this parameter should be more than all other action parameters
error_duration_before_action_retry=400

In sections [onelinkactions] and [alllinksactions], error duration of any action must be 0 (action will never be executed) or greater than previous action error duration.

error_duration_before_action_retry is not a real action. It is used to re-execute all actions (for one link or for all links) after a certain error duration.

Actions execution

All actions described in the configuration have an “error duration” parameter. If this parameter is 0, corresponding action will never occur. Otherwise, durations must be interpreted this way:

Hereunder is the configuration file that will be used to illustrate a execution cycle:

error_duration_action_1=30
error_duration_action_2=90
error_duration_action_3=150
error_duration_action_4=300

Example of execution cycle:

t0: GSM link stops working

t0+25: first call to fixnetwork
fixnetwork.py cable0,OK,0 gsm1-1,KO,25
  => no action executed

t0+60:
fixnetwork.py cable0,OK,0 gsm1-1,KO,60
=> action 1 executed (40s since first error report)

t0+95:
fixnetwork.py cable0,OK,0 gsm1-1,KO,95
=> no action executed (only 35s since first action execution)

t0+130:
fixnetwork.py cable0,OK,0 gsm1-1,KO,130
=> action 2 executed (65s since first action execution)

...

t1: GSM link starts working
fixnetwork.py cable0,OK,0 gsm1-1,OK,0
=> link is up

t2: GSM link stops working
fixnetwork.py cable0,OK,0 gsm1-1,KO,40
=> action 1 executed (link is said to be down since 40s)

Second example of execution cycle:

t0: GSM link stops working

t0+35:
fixnetwork.py cable0,OK,0 gsm1-1,KO,35
=> onelinkaction action 1 executed on GSM (35s since first error report)

t0+40: Ethernet link also stops working

t0+100:
fixnetwork.py cable0,OK,60 gsm1-1,KO,100
=> alllinksactions action 1 executed (60s since eth0 down and 100s since GSM down)

t0+120: eth0 starts working again

t0+135:
fixnetwork.py cable0,OK,15 gsm1-1,KO,135
=> onelinkaction action 1 executed on GSM (35s since alllinksactions action 1 execution)

networkmonitoring.py

Presentation

This script monitors the network and takes actions to fix it if the connection fails.
Only the default route is monitored by the script. It relies on ConnMan to define the default route and to mount the network links.
Since networkmonitoring.py monitors the default route, the client application should use the default route.

Regularly the script will check if it can access a server. The check can be done by:

Once a check fails, monitoring is done every 10 seconds. Actions are taken after a certain amount of consecutive failed attempt to receive an answer from the monitored server.

Taken actions when check fails are:

Configuration

The behavior of the script is defined in the /etc/network/networkmonitoring.conf file. This script uses LOG_LOCAL2 Syslog facility to output logs. By default, the traces are written in /var/log/networkmonitoring.log.

Here is a commented example of configuration file:

[general]
# Monitor network. 0 means no monitoring. This is the default value.
monitor_network=1
# Number of seconds to wait before first check. Default: 60
first_check_delay=1200
# Interval in seconds between two monitoring when network is OK. Default: 1200
check_interval=120
# Log level:
#  - 0: No messages
#  - 1: Messages every time an action is taken
#  - 2: Messages every time monitoring fails
#  - 3: Messages every time monitoring is done
#  - 4 or more: Script debugging, many messages
# default is 0
log_level=3
# monitor_external_usb:
#  - 0: external usb is not monitored (no reset of external USB port), default
#  - 1: external usb is monitored (external USB port is reset whith all WAN modules when needed)
monitor_external_usb=0
 
[ping]
# Server used to check if network is up or not. It can be an Ip adress or a name. Default: 8.8.8.8
server=8.8.8.8
#server=google.com
# Protocol used to check if server is reachable. Possible values are :
# - ping: will send ICMP ping to given server. Port is useless
# - tcp: will try to connect to given server on given port
#Default is ping
protocol=ping
# Port on which ping is done. Default is 80.
port=80
# Timeout in seconds before saying we failed to connect to monitored server.
# Default is 5
timeout=5
 
[actions]
# Once network failure is detected, ping is done each 10s.
# Actions is taken after a number of consecutive failure.
# If parameter is 0, corresponding action will never be taken.
# Number of failed ping before reconnecting service (using connman command). Default is 3.
ping_error_before_service_reconnect=3
# Number of failed ping before restarting Ofono and Connman servers. Default is 5.
ping_error_before_servers_restart=20
# Number of failed ping before restarting Ofono and Connman servers and eth and modems hardware. Default is 10
ping_error_before_network_hardware_reboot=50
# Number of failed ping before restarting board. Default is 20
# If not 0, this parameter should be more than all previous action parameters
ping_error_before_board_reboot=100
# Number of failed ping before actions are retried. Default is 0 (not done)
ping_error_before_reset_to_first_action=0

In most cases, it is advised to use the same server address than the one in /etc/network/connman/main.conf.
In most cases, it is advised to use the LNS address for the monitoring.
It is strongly advised to check if the ping to the server is working before doing the modification in the configuration file.

If ping_error_before_board_reboot= 0, then there will be no board reboot. Once all actions will be finished, the script will stop. In that case, it is advised to set parameter ping_error_before_reset_to_first_action to a value greater than all previous action parameters. This will allow to reset to the first action and perform actions in loop without reboot.

By default, the script is enabled. It takes the first action after 20 minutes and check the connectivity by pinging 8.8.8.8. Make sure that the network in which the gateway is installed allow this ping, otherwise the gateway may reboot because of this script. If it is not the case, change the configuration of the server in this file, or disable it.

actions execution

Since the behaviour of ConnMan and networkmonitoring.py is a bit complicated, an example of classical network failure is described in this section.

Hereunder is the configuration of ConnMan and networkmonitoring.py used in this example (without comment)

/etc/network/connman/main.conf
[General]
DefaultAutoConnectTechnologies = ethernet, cellular
PreferredTechnologies = ethernet, cellular
EnableOnlineCheck = true
OnlineCheckUseConnmanHeaders = false
OnlineCheckServerIpV4Url = http://myServer123456.com
/etc/network/networkmonitoring.conf
[general]
monitor_network=1
first_check_delay=30
check_interval=1200
log_level=1
monitor_external_usb=0
[ping]
server=myServer123456.com
protocol=ping
port=80
timeout=5
[actions]
ping_error_before_service_reconnect=3
ping_error_before_servers_restart=5
ping_error_before_network_hardware_reboot=10
ping_error_before_board_reboot=20

example: