User Tools

Site Tools


Sidebar

Kerlink Wiki Home Page

Home

Setups

General information

Wirnet™ iBTS information

Wirnet™ iFemtoCell information

Wirnet™ iFemtoCell-evolution information

Wirnet™ iStation information

System management

Network management

LoRa Features

KerOS customization

Support and resources



www.kerlink.com

wiki:network_mana:networkmonitoring

Network monitoring

Kerlink provides two scripts to monitor the network or restart network interfaces: networkmonitoring.py and fixnetwork.py.

fixnetwork.py

This script has been designed to be called by client applications that monitor different network links.
Client applications should send monitored links status to the script each time the status is refreshed. The script will then take actions to fix defective links.

Since this script is used to monitor multiple links at the same time, in most cases, it is up to the client application to choose which link should be used (instead of relying on the default route).

This script uses a configuration file (/etc/network/fixnetwork.conf) describing what actions and when this actions should be taken.

This script uses LOG_LOCAL2 Syslog facility to output logs. By default, the traces are written in /var/log/networkmonitoring.log

Script Usage

root@klk-lpbs-0504B4:/user/test # fixnetwork.py -h                                                                                                                        
Usage: /usr/bin/fixnetwork.py [OPTION] device1HwPos,connectionState,durationSinceLastOk .... deviceXHwPos,connectionState,durationSinceLastOk                             
With:                                                                                                                                                                     
 deviceXHwPos can be:                                                                                                                                                     
  gsmmodemXSlot-modemXPosition for a GSM modem inside a WAN module with:                                                                                                  
   modemXSlot: module slot number (1 to n)                                                                                                                                
   modemXPosition: 1 for mono WAN modules, 1 or 2 for Dual WAN modules                                                                                                    
  extusb: device must be the only modem plugged on external USB                                                                                                           
  cable0: device is POE/LAN ethernet on iBTS                                                                                                                              
  cable1: device is Local ethernet on iBTS, or ethernet on iFemtoCell                                                                                                     
  wifi: device is wifi device (on iFemtoCell only)                                                                                                                        
 connectionState: OK (last applicative ping using this device was OK) or KO (applicative ping KO or not done because no corresponding interface exists)                   
 durationSinceLastOk: error duration in seconds (since last successful applicative ping or first applicative ping)                                                        
Options are:                                                                                                                                                              
 -h: display help                                                                                                                                                         
 -f confFile: give configuration file. Default is /etc/network/fixnetwork.conf                                                                                            
                                                                                                                                                                          
Given information on network links to monitor, this script will take actions to try to bring monitored links up.                                                          
Actions can be: connection restart, device hardware reset, board reboot, ...                                                                                              
Example:                                                                                                                                                                  
/usr/bin/fixnetwork.py gsm1-1,OK,0 gsm1-2,KO,50 cable0,KO,3400                                                                                                             
 means:                                                                                                                                                                   
 - connection on modem 1 on slot 1 is OK                                                                                                                                  
 - connection on modem 2 on slot 1 is KO since 50s                                                                                                                        
 - connection on POE/LAN device is KO since 3400s
  • Each time the script is called, all monitored network link status must be given to the script. Otherwise, unwanted behaviours could occur. For example, if only GSM status is reported (KO), whereas Ethernet is properly working, ConnMan could be restarted, which would stop Ethernet from working for a short period of time.
  • Script execution can be long (if an action is taken). When no action is taken (normal case), script execution is less than one second long, when actions are taken, it can take up to 20 seconds per monitored link.
  • The script will take a maximum of one action per failing monitored link per script call.
  • The parameters are case sensitive, use OK not ok

configuration file

Configuration file example with default values.

[general]
# Log level:
#  - 0: No messages
#  - 1: Messages every time an action is taken
#  - 2: Messages every time a monitored connection status changes
#  - 3: Messages every time script is called
#  - 4 or more: Script debugging, many messages
# default is 1
log_level=1
 
[onelinkactions]
# These actions are done when a link is down during a given amount of time.
# If parameter is 0, corresponding action will never be taken.
# Number of seconds before reconnecting service (using ConnMan command). Default is 30.
error_duration_before_service_reconnect=30
# Number of seconds before device hw reset (if possible for this device)
error_duration_before_device_hw_reset=90
# Number of seconds before reconnecting service (using ConnMan command). Default is 150.
error_duration_before_service_reconnect_2=150
# Number of seconds before actions are retried.
# If not 0, this parameter should be more than all other action parameters
error_duration_before_action_retry=300
 
[alllinksactions]
# These actions are done when all links are down during a given amount of time.
# Number of seconds before restarting Ofono and ConnMan servers. Default is 50.
error_duration_before_servers_restart=50
# Number of seconds before restarting Ofono and ConnMan servers and devices hardware. Default is 200
error_duration_before_network_hardware_reboot=200
# Number of seconds before restarting board. Default is 0.
# If not 0, this parameter should be more than all other action parameters
error_duration_before_board_reboot=0
# Number of seconds before actions are retried.
# If not 0, this parameter should be more than all other action parameters
error_duration_before_action_retry=400

In sections [onelinkactions] and [alllinksactions], error duration of any action must be 0 (action will never be executed) or greater than previous action error duration.

error_duration_before_action_retry is not a real action. It is used to re-execute all actions (for one link or for all links) after a certain error duration.

Actions execution

All actions described in the configuration have an “error duration” parameter. If this parameter is 0, corresponding action will never occur. Otherwise, durations must be interpreted this way:

  • First action: the action will be taken when the monitored link is reported down for a time greater than this parameter.
  • Other actions: other actions are only taken if the time elapsed since the previous action is greater or equal than the action error duration minus the previous error duration (eg: (error_duration_2 = 90) - (error_duration_1 = 30) = 60s). The aim is to give the time to the script to finish its actions before doing something else. For example, asking the modem to reset could be useless, if the associated ConnMan service is under reconnection.
  • Each time the monitored link status is back to OK, then reported once again as KO, the scripts restarts to the first action.

Hereunder is the configuration file that will be used to illustrate a execution cycle:

error_duration_action_1=30
error_duration_action_2=90
error_duration_action_3=150
error_duration_action_4=300

Example of execution cycle:

t0: GSM link stops working

t0+25: first call to fixnetwork
fixnetwork.py cable0,OK,0 gsm1-1,KO,25
  => no action executed

t0+60:
fixnetwork.py cable0,OK,0 gsm1-1,KO,60
=> action 1 executed (40s since first error report)

t0+95:
fixnetwork.py cable0,OK,0 gsm1-1,KO,95
=> no action executed (only 35s since first action execution)

t0+130:
fixnetwork.py cable0,OK,0 gsm1-1,KO,130
=> action 2 executed (65s since first action execution)

...

t1: GSM link starts working
fixnetwork.py cable0,OK,0 gsm1-1,OK,0
=> link is up

t2: GSM link stops working
fixnetwork.py cable0,OK,0 gsm1-1,KO,40
=> action 1 executed (link is said to be down since 40s)

Second example of execution cycle:

t0: GSM link stops working

t0+35:
fixnetwork.py cable0,OK,0 gsm1-1,KO,35
=> onelinkaction action 1 executed on GSM (35s since first error report)

t0+40: Ethernet link also stops working

t0+100:
fixnetwork.py cable0,OK,60 gsm1-1,KO,100
=> alllinksactions action 1 executed (60s since eth0 down and 100s since GSM down)

t0+120: eth0 starts working again

t0+135:
fixnetwork.py cable0,OK,15 gsm1-1,KO,135
=> onelinkaction action 1 executed on GSM (35s since alllinksactions action 1 execution)

networkmonitoring.py

Presentation

This script monitors the network and takes actions to fix it if the connection fails.
Only the default route is monitored by the script. It relies on ConnMan to define the default route and to mount the network links.
Since networkmonitoring.py monitors the default route, the client application should use the default route.

Regularly the script will check if it can access a server. The check can be done by:

  • ICMP pings to a given server (IP address or DNS name).
  • TCP connection to a given server (IP address or DNS name) and port.

Once a check fails, monitoring is done every 10 seconds. Actions are taken after a certain amount of consecutive failed attempt to receive an answer from the monitored server.

Taken actions when check fails are:

  • Restarting ConnMan service corresponding to network default route (if any).
  • Restarting ConnMan, oFono daemons
  • Hardware reboot of all WAN modules and ethernet phy.
  • Reboot board.

Configuration

The behavior of the script is defined in the /etc/network/networkmonitoring.conf file. This script uses LOG_LOCAL2 Syslog facility to output logs. By default, the traces are written in /var/log/networkmonitoring.log.

Here is a commented example of configuration file:

[general]
# Monitor network. 0 means no monitoring. This is the default value.
monitor_network=1
# Number of seconds to wait before first check. Default: 60
first_check_delay=1200
# Interval in seconds between two monitoring when network is OK. Default: 1200
check_interval=120
# Log level:
#  - 0: No messages
#  - 1: Messages every time an action is taken
#  - 2: Messages every time monitoring fails
#  - 3: Messages every time monitoring is done
#  - 4 or more: Script debugging, many messages
# default is 0
log_level=3
# monitor_external_usb:
#  - 0: external usb is not monitored (no reset of external USB port), default
#  - 1: external usb is monitored (external USB port is reset whith all WAN modules when needed)
monitor_external_usb=0
 
[ping]
# Server used to check if network is up or not. It can be an Ip adress or a name. Default: 8.8.8.8
server=8.8.8.8
#server=google.com
# Protocol used to check if server is reachable. Possible values are :
# - ping: will send ICMP ping to given server. Port is useless
# - tcp: will try to connect to given server on given port
#Default is ping
protocol=ping
# Port on which ping is done. Default is 80.
port=80
# Timeout in seconds before saying we failed to connect to monitored server.
# Default is 5
timeout=5
 
[actions]
# Once network failure is detected, ping is done each 10s.
# Actions is taken after a number of consecutive failure.
# If parameter is 0, corresponding action will never be taken.
# Number of failed ping before reconnecting service (using connman command). Default is 3.
ping_error_before_service_reconnect=3
# Number of failed ping before restarting Ofono and Connman servers. Default is 5.
ping_error_before_servers_restart=20
# Number of failed ping before restarting Ofono and Connman servers and eth and modems hardware. Default is 10
ping_error_before_network_hardware_reboot=50
# Number of failed ping before restarting board. Default is 20
# If not 0, this parameter should be more than all previous action parameters
ping_error_before_board_reboot=100
# Number of failed ping before actions are retried. Default is 0 (not done)
ping_error_before_reset_to_first_action=0

In most cases, it is advised to use the same server address than the one in /etc/network/connman/main.conf.
In most cases, it is advised to use the LNS address for the monitoring.
It is strongly advised to check if the ping to the server is working before doing the modification in the configuration file.

If ping_error_before_board_reboot= 0, then there will be no board reboot. Once all actions will be finished, the script will stop. In that case, it is advised to set parameter ping_error_before_reset_to_first_action to a value greater than all previous action parameters. This will allow to reset to the first action and perform actions in loop without reboot.

By default, the script is enabled. It takes the first action after 20 minutes and check the connectivity by pinging 8.8.8.8. Make sure that the network in which the gateway is installed allow this ping, otherwise the gateway may reboot because of this script. If it is not the case, change the configuration of the server in this file, or disable it.

actions execution

Since the behaviour of ConnMan and networkmonitoring.py is a bit complicated, an example of classical network failure is described in this section.

Hereunder is the configuration of ConnMan and networkmonitoring.py used in this example (without comment)

/etc/network/connman/main.conf
[General]
DefaultAutoConnectTechnologies = ethernet, cellular
PreferredTechnologies = ethernet, cellular
EnableOnlineCheck = true
OnlineCheckUseConnmanHeaders = false
OnlineCheckServerIpV4Url = http://myServer123456.com
/etc/network/networkmonitoring.conf
[general]
monitor_network=1
first_check_delay=30
check_interval=1200
log_level=1
monitor_external_usb=0
[ping]
server=myServer123456.com
protocol=ping
port=80
timeout=5
[actions]
ping_error_before_service_reconnect=3
ping_error_before_servers_restart=5
ping_error_before_network_hardware_reboot=10
ping_error_before_board_reboot=20

example:

  • Initial state:
    • the gateway has an ethernet connection (eth0) and cellular modem (ppp0 not shown by ifconfig) ready to be used.
    • The default route is ethernet and this route can access the server
  • The connection to the server is lost (network problem on eth0)
    • The script detects the problem
    • The default service is restarted
    • ConnMan detects another service (ppp0) with an access to the server
    • This service becomes the default one
  • The network problem on eth0 disappears
    • The default route stays on ppp0 (Since networkmonitoring.py monitors the default route and the default route is properly working, no action is taken)
  • The connection to the server is lost (network problem on ppp0)
    • The script detects the problem
    • The default service is restarted
    • ConnMan detects another service (eth0) with an access to the server
    • This service becomes the default one
wiki/network_mana/networkmonitoring.txt · Last modified: 2023/02/13 16:15 by ehe