====== Network monitoring ====== Kerlink provides two scripts to monitor the network or restart network interfaces: ''networkmonitoring.py'' and ''fixnetwork.py''. =====fixnetwork.py===== This script has been designed to be called by client applications that monitor different network links.\\ Client applications should send monitored links status to the script each time the status is refreshed. The script will then take actions to fix defective links.\\ Since this script is used to monitor multiple links at the same time, in most cases, it is up to the client application to choose which link should be used (instead of relying on the default route). This script uses a configuration file (''/etc/network/fixnetwork.conf'') describing what actions and when this actions should be taken. This script uses LOG_LOCAL2 Syslog facility to output logs. By default, the traces are written in ''/var/log/networkmonitoring.log'' ==== Script Usage ==== root@klk-lpbs-0504B4:/user/test # fixnetwork.py -h Usage: /usr/bin/fixnetwork.py [OPTION] device1HwPos,connectionState,durationSinceLastOk .... deviceXHwPos,connectionState,durationSinceLastOk With: deviceXHwPos can be: gsmmodemXSlot-modemXPosition for a GSM modem inside a WAN module with: modemXSlot: module slot number (1 to n) modemXPosition: 1 for mono WAN modules, 1 or 2 for Dual WAN modules extusb: device must be the only modem plugged on external USB cable0: device is POE/LAN ethernet on iBTS cable1: device is Local ethernet on iBTS, or ethernet on iFemtoCell wifi: device is wifi device (on iFemtoCell only) connectionState: OK (last applicative ping using this device was OK) or KO (applicative ping KO or not done because no corresponding interface exists) durationSinceLastOk: error duration in seconds (since last successful applicative ping or first applicative ping) Options are: -h: display help -f confFile: give configuration file. Default is /etc/network/fixnetwork.conf Given information on network links to monitor, this script will take actions to try to bring monitored links up. Actions can be: connection restart, device hardware reset, board reboot, ... Example: /usr/bin/fixnetwork.py gsm1-1,OK,0 gsm1-2,KO,50 cable0,KO,3400 means: - connection on modem 1 on slot 1 is OK - connection on modem 2 on slot 1 is KO since 50s - connection on POE/LAN device is KO since 3400s ====reporting network links status==== * Each time the script is called, all monitored network link status must be given to the script. Otherwise, unwanted behaviours could occur. For example, if only GSM status is reported (KO), whereas Ethernet is properly working, ConnMan could be restarted, which would stop Ethernet from working for a short period of time. * Script execution can be long (if an action is taken). When no action is taken (normal case), script execution is less than one second long, when actions are taken, it can take up to 20 seconds per monitored link. * The script will take a maximum of one action per failing monitored link per script call. * The parameters are case sensitive, use ''OK'' not ''ok'' ====configuration file==== Configuration file example with default values. [general] # Log level: # - 0: No messages # - 1: Messages every time an action is taken # - 2: Messages every time a monitored connection status changes # - 3: Messages every time script is called # - 4 or more: Script debugging, many messages # default is 1 log_level=1 [onelinkactions] # These actions are done when a link is down during a given amount of time. # If parameter is 0, corresponding action will never be taken. # Number of seconds before reconnecting service (using ConnMan command). Default is 30. error_duration_before_service_reconnect=30 # Number of seconds before device hw reset (if possible for this device) error_duration_before_device_hw_reset=90 # Number of seconds before reconnecting service (using ConnMan command). Default is 150. error_duration_before_service_reconnect_2=150 # Number of seconds before actions are retried. # If not 0, this parameter should be more than all other action parameters error_duration_before_action_retry=300 [alllinksactions] # These actions are done when all links are down during a given amount of time. # Number of seconds before restarting Ofono and ConnMan servers. Default is 50. error_duration_before_servers_restart=50 # Number of seconds before restarting Ofono and ConnMan servers and devices hardware. Default is 200 error_duration_before_network_hardware_reboot=200 # Number of seconds before restarting board. Default is 0. # If not 0, this parameter should be more than all other action parameters error_duration_before_board_reboot=0 # Number of seconds before actions are retried. # If not 0, this parameter should be more than all other action parameters error_duration_before_action_retry=400 In sections ''[onelinkactions]'' and ''[alllinksactions]'', error duration of any action must be 0 (action will never be executed) or greater than previous action error duration. \\ ''error_duration_before_action_retry'' is not a real action. It is used to re-execute all actions (for one link or for all links) after a certain error duration. ====Actions execution==== All actions described in the configuration have an "error duration" parameter. If this parameter is 0, corresponding action will never occur. Otherwise, durations must be interpreted this way: * First action: the action will be taken when the monitored link is reported down for a time greater than this parameter. * Other actions: other actions are only taken if the time elapsed since the previous action is greater or equal than the action error duration minus the previous error duration (eg: (error_duration_2 = 90) - (error_duration_1 = 30) = 60s). The aim is to give the time to the script to finish its actions before doing something else. For example, asking the modem to reset could be useless, if the associated ConnMan service is under reconnection. * Each time the monitored link status is back to OK, then reported once again as KO, the scripts restarts to the first action. Hereunder is the configuration file that will be used to illustrate a execution cycle: error_duration_action_1=30 error_duration_action_2=90 error_duration_action_3=150 error_duration_action_4=300 Example of execution cycle: t0: GSM link stops working t0+25: first call to fixnetwork fixnetwork.py cable0,OK,0 gsm1-1,KO,25 => no action executed t0+60: fixnetwork.py cable0,OK,0 gsm1-1,KO,60 => action 1 executed (40s since first error report) t0+95: fixnetwork.py cable0,OK,0 gsm1-1,KO,95 => no action executed (only 35s since first action execution) t0+130: fixnetwork.py cable0,OK,0 gsm1-1,KO,130 => action 2 executed (65s since first action execution) ... t1: GSM link starts working fixnetwork.py cable0,OK,0 gsm1-1,OK,0 => link is up t2: GSM link stops working fixnetwork.py cable0,OK,0 gsm1-1,KO,40 => action 1 executed (link is said to be down since 40s) Second example of execution cycle: t0: GSM link stops working t0+35: fixnetwork.py cable0,OK,0 gsm1-1,KO,35 => onelinkaction action 1 executed on GSM (35s since first error report) t0+40: Ethernet link also stops working t0+100: fixnetwork.py cable0,OK,60 gsm1-1,KO,100 => alllinksactions action 1 executed (60s since eth0 down and 100s since GSM down) t0+120: eth0 starts working again t0+135: fixnetwork.py cable0,OK,15 gsm1-1,KO,135 => onelinkaction action 1 executed on GSM (35s since alllinksactions action 1 execution) ===== networkmonitoring.py ===== ==== Presentation ==== This script monitors the network and takes actions to fix it if the connection fails.\\ Only the default route is monitored by the script. It relies on ConnMan to define the default route and to mount the network links.\\ Since ''networkmonitoring.py'' monitors the default route, the client application should use the default route. Regularly the script will check if it can access a server. The check can be done by: * ICMP pings to a given server (IP address or DNS name). * TCP connection to a given server (IP address or DNS name) and port. Once a check fails, monitoring is done every 10 seconds. Actions are taken after a certain amount of consecutive failed attempt to receive an answer from the monitored server. Taken actions when check fails are: * Restarting ConnMan service corresponding to network default route (if any). * Restarting ConnMan, oFono daemons * Hardware reboot of all WAN modules and ethernet phy. * Reboot board. ==== Configuration ==== The behavior of the script is defined in the ''/etc/network/networkmonitoring.conf'' file. This script uses LOG_LOCAL2 Syslog facility to output logs. By default, the traces are written in ''/var/log/networkmonitoring.log''. Here is a commented example of configuration file: [general] # Monitor network. 0 means no monitoring. This is the default value. monitor_network=1 # Number of seconds to wait before first check. Default: 60 first_check_delay=1200 # Interval in seconds between two monitoring when network is OK. Default: 1200 check_interval=120 # Log level: # - 0: No messages # - 1: Messages every time an action is taken # - 2: Messages every time monitoring fails # - 3: Messages every time monitoring is done # - 4 or more: Script debugging, many messages # default is 0 log_level=3 # monitor_external_usb: # - 0: external usb is not monitored (no reset of external USB port), default # - 1: external usb is monitored (external USB port is reset whith all WAN modules when needed) monitor_external_usb=0 [ping] # Server used to check if network is up or not. It can be an Ip adress or a name. Default: 8.8.8.8 server=8.8.8.8 #server=google.com # Protocol used to check if server is reachable. Possible values are : # - ping: will send ICMP ping to given server. Port is useless # - tcp: will try to connect to given server on given port #Default is ping protocol=ping # Port on which ping is done. Default is 80. port=80 # Timeout in seconds before saying we failed to connect to monitored server. # Default is 5 timeout=5 [actions] # Once network failure is detected, ping is done each 10s. # Actions is taken after a number of consecutive failure. # If parameter is 0, corresponding action will never be taken. # Number of failed ping before reconnecting service (using connman command). Default is 3. ping_error_before_service_reconnect=3 # Number of failed ping before restarting Ofono and Connman servers. Default is 5. ping_error_before_servers_restart=20 # Number of failed ping before restarting Ofono and Connman servers and eth and modems hardware. Default is 10 ping_error_before_network_hardware_reboot=50 # Number of failed ping before restarting board. Default is 20 # If not 0, this parameter should be more than all previous action parameters ping_error_before_board_reboot=100 # Number of failed ping before actions are retried. Default is 0 (not done) ping_error_before_reset_to_first_action=0 In most cases, it is advised to use the same server address than the one in ''/etc/network/connman/main.conf''.\\ In most cases, it is advised to use the LNS address for the monitoring.\\ It is strongly advised to check if the ping to the server is working before doing the modification in the configuration file. If ''ping_error_before_board_reboot''= 0, then there will be no board reboot. Once all actions will be finished, the script will stop. In that case, it is advised to set parameter ''ping_error_before_reset_to_first_action'' to a value greater than all previous action parameters. This will allow to reset to the first action and perform actions in loop without reboot. By default, the script is **enabled**. It takes the first action after 20 minutes and check the connectivity by pinging ''8.8.8.8''. Make sure that the network in which the gateway is installed allow this ping, otherwise the gateway may reboot because of this script. If it is not the case, change the configuration of the server in this file, or disable it. ==== actions execution ==== Since the behaviour of ConnMan and ''networkmonitoring.py'' is a bit complicated, an example of classical network failure is described in this section. Hereunder is the configuration of ConnMan and ''networkmonitoring.py'' used in this example (without comment) [General] DefaultAutoConnectTechnologies = ethernet, cellular PreferredTechnologies = ethernet, cellular EnableOnlineCheck = true OnlineCheckUseConnmanHeaders = false OnlineCheckServerIpV4Url = http://myServer123456.com [general] monitor_network=1 first_check_delay=30 check_interval=1200 log_level=1 monitor_external_usb=0 [ping] server=myServer123456.com protocol=ping port=80 timeout=5 [actions] ping_error_before_service_reconnect=3 ping_error_before_servers_restart=5 ping_error_before_network_hardware_reboot=10 ping_error_before_board_reboot=20 **example:** * Initial state: * the gateway has an ethernet connection (eth0) and cellular modem (ppp0 not shown by ifconfig) ready to be used. * The default route is ethernet and this route can access the server * The connection to the server is lost (network problem on eth0) * The script detects the problem * The default service is restarted * ConnMan detects another service (ppp0) with an access to the server * This service becomes the default one * The network problem on eth0 disappears * The default route stays on ppp0 (Since ''networkmonitoring.py'' monitors the default route and the default route is properly working, no action is taken) * The connection to the server is lost (network problem on ppp0) * The script detects the problem * The default service is restarted * ConnMan detects another service (eth0) with an access to the server * This service becomes the default one