Ensure and maintain the availability of ServiceNav Boxes

Purpose of this documentation

If the central platform is the heart of ServiceNav, monitoring boxes are the eyes.
A ServiceNav Box (SNB or monitoring box) allows you to:

  • Collect monitoring information on the customer’s LAN or from an external source.
  • Transmit the collected data to the central platform via a VPN tunnel.
  • Send email alerts (regardless of access to the VPN tunnel).
  • Receive instructions given by a user via the web interface (immediate check execution, acknowledgment, application configuration).

It is essential to ensure that the monitoring boxes do not suffer unavailability.

This documentation will explain how to avoid unavailability of SNB’s and how to troubleshoot certain issues as required.

 

Monitoring the ServiceNav Box

It should not be news to you to be told that the best way to prevent equipment failure is to monitorit?
That’s good, ServiceNav does exactly that!

When setting up a ServiceNav Box, the first two actions must be:

  • Self-monitoring of the box via the ServiceNav Box host template- “ServiceNav Box – self-monitoring“.

In terms of resources, it is very important to consider:

    • The CPU load: about 1vCPU per 1000 service checks (varying according to the service checks used). A lack of CPU will cause instability of the box and delays in the execution of service checks.
    • RAM: Lack of RAM can prevent service checks from being executed and may result in nagios or openvpn service failures leading to a shutdown of monitoring.
    • Disk space: A lack of disk space will cause the file system to be read-only, instability or even stop monitoring.
  • Cross-SNB monitoring via another box with the ServiceNav Box host template – “ServiceNav Box – Monitoring by monitoring agent“.
    The VSBox-Live-Status check ensures that the monitored  box has sent monitoring data in the last X minutes.
    If this check reports as CRITICAL, in that the monitored SNB no longer sends data to the central platform and therefore the statuses present on the web interface are no longer accurate. It is imperative to take action to restore communication.Important to note: the monitoring and the maintenance of the boxes are the responsibility of the customer.The ServiceNav Box is described at the end of the following document: Installing an SNB

Resolving issues with a ServiceNav Box

Even if the risks are greatly reduced thanks to monitoring, it is still possible that a ServiceNav Box becomes  unavailable.
The next section will present some common scenarios and how to solve the problems.

Scenarios

  • Connexion au tunnel VPN impossible, forte latence de la box dans le tunnel VPN, pertes de connexion intempestives.
    –> Follow the solution : Check network access.
  • All control points are in Unknown state.
    -> Follow the solution: Check network access.
    -> If the problem is still not resolved: follow Restart remoteOperationBox and nagios.
  • The checks performed by a ServiceNav Box have a very old time stamp.
    -> Follow the solution: Restart remoteOperationBox and nagios.
  • Can not reload the configuration on a Servicenav Box.
    -> Follow the solution: Restart remoteOperationBox and nagios.
  • Acknowledgments are not taken into account.
    -> Follow the solution: Restart remoteOperationBox and nagios.
  • Immediate checks launched from the web interface are not taken into account.
    -> Follow the solution: Restart remoteOperationBox and nagios.

Solutions

Check network access
Check the performance of the SNB (CPU load, RAM, disk space) and add if necessary.
Check that the box is on time with the date command.
Ensure that no changes / deletions of the firewall rules have been made recently.
Check that the box has access to the ServiceNav VPN port output to the central platform.
For the https://servicenav.io -> telnet vpn.servicenav.io $ platform (awk -F ‘[]’ ‘NR == 42 {print int ($ 3)}’ /etc/openvpn/client.conf)
For the https://azure.servicenav.io -> telnet vpn-azure.servicenav.io $ platform (awk -F ‘[]’ ‘NR == 42 {print int ($ 3)}’ / etc / openvpn / client.conf)
For a platform OnPremise -> telnet <ip-public-platform> <port>
Functional access:

Check network Access

  1. Check the performance of the SNB (CPU load, RAM, disk space) and increase if necessary
  2. Check that the box is on time with the date command.
  3. Ensure that no changes / deletions of the firewall rules have been made recently
  4. Check that the box has access to the ServiceNav VPN port output to the central platform.
    For https://servicenav.io -> telnet vpn.servicenav.io $ platform (awk -F ‘[]’ ‘NR == 42 {print int ($ 3)}’ /etc/openvpn/client.conf)
    For https://azure.servicenav.io -> telnet vpn-azure.servicenav.io $ platform (awk -F ‘[]’ ‘NR == 42 {print int ($ 3)}’ / etc / openvpn / client.conf)
    For a platform OnPremise -> telnet <ip-public-platform> <port>
    Functional access:

If no access, do the necessary at the firewall.

5. Make sure that the LAN IP address of the box is not also assigned to another machine on the same network.

Restart remoteOperationBox and nagios

The remoteOperationBox process ensures the sending and receiving of messages between the box and the central platform.
If it does not work:

  • Monitoring data collected by the box will no longer be sent to the central platform.
  • All actions performed on the web interface to the box will no longer be received.

The nagios process provides the scheduling of control points. It communicates with remoteOperationBox to take into account immediate control operations or acknowledgments performed by the web interface..

Perform the following operations:

  • Connect to the ServiceNav Box with an SSH client.
  • Stop the remoteOperationBox process:
    • Run: service remoteOperationBox stop
    • Check that no more processes are running: ps aux | grep remoteOperationBox
    • If there are still remoteOperationBox processes: kill them manually: kill <id> or kill -9 <id> if there is resistancee
  • Stop the nagios process:
    • Run: service nagios stop
    • Check that no more processes are running : ps aux | grep nagios (stopping nagios can take a little time, repeat the command ps several times ps).
    • If there are still nagios processes: kill them manually: kill <id> or kill -9 <id> if there is resistance..
  • At this point, remoteOperationBox and nagios should no longer run and no process should be present at the output of the ps command.
  • Restart the nagios service : service nagios start
  • Restart the remoteOperationBox service: service remoteOperationBox start and check the presence of 6 instances of the service

  • Check on the web interface that the application is functioning again.

If problems persist after restarting both services, please contact ServiceNav Support.

Ensuring the resumption of activity of a ServiceNav Box

Three use cases:

  • The ServiceNav Box is completely unusable despite a restart. Can not connect to it via SSH or through a local console.
    –> Follow this document : Ensure the availability of a ServiceNav Box, chapter “Complete replacement of a ServiceNav Box”.
  • Migrate a faulty, but still accessible, ServiceNav Box to a new one.
    –> Follow this document : Migration ServiceNav Box
  • Perform a rollback of the ServiceNav Box thanks to a backup.
    –> Follow this document : Ensure the availability of a ServiceNav Box, chapter “Rollback from a backup of the ServiceNav Box””.

 

UK ServiceNav Product Development Manager; my priority is to be needful of the particular requirements of all ‘English-speaking’ markets where ServiceNav is sold. I have over 20 years experience of the IT monitoring field - covering a wide variety of products and technologies.