Document toolboxDocument toolbox

14.1 Setting up Auto-Recovery of Components

Checking the system

(question) How often does System Monitor check the health of the other components?

The following, mandatory, parameters configure the monitoring and restarting capabilities of the system. In the main, there are default values provided for these, but they should be reviewed for your specific circumstances - the size of the instance and whether or not is distributed, can impact the way these restarts apply.

Each process communicates back to System Monitor via a 'heartbeat' using a PCI connection. Based on the below parameters, System Monitor can determine whether a component is still sending heartbeats as appropriate and, if not, possibly try killing the component and restarting it to see if that helps.

ParameterDescription

Heatbeats

mon_heartbeat_interval

Number of seconds between heartbeat messages sent from the components to the master system monitor.

mon_max_missed_heartbeats

Max number of heartbeats missed before investigation. 

After (mon_heartbeat_interval * mon_max_missed_heartbeats) seconds the monitor will investigate the "missing" component and recover it if required. 

A longer period means that the system monitor will be more tolerant of missing heartbeats in a very busy system or on an unreliable network, but will take longer to recover a failed component and will take longer to start up the instance.

mon_tol_first_hb_delay

The number of seconds allowed for a process to send the system monitor its first heartbeat. 

How to set this up depends on the speed of the machine running the processes and how many processes are starting up at once – the more there are on a slower/smaller machine the higher you may need to set this value. Its minimum sensible setting is mon_heartbeat_interval*mon_max_missed_heartbeats since this is the earliest that the condition will be detected anyway. It will usually need to be higher than this value since the processes need time to initialize. If the value is set too low the symptom will be that the processes started by system monitor will be terminated during start up and the system will either fail to initialize or will start with an incomplete set of components. In this case increase the parameter value until the system can complete start up successfully

Control

mon_audit_events

Determines whether events reported by the System Monitor to the OS system log are recorded in Database sp419_event_audit.

mon_kill_unresponsive

Should always be set to 'YES'.

mon_max_conc_start

Max number of concurrently starting processes. Used to restrict the number of processes being started at any one time. It is recommended that this parameter is not set to more than twice the number of processors on the SP server machine (lower if SP is not the only process on that machine).

mon_pci_command_retry_limit

Number of retry attempts that should be made for a failed PCI command or state transition

mon_pci_command_timeout

The number of seconds for TCP/IP timeout for monitor PCI commands. 

Note that this parameter must be greater than mon_termination_timeout and less than or equal to mon_pci_archive_timeout or the System Monitor will not start.

mon_termination_timeout

Number of seconds to wait for a process to terminate. Used when a ServiceOptimizer process fails. 

Note that this parameter must be smaller than the value for mon_pci_command_timeout or the System Monitor will not start.

Distributed Optimizer (see 14.1 Distributed System )

mon_controlling_host

Identifies the host computer that runs the master system monitor. Must be set to a host_machine entry from Database sp416_host_machines 
(For a non-distributed server this must be left set to NULL or 'localhost'.)

mon_host_check_interval

Number of seconds between checks for lost/missing hosts in a distributed server.

mon_ping_timeout

Timeout applied when master system monitor is attempting to establish communications with a missing slave system monitor.

mon_slave_conn_attempt_limit

Number of attempts that should be made by the master system monitor to connect to a slave system monitor before giving up (on system startup).

mon_slave_connect_retry_delay

Delay in seconds between attempts by the master system monitor to connect to a slave system monitor (on system startup).

Alerting on failure

The database tables database sp410_component_alerts and Database sp413_alert_arguments can be set up to define an alert process which is invoked when a component has failed and cannot be restarted. If Master System Monitor itself fails, it will also invoke the alert process (if it can).  This is a user written process that could, for example, email the relevant system administrator or log the problem or take some similar appropriate user defined action.

Logging on Restart

(question) What happens to the logfiles when a component restarts?

14.1 Application Logfiles covers this topic in detail, including the set up of mandatory system parameters to drive the behaviour.