Setting up Auto-Recovery of Components
On This Page:
Related Pages:
Checking the system
How often does System Monitor check the health of the other components?
The following, mandatory, parameters configure the monitoring and restarting capabilities of the system. In the main, there are default values provided for these, but they should be reviewed for your specific circumstances - the size of the instance and whether or not is distributed, can impact the way these restarts apply.
Each process communicates back to System Monitor via a 'heartbeat' using a PCI connection. Based on the below parameters, System Monitor can determine whether a component is still sending heartbeats as appropriate and, if not, possibly try killing the component and restarting it to see if that helps.
Parameter | Description |
---|---|
Heatbeats | |
mon_heartbeat_interval | Number of seconds between heartbeat messages sent from the components to the master system monitor. |
mon_max_missed_heartbeats | Max number of heartbeats missed before investigation. |
mon_tol_first_hb_delay | The number of seconds allowed for a process to send the system monitor its first heartbeat. |
Control | |
mon_audit_events | Determines whether events reported by the System Monitor to the OS system log are recorded in Database sp419_event_audit. |
mon_kill_unresponsive | Should always be set to 'YES'. |
mon_max_conc_start | Max number of concurrently starting processes. Used to restrict the number of processes being started at any one time. It is recommended that this parameter is not set to more than twice the number of processors on the SP server machine (lower if SP is not the only process on that machine). |
mon_pci_command_retry_limit | Number of retry attempts that should be made for a failed PCI command or state transition |
mon_pci_command_timeout | The number of seconds for TCP/IP timeout for monitor PCI commands. |
mon_termination_timeout | Number of seconds to wait for a process to terminate. Used when a ServiceOptimizer process fails. |
Distributed Optimizer (see Distributed System ) | |
mon_controlling_host | Identifies the host computer that runs the master system monitor. Must be set to a host_machine entry from Database sp416_host_machines |
mon_host_check_interval | Number of seconds between checks for lost/missing hosts in a distributed server. |
mon_ping_timeout | Timeout applied when master system monitor is attempting to establish communications with a missing slave system monitor. |
mon_slave_conn_attempt_limit | Number of attempts that should be made by the master system monitor to connect to a slave system monitor before giving up (on system startup). |
mon_slave_connect_retry_delay | Delay in seconds between attempts by the master system monitor to connect to a slave system monitor (on system startup). |
Alerting on failure
The database tables database sp410_component_alerts and Database sp413_alert_arguments can be set up to define an alert process which is invoked when a component has failed and cannot be restarted. If Master System Monitor itself fails, it will also invoke the alert process (if it can). This is a user written process that could, for example, email the relevant system administrator or log the problem or take some similar appropriate user defined action.
Logging on Restart
What happens to the logfiles when a component restarts?
Application Logfiles covers this topic in detail, including the set up of mandatory system parameters to drive the behaviour.