On This Page:
Related Pages:
Checking for Hardware crashes
Did (any one of) the server machine(s) crash?
Check the up time for the each machine
Microsoft Windows® |
Copy the following uptime.exe executable to the server and execute it from the command line. Note this executable supports Intel processors. Please see http://support.microsoft.com/kb/232243#appliesto for additional information and an Alpha version. |
This will indicate how long the machine has been up. If this is a short period of time it is likely that the outage has been caused by machine problems. This should be reported to the system administrator to investigate further.
Checking for failed processes/start up
Has the system as a whole failed to start?
If one or more of the key processes fails to start up then the whole server will not start up. Additionally, if one of the non-key processes does not start then the rest of the system may appear to be running smoothly, but may in fact require investigation.
Check which process has failed to start
Microsoft Windows® |
If the system was started from a cmd prompt using spsysmon then you will see a message indicating which process failed to start. If SERVICEPower was started as a service then find the cmd prompt running the spsysmon process and view the screen output. |
If the screen output cannot be verified seek the latest system monitor (sysmon) log file and this will state which process failed to start. |
If any process has crashed or failed to start then you will find a recovery directory within the log directory (see Recovered Components). If there is no recovery directory then no process has failed/restarted.
Any error resulting in a failure to start/restart is usually around the last entry in a log file. Looking for these recovery files will, at the very least, give you the exact time of the crash (from the log file name, but note that for FRU crashes, the optimizer log file may not have the same timestamp as the main log file as it is only written to when an optimizer is run).
Has any ServiceOptimizer server process crashed? Are there any recovery files on the server?
First, check whether the ServiceOptimizer processes are still present.
Check which processes are running
Microsoft Windows® |
Start Task Manager and click on the processes tab. Note: When the Optimizer is running as a service, the processes will only appear in Task Manager- there will be no command prompt windows open. |
In most cases they will be found to still be running, as the system monitor is designed to check the state of all its child processes and, on finding any one missing, to automatically start it up again (see System Monitor for further details). Only with a very serious problem would a process not restart and would you find not all processes present.
Irrespective of whether the system is now working OK, it is advisable to follow up on the cause of the issue:
Common things to look for when a process crashes
Error | Description/Cause/next steps |
---|
Bad Data in an API | Under normal circumstances ServiceOptimizer will return an error code if it can not accept the data that has been supplied through an API call. However, if the data causes a serious corruption of ServiceOptimizer it may need to restart a process to recover. Often interface code is set to retry the API call when the error occurs, which is likely to have the same effect on SP. This can be spotted by looking in the recovery folder for more than 1 of the same type of log file, produced quite close to each other. If each file shows a similar error close to the bottom of the log file, and against the same API call (see Checking for failing API calls), then this could be the problem. In this case it is important to stop the call coming through again as it will just have the same effect. Once you have stopped the call, contact SERVICEPower support to diagnose the root cause (see section 8) |
---|
UNEXPECTED_ERROR | If SERVICEOptimizer manages to catch the error then it will report it in the log file as an UNEXPECTED_ERROR. Find the last entry in the log file and then search up for UNEXPECTED_ERROR appearing after the timestamp on a line. This may have occurred several seconds before the last entry is written. |
---|
No error indication | If SERVICEOptimizer hits an uncaught exception then it may crash without leaving an error behind in the logs. |
---|
Database error | Check the database is available. The most common reason for the whole server not to come back up after it has hit an issue is a problem with the database. SERVICEOptimizer components that lose connectivity to the database will try to re-establish a connection to the database, but if they are unsuccessful, they will crash. If all components cannot connect to the database, the whole system will remain down. The client-server connectivity to the SERVICEOptimizer database can be checked by trying the following from the SERVICEOptimizer server (within cmd or console) that is having problems: ORACLE: | SQLServer: |
---|
Microsoft Windows® | | sqlplus <dbuser/password>
| If you have a SQL server client, then attempt to make a connection to the SERVICEOptimizer database. If you cannot, then contact your DBA to confirm its state. | tnsping %LOCAL% | |
|
---|
Bad data in the database | If a single process goes down and does not manage to recover then it is likely a data issue. This is uncommon but, if it occurs, can only be remedied by removing the data from the database. Contacting ServicePower Support in your next step in this circumstance. |
---|
Checking Communications
Has Connection from Client/Interface Processes Failed?
This could be due to network traffic issues.They are usually short term and correct themselves. The actual effect on the ServiceOptimizer system will depend on the nature of the incident, and whether ServiceOptimizer is set up in a distributed form.
Checking System Resource/Filestore Usage
Is there evidence in the O/S logs to show a lack of any of the resources required for the instance?
O/S knowledge will be required to follow up on this, but Resources may give some pointers.
Checking for Database Errors
There are 2 main types of database error that can occur:
Has the application attempted to do any illegal functions?
Is the a problem on the database server?
The ServiceOptimizer processes have some capability to withstand temporary database issues, however, any lasting issue will cause the affected process to restart. ServiceOptimizer needs to ensure the integrity of the data it is dealing with in memory and hence cannot risk the side effects of an error. It therefore takes the safe route of a restart. Any database errors that occur will be recorded in the failed process’s log file. Reading Log files gives details of what to look for and what it means.