AWG Blogs

Thursday, July 14, 2011

SSIM Red Queues


When you see all three left red bars completely red, and the EPS counts are 0, here are some troubleshooting tips:

- Check the Sesa Agent if it's a collector server
- SSH into the collector server in question
- "su -" (change to root)
- tail the collector log which produces the most EPS or highest spikes on this collector
e.g. "tail -200 /opt/Symantec/sesa/Agent/logs/stonegate.log"
- If you see WARN entries containing "the agent queue is full" you may need to adjust the agent queue configurations: SSIM console: System > Product Configurations > SES > STATE > SSIM Agent and Manager > Agent Configurations > Standard (or the configuration applied to this server) > Logging
- adjust the following properties as needed, e.g. to increase the queue size, flush size, decrease the queue flush time (so that events are cleared more frequently).
- Save the configuration (the agent should pick up the settings automatically)

- Check the Recv-Q and Send-Q of Netstat
- on both the archiver and the correlator do "netstat -anp | grep 100"
- look for signs of downstream bottlenecks
- Examples
- if archiver is queued up sending to correlator:10010, check the correlator queues. E.g. if the correlator is queued up sending to 127.0.0.1:10080, there may be a problem with the asset service. Check assetsvc.log for errors. Try restarting the assetsvc service, and monitor the netstat -anp | grep 100 for movement in the queues.
As db2admin, cd ~/sqllib/bin
db2 connect to SESA
db2 list applications show detail
Check the SSIM-ASSET application for signs that it is currently Executing. Let it run for a while to see if it finishes.
- on correlator, check "netstat -anp | grep 555" for large number of FIN_WAIT2
- on the correlator check the icesvc.log, assetsvc.log, and rulesvc.log for recent errors (see /opt/Symantec/simserver/logs)
- Disable any rules temporarily that contain references to lookup tables to determine whether any of these rules are causing the backlog.
- Check swap space usage on Correlator. If there is any Swap used, reboot the server.
- Check SSIM Event logs for warnings or errors; query all recent logs where product = Symantec Security Information Manager or product = SSIM System
- Do a query on Event Type ID = Conclusion Updated. Verify that not more than a few events a minute are being created
- Check top on both archiver and correlator. Look for processes whose VIRT is close to the same size as RES. This process may be overworked. The memory allocation can be increased in the /opt/Symantec/simserver/svclauncher.cfg config.
- Investigate whether simserver (correlator service) is overloaded.
- ascertain the pid of the simserver in top
- do lsof | grep | awk '{print $7,$1,$2,$3,$4,$5,$6,$8,$9}' | sort -n
- look for any unusually large files in output. e.g. .que files larger than 10MB might indicate an overworked correlator.
- Do an advanced SQL query in the UI to determine whether there are any unusually large number of incidents for particular rule: SELECT count(incident_code),incident_code FROM SYMCMGMT.SYMC_IMR_INCIDENT_LIST_VIEW group by incident_code
order by count(incident_code) desc
or: SELECT desc_id, sum(event_count) as sumevtcnt FROM symc_sim_conclusion
where modified_time >= (current timestamp - 1 DAY) group by desc_id
order by sumevtcnt desc
- Disable any rules with more than 50,000 events in one day. For example, a system rule that can cause issues if not adapted is Internal Port Sweep. Simserver process memory will be exhausted typically after 24 hours, and Queues will increase steadily in /opt/Symantec/simserver/queues/ice/input. IIRC the max is three 64MB files here. If there are more than one file, the icesvc is backed up due to the amount of event tracking updates waiting to be pushed to the database. Verify that SSIM-ICE is "UOW Executing" in db2 list applications show detail.
- Also check Events > System Queries > SSIM > SSIM System > Count of Conclusions by Rule Name
- Look for FIN_WAITs associated with large queues in a frozen state in netstat for the simserver process (get pid from `status`). E.g.:
tcp 80384 0 ::ffff:127.0.0.1:10010 ::ffff:127.0.0.1:36446 ESTABLISHED 6131/java
tcp 0 145961 ::ffff:127.0.0.1:36446 ::ffff:127.0.0.1:10010 FIN_WAIT1 -
This could indicate a memory resource problem in the simserver service or another service. Check the target port, and cross reference the admin guide with the service in question. e.g. 10010 is simserver, 55562 is icesvc, etc.
If it's really in frozen state, the process may have crashed (i.e. you see 0 EPS). If the Recv-Q/Send-Q varies and EPS varies, then this could be indicative of High IO Waits. Check the IOWait in the Statistics tile of the GUI. Also check vmstat 1. Sustained IOWaits of more than 2% could indicate the drive is not fast enough.