In the area I work for, we have one duty among several others more: Maximize the availability of the network management systems. The NMSs must run almost every time because the NOCs (network operation centers) monitor them 7x24.
But thats not an easy task, sometimes the planning area delivers the NMS implementation with many flaws, other times the machine you get is not what you expected or the application servers freezes continously and research the root cause can take weeks.
As we're not big fan of attending service disruption calls at 3am, we deployed a nice and useful service in all our Linux machines: Monit. This nice program monitor the service existence,availability and performance and take automated actions when the rules/thresholds are exceeded.
For example, we had a Tomcat container that was getting frozen several times at the week...the rule for Monit was something like this:
check process tomcat5 with pidfile /var/run/tomcat5.pid group tomcat5 start program = "/etc/init.d/tomcat5 start" with timeout 120 seconds stop program = "/etc/init.d/tomcat5 stop" with timeout 120 seconds if failed host 127.0.0.1 port 8080 protocol HTTP request /archivos/gestion.jpg TIMEOUT 3 SECONDS then restart if cpu usage > 95% for 10 cycles then restart if 5 restarts within 5 cycles then timeout
So, this loyalty automated employee check the port 8080, check the HTTP request of gestion.jpg to be less than 3 seconds and check the cpu usage of the process to be under 95%. If Monit sees any of these rules broken then it begins to restart the service. As the good employee he is, Monit sends email notification of every step he takes.
Hope this application be useful for you,
Quick update: Even thought Monit is doing a great job is important to find the root cause. Regarding the tomcat issue i found this useful site to tune the JVM parameters: http://wiki.alfresco.com/wiki/JVM_Tuning