The Casablanca Method of Server Monitoring

By: Joey Robichaux

Server Monitoring tools can capture such a wide variety of server metrics that many find it difficult to determine what metrics they should monitor.

Actually, actor Claude Rains gave us the answer back in 1942. In the movie "Casablanca", Rains's character "Captain Renault" announces "Round up the usual suspects" -- and that's exactly where your server monitoring should begin!

Who are the usual suspects? Simple enough -- the main components of our server are the CPU, Memory, and the Disk. Applications can become CPU bound, Memory bound, or Disk bound -- so we should collect metrics that reflect this.

Our CPU metric usual suspects should be:

Processor Queue Length -- How many jobs are waiting to run? (Note: The CPU Load Average available on Unix boxes is a better metric since it's time-weighted.)

Processor Utilization -- On a multi-processor machine, is the work load evenly distributed? Or, is one processor pegged while the others a lightly used?

Context Switch Rate -- My candidate for the single most useful "usual suspect". Context Switch Rate offers a good measure on how "busy" your machine is -- or, how much time is wasted on overhead rather than being used for application processing.

CPU Utilization -- Not as useful as you might think, but can be useful if you can use it to identify what processes are running at the moment.

Memory usual suspects are:

Pages/Sec -- Time spent paging is time wasted.

Pages Out/Sec (or Swaps Out/Sec) -- Paging out (or swaps out) suggests desperation; active applications get swapped out because memory has become dear.

Memory Utilization -- Paging is more critical, but Total Memory Utilization is useful for capacity planning and trending.

Disk usual suspects are:

Disk Queue Length -- How many jobs are waiting to go to disk?

% Disk Time -- How much time is spent accessing the disk

This small set of machine metrics make up a very effective group of usual suspects. Basically, you're looking for situations where your machine is waiting for something. "Waiting" means work isn't happening!

You can wait for CPU, you can wait for memory, and you can wait for disk access. When you're waiting for a resource, your system overhead increases -- which is reflected in spiking Context Switch rates (the reason I find it the most effective metric). It's possible for a machine that's only using 50 or 60 percent of the CPU to be completely hung because of system overhead.

Quite often, you'll see symptoms ripple through these metrics when situations occurr.

For instance, suppose you have an application that's spending a great amount of time writing to disk. This causes the disk queues to back up (because all requests can't get through at once) and the Disk Time Percent to increase.

Because of the Disk backup, disk writes get posted to memory cache. This decreases the amount of free memory available -- meaing paging increases. The system overhead involved in managing this situation means Context Switches go up.

The end result -- no work gets out; everyone is waiting for something. Your CPU may not even being spiked -- because you're not waiting on CPU, you're waiting on Disk or Memory! If you're able to capture the processes running at this time, you can identify the jobs responsible for this situation.

Specific applications may require special monitoring -- databases, daemons, numbers of users, etc. However, for general system monitoring, it's hard to go wrong if you simply "Round up the usual suspects"!

Top Searches on
Hardware
 • 
 • 
 • 
 • 
 • 
 • 
 • 
 • 
 • 
 • 
 • 
 • 
 • 
 • 
 • 
 • 
 • 
 • 
 • 
 • 

» More on Hardware
 



Share this article :
Click to see more related articles