iostat命令/porc/stat

PXE2 · 发表于 2005-1-9 23:00:55

Example for the external program interface
So, you want to do it yourself. First of all, make sure you have a clear picture of what you want to achieve. Make sure that you know how to access the data sources that you want to monitor.

Next thing you need is a script. Use a script language that you're familiar with. Some will use perl, some will use bash. It doesn't really matter what you use as long as you can get the job done.

I'll try and explain how I can monitor processor usage on a linux box, without using SNMP. I'll be using the proc file system, the file /proc/stat to be exact. Don't say this is of no use to you as you are using NT. This web page is not there to provide you with a script, it is there to show you how to develop your own script. Use your own imagination and creativity, or hire someone to do so.

This file contains, amongst other things, the usage of the cpu(s) in hundreds of seconds. It has fields for user, system, nice and idle time. Added up, they provide the total uptime of the system.

After the last involuntary power down (thank you, electricity supplier!) the output of "uptime" looks like this:

10:47pm  up 6 days,  8:31,  4 users,  load average: 0.72, 1.12, 1.11

The first line of /proc/stat is:
cpu  13364287 0 1679284 39864010
First of all, check the numbers and verify that you understand what you want to do. The machine is up for 6 days, 8 hours and 31 minutes. That would be 6*24*60*60 + 8*60*60 + 31*60 seconds = 549060 seconds. /proc/stat tells us it was up for (13364287 + 0 + 1679284 + 39864010)/100 = 54907581/100 = 549075.81 seconds. The difference can be explained, uptime doesn't display the remaining 15 seconds.

Now we need to know which number represents what. Knowing that it would be there, I started looking in the kernel sources. I stumbled upon the file fs/proc/root.c where the "file" /proc/stat is defined. This has something to do with PROC_STAT which in its turn led me to the file fs/proc/array.c procedure get_kstat. In there, the cpu info is easily spotted and as it turns out the order is:
user nice system idle

What happens is that time lapsed is split into three variables, when added up they represent useful time spent by the processor and when we add idle time to those three, we end up with the total time lapsed again. This means that we get three counters (which count in hundreds of seconds) while we can only display two. Problem? Not really, just add the user time and the nice time. Now you have two variables and are left with the remainder. If you stack one value on top of the other, you can split the time in three: first graph a green area representing user time. Then graph a blue line representing the sum of user and system time (so it looks as if system time is stacked on top of user time) and the remaining white space in the graph now represents idle time.

The counters we get from our linux box are counting hundreds of seconds. This is very nice, as we can see in the next part where I will show what MRTG will do with the values we input. Every time interval (lets say one second) will be split into user, nice, system and idle time. The delta of two samples are thus representing a fraction of the full time. However, as they are multiplied by 100, they really are representing a percentage of the time!

I will get some sample values from my machine. They are not five minutes apart as normally would be but this really doesn't matter. The method used by MRTG to calculate the "rate" compensates for this.

Fri Mar  3 00:06:31 CET 2000 cpu  5513569 2263 322134 561389390
Fri Mar  3 00:11:14 CET 2000 cpu  5517438 5778 322546 561409848

What we would feed to MRTG:
at 00:06:31 we would feed 5513569+2263 and 5513569+2263+322134
at 00:11:14 we would feed 5517438+5778 and 5517438+5778+322546

MRTG will calculate the differences:
delta in =  5523216 -  5515832 = 7384
delta out  =  5845762 -  5837966 = 7796
delta time = 00:11:14 - 00:06:31 =  283

The calculated "rates" will be:
in:  7384/283 = 26
out: 7796/283 = 27

It's time to create a script. This is not too hard. I choose to use awk, the programmable filter for unix. First of all, get the stats. We need one line from the output of 'cat /proc/stat'. On a multiprocessor machine this file starts with:

cpu  <total user> <total nice> <total system> <total idle>
cpu0 <cpu0  user> <cpu0  nice> <cpu0  system> <cpu0  idle>
cpu1 <cpu1  user> <cpu1  nice> <cpu1  system> <cpu1  idle>

We need only the total so we use:
awk '/cpu /' < /proc/stat
(mind the space after "cpu" to filter out cpu0, cpu1 and so on).
Next step is to add up user and nice:
awk '/cpu / {print $2+$3}' < /proc/stat

Next few steps: output the first+second value, the uptime and the hostname, each on a separate line.
awk '/cpu / {print $2+$3; print $2+$3+$4; print "quite some time"; print "myname"}' < /proc/stat
The resulting output does conform to the MRTG specifications.

Next to write is a config file for MRTG. As the maximum "rate" seen by MRTG will never exceed 100, this will be the MaxBytes setting. It is a maximum of 100 because of this: suppose all available time on the processor will have been allocated to the system. In that case, the system time counter will have grown with 100*<time lapsed>. MRTG will divide that growth with <time lapsed> and will therefore end up with 100!

The other items in the config file are just to produce a nice looking graph (the Target line being the most important one of course). In the MRTG doc directory, and on the website, you can find out what each configurable item does. I had to use "Unscaled" to show the idle time, "Options" to get rid of the percentages calculated by MRTG (together with growright which I just like) and several "Legend" settings.

Workdir: /some/path
Target[home.cpu]: `/usr/bin/awk '/cpu /{print $2+$3; print $2+$3+$4; print "quite some time"; print "home"}'</proc/stat`
Title[home.cpu]: Processor stats at home
PageTop[home.cpu]: <H1>Processor stats</H1>
MaxBytes[home.cpu]: 100
Unscaled[home.cpu]: ymwd
Options[home.cpu]: growright,nopercent
LegendI[home.cpu]: &nbsp;user:
LegendO[home.cpu]: &nbsp;total:
Ylegend[home.cpu]: %
ShortLegend[home.cpu]: %
Legend1[home.cpu]: Time spent in user mode
Legend2[home.cpu]: Time spent in user mode + time spent in system mode
Legend3[home.cpu]: Maximum occurance of time spent in user mode
Legend4[home.cpu]: Maximum occurance of (time spent in user mode + time spent in system mode)

Add the job to your scheduler, bring the box down to its knees by an unhealthy amount of flood ping and wait a while. Enjoy.

		自动登录	找回密码
密码			注册