Looking for resources on system performance monitoring & tuning
I'm the de facto admin for 32-core research server running Ubuntu Server 10.04
LTS on Dell hardware. 'uname -a' gives
Linux UTxxxxx 2.6.32-33-server #72-Ubuntu SMP Fri Jul 29 21:21:55 UTC 2011
An external collaborator is running a long-running multi-threaded CPU-intensive
process on our machine. They claim that it runs fine on all CPUs for about 6
days before being reduced to about 4 cores. We sometimes renice it a bit, but we
aren't taking any action to throttle it. The other users of this system have
occasional processes that run for an hour or two and consume most of one or two
processes, but nothing like our collaborator's process either for resource
intensity or duration.
I'm tasked with working on the problem on the server-side (checking the
application is up to them). Can you recommend resources that will bring me up to
speed on performance monitoring and tuning? I know how to run e.g. dstat, and I
have a feel for the CPU, mem, and disk i/o numbers, but I'm not sure how to
interpret the hardware & software interrupt numbers.
One other oddity that we've noticed with this process in the past is that after
running for a few days, it disappears from the top of 'top', and in the ps
output for it it has some ridiculously high number for % of CPU time (should be
around 3200%, i.e. all of 32 cores, but it could be 1393215%). This struck us as
weird even before our partners told us about the performance problem.
Many thanks in advance. I really appreciate any pointers to good, reliable material.
Programmer Analyst IV
The University of Texas Health Science Center at Houston
School of Biomedical Informatics
ubuntu-server mailing list
More info: https://wiki.ubuntu.com/ServerTeam