Linux Archive

Linux Archive (http://www.linux-archive.org/)
-   CentOS (http://www.linux-archive.org/centos/)
-   -   bizarre system slowness (http://www.linux-archive.org/centos/513893-bizarre-system-slowness.html)

Florin Andrei 04-13-2011 08:06 PM

bizarre system slowness
 
Running v5 64bit on a Dell 1950.

A cluster of 3 DB machines, identical hardware. One of them suddenly
became slower 2 weeks ago.

tar -zxf with a large file on this machine takes 1.5 minutes, but takes
only 10 seconds on any of its siblings. CPU usage seems high while
untarring, with lots of user and sys cycles being used, but almost no
wait cycles. It doesn't matter whether I untar on a local disk, or on a
fiber channel SAN volume, it's slow anyway.

scp a file over the network is slow too: 6 MB/s to this machine, 70 MB/s
to its siblings.

However, this is just as fast on all systems, including the "sick" one:

# time dd if=/dev/zero of=/dev/null bs=1M count=100000
100000+0 records in
100000+0 records out
104857600000 bytes (105 GB) copied, 2.59213 seconds, 40.5 GB/s

real 0m2.600s
user 0m0.025s
sys 0m2.550s

/proc/cpuinfo looks fine. Nothing suspect in dmesg.

Reboot doesn't fix it. Power off / power on doesn't fix it. Single mode
is slow too, and I tried a couple different kernels.

Dell's online diagnostics program could find nothing wrong with it.

/var/log/messages was full of "ntpd[7313]: frequency error -1707 PPM
exceeds tolerance 500 PPM" messages. There was a lot of messages about
"the system limit for the maximum number of semaphore sets has been
exceeded"; there was indeed a lot of leftover semaphores created by NRPE
(owned by the nagios user); I deleted them but nothing has changed, so
they were a symptom, not the cause.

I'm still kind of hoping it's a software issue, but chances are slim.
OTOH, I can't imagine any hardware problem that would exhibit these
symptoms.

Any idea what to test?

--
Florin Andrei
http://florin.myip.org/
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos

"Brunner, Brian T." 04-13-2011 08:16 PM

bizarre system slowness
 
centos-bounces@centos.org wrote:
> Running v5 64bit on a Dell 1950.
>
> A cluster of 3 DB machines, identical hardware. One of them suddenly
> became slower 2 weeks ago.

<snip proof the overall system is slow>

<snip proof that the CPU is not the problem>

> /var/log/messages was full of "ntpd[7313]: frequency error -1707 PPM
> exceeds tolerance 500 PPM" messages.

Sounds like your CMOS time-keeping chip may be dying.

> I'm still kind of hoping it's a software issue, but chances are slim.
> OTOH, I can't imagine any hardware problem that would exhibit these
> symptoms.
>
> Any idea what to test?

Any RAID setups go into self-repair mode?

dd if=/dev/($nextdrive)1 of=/dev/null count=100000 (just compare just
the read speeds off each spindle)
for each disk on each system.
ONE drive slower then the other blames the drive or the data thereon
(RAID rebuild).
ALL drives on the slow system blames the mobo.

smartctl -t on each of the disks then smartctl -a

Are these system busy serving customers, or can they be opened and drive
sets swapped between systems?


Insert spiffy .sig here:
Life is complex: it has both real and imaginary parts.

//me
************************************************** *****************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom
they are addressed. If you have received this email in error please
notify the system manager. This footnote also confirms that this
email message has been swept for the presence of computer viruses.
www.Hubbell.com - Hubbell Incorporated**

_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos

Cal Webster 04-13-2011 08:34 PM

bizarre system slowness
 
On Wed, 2011-04-13 at 13:06 -0700, Florin Andrei wrote:
> Running v5 64bit on a Dell 1950.
>
> A cluster of 3 DB machines, identical hardware. One of them suddenly
> became slower 2 weeks ago.
>
> tar -zxf with a large file on this machine takes 1.5 minutes, but takes
> only 10 seconds on any of its siblings. CPU usage seems high while
> untarring, with lots of user and sys cycles being used, but almost no
> wait cycles. It doesn't matter whether I untar on a local disk, or on a
> fiber channel SAN volume, it's slow anyway.
>
> scp a file over the network is slow too: 6 MB/s to this machine, 70 MB/s
> to its siblings.
>
> However, this is just as fast on all systems, including the "sick" one:
>
> # time dd if=/dev/zero of=/dev/null bs=1M count=100000
> 100000+0 records in
> 100000+0 records out
> 104857600000 bytes (105 GB) copied, 2.59213 seconds, 40.5 GB/s
>
> real 0m2.600s
> user 0m0.025s
> sys 0m2.550s
>
> /proc/cpuinfo looks fine. Nothing suspect in dmesg.
>
> Reboot doesn't fix it. Power off / power on doesn't fix it. Single mode
> is slow too, and I tried a couple different kernels.
>
> Dell's online diagnostics program could find nothing wrong with it.
>
> /var/log/messages was full of "ntpd[7313]: frequency error -1707 PPM
> exceeds tolerance 500 PPM" messages. There was a lot of messages about
> "the system limit for the maximum number of semaphore sets has been
> exceeded"; there was indeed a lot of leftover semaphores created by NRPE
> (owned by the nagios user); I deleted them but nothing has changed, so
> they were a symptom, not the cause.

Are the system times different between the siblings?
Are all 3 siblings running ntpd and using the same time source
(server(s))?
Do the symptoms change with ntpd stopped/running?
Are the frequency offsets the same on each sibling?

Since your log messages appear to be ntp related, you might try
resetting your frequency offset and drift values. Having a -1707 PPM
offset could cause many issues like you describe.

service ntpd stop
ntptime -f 0
echo "0" > /var/lib/ntp/drift
service ntpd start


> I'm still kind of hoping it's a software issue, but chances are slim.
> OTOH, I can't imagine any hardware problem that would exhibit these
> symptoms.
>
> Any idea what to test?
>

_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos

Benjamin Franz 04-13-2011 08:55 PM

bizarre system slowness
 
On 04/13/2011 01:34 PM, Cal Webster wrote:
>
>> tar -zxf with a large file on this machine takes 1.5 minutes, but takes
>> only 10 seconds on any of its siblings. CPU usage seems high while
>> untarring, with lots of user and sys cycles being used, but almost no
>> wait cycles. It doesn't matter whether I untar on a local disk, or on a
>> fiber channel SAN volume, it's slow anyway.

1) Are you untarring from *and* to the SAN volume or is the source on
the local volume?
2) What kind of local drives? If the local drive is IDE or SATA it is
possible the machine is using PIO mode. That would match the symptoms of
very high CPU usage and very slow I/O (yes - I've seen it happen with
SATA drives with certain Supermicro chipsets).

--
Benjamin Franz


_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos

Florin Andrei 04-13-2011 09:01 PM

bizarre system slowness
 
On 04/13/2011 01:55 PM, Benjamin Franz wrote:
>
> 1) Are you untarring from *and* to the SAN volume or is the source on
> the local volume?

Source on SAN, destination on SAN. Still slow.

--
Florin Andrei
http://florin.myip.org/
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos

Florin Andrei 04-13-2011 09:14 PM

bizarre system slowness
 
On 04/13/2011 01:16 PM, Brunner, Brian T. wrote:
>
> Any RAID setups go into self-repair mode?

No RAID here, just LVM - not too different from the default redhat-style
setup of the system drives (except the additional SAN stuff and DB).

Anyway, if the drives are the cause, then riddle me this:

scp test.tar.gz bad-server:/dev/null

Fast to the healthy siblings, slow to the sick machine.

Could the network be the problem? I doubt it. Each system has 4 physical
Ethernet ports, bonded two by two. Each bonded pair is connected to a
different VLAN. This system is slow on both VLANs.

In any case, the explanation would have to account for the fact that
network transfers, *and* local disk activity, are both slow.

--
Florin Andrei
http://florin.myip.org/
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos

Devin Reade 04-13-2011 10:33 PM

bizarre system slowness
 
Maybe check /proc/interrupts ?

_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos

Michael Simpson 04-14-2011 09:37 AM

bizarre system slowness
 
On 13 April 2011 21:06, Florin Andrei <florin@andrei.myip.org> wrote:
> Running v5 64bit on a Dell 1950.
> /var/log/messages was full of "ntpd[7313]: frequency error -1707 PPM
> exceeds tolerance 500 PPM" messages. There was a lot of messages about
> "the system limit for the maximum number of semaphore sets has been
> exceeded"; there was indeed a lot of leftover semaphores created by NRPE
> (owned by the nagios user); I deleted them but nothing has changed, so
> they were a symptom, not the cause.
>

Sounds as though something in software is leaking semaphores rather
than having them cleared up. Are you using intel nics as if you are
you might want to check if you have e1000 dkms rpm installed. No idea
why it shoud only affect 1 box if they are all the same

regards

mike
_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos

Ljubomir Ljubojevic 04-14-2011 09:37 AM

bizarre system slowness
 
I was wondering if you have normal internet access on that machine. I
found out that somehow systems I set up just freeze and are horribly
slow when there is no internet access. Terminnal would take few minutes
to open, and if I try several terminals all would open at once once
system is "awaken". I always fail to ask what the cause might be. DNS
issue? Routing? Even on boot, system would hang on "internet" stuff link
NTP update. I do use NFS with automount. maybe that causes it?

Ljubomir

Florin Andrei wrote:
> Running v5 64bit on a Dell 1950.
>
> A cluster of 3 DB machines, identical hardware. One of them suddenly
> became slower 2 weeks ago.
>
> tar -zxf with a large file on this machine takes 1.5 minutes, but takes
> only 10 seconds on any of its siblings. CPU usage seems high while
> untarring, with lots of user and sys cycles being used, but almost no
> wait cycles. It doesn't matter whether I untar on a local disk, or on a
> fiber channel SAN volume, it's slow anyway.
>
> scp a file over the network is slow too: 6 MB/s to this machine, 70 MB/s
> to its siblings.
>
> However, this is just as fast on all systems, including the "sick" one:
>
> # time dd if=/dev/zero of=/dev/null bs=1M count=100000
> 100000+0 records in
> 100000+0 records out
> 104857600000 bytes (105 GB) copied, 2.59213 seconds, 40.5 GB/s
>
> real 0m2.600s
> user 0m0.025s
> sys 0m2.550s
>
> /proc/cpuinfo looks fine. Nothing suspect in dmesg.
>
> Reboot doesn't fix it. Power off / power on doesn't fix it. Single mode
> is slow too, and I tried a couple different kernels.
>
> Dell's online diagnostics program could find nothing wrong with it.
>
> /var/log/messages was full of "ntpd[7313]: frequency error -1707 PPM
> exceeds tolerance 500 PPM" messages. There was a lot of messages about
> "the system limit for the maximum number of semaphore sets has been
> exceeded"; there was indeed a lot of leftover semaphores created by NRPE
> (owned by the nagios user); I deleted them but nothing has changed, so
> they were a symptom, not the cause.
>
> I'm still kind of hoping it's a software issue, but chances are slim.
> OTOH, I can't imagine any hardware problem that would exhibit these
> symptoms.
>
> Any idea what to test?
>

_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


All times are GMT. The time now is 02:55 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.