Linux Archive

Linux Archive (http://www.linux-archive.org/)
-   Fedora Directory (http://www.linux-archive.org/fedora-directory/)
-   -   Strange Disk IO issue (http://www.linux-archive.org/fedora-directory/666854-strange-disk-io-issue.html)

Brad Schuetz 05-14-2012 11:54 PM

Strange Disk IO issue
 
I have recently upgraded our 389 servers from pretty old versions that
were a mix and match of 389 release and CentOS released versions (all on
centos 5) to the latest (on centos6) (specific RPMs listed below).

I did this though a full ldif dump of the original server and imported
into a freshly installed new master server. Then I setup the
replication agreements with the 7 slave servers and everything was
running fine.

After about a week I starting having a problem with the hubs servers
where all of them after (possibly exactly) 24 hours would start going
crazy on the disk IO (95-100% according to sysstat) of that server
making queries to ldap slow. The master server does not exhibit this
problem, it will run completely fine.

A simple restart of the dirsrv process corrects the issue and then it
will run for another 24 hours before repeating the issue.

The hardware running each node is somewhat different with varying disk
speeds underlying, but all exhibit the same behavior.

This happens the same on the 2 nodes that get relatively little traffic
and the 5 nodes that get a lot of traffic.

I was originally on the 389-ds-base release that shipped with CentOS6
and have changed to the version from the
<http://repos.fedorapeople.org/repos/rmeggins/389-ds-base/epel-389-ds-base.repo>
repo, both do the same thing.

Any thoughts/suggestions on how to fix or further diagnose this? I've
had no luck with strace or error logs to find any issues. At this point
I've unfortunately had to resort to a cron job to restart all of my LDAP
hubs.

Installed RPMs:
389-ds-console-1.2.6-1.el6.noarch
389-ds-1.2.2-1.el6.noarch
389-console-1.1.7-1.el6.noarch
389-admin-console-1.1.8-1.el6.noarch
389-ds-console-doc-1.2.6-1.el6.noarch
389-dsgw-1.1.9-1.el6.x86_64
389-admin-1.1.29-1.el6.x86_64
389-ds-base-1.2.10.7-1.el6.x86_64
389-adminutil-1.1.15-1.el6.x86_64
389-admin-console-doc-1.1.8-1.el6.noarch
389-ds-base-libs-1.2.10.7-1.el6.x86_64

--
Brad
--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

Paul Robert Marino 05-15-2012 01:18 PM

Strange Disk IO issue
 
That is odd behavior.

Do all of the replcas have the same applications connecting to them? Not nessisarily the same instances but the same applcations configured in a simmilar way. The reason I ask is I'm wondering if there might be a rouge app sending heavy queries repeatedly to the servers. Is sounds like issues I've seen in environments where nscd wasn't properly tuned (hundreds of processes querying nscd but only the default low thread count limit) and had stopped functioning correctly. The result was the ldap servers were victims of hundreds of boxes launching an unintended dos atack.


On May 14, 2012 7:54 PM, "Brad Schuetz" <brad@omnis.com> wrote:
I have recently upgraded our 389 servers from pretty old versions that

were a mix and match of 389 release and CentOS released versions (all on

centos 5) to the latest (on centos6) (specific RPMs listed below).



I did this though a full ldif dump of the original server and imported

into a freshly installed new master server. *Then I setup the

replication agreements with the 7 slave servers and everything was

running fine.



After about a week I starting having a problem with the hubs servers

where all of them after (possibly exactly) 24 hours would start going

crazy on the disk IO (95-100% according to sysstat) of that server

making queries to ldap slow. *The master server does not exhibit this

problem, it will run completely fine.



A simple restart of the dirsrv process corrects the issue and then it

will run for another 24 hours before repeating the issue.



The hardware running each node is somewhat different with varying disk

speeds underlying, but all exhibit the same behavior.



This happens the same on the 2 nodes that get relatively little traffic

and the 5 nodes that get a lot of traffic.



I was originally on the 389-ds-base release that shipped with CentOS6

and have changed to the version from the

<http://repos.fedorapeople.org/repos/rmeggins/389-ds-base/epel-389-ds-base.repo>

repo, both do the same thing.



Any thoughts/suggestions on how to fix or further diagnose this? *I've

had no luck with strace or error logs to find any issues. *At this point

I've unfortunately had to resort to a cron job to restart all of my LDAP

hubs.



Installed RPMs:

389-ds-console-1.2.6-1.el6.noarch

389-ds-1.2.2-1.el6.noarch

389-console-1.1.7-1.el6.noarch

389-admin-console-1.1.8-1.el6.noarch

389-ds-console-doc-1.2.6-1.el6.noarch

389-dsgw-1.1.9-1.el6.x86_64

389-admin-1.1.29-1.el6.x86_64

389-ds-base-1.2.10.7-1.el6.x86_64

389-adminutil-1.1.15-1.el6.x86_64

389-admin-console-doc-1.1.8-1.el6.noarch

389-ds-base-libs-1.2.10.7-1.el6.x86_64



--

Brad

--

389 users mailing list

389-users@lists.fedoraproject.org

https://admin.fedoraproject.org/mailman/listinfo/389-users
--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

Brad Schuetz 05-15-2012 06:22 PM

Strange Disk IO issue
 
On 05/15/2012 06:18 AM, Paul Robert Marino wrote:
>
> That is odd behavior.
> Do all of the replcas have the same applications connecting to them?
> Not nessisarily the same instances but the same applcations configured
> in a simmilar way. The reason I ask is I'm wondering if there might be
> a rouge app sending heavy queries repeatedly to the servers. Is sounds
> like issues I've seen in environments where nscd wasn't properly tuned
> (hundreds of processes querying nscd but only the default low thread
> count limit) and had stopped functioning correctly. The result was the
> ldap servers were victims of hundreds of boxes launching an unintended
> dos atack.
>

Of the 7 replicas, 5 are attached to one network, and 2 are on another
network. The 5 are queried a LOT, the other 2 barely get any traffic at
all. All, however, are getting the same traffic that they were getting
when I was using previous versions of the LDAP server.

The 2 that are barely used I've checked for excessive queries being run
at the point when load goes crazy and they are getting the usual minimal
load.

Also that doesn't explain why it's always 24 hours that it goes
haywire. It doesn't matter when I restart the service, it could be
restarted at 2am, then in 24 hours it will go crazy IO load.

--
Brad
--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

Gregory Matthews 05-16-2012 11:51 AM

Strange Disk IO issue
 
On 15/05/12 19:22, Brad Schuetz wrote:

Of the 7 replicas, 5 are attached to one network, and 2 are on another
network. The 5 are queried a LOT, the other 2 barely get any traffic at
all. All, however, are getting the same traffic that they were getting
when I was using previous versions of the LDAP server.

The 2 that are barely used I've checked for excessive queries being run
at the point when load goes crazy and they are getting the usual minimal
load.

Also that doesn't explain why it's always 24 hours that it goes
haywire. It doesn't matter when I restart the service, it could be
restarted at 2am, then in 24 hours it will go crazy IO load.


do you know what the IO is? is it swapping? are you running collectl or
similar so you can look at historic performance data?


G



--
Brad
--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users



--
Greg Matthews 01235 778658
Scientific Computing Group Leader
Diamond Light Source Ltd. OXON UK

--
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.

Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom




--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

Paul Robert Marino 05-16-2012 01:16 PM

Strange Disk IO issue
 
The exact timing of the issue is to strange is there a backup job running at midnight. Or some other timed job that could be eating the ram or disk IO. Possibly one that is reliant on ldap queries that would otherwise be inocuious.


On May 16, 2012 7:51 AM, "Gregory Matthews" <greg.matthews@diamond.ac.uk> wrote:
On 15/05/12 19:22, Brad Schuetz wrote:


Of the 7 replicas, 5 are attached to one network, and 2 are on another

network. *The 5 are queried a LOT, the other 2 barely get any traffic at

all. *All, however, are getting the same traffic that they were getting

when I was using previous versions of the LDAP server.



The 2 that are barely used I've checked for excessive queries being run

at the point when load goes crazy and they are getting the usual minimal

load.



Also that doesn't explain why it's always 24 hours that it goes

haywire. *It doesn't matter when I restart the service, it could be

restarted at 2am, then in 24 hours it will go crazy IO load.




do you know what the IO is? is it swapping? are you running collectl or similar so you can look at historic performance data?



G






--

Brad

--

389 users mailing list

389-users@lists.fedoraproject.org

https://admin.fedoraproject.org/mailman/listinfo/389-users






--

Greg Matthews * * * *01235 778658

Scientific Computing Group Leader

Diamond Light Source Ltd. OXON UK



--

This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.


Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.


Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom









--

389 users mailing list

389-users@lists.fedoraproject.org

https://admin.fedoraproject.org/mailman/listinfo/389-users
--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

Brad Schuetz 05-16-2012 06:15 PM

Strange Disk IO issue
 
On 05/16/2012 04:51 AM, Gregory Matthews wrote:
> On 15/05/12 19:22, Brad Schuetz wrote:
>> Of the 7 replicas, 5 are attached to one network, and 2 are on another
>> network. The 5 are queried a LOT, the other 2 barely get any traffic at
>> all. All, however, are getting the same traffic that they were getting
>> when I was using previous versions of the LDAP server.
>>
>> The 2 that are barely used I've checked for excessive queries being run
>> at the point when load goes crazy and they are getting the usual minimal
>> load.
>>
>> Also that doesn't explain why it's always 24 hours that it goes
>> haywire. It doesn't matter when I restart the service, it could be
>> restarted at 2am, then in 24 hours it will go crazy IO load.
>
> do you know what the IO is? is it swapping? are you running collectl
> or similar so you can look at historic performance data?
>
> G
>
It's not swap, and happens regardless of the amount of the ram in the
server.

I've run "sysctl -x 1" for random periods of time both before and after
the issue hits and the IO is very low, < 20% most of the time with is
typically < 5% usage.

But when the problem happens sysctl reports > 95% usage.

--
Brad
--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

Brad Schuetz 05-16-2012 06:19 PM

Strange Disk IO issue
 
On 05/16/2012 06:16 AM, Paul Robert Marino wrote:
>
> The exact timing of the issue is to strange is there a backup job
> running at midnight. Or some other timed job that could be eating the
> ram or disk IO. Possibly one that is reliant on ldap queries that
> would otherwise be inocuious.
>
>

It doesn't happen at midnight, it's 24 hours from when the process was
started, so I can restart dirsrv at 3:17pm on Wednesday and at right
around 3:17pm on Thursday that server will go to 100% disk IO usage.

I've restarted the servers at totally random times to reproduce this
issue, and currently restart, via cron, all my ldap servers twice per
day at randomly selected times of the day to make sure that both they
are restarted before the 24 hours hits, and so that no more than 1
dirsrv process is being restarted at the same time.

Keep in mind, the ldap queries load has not changed from the setup I was
running prior to this which was running (much) older versions of the 389
server software. In fact, as part of this system upgrade, additional
servers were added to reduce the individual load on each server.

--
Brad
--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

Nathan Kinder 05-16-2012 06:19 PM

Strange Disk IO issue
 
On 05/16/2012 06:16 AM, Paul Robert Marino wrote:


The exact timing of the issue is to strange is there a backup
job running at midnight. Or some other timed job that could be
eating the ram or disk IO. Possibly one that is reliant on ldap
queries that would otherwise be inocuious.


It is possible that it is the tombstone reap thread, which runs
periodically.* Do you do a lot of entry deletion operations
throughout the typical day?


On May 16, 2012 7:51 AM, "Gregory
Matthews" <greg.matthews@diamond.ac.uk>
wrote:

On 15/05/12 19:22, Brad Schuetz wrote:


Of the 7 replicas, 5 are attached to one network, and 2 are
on another

network. *The 5 are queried a LOT, the other 2 barely get
any traffic at

all. *All, however, are getting the same traffic that they
were getting

when I was using previous versions of the LDAP server.



The 2 that are barely used I've checked for excessive
queries being run

at the point when load goes crazy and they are getting the
usual minimal

load.



Also that doesn't explain why it's always 24 hours that it
goes

haywire. *It doesn't matter when I restart the service, it
could be

restarted at 2am, then in 24 hours it will go crazy IO load.




do you know what the IO is? is it swapping? are you running
collectl or similar so you can look at historic performance
data?



G






--

Brad

--

389 users mailing list

389-users@lists.fedoraproject.org

https://admin.fedoraproject.org/mailman/listinfo/389-users






--

Greg Matthews * * * *01235 778658

Scientific Computing Group Leader

Diamond Light Source Ltd. OXON UK



--

This e-mail and any attachments may contain confidential,
copyright and or privileged material, and are for the use of
the intended addressee only. If you are not the intended
addressee or an authorised recipient of the addressee please
notify us of receipt by returning the e-mail and do not use,
copy, retain, distribute or disclose the information in or
attached to the e-mail.

Any opinions expressed within this e-mail are those of the
individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or
any attachments are free from viruses and we cannot accept
liability for any damage which you may sustain as a result of
software viruses which may be transmitted in or with the
message.

Diamond Light Source Limited (company no. 4375679). Registered
in England and Wales with its registered office at Diamond
House, Harwell Science and Innovation Campus, Didcot,
Oxfordshire, OX11 0DE, United Kingdom









--

389 users mailing list

389-users@lists.fedoraproject.org

https://admin.fedoraproject.org/mailman/listinfo/389-users






--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users





--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

Nathan Kinder 05-16-2012 06:54 PM

Strange Disk IO issue
 
On 05/16/2012 11:19 AM, Brad Schuetz wrote:


On 05/16/2012 06:16 AM, Paul Robert Marino wrote:

The exact timing of the issue is to strange is there a backup job
running at midnight. Or some other timed job that could be eating the
ram or disk IO. Possibly one that is reliant on ldap queries that
would otherwise be inocuious.



It doesn't happen at midnight, it's 24 hours from when the process was
started, so I can restart dirsrv at 3:17pm on Wednesday and at right
around 3:17pm on Thursday that server will go to 100% disk IO usage.
The default tombstone purge interval is 1 day, which seems to fit what
you are seeing. The tombstone reap thread will start every 24 hours to
find tombstone entries that can be deleted. The default retention
period for tombstones is 1 week. It is possible that you have a large
number of tombstone entries that need to be deleted. This will occur
independently on all of your server instances. This is controlled by
the "nsDS5ReplicaTombstonePurgeInterval" and "nsDS5ReplicaPurgeDelay"
attributes in your "cn=replica,cn=<suffixDN>,cn=mapping tree,cn=config"
entry.


You can search for "(objectclass=nstombstone)" as Directory Manager to
see how many tombstone entries you have.


I've restarted the servers at totally random times to reproduce this
issue, and currently restart, via cron, all my ldap servers twice per
day at randomly selected times of the day to make sure that both they
are restarted before the 24 hours hits, and so that no more than 1
dirsrv process is being restarted at the same time.

Keep in mind, the ldap queries load has not changed from the setup I was
running prior to this which was running (much) older versions of the 389
server software. In fact, as part of this system upgrade, additional
servers were added to reduce the individual load on each server.

--
Brad
--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users


--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

Brad Schuetz 05-16-2012 07:19 PM

Strange Disk IO issue
 
On 05/16/2012 11:19 AM, Nathan Kinder wrote:


On 05/16/2012 06:16 AM, Paul Robert Marino wrote:


The exact timing of the issue is to strange is there a backup
job running at midnight. Or some other timed job that could be
eating the ram or disk IO. Possibly one that is reliant on
ldap queries that would otherwise be inocuious.


It is possible that it is the tombstone reap thread, which runs
periodically.* Do you do a lot of entry deletion operations
throughout the typical day?






I don't have the exact numbers, but yes, I would say there are a
significant amount of entry deletions on any given day.



--

Brad



--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users


All times are GMT. The time now is 03:04 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.