FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Fedora Infrastructure

 
 
LinkBack Thread Tools
 
Old 10-05-2010, 08:44 PM
Mike McGrath
 
Default Nagios event handlers

In an effort to further hide the fas issues we've been running into I've
added an event handler to the app servers. A brief description of the
problem is when fas hangs, app server httpd processes stack up. When they
do they become unresponsive.

Currently nagios does this on failure:

Failed check 1: nothing (Soft)
Failed check 2: nothing (Soft)
Failed check 3: Send notification (hard)

Once it hits that hard state, nagios claims its dead. We get paged, the
alert shows up in #fedora-noc. Doom.

Now what it does is this:

Failed check 1: nothing (Soft)
Failed Check 2: send notification to #fedora-noc, issue a service httpd
reload
Failed Check 3: Send paged / emailed notifications, issue a service httpd
restart


This is a very different change from how things were and as such we should
track this closely. The reason for the notification issue to #fedora-noc
is to ensure things aren't auto-correcting without us knowing. But at the
same time we're not generating a lot of un-needed email / paged alerts.
I'm going to let this run for a while and lets see how it goes.

pkgdb, for whatever reason, has always been an excellent canary which is
why I'm checking it.

Questions / comments?

-Mike
_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure
 
Old 10-06-2010, 02:58 PM
"Carlos (casep) Sepulveda"
 
Default Nagios event handlers

On 5 October 2010 16:44, Mike McGrath <mmcgrath@redhat.com> wrote:
> In an effort to further hide the fas issues we've been running into I've
> added an event handler to the app servers. *A brief description of the
> problem is when fas hangs, app server httpd processes stack up. *When they
> do they become unresponsive.
>
> Currently nagios does this on failure:
>
> Failed check 1: nothing (Soft)
> Failed check 2: nothing (Soft)
> Failed check 3: Send notification (hard)
>
> Once it hits that hard state, nagios claims its dead. *We get paged, the
> alert shows up in #fedora-noc. *Doom.
>
> Now what it does is this:
>
> Failed check 1: nothing (Soft)
> Failed Check 2: send notification to #fedora-noc, issue a service httpd
> * * *reload
> Failed Check 3: Send paged / emailed notifications, issue a service httpd
> * * *restart
>
>
> This is a very different change from how things were and as such we should
> track this closely. *The reason for the notification issue to #fedora-noc
> is to ensure things aren't auto-correcting without us knowing. *But at the
> same time we're not generating a lot of un-needed email / paged alerts.
> I'm going to let this run for a while and lets see how it goes.
>
> pkgdb, for whatever reason, has always been an excellent canary which is
> why I'm checking it.
>


Hi:
It looks OK to me, but, do you've stats about how many time you get a
2nd fail check without reaching a 3rd? I'm thinking in network
micro-outage, load peaks or something funny in the server. Maybe it
needs to be a 4 checks service (reload at third).
In the other hand, it's just a reload of httpd

Regards
--
"My name is Ozymandias, king of kings:
Look on my works, ye Mighty, and despair!"
Percy Bysshe Shelley
http://sites.google.com/site/carlossepulveda
_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure
 
Old 10-06-2010, 03:35 PM
Mike McGrath
 
Default Nagios event handlers

On Wed, 6 Oct 2010, Carlos (casep) Sepulveda wrote:

> On 5 October 2010 16:44, Mike McGrath <mmcgrath@redhat.com> wrote:
> > In an effort to further hide the fas issues we've been running into I've
> > added an event handler to the app servers. *A brief description of the
> > problem is when fas hangs, app server httpd processes stack up. *When they
> > do they become unresponsive.
> >
> > Currently nagios does this on failure:
> >
> > Failed check 1: nothing (Soft)
> > Failed check 2: nothing (Soft)
> > Failed check 3: Send notification (hard)
> >
> > Once it hits that hard state, nagios claims its dead. *We get paged, the
> > alert shows up in #fedora-noc. *Doom.
> >
> > Now what it does is this:
> >
> > Failed check 1: nothing (Soft)
> > Failed Check 2: send notification to #fedora-noc, issue a service httpd
> > * * *reload
> > Failed Check 3: Send paged / emailed notifications, issue a service httpd
> > * * *restart
> >
> >
> > This is a very different change from how things were and as such we should
> > track this closely. *The reason for the notification issue to #fedora-noc
> > is to ensure things aren't auto-correcting without us knowing. *But at the
> > same time we're not generating a lot of un-needed email / paged alerts.
> > I'm going to let this run for a while and lets see how it goes.
> >
> > pkgdb, for whatever reason, has always been an excellent canary which is
> > why I'm checking it.
> >
>
>
> Hi:
> It looks OK to me, but, do you've stats about how many time you get a
> 2nd fail check without reaching a 3rd? I'm thinking in network
> micro-outage, load peaks or something funny in the server. Maybe it
> needs to be a 4 checks service (reload at third).
> In the other hand, it's just a reload of httpd
>

I had considered doing a graceful but my feeling is it wouldn't recover
quickly enough for this change to have a full impact :-/

-Mike______________________________________________ _
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure
 

Thread Tools




All times are GMT. The time now is 04:12 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org