FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Fedora Infrastructure

 
 
LinkBack Thread Tools
 
Old 08-20-2011, 02:45 AM
Toshio Kuratomi
 
Default Unplanned Proxy Outage: - 2011-08-19 16:30 UTC

Summary of Event
================

Tonight there was an unplanned outage of two proxy servers (proxy01 and
proxy02). The proxies were unresponsive and needed to be rebooted in order
to come back online. Proxy01 being down caused a cascade of other issues
that should have had very little end-user impact. As far as we know, the
applications on admin.fp.o would have been up but appeared very slow and the
wiki would have been up for reading but logging in would have failed.
Explanation to follow.

Proxy01 is the only proxy server that is used for app servers (web apps,
cronjobs, etc) in phx2 that need to talk to our web applications in phx2.
This was setup because the router that handles traffic into and out of phx2
does not allow us to "hairpin", send a request for data from phx2 to an
external ip address that then resolves back to a server in phx2. As
currently implemented, we have an /etc/hosts entry that points
admin.fedoraproject.org at the internal ip address of phx2.

When proxy01 went down, things in PHX2 that needed to talk to
admin.fedoraproject.org were no longer able to get the data they needed.
For the wiki, this meant that attempting to login during the outage would be
unable to verify the password in fas. For the TurboGears apps on
admin.fedoraproject.org the situation was worse. TG1 apps' identity
management depends on visit tracking to work. Visit tracking hits fas for
every request. This means that no page could be served for the TG1 apps
from the phx2 app servers.

We have two app servers that reside outside of phx2. Because of network
latency between these servers and the database server in phx2, these servers
are configured to be backups for the servers in phx2, not handling requests
unless phx2 is unable to. The remaining proxy servers detected that the app
servers within phx2 were down and properly switched over to app servers
outside of phx2 so there was no apparent outage for people trying to use
admin.fedoraproject.org, although response time would have been drastically
less.

Looking at the haproxy status page for proxy03 during the outage we noticed
that only one of the two app servers outside of phx2 (app05 at ibiblio) was
handling traffic. app06 (at telia) was not. We are not sure why this is.
One possibility is that telia's network latency is just too high so haproxy
decided that app06 was also down and did not pass traffic to it.

Action Items
============

There are some open questions to try to resolve:

* Why did proxy01 and proxy02 die? A brief look at the logs has not
revealed a cause for this.
* Why didn't app06 take up any of the slack when haproxy started passing
traffic to the backups?

We have identified one means of mitigating this in the future:

If we ran internal DNS for phx2 then we could have admin.fedoraproject.org
resolve to different proxy servers (using internal ip addresses for the
proxies inside of PHX2). This should remove the SPOF on proxy01. We have
not yet determined whether we'd need to run more proxy servers inside of
PHX2 or if hairpinning would not be an issue if we used proxy servers
outside of phx2.

-Toshio
_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure
 
Old 08-22-2011, 03:38 PM
Kevin Fenzi
 
Default Unplanned Proxy Outage: - 2011-08-19 16:30 UTC

On Fri, 19 Aug 2011 19:45:45 -0700
Toshio Kuratomi <a.badger@gmail.com> wrote:

...snip...

> Action Items
> ============
>
> There are some open questions to try to resolve:
>
> * Why did proxy01 and proxy02 die? A brief look at the logs has not
> revealed a cause for this.

I can't find any cause here. Logs just stop, they were locked up
hard. ;(

As a side note: libvirt/kvm supports watchdog. We could possibly setup
watchdog on all our guests so they at least reboot if they are
unresponsive. Of course that could lead to problems if they get stuck
in a reboot/lockup cycle.

> * Why didn't app06 take up any of the slack when haproxy started
> passing traffic to the backups?

Yeah, all I can think of is that it was too slow to answer and haproxy
didn't want to add it.

> We have identified one means of mitigating this in the future:
>
> If we ran internal DNS for phx2 then we could have
> admin.fedoraproject.org resolve to different proxy servers (using
> internal ip addresses for the proxies inside of PHX2). This should
> remove the SPOF on proxy01. We have not yet determined whether we'd
> need to run more proxy servers inside of PHX2 or if hairpinning would
> not be an issue if we used proxy servers outside of phx2.

Well, we do run dns there, so we can tweak it.

Hairpinning only comes into play if we try and list a phx2 external IP
in there. The problem with listing another external proxy is that then
it's likely to be slow... the request would need to go all the way out,
then back in to fas.

We could run another proxy thats just internal to phx2.
That seems like it's sort of overkill though. ;(

I think I might sit down and draw up our proxy/app/fas/etc setup and
perhaps we can look at a picture and see how we can simplify it or make
it more robust.

kevin
_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure
 

Thread Tools




All times are GMT. The time now is 01:59 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org