FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor


 
 
LinkBack Thread Tools
 
Old 01-12-2011, 02:06 PM
seth vidal
 
Default Outage notes

Hi Everyone,
I took some notes while we were rebooting boxes I wanted to share them
with everyone for future outages.

Ordering of the bounces:
1. xen14: puppet is on there and if that is back up first we have a
place to stand for pushing out any changes (dns changes for example via
puppet) - xen14 takes about 4 minutes to restart/POST

2. xen15: bastion01, db02 are on there - same 4 minute restart window
once this is up you'll want to logout of bastion02 and into bastion01
so you have a firm place to do the xen05 reboots from which will take
out bastion02.

3. edit dns on puppet to remove proxy01 from the wildcard/roundrobin and
push that to the ns* servers and verify.

4. xen05: bastion02 (openvpn), proxy01 - 4-5 minutes for this machine to
restart.

once xen05 is completely up log back in and verify the vpn is back
online

5. edit dns remove all the other proxy hosts and put proxy01 back in.
Push and verify

6. virthost01 - I had to halt each of the kvms from a login - virsh
shutdown didn't work. - 4 minute restart time on the hw.
Note: make sure virthost01 is completely up - especially fas03. since
taking down virthost02 next will take out fas02- you want to make sure
you don't leave fas01 all by itself.

7. virthost02 - fas02 was not setup to autostart - that's now fixed.

8. virthost13 - uneventful

9. xen03 - spin01 spewed lots of umount issues - those are from the spin
creation paths - they can be safely ignored
- fas01.stg was running on xen03 according to the logs but
there's no definition for it on the system - so not sure what the story
is there.
- neither of the other staging hosts were set to autostart

10. xen04: we apparently have a number of hosts w/only one dns record
internally and they point to ns03 only. B/c when ns03 went away - lots
of things got VERY VERY SLOW trying to resolve names. This is on my list
to address. You must wait for xen04 to be completely up and ns03 running
before you can take down xen07. Otherwise we'll be w/o dns internally to
phx.

11. xen07: iscsi disks didn't come up right away - this kept ns04 from
coming up immediately - needed to run /etc/init.d/iscsi start and they
showed up.

12. xen09: uneventful

13. xen10: log01 needed an fsck b/c of the time since last mount - this
took a long time.

14. xen11: secondary1 needed an fsck. also a 5-6 minute hw reboot time.

15. xen12: db1->db01 naming change kept it from coming up at boot b/c of
the 'auto' symlink to db1. db01 had to fsck

16. cnode01 - 6-7 minute reboot time - nothing was set to autostart in
xen - this is now fixed - autoqa01 and dhcp02 are set to autostart

17. db03: fsck took FOREVER to complete and this takes a lot of things
done - for the future move db03 reboot higher up the stack, just in
case. This machine's restart/POST time is REALLY high like 7-10 minutes.
The console for it is less than forthcoming, too.

18. backup01: uneventful

At this point internal was back online - except for the build xen
systems and servers.

External hosts:

19. - bodhost01: 5-6 minute machine reboot time
- people01 - uneventful.
- ibiblio01 - 5-7 minute machine reboot time. uneventful
20. - internetx01: uneventful
- osuosl01: uneventful
21. - sb2 - must wait for ibiblio01 to be up b/c of not having any
external name servers
- sb3 - uneventful
- sb4 - hosted1 listed more 'maxmem' in its config that sb4 had
available - so that had to be edited down. Not sure how that EVER
started
- sb5 - uneventful
22. telia01 - proxy5 did not restart on its own - unknown as to WHY yet
- but it did start manually.
- retrace01 was not set to autostart
tummy1 - uneventful

Now all the proxy* rebooting is over so we can:

23. edit dns: put the other proxy hosts in the wildcard/RR - push and
verify


Build boxes:
- bxen03 had koji2 listed in its set of hosts - but it wasn't running.
This led to some confusion as to how to start the hosts on bxen03 b/c
of insufficient memory for all guests. Eventually I realized bxen04 is
where koji02 was running and that the left over guest file was never
cleaned up on bxen03.


Things to think about post-outage:
- check all the raid arrays for lost disks - we saw this a couple of
times - it's not pleasant.
- check for downed vpns and/or broken resolution - we need to get a
firm handle on why this is a hassle so often.


Overall things to think about for the future:
1. dumping a complete virsh list - including how much memory is actually
being used per vm per server before we start reboots
2. checking what disks need fscks because of mounted time and doing
those earlier or separately.
3. verifying that all running vms are:
a. intended to be running
b. have a config file
c. are set to autostart
4. verifying that all NOT running vms are:
a. intended to be off
b. are NOT set to autostart

thoughts welcome.
-sv




_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure
 
Old 01-12-2011, 08:13 PM
Stephen John Smoogen
 
Default Outage notes

On Wed, Jan 12, 2011 at 08:06, seth vidal <skvidal@fedoraproject.org> wrote:
> Hi Everyone,
> *I took some notes while we were rebooting boxes I wanted to share them
> with everyone for future outages.
>
> Ordering of the bounces:
> 1. xen14: puppet is on there and if that is back up first we have a
> place to stand for pushing out any changes (dns changes for example via
> puppet) - xen14 takes about 4 minutes to restart/POST

Most of the new IBM hardware can take 4-6 minutes to reboot. I don't
know if there is some flags I should have put in it, but it is deadly
slow.


> Overall things to think about for the future:
> 1. dumping a complete virsh list - including how much memory is actually
> being used per vm per server before we start reboots
> 2. checking what disks need fscks because of mounted time and doing
> those earlier or separately.
> 3. verifying that all running vms are:
> * a. intended to be running
> * b. have a config file
> * c. are set to autostart
> 4. verifying that all NOT running vms are:
> * a. intended to be off
> * b. are NOT set to autostart

looks good. I thought koji2 was running before the reboots but it may
have been a ghost vm.

> thoughts welcome.
> -sv
>
>
>
>
> _______________________________________________
> infrastructure mailing list
> infrastructure@lists.fedoraproject.org
> https://admin.fedoraproject.org/mailman/listinfo/infrastructure
>



--
Stephen J Smoogen.
"The core skill of innovators is error recovery, not failure avoidance."
Randy Nelson, President of Pixar University.
"Let us be kind, one to another, for most of us are fighting a hard
battle." -- Ian MacLaren
_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure
 
Old 01-12-2011, 08:15 PM
seth vidal
 
Default Outage notes

On Wed, 2011-01-12 at 14:13 -0700, Stephen John Smoogen wrote:
> looks good. I thought koji2 was running before the reboots but it may
> have been a ghost vm.
>

bxen03.phx2.fedoraproject.org:cvs1:running
bxen03.phx2.fedoraproject.org:koji2:shutdown
bxen03.phx2.fedoraproject.org:kojipkgs1:running
bxen03.phx2.fedoraproject.orgkgs01:running
bxen03.phx2.fedoraproject.org:releng2:running


note the 'shutdown'

it had been moved to bxen04 - but it wasn't cleaned up from bxen03.

-sv

_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure
 
Old 01-12-2011, 08:16 PM
Kevin Fenzi
 
Default Outage notes

On Wed, 12 Jan 2011 10:06:58 -0500
seth vidal <skvidal@fedoraproject.org> wrote:

...snip...

> 2. checking what disks need fscks because of mounted time and doing
> those earlier or separately.

I wonder if we shouldn't (as a matter of course on install, and before
the next reboot cycle) just disable automatic fscks. They are disabled
by default in fedora, I have never seen one actually find a issue with
a controlled reboot of a machine with no known disk problems. Disks
that are marked dirty will of course keep checking on boot.

Just a thought.

kevin
_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure
 
Old 01-12-2011, 08:16 PM
Stephen John Smoogen
 
Default Outage notes

On Wed, Jan 12, 2011 at 14:15, seth vidal <skvidal@fedoraproject.org> wrote:
> On Wed, 2011-01-12 at 14:13 -0700, Stephen John Smoogen wrote:
>> looks good. I thought koji2 was running before the reboots but it may
>> have been a ghost vm.
>>
>
> bxen03.phx2.fedoraproject.org:cvs1:running
> bxen03.phx2.fedoraproject.org:koji2:shutdown
> bxen03.phx2.fedoraproject.org:kojipkgs1:running
> bxen03.phx2.fedoraproject.orgkgs01:running
> bxen03.phx2.fedoraproject.org:releng2:running

ugh my apologies then. I thought I had found all of them before we did
the reboots.

>
> note the 'shutdown'
>
> it had been moved to bxen04 - but it wasn't cleaned up from bxen03.
>
> -sv
>
> _______________________________________________
> infrastructure mailing list
> infrastructure@lists.fedoraproject.org
> https://admin.fedoraproject.org/mailman/listinfo/infrastructure
>



--
Stephen J Smoogen.
"The core skill of innovators is error recovery, not failure avoidance."
Randy Nelson, President of Pixar University.
"Let us be kind, one to another, for most of us are fighting a hard
battle." -- Ian MacLaren
_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure
 
Old 01-12-2011, 10:38 PM
Gareth Marchant
 
Default Outage notes

On Wed, 2011-01-12 at 14:13 -0700, Stephen John Smoogen wrote:
> On Wed, Jan 12, 2011 at 08:06, seth vidal <skvidal@fedoraproject.org> wrote:
> > Hi Everyone,
> > I took some notes while we were rebooting boxes I wanted to share them
> > with everyone for future outages.
> >
> > Ordering of the bounces:
> > 1. xen14: puppet is on there and if that is back up first we have a
> > place to stand for pushing out any changes (dns changes for example via
> > puppet) - xen14 takes about 4 minutes to restart/POST
>
> Most of the new IBM hardware can take 4-6 minutes to reboot. I don't
> know if there is some flags I should have put in it, but it is deadly
> slow.
>

I have seen in past where IBM Intel boxes are not configured to fast
POST, this could potentially be cause for slow reboot time esp. wrt
installed system memory during POST checks?

>
> > Overall things to think about for the future:
> > 1. dumping a complete virsh list - including how much memory is actually
> > being used per vm per server before we start reboots
> > 2. checking what disks need fscks because of mounted time and doing
> > those earlier or separately.
> > 3. verifying that all running vms are:
> > a. intended to be running
> > b. have a config file
> > c. are set to autostart
> > 4. verifying that all NOT running vms are:
> > a. intended to be off
> > b. are NOT set to autostart
>
> looks good. I thought koji2 was running before the reboots but it may
> have been a ghost vm.
>
> > thoughts welcome.
> > -sv
> >
> >
> >
> >
> > _______________________________________________
> > infrastructure mailing list
> > infrastructure@lists.fedoraproject.org
> > https://admin.fedoraproject.org/mailman/listinfo/infrastructure
> >
>
>
>


_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure
 
Old 01-13-2011, 04:49 AM
seth vidal
 
Default Outage notes

On Wed, 2011-01-12 at 14:16 -0700, Kevin Fenzi wrote:
> On Wed, 12 Jan 2011 10:06:58 -0500
> seth vidal <skvidal@fedoraproject.org> wrote:
>
> ...snip...
>
> > 2. checking what disks need fscks because of mounted time and doing
> > those earlier or separately.
>
> I wonder if we shouldn't (as a matter of course on install, and before
> the next reboot cycle) just disable automatic fscks. They are disabled
> by default in fedora, I have never seen one actually find a issue with
> a controlled reboot of a machine with no known disk problems. Disks
> that are marked dirty will of course keep checking on boot.
>
> Just a thought.

I've debated that, too.

I'm open to opinions on the subject.

-sv


_______________________________________________
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure
 

Thread Tools




All times are GMT. The time now is 01:08 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org