FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Debian > Debian User

 
 
LinkBack Thread Tools
 
Old 09-14-2011, 12:51 PM
Miles Fidelman
 
Default Worst Admin Mistake? was --> /usr broken, will the machine reboot ?

Bryan Irvine wrote:
Which brings me to another fun question. What's your worst
administration mistake and how did you recover? -Bryan


Discovered the hard way the symptoms of a failing drive in a RAID array,
leading to completely rebuilding an O/S install and restoring from backup.


Had a server that was running slower... and slower... and slower....
Still running, but taking forever to respond to even the simplest
prompts. Couldn't figure out what was wrong - some things made it look
like hardware, some like software.


Long story, short: turns out one of the drives in a 4-drive RAID array
was experiencing a high, and increasing, raw-read-error rate. Since
the drive's internal software was doing re-reads, and eventually
succeeding, the result was that the drive simply slowed down; and pulled
down the response time of the entire array. That's when I discovered
(after the fact) that linux md drivers don't consider long delays a
reason for failing a drive out of an array.


Worse.. when you're running a high-availability configuration (xen,
pacemaker, drbd, etc.) - one slow drive in an array on one server, drags
down the DRBD mirror, as well. The good news: when I powered down the
failing system, the backup started to work just fine. The bad news: I
trashed some stuff before figuring this out. Sigh...


If I had known, I could have pulled one drive, plugged in a new one, let
the array rebuild, and kept on going. Unfortunately, what I did was...
lots of diagnostics, lots of trial and error, ultimately trashing my
system and some user data (not a lot.. good backups).. and ultimately
had to reinstall the o/s and restore from backup.


Four lessons learned:
- RAID and high-availability configurations are vulnerable to a single
drive failure
- keep a close eye on the raw-read-error rates of drives (anything over
0 raises questions)
- be sure to purchase server-grade drives (they assume that failures
will be handled by a RAID array, so spend less time trying to recover
from a read error)
- when one disk starts going, replace them all (assuming that they went
online at the same time)... it's amazing how similar the lifetime is for
all the disks in an array


Miles Fidelman

--
In theory, there is no difference between theory and practice.
In<fnord> practice, there is. .... Yogi Berra



--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org

Archive: 4E70A366.8010508@meetinghouse.net">http://lists.debian.org/4E70A366.8010508@meetinghouse.net
 
Old 09-14-2011, 01:11 PM
Camaleón
 
Default Worst Admin Mistake? was --> /usr broken, will the machine reboot ?

On Tue, 13 Sep 2011 15:15:13 -0700, Bryan Irvine wrote:

(...)

> Which brings me to another fun question. What's your worst
> administration mistake and how did you recover?

Hum, I don't recall of any... yet.

But that's because I started administering linux boxes only a lustrum+3
years ago, so I guess I'm still waiting for "The Day of The Big Mistake"
to come :-P

Greetings,

--
Camaleón


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: pan.2011.09.14.13.11.32@gmail.com">http://lists.debian.org/pan.2011.09.14.13.11.32@gmail.com
 
Old 09-14-2011, 02:02 PM
Aaron Toponce
 
Default Worst Admin Mistake? was --> /usr broken, will the machine reboot ?

On Tue, Sep 13, 2011 at 03:15:13PM -0700, Bryan Irvine wrote:
> Which brings me to another fun question. What's your worst
> administration mistake and how did you recover?

My worst administration mistake was rebooting a rack in our production data
center. I thought I had typed a specific IP address to get to a specific
rack, but fat-fingered one of the numbers in the IP, and it send me to our
production rack.

My job was to setup the hard drives with software RAID, and put LVM on
them. THere were plenty of opportunities the system was giving me that
should have warned me that I was on the wrong rack, but I continued anyway.

Getting frustrated that I was seeing more devices than expected, I issued a
reboot on most of the servers in that rack. Because those servers were part
of a clustered filesystem, and running many virtual machines, a lot of our
infrastructure went down, and we were down for about 3 hours.

Needless to say, it was a valuable lesson, one I'll never forget. In fact,
it prompted me to use LocalCommand in my ~/.ssh/config, and echo colored
prompts, depending on whether or not I'm on a production (blinking bold red),
staging (bold yellow) ordevelopment (bold green) server.

--
. o . o . o . . o o . . . o .
. . o . o o o . o . o o . . o
o o o . o . . o o o o . o o o
 
Old 09-14-2011, 03:51 PM
Rob Owens
 
Default Worst Admin Mistake? was --> /usr broken, will the machine reboot ?

On Tue, Sep 13, 2011 at 11:32:38PM +0000, Walter Hurry wrote:
> On Tue, 13 Sep 2011 15:15:13 -0700, Bryan Irvine wrote:
>
> > Which brings me to another fun question. What's your worst
> > administration mistake and how did you recover?
>
> The worst admin mistake is failure to secure proper backups. Full stop.
>
Early one morning I was experimenting on one of my company's Linux
servers. In the home directory of a test user, I issued:

rm -rf *

I did it on purpose. But it took a long time to remove what should have
only been a handful of files. I hit Ctrl-C and then 'ls'. I realized
that I had a shared company network drive mounted as
/home/testuser/company, and it was deleting everything on that drive!

11 GB of data was deleted from that drive. Thanks to BackupPC, I had
everything restored in 15 minutes. Nothing was lost, and only one
person even noticed what happened.

-Rob


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20110914155144.GB29689@aurora.owens.net">http://lists.debian.org/20110914155144.GB29689@aurora.owens.net
 
Old 09-14-2011, 04:24 PM
Bryan Irvine
 
Default Worst Admin Mistake? was --> /usr broken, will the machine reboot ?

On Wed, Sep 14, 2011 at 7:02 AM, Aaron Toponce <aaron.toponce@gmail.com> wrote:
> On Tue, Sep 13, 2011 at 03:15:13PM -0700, Bryan Irvine wrote:
>> Which brings me to another fun question. *What's your worst
>> administration mistake and how did you recover?
>
> My worst administration mistake was rebooting a rack in our production data
> center. I thought I had typed a specific IP address to get to a specific
> rack, but fat-fingered one of the numbers in the IP, and it send me to our
> production rack.
>
> My job was to setup the hard drives with software RAID, and put LVM on
> them. THere were plenty of opportunities the system was giving me that
> should have warned me that I was on the wrong rack, but I continued anyway.
>
> Getting frustrated that I was seeing more devices than expected, I issued a
> reboot on most of the servers in that rack. Because those servers were part
> of a clustered filesystem, and running many virtual machines, a lot of our
> infrastructure went down, and we were down for about 3 hours.
>
> Needless to say, it was a valuable lesson, one I'll never forget. In fact,
> it prompted me to use LocalCommand in my ~/.ssh/config, and echo colored
> prompts, depending on whether or not I'm on a production (blinking bold red),
> staging (bold yellow) ordevelopment (bold green) server.

Now THAT is genius! I'm going to have to do that. :-)


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: CAG367gb8Y_kA0J7ZNtxqkRRJLpHko1u62CO3s9d8Bf+c=p_q1 g@mail.gmail.com">http://lists.debian.org/CAG367gb8Y_kA0J7ZNtxqkRRJLpHko1u62CO3s9d8Bf+c=p_q1 g@mail.gmail.com
 
Old 09-14-2011, 05:22 PM
Mike McClain
 
Default Worst Admin Mistake? was --> /usr broken, will the machine reboot ?

On Wed, Sep 14, 2011 at 08:51:50AM -0400, Miles Fidelman wrote:
> Bryan Irvine wrote:
> >Which brings me to another fun question. What's your worst
> >administration mistake and how did you recover? -Bryan
>
<snip>

I've never administered anyone elses system but my own but
for several years was running different versions of Linux,
FreeBSD and Solaris x86 on the same machine (different
partitions) to get a feel for things were handled by the
different OSs. I was dismayed to find when I installed Redhat
that it used my Solaris partition as swap durring the install.
Ouch,
Mike
--
Satisfied user of Linux since 1997.
O< ascii ribbon campaign - stop html mail - www.asciiribbon.org


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20110914172213.GA20486@playground">http://lists.debian.org/20110914172213.GA20486@playground
 
Old 09-14-2011, 06:13 PM
Walter Hurry
 
Default Worst Admin Mistake? was --> /usr broken, will the machine reboot ?

On Wed, 14 Sep 2011 10:22:14 -0700, Mike McClain wrote:

> On Wed, Sep 14, 2011 at 08:51:50AM -0400, Miles Fidelman wrote:
>> Bryan Irvine wrote:
>> >Which brings me to another fun question. What's your worst
>> >administration mistake and how did you recover? -Bryan
>>
> <snip>
>
> I've never administered anyone elses system but my own but for several
> years was running different versions of Linux, FreeBSD and Solaris x86
> on the same machine (different partitions) to get a feel for things were
> handled by the different OSs. I was dismayed to find when I installed
> Redhat that it used my Solaris partition as swap durring the install.
> Ouch,

It is perhaps unfortunate that type 82 is used for both Solaris and Linux
Swap. However, *you* get to decide during installation which partition is
used for swap.



--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: j4qqrv$bfa$2@dough.gmane.org">http://lists.debian.org/j4qqrv$bfa$2@dough.gmane.org
 
Old 09-14-2011, 06:14 PM
Brad Alexander
 
Default Worst Admin Mistake? was --> /usr broken, will the machine reboot ?

I had a case where it had snowed, and instead of driving 50 miles in snow and ice with dodgy DC drivers, I'd work from home. Had my laptop, was doing work. Well, they scheduled a meeting for that afternoon (at about lunch time), so I got ready and headed in to the office. I typed halt in a window on my machine, and went to get my stuff together. Came back a few minutes later and found the laptop was still up. Had inadvertantly (I blame focus-follows-mouse) shut down a remote box, our production webserver...


As for non-Linux, I found out that the GNU version of killall doesn't take arguments, it does just that...Everything but the halt. Was on an AIX box, and needed to kill several licensing servers, so I typed "killall <processname>" After about 5 minutes, lost contact with the box, because it had killed all processes. Since then, I always prefer pkill...


--b

On Wed, Sep 14, 2011 at 12:24 PM, Bryan Irvine <sparctacus@gmail.com> wrote:

On Wed, Sep 14, 2011 at 7:02 AM, Aaron Toponce <aaron.toponce@gmail.com> wrote:

> On Tue, Sep 13, 2011 at 03:15:13PM -0700, Bryan Irvine wrote:

>> Which brings me to another fun question. *What's your worst

>> administration mistake and how did you recover?

>

> My worst administration mistake was rebooting a rack in our production data

> center. I thought I had typed a specific IP address to get to a specific

> rack, but fat-fingered one of the numbers in the IP, and it send me to our

> production rack.

>

> My job was to setup the hard drives with software RAID, and put LVM on

> them. THere were plenty of opportunities the system was giving me that

> should have warned me that I was on the wrong rack, but I continued anyway.

>

> Getting frustrated that I was seeing more devices than expected, I issued a

> reboot on most of the servers in that rack. Because those servers were part

> of a clustered filesystem, and running many virtual machines, a lot of our

> infrastructure went down, and we were down for about 3 hours.

>

> Needless to say, it was a valuable lesson, one I'll never forget. In fact,

> it prompted me to use LocalCommand in my ~/.ssh/config, and echo colored

> prompts, depending on whether or not I'm on a production (blinking bold red),

> staging (bold yellow) ordevelopment (bold green) server.



Now THAT is genius! *I'm going to have to do that. :-)





--

To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org

with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org

Archive: http://lists.debian.org/CAG367gb8Y_kA0J7ZNtxqkRRJLpHko1u62CO3s9d8Bf+cp_q1g @mail.gmail.com
 
Old 09-14-2011, 07:29 PM
Justin The Cynical
 
Default Worst Admin Mistake? was --> /usr broken, will the machine reboot ?

On 9/13/11 3:15 PM, Bryan Irvine wrote:

> Which brings me to another fun question. What's your worst
> administration mistake and how did you recover?

Years ago on my main workstation back in my slackware days, I was
upgrading samba from the source tarballs.

I had everything compiled and installed in /usr/local and was trying to
remove the old binaries in /usr/bin. The command was something like this:

root:/usr# rm /usr/bin/smb *

The command ran quickly, and obviously I realised something was wrong
when various things, like ls, stopped working.

Fortunately I had an open instance of midnight commander, so I was able
to ftp the now-missing binaries from the original install media in a
secondary windows machine and use the built-in chmod/chown functions in
mc to unarchive and fix the files so they could be run.

>From this, I learned the importance of watching the spaces in a command
and being aware of what the value of cwd is.


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4E710087.6030402@penguinness.org">http://lists.debian.org/4E710087.6030402@penguinness.org
 
Old 09-15-2011, 01:53 AM
Andrew Reid
 
Default Worst Admin Mistake? was --> /usr broken, will the machine reboot ?

> I had a case where it had snowed, and instead of driving 50 miles in snow
> and ice with dodgy DC drivers, I'd work from home. Had my laptop, was doing
> work. Well, they scheduled a meeting for that afternoon (at about lunch
> time), so I got ready and headed in to the office. I typed halt in a window
> on my machine, and went to get my stuff together. Came back a few minutes
> later and found the laptop was still up. Had inadvertantly (I blame
> focus-follows-mouse) shut down a remote box, our production webserver...

You can use "molly-guard" to protect against this -- installed on the
remote system, it prompts for confirmation if a shutdown, reboot, halt,
or poweroff command is entered in a remote shell.

<http://packages.debian.org/squeeze/molly-guard>

There's a legend that the name comes from an actual little girl named
Molly, who was visiting the workplace and tried out the shiny red button.

-- A.
--
Andrew Reid / reidac@bellatlantic.net


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 201109142153.00984.reidac@bellatlantic.net">http://lists.debian.org/201109142153.00984.reidac@bellatlantic.net
 

Thread Tools




All times are GMT. The time now is 12:44 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright ©2007 - 2008, www.linux-archive.org