FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Gentoo > Gentoo User

 
 
LinkBack Thread Tools
 
Old 02-26-2010, 02:33 AM
Mark Knecht
 
Default recovery from /var corruption?

So I got my wife's machine booted today using a install disk and
played a bit with e2fsck. The machine stopped being happy last night
due to some sort of corruption on the /var partition. e2fsck
complained about 3 or 4 files and then repaired the partition. The
machine booted cleanly as far as I can tell.

So, something went bad and I managed to sneak around it for a while
and now I'm sort of living with the machine wondering what to do.

Do I just watch the logs looking for problems? I have no way of
knowing right now whether this was a disk problem that's going to come
back, a 1 time deal due to power, or something else entirely.

As these cheap machines that don't use RAID what's the right way to
go? emerge -e @world and then wait for the next event? Do nothing and
wait?

We've got decent personal data backups as well as basic /etc data.

Thanks,
Mark
 
Old 02-26-2010, 08:09 AM
Neil Bothwick
 
Default recovery from /var corruption?

On Thu, 25 Feb 2010 19:33:23 -0800, Mark Knecht wrote:

> So I got my wife's machine booted today using a install disk and
> played a bit with e2fsck. The machine stopped being happy last night
> due to some sort of corruption on the /var partition. e2fsck
> complained about 3 or 4 files and then repaired the partition. The
> machine booted cleanly as far as I can tell.
>
> So, something went bad and I managed to sneak around it for a while
> and now I'm sort of living with the machine wondering what to do.

Check the disk with smartmontools.


--
Neil Bothwick

All mail what i send is thoughly proof-red, definately!
 
Old 02-26-2010, 08:46 AM
Alex Schuster
 
Default recovery from /var corruption?

Mark Knecht writes:

> Do I just watch the logs looking for problems? I have no way of
> knowing right now whether this was a disk problem that's going to come
> back, a 1 time deal due to power, or something else entirely.
>
> As these cheap machines that don't use RAID what's the right way to
> go? emerge -e @world and then wait for the next event? Do nothing and
> wait?

Emerge smartmontools, then:

smartctl -h /dev/sda # get overview of what the drive thinks about itself

smartctl -t short /dev/sda # start short self test
Wait
smartctl -l selftest /dev/sda # see results

smartctl -t long /dev/sda # start long self test
Wait a lot longer
smartctl -l selftest /dev/sda # see results

You can continue working in the meanwhile, there will be no performance
impact. You will see something like this in the log:

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours)
LBA_of_first_error
# 1 Short offline Completed without error 00% 2275 -
# 2 Extended offline Completed without error 00% 2270 -
# 3 Extended offline Completed without error 00% 1799 -
# 4 Extended offline Completed without error 00% 197 -
# 5 Extended offline Completed without error 00% 26 -

I you have a '-' in the right column, the disk has found no errors. If
there is a number, than it's the position of the first error.

There's also badblocks, this will check every block and output the bad
ones: badblocks -sv /dev/sda

badblocks -svn /dev/sda will do a read-write test. In case of a bad block,
the drive should exchange it with a spare one. Maybe this happens already
in read-only mode, I am not sure.

Also watch for errors in syslog or via dmesg, there should be some when
bad blocks are being accessed.

Wonko
 
Old 02-26-2010, 10:47 AM
daid kahl
 
Default recovery from /var corruption?

On 26 February 2010 12:33, Mark Knecht <markknecht@gmail.com> wrote:
> So I got my wife's machine booted today using a install disk and
> played a bit with e2fsck. The machine stopped being happy last night
> due to some sort of corruption on the /var partition. e2fsck
> complained about 3 or 4 files and then repaired the partition. The
> machine booted cleanly as far as I can tell.

Hey buddy!

This happened to me, too! See below for my savage ranting for a good laugh.

My rule for this is rsnapshot my present system as it is, grab a disk
image backup (taken less frequently), and then go to town with
portage.

I emerged 620 packages today. (Much more in fact if I count
rebuilding and stuff.) Only OO.o update is remaining in world.

I don't think there's a good and safe way around it. I find inode
corruption can be sneaky and hit other stuff. Assuming your backs all
exist and stuff, then you can hit up stuff like rsync with the update
flag for your personal files between newest and safest backups.

Rant:
Okay, so Mac OS is getting it to the face now, officially, and forever
in my world. I've almost kind of said this before, and I can't
remember why I don't follow my own advice, but nothing can be worse
than twice-monthly 10% inode corruption.

Now check this out:
The e2fs program is told "do not mount sda3" and "if you ever do,
mount it ro." Even though Mac OS is crazy enough not to use
/etc/fstab, it will still (supposedly) listen to rules in here. I
found some very retarded way of effectively serial-device referencing
sda3, and I said, "do not mount this drive at boot, and if you do, do
it ro." Then I went into a Disk Utility thing. I told that the same
thing. So that's three times I've said, "Never touch this drive with
a 10 foot pole, plz thx!" Yeah, please explain to me how an
unmounted, only ro drive can receive rectal examination of 11.4%
inode corruption.

Others, please take this as a lesson (in some form or another). I
think it's the badly coded e2fs program, but that thing is so bad that
if it is to blame, it happened after I tried to uninstall the program
too, so who knows. So I'm going to put a tiny Tiger install this
weekend so I can get nice boot, a few firmware accesses (kill the
silly booting sound, and delay an annoying 20 second boot delay in the
case there is no EFI partition...ugh). And then I am going to never
look at it's ugly face again.

System Rescue CD, partimage, and rsnapshot are my friends!

(I had so many packages because over the holidays I didn't do sync and
world updates, and then I decided to go back to the wonderful ~x86,
but since I was super busy and I don't like backing up a system that's
untested, then I didn't have good backups of the updates. Maybe a
poor choice, but in any case, that was not the reason I was trying to
kick myself in the face.

Be bloody lucky,
or don't use retarded softwarez---
daid

>
> So, something went bad and I managed to sneak around it for a while
> and now I'm sort of living with the machine wondering what to do.
>
> Do I just watch the logs looking for problems? I have no way of
> knowing right now whether this was a disk problem that's going to come
> back, a 1 time deal due to power, or something else entirely.
>
> As these cheap machines that don't use RAID what's the right way to
> go? emerge -e @world and then wait for the next event? Do nothing and
> wait?
>
> We've got decent personal data backups as well as basic /etc data.
>
> Thanks,
> Mark
>
>
 
Old 02-26-2010, 02:17 PM
Mark Knecht
 
Default recovery from /var corruption?

On Fri, Feb 26, 2010 at 1:46 AM, Alex Schuster <wonko@wonkology.org> wrote:
> Mark Knecht writes:
>
>> Do I just watch the logs looking for problems? I have no way of
>> knowing right now whether this was a disk problem that's going to come
>> back, a 1 time deal due to power, or something else entirely.
>>
>> As these cheap machines that don't use RAID what's the right way to
>> go? emerge -e @world and then wait for the next event? Do nothing and
>> wait?
>
> Emerge smartmontools, then:
>
> smartctl -h /dev/sda *# get overview of what the drive thinks about itself
>
> smartctl -t short /dev/sda * * # start short self test
> Wait
> smartctl -l selftest /dev/sda *# see results
>
> smartctl -t long /dev/sda * * *# start long self test
> Wait a lot longer
> smartctl -l selftest /dev/sda *# see results
>
> You can continue working in the meanwhile, there will be no performance
> impact. You will see something like this in the log:
>
> === START OF READ SMART DATA SECTION ===
> SMART Self-test log structure revision number 1
> Num *Test_Description * Status * * * * * * *Remaining *LifeTime(hours)
> LBA_of_first_error
> # 1 *Short offline * * *Completed without error * 00% * *2275 * * * -
> # 2 *Extended offline * Completed without error * 00% * *2270 * * * -
> # 3 *Extended offline * Completed without error * 00% * *1799 * * * -
> # 4 *Extended offline * Completed without error * 00% * * 197 * * * -
> # 5 *Extended offline * Completed without error * 00% * * *26 * * * -
>
> I you have a '-' in the right column, the disk has found no errors. If
> there is a number, than it's the position of the first error.
>
> There's also badblocks, this will check every block and output the bad
> ones: badblocks -sv /dev/sda
>
> badblocks -svn /dev/sda will do a read-write test. In case of a bad block,
> the drive should exchange it with a spare one. Maybe this happens already
> in read-only mode, I am not sure.
>
> Also watch for errors in syslog or via dmesg, there should be some when
> bad blocks are being accessed.
>
> * * * *Wonko
>
>

Hi Wonko,
Yes, I do use smartctl on some other machines although I'm not very
good about it and your write-up is helpful so thanks for that.

My wife's machines is older and and I don't think SMART is
supported on her drive. Note the lack of a * on the SMART line in
hdparm -I:

dragonfly ~ # hdparm -I /dev/hda

/dev/hda:

ATA device, with non-removable media
Model Number: WDC WD1600BB-00FTA0
Serial Number: WD-WMAES2091586
Firmware Revision: 15.05R15
Standards:
Supported: 6 5 4
Likely used: 6
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 312581808
Logical/Physical Sector size: 512 bytes
device size with M = 1024*1024: 152627 MBytes
device size with M = 1000*1000: 160041 MBytes (160 GB)
cache/buffer size = 2048 KBytes (type=DualPortCache)
Capabilities:
LBA, IORDY(can be disabled)
Standby timer values: spec'd by Standard, with device specific minimum
R/W multiple sector transfer: Max = 16 Current = 16
Recommended acoustic management value: 128, current value: 254
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* Host Protected Area feature set
* WRITE_BUFFER command
* READ_BUFFER command
* DOWNLOAD_MICROCODE
SET_MAX security extension
Automatic Acoustic Management feature set
* 48-bit Address feature set
* Device Configuration Overlay feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* SMART self-test
Security:
supported
not enabled
not locked
not frozen
not expired: security count
not supported: enhanced erase
HW reset results:
CBLID- above Vih
Device num = 0 determined by CSEL
Checksum: correct
dragonfly ~ #

dragonfly ~ # smartctl -H /dev/hda
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

SMART Disabled. Use option -s with argument 'on' to enable it.
dragonfly ~ # smartctl -s on /dev/hda
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF ENABLE/DISABLE COMMANDS SECTION ===
Error SMART Enable failed: Input/output error
Smartctl: SMART Enable Failed.

A mandatory SMART command failed: exiting. To continue, add one or
more '-T permissive' options.
dragonfly ~ #

I've not tried the -T permissive options.

I've never used badblocks as it seems I should only do that off-line.
This might be a good time to boot with a CD and try it out.

Maybe I should just get a new drive that supports SMART?

- Mark
 
Old 02-26-2010, 03:01 PM
Alex Schuster
 
Default recovery from /var corruption?

Mark Knecht writes:

> Yes, I do use smartctl on some other machines although I'm not very
> good about it and your write-up is helpful so thanks for that.
>
> My wife's machines is older and and I don't think SMART is
> supported on her drive. Note the lack of a * on the SMART line in
> hdparm -I:

Okay, but it still states:

> * SMART error logging
> * SMART self-test

So maybe smartctl -t long /dev/hda still works? Just give it a try.


> dragonfly ~ # smartctl -H /dev/hda
> smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce
> Allen Home page is http://smartmontools.sourceforge.net/
>
> SMART Disabled. Use option -s with argument 'on' to enable it.
> dragonfly ~ # smartctl -s on /dev/hda
> smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce
> Allen Home page is http://smartmontools.sourceforge.net/
>
> === START OF ENABLE/DISABLE COMMANDS SECTION ===
> Error SMART Enable failed: Input/output error
> Smartctl: SMART Enable Failed.
>
> A mandatory SMART command failed: exiting. To continue, add one or
> more '-T permissive' options.
> dragonfly ~ #
>
> I've not tried the -T permissive options.

I would There is also a BIOS setting for SMART, but I think this does
not matter here, and it's only for being able to report a failing drive
before booting.

> I've never used badblocks as it seems I should only do that off-line.
> This might be a good time to boot with a CD and try it out.

In read-only mode, you can use it when the system is running. Only the
write test (option -n) refuses to run if partitions are mounted from the
drive. So I'd do the 'badblocks -sv /dev/hda' right now, if you do not
need the drive at full speed for a while. You can interrupt it at any
point with Ctrl-Z and continue with the fg command.

> Maybe I should just get a new drive that supports SMART?

When the drive is that old it does not support SMART, you probably can get
one ten times as huge for much less than it had cost you. And I would
trust a new drive much more than such an old one. Depends on how important
the data is, if a total loss would not be too painful and I had backups,
and I would not need more speed and size, I would keep it if it shows no
errors.

Wonko
 
Old 02-26-2010, 03:53 PM
Mark Knecht
 
Default recovery from /var corruption?

On Fri, Feb 26, 2010 at 8:01 AM, Alex Schuster <wonko@wonkology.org> wrote:
> Mark Knecht writes:
>
>> * *Yes, I do use smartctl on some other machines although I'm not very
>> good about it and your write-up is helpful so thanks for that.
>>
>> * *My wife's machines is older and and I don't think SMART is
>> supported on her drive. Note the lack of a * on the SMART line in
>> hdparm -I:
>
> Okay, but it still states:
>
>> * * * * ** * *SMART error logging
>> * * * * ** * *SMART self-test
>
> So maybe smartctl -t long /dev/hda still works? Just give it a try.

No, -t long fails the same way. Basically every time I try to use
smartctl on the drive it seems to issue one of these 3-line reports
about SectorIDNotFound in dmesg. My other machines don't do this. Not
a good sign I think...

hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error }
hda: task_no_data_intr: error=0x10 { SectorIdNotFound },
LBAsect=16777008, sector=18446744073709551615
hda: possibly failed opcode: 0xb0
hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error }
hda: task_no_data_intr: error=0x10 { SectorIdNotFound },
LBAsect=262192, sector=18446744073709551615
hda: possibly failed opcode: 0xb0
hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error }
hda: task_no_data_intr: error=0x10 { SectorIdNotFound }, LBAsect=48,
sector=18446744073709551615
hda: possibly failed opcode: 0xb0

These command create the same sort of lines in dmesg:

dragonfly ~ # smartctl -i /dev/hda
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar family
Device Model: WDC WD1600BB-00FTA0
Serial Number: WD-WMAES2091586
Firmware Version: 15.05R15
User Capacity: 160,041,885,696 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 6
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Fri Feb 26 08:49:00 2010 PST
SMART support is: Available - device has SMART capability.
SMART support is: Disabled

SMART Disabled. Use option -s with argument 'on' to enable it.
dragonfly ~ # smartctl -P show /dev/hda
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Drive found in smartmontools Database. Drive identity strings:
MODEL: WDC WD1600BB-00FTA0
FIRMWARE: 15.05R15
match smartmontools Drive Database entry:
MODEL REGEXP: ^WDC WD(2|3|4|6|8|10|12|16|18|20|25)00BB-.*$
FIRMWARE REGEXP: .*
MODEL FAMILY: Western Digital Caviar family
ATTRIBUTE OPTIONS: None preset; no -v options are required.
dragonfly ~ #


<SNIP>
>>
>> I've not tried the -T permissive options.
>
> I would *There is also a BIOS setting for SMART, but I think this does
> not matter here, and it's only for being able to report a failing drive
> before booting.

Tried -T permissive and -T verypermissive. Same result. More lines and
told it's not turning on.

Could this have ANYTHING to do with kernel configuation? Is there
anything required at the kernel level that I might not have turned on?

>
>> I've never used badblocks as it seems I should only do that off-line.
>> This might be a good time to boot with a CD and try it out.
>
> In read-only mode, you can use it when the system is running. Only the
> write test (option -n) refuses to run if partitions are mounted from the
> drive. So I'd do the 'badblocks -sv /dev/hda' right now, if you do not
> need the drive at full speed for a while. You can interrupt it at any
> point with Ctrl-Z and continue with the fg command.
>
OK, I've started that test and will report back later what it says.

Thanks!

- Mark
 
Old 02-26-2010, 04:27 PM
Alex Schuster
 
Default recovery from /var corruption?

Mark Knecht writes:

> On Fri, Feb 26, 2010 at 8:01 AM, Alex Schuster <wonko@wonkology.org>
> wrote:

> > Okay, but it still states:
> >> * SMART error logging
> >> * SMART self-test
> >
> > So maybe smartctl -t long /dev/hda still works? Just give it a try.
>
> No, -t long fails the same way. Basically every time I try to use
> smartctl on the drive it seems to issue one of these 3-line reports
> about SectorIDNotFound in dmesg. My other machines don't do this. Not
> a good sign I think...
>
> hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error }
> hda: task_no_data_intr: error=0x10 { SectorIdNotFound },
> LBAsect=16777008, sector=18446744073709551615
> hda: possibly failed opcode: 0xb0

Uh-oh. Okay, I guess it just won't work then.


> Could this have ANYTHING to do with kernel configuation? Is there
> anything required at the kernel level that I might not have turned on?

I'm pretty sure it has nothing to do with the kernel, but with your drive
being incapable of the SMART commands.

But I guess using badblocks is not that different in the end. The SMART
selftest runs in the background and does not create disk I/O, but I think
it does nothing so much different from badblocks.

Wonko
 
Old 02-26-2010, 04:38 PM
daid kahl
 
Default recovery from /var corruption?

On 26 February 2010 12:33, Mark Knecht <markknecht@gmail.com> wrote:
> So I got my wife's machine booted today using a install disk and
> played a bit with e2fsck. The machine stopped being happy last night
> due to some sort of corruption on the /var partition. e2fsck
> complained about 3 or 4 files and then repaired the partition. The
> machine booted cleanly as far as I can tell.
>
> So, something went bad and I managed to sneak around it for a while
> and now I'm sort of living with the machine wondering what to do.
>
> Do I just watch the logs looking for problems? I have no way of
> knowing right now whether this was a disk problem that's going to come
> back, a 1 time deal due to power, or something else entirely.
>
> As these cheap machines that don't use RAID what's the right way to
> go? emerge -e @world and then wait for the next event? Do nothing and
> wait?
>
> We've got decent personal data backups as well as basic /etc data.
>
> Thanks,
> Mark
>

I reconsidered your problem, and I actually wonder if emerging world
is a valid notion in this case, as the world file is under /var and
this is reported as corrupt.

In this sense, it may be entirely non-trivial to regenerate (without
backup) the correct world-file for a system.

Am I out in the deep end, or is this, in fact, the critical point that
needs consideration here?

~daid
 
Old 02-26-2010, 04:51 PM
Mark Knecht
 
Default recovery from /var corruption?

On Fri, Feb 26, 2010 at 9:27 AM, Alex Schuster <wonko@wonkology.org> wrote:
> Mark Knecht writes:
>
>> On Fri, Feb 26, 2010 at 8:01 AM, Alex Schuster <wonko@wonkology.org>
>> wrote:
>
>> > Okay, but it still states:
>> >> * * * * ** * *SMART error logging
>> >> * * * * ** * *SMART self-test
>> >
>> > So maybe smartctl -t long /dev/hda still works? Just give it a try.
>>
>> No, -t long fails the same way. Basically every time I try to use
>> smartctl on the drive it seems to issue one of these 3-line reports
>> about SectorIDNotFound in dmesg. My other machines don't do this. Not
>> a good sign I think...
>>
>> hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error }
>> hda: task_no_data_intr: error=0x10 { SectorIdNotFound },
>> LBAsect=16777008, sector=18446744073709551615
>> hda: possibly failed opcode: 0xb0
>
> Uh-oh. Okay, I guess it just won't work then.
>
>
>> Could this have ANYTHING to do with kernel configuation? Is there
>> anything required at the kernel level that I might not have turned on?
>
> I'm pretty sure it has nothing to do with the kernel, but with your drive
> being incapable of the SMART commands.
>
> But I guess using badblocks is not that different in the end. The SMART
> selftest runs in the background and does not create disk I/O, but I think
> it does nothing so much different from badblocks.
>
> * * * *Wonko
>
>

The machine _mostly_ crashed while running badblocks. I say mostly
because the mouse is still alive but I can no longer ssh in and cannot
open a terminal on my wife's desktop or get to the console.

I tried to Ctrl-C out out of badblocks here (this is running shelled
in) before I figured out it was a total crash which messed up the
terminal a bit but you can see what it was reporting before the crash

dragonfly ~ # badblocks -sv /dev/hda
Checking blocks 0 to 156290903
Checking for bad blocks (read-only test): 89360960done, 35:00 elapsed
89360961done, 35:09 elapsed
89360962
89360963
^C^C18% done, 35:27 elapsed

So, there seem to be problems, possibly with the drive, or maybe it's
some sort of overheating problem on the processor and this was just
the way the processor failed before the crash?

I ran memtest86 night before last for 8 hours and had no memory
problems. I'll remove memory and PCI cards, reseat everything, and
then see what happens.

- Mark
 

Thread Tools




All times are GMT. The time now is 10:50 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org