FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Debian > Debian User

 
 
LinkBack Thread Tools
 
Old 12-10-2008, 08:15 PM
lee
 
Default problem with SATA disk, difference between standard kernel and Debian kernel

Hi,

what's the difference between a standard kernel and a kernel that
comes as a Debian package?

I'm using a standard kernel, but I'm having problems with one of my
disks (see below). The disk "gets lost" every now and then, i. e. it
seems to take a couple days or weeks now (I've seen it taking as long
as about two months with the old board) before it happens. The disk
remains unavailable until I turn the power off and back on. Once the
disk is back, I can re-add the partitions on the failed disk to the md
devices, and they are being rebuilt just fine, and it works for some
time until the disk "gets lost" again.

This problem isn't new; it has been there with another board/CPU/RAM,
cables and power supply ever since I got the two SATA disks new. It's
been there with every standard kernel I tried over the years, with
i368, and now it's the same with amd64. I've been thinking it was a
problem of the board I had, but as it's there with another board etc.,
it must be either the disk itself or the SATA driver.

Googling revealed that this isn't a rare problem. There are people
reporting it with all kinds of different disks and boards and
different distributions. Some suggest that it's a problem with the PSU
or the SATA cables, but imho that's unlikely. Interestingly, it seems
to be more common for this problem to show up in RAID setups.

Also interestingly, mdadm did *not* detect the disk failure for
/dev/md2 which is mounted read only.

And even more interestingly, the problem is and has always been with
/dev/sdb, never with /dev/sda. I can't tell if the disks have been
swapped when I connected them to the new board, though. But I'd rule
out a problem with the firmware of the disk as well since both disks
use the same firmware version.

So is there a difference between Debian and standard kernels so that I
might not have this problem if I'd use a Debian kernel? Has this
problem been solved in some way yet?

I might get another two disks, but I'm afraid that the same problem
would come up with other disks as well ...


Info:

cat:/home/lee# uname -a
Linux cat 2.6.27.7-cat-smp #4 SMP Thu Dec 4 16:03:29 CST 2008 x86_64 GNU/Linux
cat:/home/lee# smartctl -i /dev/sda
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family: Maxtor MaXLine III family (SATA/300)
Device Model: Maxtor 7V300F0
Serial Number: V604E3FG
Firmware Version: VA111630
User Capacity: 300,090,728,448 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0
Local Time is: Wed Dec 10 15:00:04 2008 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

cat:/home/lee# smartctl -i /dev/sdb
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family: Maxtor MaXLine III family (SATA/300)
Device Model: Maxtor 7V300F0
Serial Number: V601T7VG
Firmware Version: VA111630
User Capacity: 300,090,728,448 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0
Local Time is: Wed Dec 10 15:00:42 2008 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

cat:/home/lee# lspci
[...]
00:1f.2 SATA controller: Intel Corporation 82801IB (ICH9) 4 port SATA AHCI Controller (rev 02)


syslog:


Dec 10 00:09:10 cat kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec 10 00:09:10 cat kernel: ata5.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Dec 10 00:09:10 cat kernel: res 40/00:00:00:4f:c2/00:00:00:c2:00/00 Emask 0x4 (timeout)
Dec 10 00:09:10 cat kernel: ata5.00: status: { DRDY }
Dec 10 00:09:10 cat kernel: ata5: hard resetting link
Dec 10 00:09:10 cat kernel: ata5: SATA link down (SStatus 0 SControl 300)
Dec 10 00:09:15 cat kernel: ata5: hard resetting link
Dec 10 00:09:16 cat kernel: ata5: SATA link down (SStatus 0 SControl 300)
Dec 10 00:09:21 cat kernel: ata5: hard resetting link
Dec 10 00:09:21 cat kernel: ata5: SATA link down (SStatus 0 SControl 300)
Dec 10 00:09:21 cat kernel: ata5.00: disabled
Dec 10 00:09:21 cat kernel: end_request: I/O error, dev sdb, sector 478543967
Dec 10 00:09:21 cat kernel: md: super_written gets error=-5, uptodate=0
Dec 10 00:09:21 cat kernel: raid1: Disk failure on sdb2, disabling device.
Dec 10 00:09:21 cat kernel: raid1: Operation continuing on 1 devices.
Dec 10 00:09:21 cat kernel: ata5: EH complete
Dec 10 00:09:21 cat kernel: ata5.00: detaching (SCSI 4:0:0:0)
Dec 10 00:09:21 cat kernel: sd 4:0:0:0: [sdb] Synchronizing SCSI cache
Dec 10 00:09:21 cat kernel: sd 4:0:0:0: [sdb] Result: hostbyte=0x04 driverbyte=0x00
Dec 10 00:09:21 cat kernel: sd 4:0:0:0: [sdb] Stopping disk
Dec 10 00:09:21 cat kernel: sd 4:0:0:0: [sdb] START_STOP FAILED
Dec 10 00:09:21 cat kernel: sd 4:0:0:0: [sdb] Result: hostbyte=0x04 driverbyte=0x00
Dec 10 00:09:21 cat kernel: RAID1 conf printout:
Dec 10 00:09:21 cat kernel: --- wd:1 rd:2
Dec 10 00:09:21 cat kernel: disk 0, wo:0, o:1, dev:sda2
Dec 10 00:09:21 cat kernel: disk 1, wo:1, o:0, dev:sdb2
Dec 10 00:09:21 cat kernel: RAID1 conf printout:
Dec 10 00:09:21 cat kernel: --- wd:1 rd:2
Dec 10 00:09:21 cat kernel: disk 0, wo:0, o:1, dev:sda2
Dec 10 00:09:21 cat kernel: scsi 4:0:0:0: rejecting I/O to dead device
Dec 10 00:09:21 cat kernel: scsi 4:0:0:0: rejecting I/O to dead device
Dec 10 00:09:21 cat kernel: end_request: I/O error, dev sdb, sector 146496512
Dec 10 00:09:21 cat kernel: md: super_written gets error=-5, uptodate=0
Dec 10 00:09:21 cat kernel: raid1: Disk failure on sdb1, disabling device.
Dec 10 00:09:21 cat kernel: raid1: Operation continuing on 1 devices.
Dec 10 00:09:21 cat mdadm[1995]: Fail event detected on md device /dev/md1, component device /dev/sdb2
Dec 10 00:09:21 cat kernel: RAID1 conf printout:
Dec 10 00:09:21 cat kernel: --- wd:1 rd:2
Dec 10 00:09:21 cat kernel: disk 0, wo:0, o:1, dev:sda1
Dec 10 00:09:21 cat kernel: disk 1, wo:1, o:0, dev:sdb1
Dec 10 00:09:21 cat kernel: RAID1 conf printout:
Dec 10 00:09:21 cat kernel: --- wd:1 rd:2
Dec 10 00:09:21 cat kernel: disk 0, wo:0, o:1, dev:sda1
Dec 10 00:10:21 cat mdadm[1995]: Fail event detected on md device /dev/md0


--
"Don't let them, daddy. Don't let the stars run down."
http://adin.dyndns.org/adin/TheLastQ.htm


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 12-10-2008, 09:52 PM
lee
 
Default problem with SATA disk, difference between standard kernel and Debian kernel

On Wed, Dec 10, 2008 at 04:05:26PM -0600, Boyd Stephen Smith Jr. wrote:

> >So is there a difference between Debian and standard kernels so that I
> >might not have this problem if I'd use a Debian kernel?
>
> Not that I know of.

Hm, maybe I'll just try one and see what happens.

> Yeah, it's a problem, but it's virtually impossible to diagnose that kind of
> error without instrumenting (jargon: attaching real-/run-time sensors to) the
> kernel and reproducing the problem, many times.

It's not something I could reproduce, it "just happens" for no
apparent reason, at unregular intervals.

> Causing the kernel to "dump" (similar to a process coredumping, but
> the whole kernel) when some symptom (super_written get error = -5,
> maybe?) manifests might give you an image that a kernel hacker could
> perform a post-mortem on. Enough dumps might show a pattern.

Hm. It seems that there is already an attempt made to recover from
this error, so that might be a place to somehow put a hook on. The
problem is that the recovery attempt doesn't work; the only thing that
"works" is turning the power off and back on.

> If you can find a kernel that does work, you might be able to do a
> "git bisect" and identify the patch(es) that broke you -- but that
> would certainly be a project.

Well, that would go back about 4 years or so --- it might be in there
since they switched away from libata (or whatever happened).

> How much resources do you want to spend on fixing the problem? (If
> you kick in enough, I'll bet the kernel hackers will kick in some,
> too.)

I can spend some time on it, try out different kernels, maybe get it
to produce dumps ... But I don't know where I would start, other than
looking at the source --- which probably won't tell me anything.

But I'm wondering how many people have this problem. There are
probably lots of people with SATA disks, and if most of them had this
problem, it might have already heen solved. If lots of people have
SATA disks but don't have this problem, I might get away with getting
new disks. But maybe lots of people have it and just live with it?

Or maybe there are not so many people with SATA disks? The Debian
amd64 installer wasn't even able to install on SATA disks because the
kernel module for the controller wasn't available, and I don't have
any unusual hardware. I had to install on the IDE disk I wanted to get
rid of instead --- and next time I'll get a new board, it might not
have any IDE connectors and I'll be screwed when trying to install ...


--
"Don't let them, daddy. Don't let the stars run down."
http://adin.dyndns.org/adin/TheLastQ.htm


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 12-10-2008, 10:06 PM
martin f krafft
 
Default problem with SATA disk, difference between standard kernel and Debian kernel

also sprach lee <lee@yun.yagibdah.de> [2008.12.10.2215 +0100]:
> I'm using a standard kernel, but I'm having problems with one of my
> disks (see below).
[...]

> So is there a difference between Debian and standard kernels so
> that I might not have this problem if I'd use a Debian kernel? Has
> this problem been solved in some way yet?

You don't provide much information. How about you try the Debian
kernels and see if the problem persists?

--
.'`. martin f. krafft <madduck@d.o> Related projects:
: :' : proud Debian developer http://debiansystem.info
`. `'` http://people.debian.org/~madduck http://vcs-pkg.org
`- Debian - when you have better things to do than fixing systems

remember, half the people are below average.
 
Old 12-10-2008, 11:25 PM
Florian Kulzer
 
Default problem with SATA disk, difference between standard kernel and Debian kernel

On Wed, Dec 10, 2008 at 16:21:26 -0600, Ron Johnson wrote:
> On 12/10/08 16:10, Celejar wrote:
>> On Wed, 10 Dec 2008 16:05:26 -0600 "Boyd Stephen Smith Jr." wrote:

[...]

>>> The Debian kernel has some non-free (as in: source not available)
>>> parts removed. There are also Debian-specific patches added.
>>
>> The vanilla kernel has non-free stuff in it? I thought it's all GPL.
>
> Some drivers have firmware blobs encoded in them as really long byte
> arrays. They are de jure GPL, but, practically, are closed source.

Some of these blobs seem to have quite serious license problems, e.g.:

http://bugzilla.kernel.org/show_bug.cgi?id=10750

AFAICT, upstream might eventually have to remove such blobs as well, or
at least try harder to make sure that they were indeed licensed by the
copyright holder for distribution in the kernel. It seems reasonable to
me that Debian tries to be extra careful in these cases, keeping in mind
the "100% free" guarantee in the social contract.

--
Regards, | http://users.icfo.es/Florian.Kulzer
Florian |


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 12-11-2008, 12:14 AM
lee
 
Default problem with SATA disk, difference between standard kernel and Debian kernel

On Wed, Dec 10, 2008 at 06:30:02PM -0600, Boyd Stephen Smith Jr. wrote:
> On Wednesday 2008 December 10 16:52:03 lee wrote:
> >But I'm wondering how many people have this problem. There are
> >probably lots of people with SATA disks, and if most of them had this
> >problem, it might have already heen solved. If lots of people have
> >SATA disks but don't have this problem, I might get away with getting
> >new disks. But maybe lots of people have it and just live with it?
>
> Two WD Raptors connected via SATA to my Tyan motherboard in my desktop since
> 2005. No drops.
>
> A varying number of Hitachi drives (both 500G and 1000G) connected via SATA to
> my Areca PCI-X controller. No drops.

Hm, so I might just have bad luck with (one of) these disks.

> If you can find enough people with the same problems and co-ordinate, you
> might be able to reduce the amount of effort each of you put forward. If
> time is what it takes to reproduce, that's fine -- just figure out what a
> good upper bound is. Is 1 day long enough without drops long enough to say
> the kernel is good? 7 days? 30 days?

I'd say 1/2 year in this case.


--
"Don't let them, daddy. Don't let the stars run down."
http://adin.dyndns.org/adin/TheLastQ.htm


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 12-11-2008, 12:19 AM
lee
 
Default problem with SATA disk, difference between standard kernel and Debian kernel

On Thu, Dec 11, 2008 at 12:06:27AM +0100, martin f krafft wrote:
> also sprach lee <lee@yun.yagibdah.de> [2008.12.10.2215 +0100]:
> > I'm using a standard kernel, but I'm having problems with one of my
> > disks (see below).
> [...]
>
> > So is there a difference between Debian and standard kernels so
> > that I might not have this problem if I'd use a Debian kernel? Has
> > this problem been solved in some way yet?
>
> You don't provide much information.

Well, I can provide more. What information do I need to provide?

> How about you try the Debian kernels and see if the problem
> persists?

Yeah, I'll probably do that next time it happens. For now, I've used
hdparm to disable the powermanagement --- not that I enabled it, but
I'll just have to see if it makes a difference.


--
"Don't let them, daddy. Don't let the stars run down."
http://adin.dyndns.org/adin/TheLastQ.htm


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 12-11-2008, 12:59 AM
lee
 
Default problem with SATA disk, difference between standard kernel and Debian kernel

On Wed, Dec 10, 2008 at 07:30:40PM -0600, Boyd Stephen Smith Jr. wrote:
> On Wednesday 2008 December 10 19:14:37 lee wrote:
> >On Wed, Dec 10, 2008 at 06:30:02PM -0600, Boyd Stephen Smith Jr. wrote:
> >> If
> >> time is what it takes to reproduce, that's fine -- just figure out what a
> >> good upper bound is. Is 1 day long enough without drops long enough to
> >> say the kernel is good? 7 days? 30 days?
> >
> >I'd say 1/2 year in this case.
>
> Without some help from others, you probably won't be able to check kernels as
> fast as they release new ones, so start looking for help.

Yeah, that's what I was thinking.


--
"Don't let them, daddy. Don't let the stars run down."
http://adin.dyndns.org/adin/TheLastQ.htm


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 12-11-2008, 01:37 AM
Alex Samad
 
Default problem with SATA disk, difference between standard kernel and Debian kernel

On Wed, Dec 10, 2008 at 03:15:56PM -0600, lee wrote:
> Hi,
>

[snip]

> smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
> Home page is http://smartmontools.sourceforge.net/
>
> === START OF INFORMATION SECTION ===
> Model Family: Maxtor MaXLine III family (SATA/300)
> Device Model: Maxtor 7V300F0
> Serial Number: V604E3FG
> Firmware Version: VA111630
> User Capacity: 300,090,728,448 bytes
> Device is: In smartctl database [for details use: -P show]
> ATA Version is: 7
> ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0
> Local Time is: Wed Dec 10 15:00:04 2008 CST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> cat:/home/lee# smartctl -i /dev/sdb
> smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
> Home page is http://smartmontools.sourceforge.net/
>
> === START OF INFORMATION SECTION ===
> Model Family: Maxtor MaXLine III family (SATA/300)
> Device Model: Maxtor 7V300F0
> Serial Number: V601T7VG
> Firmware Version: VA111630
> User Capacity: 300,090,728,448 bytes
> Device is: In smartctl database [for details use: -P show]
> ATA Version is: 7
> ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0
> Local Time is: Wed Dec 10 15:00:42 2008 CST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>


have you tried smartctl -H <device> and smartctl -t short|long <device>
tried changing the cable ?


[snip]

>
> --
> "Don't let them, daddy. Don't let the stars run down."
> http://adin.dyndns.org/adin/TheLastQ.htm
>
>
> --
> To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
> with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
>
>

--
May cause drowsiness.
 
Old 12-11-2008, 05:58 AM
lee
 
Default problem with SATA disk, difference between standard kernel and Debian kernel

On Thu, Dec 11, 2008 at 01:37:57PM +1100, Alex Samad wrote:

> have you tried smartctl -H <device> and smartctl -t short|long
> <device>

Yes, there doesn't seem to be anything unusual:


cat:/home/lee# smartctl -H /dev/sdb
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

cat:/home/lee# smartctl -t short /dev/sdb
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 2 minutes for test to complete.
Test will complete after Wed Dec 10 23:53:43 2008

Use smartctl -X to abort test.
cat:/home/lee# smartctl -l selftest /dev/sdb
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 9158 -


I've started a long test, but it says it'll take about two hours. I'll
let you know the result.


BTW, what is this value for lifetime hours? It's the same value as
"smartctl -A" reports for "Power_On_Hours", but /sda has 9663 and /sdb
has 9158. Both values would have to be identical if they represent
what their name suggests: These disks have always been powered or
turned off at the same time, with no exceptions. Their actual "power
on hours" are identical, if not to the second, the at least to the
minute. There's no way they could differ by 500 hours. --- Digging in
my mails turned up that they were probably bought in April 2006; they
have been used until June 2007 and then not been used until about this
month. That makes for about 9.5k hours.


cat:/home/lee# smartctl -A /dev/sda
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
9 Power_On_Hours 0x0032 226 226 000 Old_age Always - 9663
cat:/home/lee# smartctl -A /dev/sdb
9 Power_On_Hours 0x0032 227 227 000 Old_age Always - 9158


> tried changing the cable ?

Yes, I'm using different cables that came with the new board. The old
board (Asus A8N-SLI with an AMD64-4000) had a totally different
chipset as well:


0000:00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)


The kernel version was 2.6.16.2 when the disks were new, using the
sata_nv (or nv_sata) module.

The new board is a Gigabyte GA-P35-DS3L with a 3GHz Intel
Dual-Core. And the AMD was actually a bit faster, if you consider the
CPU alone ...


Anyway, if it's a software problem, it's probably not the module for
the particular controller but something else. That people with all
kinds of different hardware have this problem supports this theory.

Hm, and I haven't seen anyone using Debian reporting it ... Is there
anybody here who has seen it?


--
"Don't let them, daddy. Don't let the stars run down."
http://adin.dyndns.org/adin/TheLastQ.htm


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 12-11-2008, 05:39 PM
lee
 
Default problem with SATA disk, difference between standard kernel and Debian kernel

On Thu, Dec 11, 2008 at 12:58:30AM -0600, lee wrote:

> I've started a long test, but it says it'll take about two hours. I'll
> let you know the result.

cat:/home/lee# smartctl -l selftest /dev/sdb
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 9160 -
# 2 Short offline Completed without error 00% 9158 -

cat:/home/lee#


--
"Don't let them, daddy. Don't let the stars run down."
http://adin.dyndns.org/adin/TheLastQ.htm


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 

Thread Tools




All times are GMT. The time now is 06:50 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org