FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Fedora User

 
 
LinkBack Thread Tools
 
Old 08-17-2010, 03:07 PM
Gilboa Davara
 
Default kernel crash

On Tue, 2010-08-17 at 09:44 -0400, Steve Blackwell wrote:
> I leave my computer on 24/7 so that my backups can run at night.
> Lately, it has been crashing during the night usually leaving no trace
> of what happened. Last night it crashed but left this
> in /var/log/messages:
>
> Aug 17 01:04:56 steve kernel: INFO: task kjournald:1960 blocked for more than 120 seconds.
> Aug 17 01:04:56 steve kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Could a hard drive get shut down because it was getting too hot? What would be a normal temp for a hard drive that has just completed a backup? 124C seems really hot. The HD cooling fan had been broken so I replaced it this past weekend but it doesn't seem to have helped. Too late? Permanent HD damage already done?
> Any other comments or suggestions?

Hello Steve,

This is not a crash.
The kjournald kernel process (which handles various file-system task).
You assumption that the HD went into some type of sleep/suspend mode
during write sounds reasonable to me.

124C seems -very- hot. Even during heavy I/O.
Two things spring into mind:
A. Is it a normal desktop SATA drive or high-speed SCSI/SAS drive?
B. Please post the SMART log of the drive. (smartctl -a /dev/sdX).

- Gilboa

--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
 
Old 08-17-2010, 04:05 PM
Steve Blackwell
 
Default kernel crash

On Tue, 17 Aug 2010 18:07:18 +0300
Gilboa Davara <gilboad@gmail.com> wrote:

> On Tue, 2010-08-17 at 09:44 -0400, Steve Blackwell wrote:
> > I leave my computer on 24/7 so that my backups can run at night.
> > Lately, it has been crashing during the night usually leaving no
> > trace of what happened. Last night it crashed but left this
> > in /var/log/messages:
> >
> > Aug 17 01:04:56 steve kernel: INFO: task kjournald:1960 blocked for
> > more than 120 seconds. Aug 17 01:04:56 steve kernel: "echo 0
> > > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > Could a hard drive get shut down because it was getting too hot?
> > > What would be a normal temp for a hard drive that has just
> > > completed a backup? 124C seems really hot. The HD cooling fan had
> > > been broken so I replaced it this past weekend but it doesn't
> > > seem to have helped. Too late? Permanent HD damage already done?
> > Any other comments or suggestions?
>
> Hello Steve,
>
> This is not a crash.
> The kjournald kernel process (which handles various file-system task).
> You assumption that the HD went into some type of sleep/suspend mode
> during write sounds reasonable to me.
>
> 124C seems -very- hot. Even during heavy I/O.
> Two things spring into mind:
> A. Is it a normal desktop SATA drive or high-speed SCSI/SAS drive?
> B. Please post the SMART log of the drive. (smartctl -a /dev/sdX).
>
> - Gilboa
>

Hello Gilboa,

Yes I realize that it was not a crash. When I first saw the kernel
messages I thought it was and started writing the e-mail. I neglected
to correct the subject line after I actually read the messages. Sorry
about that.

I had already run the command:
smartctl -t long /dev/sdb
before I got your reply. The results should be ready soon.

I've been looking at my logs some more. I don't understand these
messages:

Aug 17 10:30:50 steve kernel: CPU0: Temperature above threshold, cpu
clock throttled (total events = 455)
Aug 17 10:30:50 steve kernel: CPU1: Temperature above threshold, cpu
clock throttled (total events = 455)
Aug 17 10:30:50 steve kernel: CPU1: Temperature/speed normal
Aug 17 10:30:50 steve kernel: CPU0: Temperature/speed normal

These messages are repeated every hour or so. It seems unlikely that
every time the threshold is exceeded, it immediately (within one
second) drops back again. What is going on here?

The drive is an old IDE drive: WDC WD1600JB-00F

Thanks,
Steve
--
Changing lives one card at a time

http://www.send1cardnow.com
--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
 
Old 08-17-2010, 04:11 PM
JD
 
Default kernel crash

On 08/17/2010 06:44 AM, Steve Blackwell wrote:
> I leave my computer on 24/7 so that my backups can run at night.
> Lately, it has been crashing during the night usually leaving no trace
> of what happened. Last night it crashed but left this
> in /var/log/messages:
>
> Aug 17 01:04:56 steve kernel: INFO: task kjournald:1960 blocked for more than 120 seconds.
> Aug 17 01:04:56 steve kernel: "echo 0> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Aug 17 01:04:56 steve kernel: kjournald D 00002743 0 1960 2 0x00000080
> Aug 17 01:04:56 steve kernel: cf98fd9c 00000046 ff2f442e 00002743 00032558 00000000 f15c756c cf82d400
> Aug 17 01:04:56 steve kernel: c0a5e6ac c0a63140 f15c756c c0a63140 c0a63140 cf98fd74 c05b61ef f1714e18
> Aug 17 01:04:56 steve kernel: 00000001 00000000 00002743 f15c72c0 b39690c0 1b48082c f6630a60 c2208140
> Aug 17 01:04:56 steve kernel: Call Trace:
> Aug 17 01:04:56 steve kernel: [<c05b61ef>] ? cfq_may_queue+0x48/0xa8
> Aug 17 01:04:56 steve kernel: [<c0793ef7>] io_schedule+0x5f/0x98
> Aug 17 01:04:56 steve kernel: [<c05ac02f>] get_request_wait+0xc7/0x13c
> Aug 17 01:04:56 steve kernel: [<c0454641>] ? autoremove_wake_function+0x0/0x34
> Aug 17 01:04:56 steve kernel: [<c05ac4a4>] __make_request+0x27f/0x386
> Aug 17 01:04:56 steve kernel: [<c04cebd4>] ? __slab_alloc+0x269/0x3f6
> Aug 17 01:04:56 steve kernel: [<c05ab011>] generic_make_request+0x286/0x2d0
> Aug 17 01:04:56 steve kernel: [<c04a77e5>] ? mempool_alloc_slab+0x13/0x15
> Aug 17 01:04:56 steve kernel: [<c04a78b1>] ? mempool_alloc+0x5c/0xf2
> Aug 17 01:04:56 steve kernel: [<c05ab122>] submit_bio+0xc7/0xe0
> Aug 17 01:04:56 steve kernel: [<c04fc9d3>] ? bio_alloc_bioset+0x2a/0xb9
> Aug 17 01:04:56 steve kernel: [<c04f9038>] submit_bh+0xf4/0x114
> Aug 17 01:04:56 steve kernel: [<c0562f74>] journal_commit_transaction+0x38b/0xcc7
> Aug 17 01:04:56 steve kernel: [<c044747a>] ? lock_timer_base+0x26/0x45
> Aug 17 01:04:56 steve kernel: [<c0447696>] ? try_to_del_timer_sync+0x5e/0x66
> Aug 17 01:04:56 steve kernel: [<c0565f1d>] kjournald+0xb8/0x1cc
> Aug 17 01:04:56 steve kernel: [<c0454641>] ? autoremove_wake_function+0x0/0x34
> Aug 17 01:04:56 steve kernel: [<c0565e65>] ? kjournald+0x0/0x1cc
> Aug 17 01:04:56 steve kernel: [<c0454409>] kthread+0x64/0x69
> Aug 17 01:04:56 steve kernel: [<c04543a5>] ? kthread+0x0/0x69
> Aug 17 01:04:56 steve kernel: [<c04041e7>] kernel_thread_helper+0x7/0x10
>
> This happened in the middle of the backup which started at 1:00am and finished (successfully) at 1:28am so perhaps the backup blocked the kjournald process but it didn't crash the computer because there are later messages in the backup log and the messages file.
>
> The last entry in the messages file is:
>
> Aug 17 02:03:55 steve smartd[2347]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 167 to 168
> Aug 17 02:03:55 steve smartd[2347]: Device: /dev/sda [SAT], SMART Usage
> Attribute: 194 Temperature_Celsius changed from 122 to 124
>
> Could a hard drive get shut down because it was getting too hot? What would be a normal temp for a hard drive that has just completed a backup? 124C seems really hot. The HD cooling fan had been broken so I replaced it this past weekend but it doesn't seem to have helped. Too late? Permanent HD damage already done?
> Any other comments or suggestions?
>
> Thanks
> Steve
>
>
Hi Steve,
REPLACE THE DRIVE IMMEDIATELY!!
Otherwise, you are courting disaster!
See if it is still under warranty and ask manfacturer for RMA.

--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
 
Old 08-17-2010, 04:42 PM
Tim
 
Default kernel crash

On Tue, 2010-08-17 at 12:05 -0400, Steve Blackwell wrote:
> I've been looking at my logs some more. I don't understand these
> messages:
>
> Aug 17 10:30:50 steve kernel: CPU0: Temperature above threshold, cpu
> clock throttled (total events = 455)
> Aug 17 10:30:50 steve kernel: CPU1: Temperature above threshold, cpu
> clock throttled (total events = 455)
> Aug 17 10:30:50 steve kernel: CPU1: Temperature/speed normal
> Aug 17 10:30:50 steve kernel: CPU0: Temperature/speed normal

And the CPU overheating as well as your hard drive?

Is the computer in a hot room? Are the fans working? Is the
ventilation blocked? Is the computer wedged in between things that
restrict airflow? Are things full of fluff and dust?


--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
 
Old 08-17-2010, 04:48 PM
Steve Blackwell
 
Default kernel crash

On Tue, 17 Aug 2010 12:05:44 -0400
Steve Blackwell <zephod@cfl.rr.com> wrote:

> On Tue, 17 Aug 2010 18:07:18 +0300
> Gilboa Davara <gilboad@gmail.com> wrote:
>
> > On Tue, 2010-08-17 at 09:44 -0400, Steve Blackwell wrote:
> > > I leave my computer on 24/7 so that my backups can run at night.
> > > Lately, it has been crashing during the night usually leaving no
> > > trace of what happened. Last night it crashed but left this
> > > in /var/log/messages:
> > >
> > > Aug 17 01:04:56 steve kernel: INFO: task kjournald:1960 blocked
> > > for more than 120 seconds. Aug 17 01:04:56 steve kernel: "echo 0
> > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > Could a hard drive get shut down because it was getting too hot?
> > > > What would be a normal temp for a hard drive that has just
> > > > completed a backup? 124C seems really hot. The HD cooling fan
> > > > had been broken so I replaced it this past weekend but it
> > > > doesn't seem to have helped. Too late? Permanent HD damage
> > > > already done?
> > > Any other comments or suggestions?
> >
> > Hello Steve,
> >
> > This is not a crash.
> > The kjournald kernel process (which handles various file-system
> > task). You assumption that the HD went into some type of
> > sleep/suspend mode during write sounds reasonable to me.
> >
> > 124C seems -very- hot. Even during heavy I/O.
> > Two things spring into mind:
> > A. Is it a normal desktop SATA drive or high-speed SCSI/SAS drive?
> > B. Please post the SMART log of the drive. (smartctl -a /dev/sdX).
> >
> > - Gilboa
> >
>
> Hello Gilboa,
>
> Yes I realize that it was not a crash. When I first saw the kernel
> messages I thought it was and started writing the e-mail. I neglected
> to correct the subject line after I actually read the messages. Sorry
> about that.
>
> I had already run the command:
> smartctl -t long /dev/sdb
> before I got your reply. The results should be ready soon.
>
> I've been looking at my logs some more. I don't understand these
> messages:
>
> Aug 17 10:30:50 steve kernel: CPU0: Temperature above threshold, cpu
> clock throttled (total events = 455)
> Aug 17 10:30:50 steve kernel: CPU1: Temperature above threshold, cpu
> clock throttled (total events = 455)
> Aug 17 10:30:50 steve kernel: CPU1: Temperature/speed normal
> Aug 17 10:30:50 steve kernel: CPU0: Temperature/speed normal
>
> These messages are repeated every hour or so. It seems unlikely that
> every time the threshold is exceeded, it immediately (within one
> second) drops back again. What is going on here?
>
> The drive is an old IDE drive: WDC WD1600JB-00F
>
> Thanks,
> Steve

Well, the long self test passed.
Here is the result of
# smartctl -a /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [i386-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar SE family
Device Model: WDC WD1600JB-00FUA0
Serial Number: WD-WCAES1024695
Firmware Version: 15.05R15
User Capacity: 160,041,885,696 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 6
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Tue Aug 17 12:36:35 2010 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x85) Offline data collection activity
was aborted by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (5073) seconds.
Offline data collection
capabilities: (0x79) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 67) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0007 146 142 021 Pre-fail Always - 3233
4 Start_Stop_Count 0x0032 099 099 040 Old_age Always - 1681
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0
9 Power_On_Hours 0x0032 070 070 000 Old_age Always - 22478
10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0
11 Calibration_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 1654
194 Temperature_Celsius 0x0022 116 253 000 Old_age Always - 34
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 1
200 Multi_Zone_Error_Rate 0x0009 200 155 051 Pre-fail Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 632 -
# 2 Short offline Completed without error 00% 696 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute
delay.

This doesn't make much sense to me. If the overall health status id PASSED then why are all the vendor specific threshold values exceeded? Am I reading that wrong?

Thanks,
Steve
--
Changing lives one card at a time

http://www.send1cardnow.com
--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
 
Old 08-17-2010, 05:08 PM
Steve Blackwell
 
Default kernel crash

On Wed, 18 Aug 2010 02:12:16 +0930
Tim <ignored_mailbox@yahoo.com.au> wrote:

> On Tue, 2010-08-17 at 12:05 -0400, Steve Blackwell wrote:
> > I've been looking at my logs some more. I don't understand these
> > messages:
> >
> > Aug 17 10:30:50 steve kernel: CPU0: Temperature above threshold, cpu
> > clock throttled (total events = 455)
> > Aug 17 10:30:50 steve kernel: CPU1: Temperature above threshold, cpu
> > clock throttled (total events = 455)
> > Aug 17 10:30:50 steve kernel: CPU1: Temperature/speed normal
> > Aug 17 10:30:50 steve kernel: CPU0: Temperature/speed normal
>
> And the CPU overheating as well as your hard drive?
>
> Is the computer in a hot room? Are the fans working? Is the
> ventilation blocked? Is the computer wedged in between things that
> restrict airflow? Are things full of fluff and dust?
>
>
Well it would seems so but I don't trust the messages. It doesn't seem
reasonable that the CPUs go overtemp and then immediately cool down
enough to be OK.

As for your other questions, I spent the weekend replacing a broken
cooling fan, removing the dust build-up, rearranging the
internal components to maximize the space between them and rearranging
my office to place the computer in a more open space. None of these
actions appear to have helped.

Steve

--
Changing lives one card at a time

http://www.send1cardnow.com
--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
 
Old 08-17-2010, 11:22 PM
Bill Davidsen
 
Default kernel crash

Steve Blackwell wrote:

> This happened in the middle of the backup which started at 1:00am and finished (successfully) at 1:28am so perhaps the backup blocked the kjournald process but it didn't crash the computer because there are later messages in the backup log and the messages file.
>
> The last entry in the messages file is:
>
> Aug 17 02:03:55 steve smartd[2347]: Device: /dev/sda [SAT], SMART Prefailure Attribute: 3 Spin_Up_Time changed from 167 to 168
> Aug 17 02:03:55 steve smartd[2347]: Device: /dev/sda [SAT], SMART Usage
> Attribute: 194 Temperature_Celsius changed from 122 to 124
>
> Could a hard drive get shut down because it was getting too hot? What would be a normal temp for a hard drive that has just completed a backup? 124C seems really hot. The HD cooling fan had been broken so I replaced it this past weekend but it doesn't seem to have helped. Too late? Permanent HD damage already done?
> Any other comments or suggestions?
>
If this line is for real:
194 Temperature_Celsius 0x0022 116 253 000 Old_age Always
- 34

Then your drive is running hotter than boiling water and has been close to
melting point of solder. In spite of that the error count is fine, but holding
your hand an inch or so from the drive should tell you if this is that hot.

I start taking some action on fans and dust if a drive hits 45C, so your drive
is either way hot (probably) or reporting false bad news. That spin up time is
very long, even if that's 10ths of sec that's slow.

I would be sure that backup is good, and plan on replacing that drive sooner
rather than later. Tonight would probably be good...

--
Bill Davidsen <davidsen@tmr.com>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot
--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
 
Old 08-18-2010, 01:44 AM
David
 
Default kernel crash

On 18 August 2010 09:22, Bill Davidsen <davidsen@tmr.com> wrote:
>>
> If this line is for real:
> *194 Temperature_Celsius * * 0x0022 * 116 * 253 * 000 * *Old_age * Always
> *- * * * 34
>
> Then your drive is running hotter than boiling water and has been close to
> melting point of solder. In spite of that the error count is fine, but holding
> your hand an inch or so from the drive should tell you if this is that hot.

Having rtfm, I think the values reported there are normalised values,
not degrees Celsius.

See
http://sourceforge.net/apps/trac/smartmontools/wiki/FAQ#Whyismydisktemperaturesreportedbysmartdas150Ce lsius

and read 'man smartctl' under option -A
--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
 
Old 08-18-2010, 05:44 AM
Gilboa Davara
 
Default kernel crash

On Tue, 2010-08-17 at 13:08 -0400, Steve Blackwell wrote:
> On Wed, 18 Aug 2010 02:12:16 +0930
> Tim <ignored_mailbox@yahoo.com.au> wrote:
>
> > On Tue, 2010-08-17 at 12:05 -0400, Steve Blackwell wrote:
> > > I've been looking at my logs some more. I don't understand these
> > > messages:
> > >
> > > Aug 17 10:30:50 steve kernel: CPU0: Temperature above threshold, cpu
> > > clock throttled (total events = 455)
> > > Aug 17 10:30:50 steve kernel: CPU1: Temperature above threshold, cpu
> > > clock throttled (total events = 455)
> > > Aug 17 10:30:50 steve kernel: CPU1: Temperature/speed normal
> > > Aug 17 10:30:50 steve kernel: CPU0: Temperature/speed normal
> >
> > And the CPU overheating as well as your hard drive?
> >
> > Is the computer in a hot room? Are the fans working? Is the
> > ventilation blocked? Is the computer wedged in between things that
> > restrict airflow? Are things full of fluff and dust?
> >
> >
> Well it would seems so but I don't trust the messages. It doesn't seem
> reasonable that the CPUs go overtemp and then immediately cool down
> enough to be OK.

Actually it is possible.
Your CPU has auto-throttle support. Read: When the CPU passes a certain
temperature threshold, it automatically clocks down (or inserts NOPs)
in-order to prevent is from burning out. Never the less, if your
machine's cooling is sufficient you shouldn't see this message.

If you CPU's high and low water mark are the same (E.g. 90C), the CPU
will reach 90C, throttle, and drop to 89C - all in one second.
I'd suggest you configure lm_sensros and monitor the CPU and board
temperature.
$ sensors-detect
$ /etc/init.d/lm_sensors restart
$ sensors -s
$ sensors

- Gilboa
P.S. can you post your hardware configuration?

--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
 
Old 08-18-2010, 05:47 AM
Gilboa Davara
 
Default kernel crash

On Tue, 2010-08-17 at 12:48 -0400, Steve Blackwell wrote:
> On Tue, 17 Aug 2010 12:05:44 -0400
> Steve Blackwell <zephod@cfl.rr.com> wrote:
>
> > On Tue, 17 Aug 2010 18:07:18 +0300
> > Gilboa Davara <gilboad@gmail.com> wrote:
> >
> > > On Tue, 2010-08-17 at 09:44 -0400, Steve Blackwell wrote:
> > > > I leave my computer on 24/7 so that my backups can run at night.
> > > > Lately, it has been crashing during the night usually leaving no
> > > > trace of what happened. Last night it crashed but left this
> > > > in /var/log/messages:
> > > >
> > > > Aug 17 01:04:56 steve kernel: INFO: task kjournald:1960 blocked
> > > > for more than 120 seconds. Aug 17 01:04:56 steve kernel: "echo 0
> > > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > > Could a hard drive get shut down because it was getting too hot?
> > > > > What would be a normal temp for a hard drive that has just
> > > > > completed a backup? 124C seems really hot. The HD cooling fan
> > > > > had been broken so I replaced it this past weekend but it
> > > > > doesn't seem to have helped. Too late? Permanent HD damage
> > > > > already done?
> > > > Any other comments or suggestions?
> > >
> > > Hello Steve,
> > >
> > > This is not a crash.
> > > The kjournald kernel process (which handles various file-system
> > > task). You assumption that the HD went into some type of
> > > sleep/suspend mode during write sounds reasonable to me.
> > >
> > > 124C seems -very- hot. Even during heavy I/O.
> > > Two things spring into mind:
> > > A. Is it a normal desktop SATA drive or high-speed SCSI/SAS drive?
> > > B. Please post the SMART log of the drive. (smartctl -a /dev/sdX).
> > >
> > > - Gilboa
> > >
> >
> > Hello Gilboa,
> >
> > Yes I realize that it was not a crash. When I first saw the kernel
> > messages I thought it was and started writing the e-mail. I neglected
> > to correct the subject line after I actually read the messages. Sorry
> > about that.
> >
> > I had already run the command:
> > smartctl -t long /dev/sdb
> > before I got your reply. The results should be ready soon.
> >
> > I've been looking at my logs some more. I don't understand these
> > messages:
> >
> > Aug 17 10:30:50 steve kernel: CPU0: Temperature above threshold, cpu
> > clock throttled (total events = 455)
> > Aug 17 10:30:50 steve kernel: CPU1: Temperature above threshold, cpu
> > clock throttled (total events = 455)
> > Aug 17 10:30:50 steve kernel: CPU1: Temperature/speed normal
> > Aug 17 10:30:50 steve kernel: CPU0: Temperature/speed normal
> >
> > These messages are repeated every hour or so. It seems unlikely that
> > every time the threshold is exceeded, it immediately (within one
> > second) drops back again. What is going on here?
> >
> > The drive is an old IDE drive: WDC WD1600JB-00F
> >
> > Thanks,
> > Steve
>
> Well, the long self test passed.
> Here is the result of
> # smartctl -a /dev/sdb
> smartctl 5.39.1 2010-01-28 r3054 [i386-redhat-linux-gnu] (local build)
> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
>
> === START OF INFORMATION SECTION ===
> Model Family: Western Digital Caviar SE family
> Device Model: WDC WD1600JB-00FUA0
> Serial Number: WD-WCAES1024695
> Firmware Version: 15.05R15
> User Capacity: 160,041,885,696 bytes
> Device is: In smartctl database [for details use: -P show]
> ATA Version is: 6
> ATA Standard is: Exact ATA specification draft version not indicated
> Local Time is: Tue Aug 17 12:36:35 2010 EDT
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status: (0x85) Offline data collection activity
> was aborted by an interrupting command from host.
> Auto Offline Data Collection: Enabled.
> Self-test execution status: ( 0) The previous self-test routine completed
> without error or no self-test has ever
> been run.
> Total time to complete Offline
> data collection: (5073) seconds.
> Offline data collection
> capabilities: (0x79) SMART execute Offline immediate.
> No Auto Offline data collection support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities: (0x0003) Saves SMART data before entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability: (0x01) Error logging supported.
> No General Purpose Logging support.
> Short self-test routine
> recommended polling time: ( 2) minutes.
> Extended self-test routine
> recommended polling time: ( 67) minutes.
> Conveyance self-test routine
> recommended polling time: ( 5) minutes.
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
> 1 Raw_Read_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0
> 3 Spin_Up_Time 0x0007 146 142 021 Pre-fail Always - 3233
> 4 Start_Stop_Count 0x0032 099 099 040 Old_age Always - 1681
> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
> 7 Seek_Error_Rate 0x000b 200 200 051 Pre-fail Always - 0
> 9 Power_On_Hours 0x0032 070 070 000 Old_age Always - 22478
> 10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0
> 11 Calibration_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0
> 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 1654
> 194 Temperature_Celsius 0x0022 116 253 000 Old_age Always - 34
> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
> 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0
> 198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always - 0
> 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 1
> 200 Multi_Zone_Error_Rate 0x0009 200 155 051 Pre-fail Offline - 0
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
> # 1 Extended offline Completed without error 00% 632 -
> # 2 Short offline Completed without error 00% 696 -
>
> SMART Selective self-test log data structure revision number 1
> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> 1 0 0 Not_testing
> 2 0 0 Not_testing
> 3 0 0 Not_testing
> 4 0 0 Not_testing
> 5 0 0 Not_testing
> Selective self-test flags (0x0):
> After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute
> delay.
>
> This doesn't make much sense to me. If the overall health status id PASSED then why are all the vendor specific threshold values exceeded? Am I reading that wrong?
>
> Thanks,
> Steve

The drive seems OK.
I'd look at the machine's cooling (see my suggest concerning
lm_sensros).
(Even the UDMA CRC Error might be attributed to high temperature)

- Gilboa


--
users mailing list
users@lists.fedoraproject.org
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
 

Thread Tools




All times are GMT. The time now is 01:22 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org