FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Device-mapper Development

 
 
LinkBack Thread Tools
 
Old 05-07-2011, 10:39 AM
Nikola Ciprich
 
Default 2.6.32.28 - md resync + pvmove - crash

Hi,
first, I'm sorry for crossposting and also CCing stable@, if that's not OK, please let me knows.
Anyways, we've experienced hang of system running 2.6.32.28.
After upgrading to 2.6.32 and replacing failed disk, md resync has started. Then when the technician started pvmove, dome deadlock must have occured, because all disk requests started to hang and the whole system had to be rebooted...

here's the backtrace:

[ 1229.645028] alg: No test for stdrng (krng)
[ 1229.668172] alg: No test for authenc(hmac(sha1),cbc(des3_ede)) (authenc(hmac(sha1-generic),cbc(des3_ede-generic)))
[ 1531.585167] md: bind<sda2>
[ 1531.927846] raid1: raid set md2 active with 1 out of 2 mirrors
[ 1531.934613] md2: detected capacity change from 0 to 2000133029888
[ 1549.850444] md1: bitmap file is out of date (0 < 439231) -- forcing full recovery
[ 1549.858719] md1: bitmap file is out of date, doing full recovery
[ 1550.068105] md1: bitmap initialized from disk: read 11/11 pages, set 357576 bits
[ 1550.076054] created bitmap (175 pages) for device md1
[ 1561.449841] md2: unknown partition table
[ 1561.501645] md2: bitmap file is out of date (0 < 4) -- forcing full recovery
[ 1561.509999] md2: bitmap file is out of date, doing full recovery
[ 1562.158515] md2: bitmap initialized from disk: read 15/15 pages, set 476869 bits
[ 1562.167764] created bitmap (233 pages) for device md2
[ 2400.956019] INFO: task kjournald:1038 blocked for more than 120 seconds.
[ 2400.963280] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2400.971356] kjournald D ffff8800016ac400 0 1038 2 0x00000000
[ 2400.978621] ffff88003cc33c60 0000000000000046 ffff88003cc33bd0 ffffffff8119ba6f
[ 2400.986513] 0000000000013780 ffff88003f9746b0 ffff88003f9745f0 ffff88003ea2c5f0
[ 2400.994426] ffff88003f9749a0 ffff88003cc33fd8 ffff88003d65b000 ffff880035600a00
[ 2401.002415] Call Trace:
[ 2401.005024] [<ffffffff8119ba6f>] ? blk_unplug+0x2f/0xa0
[ 2401.010530] [<ffffffff81076bb4>] ? ktime_get_ts+0xa4/0xd0
[ 2401.016182] [<ffffffff8133773e>] io_schedule+0x6e/0xc0
[ 2401.021643] [<ffffffff81136afe>] sync_buffer+0x3e/0x50
[ 2401.027029] [<ffffffff81337c75>] __wait_on_bit+0x55/0x80
[ 2401.032638] [<ffffffff81136ac0>] ? sync_buffer+0x0/0x50
[ 2401.038177] [<ffffffff81136ac0>] ? sync_buffer+0x0/0x50
[ 2401.043659] [<ffffffff81337d18>] out_of_line_wait_on_bit+0x78/0x90
[ 2401.050129] [<ffffffff8106e070>] ? wake_bit_function+0x0/0x30
[ 2401.056143] [<ffffffff81136a36>] __wait_on_buffer+0x26/0x30
[ 2401.062077] [<ffffffffa002a097>] journal_commit_transaction+0x657/0x13c0 [jbd]
[ 2401.069693] [<ffffffff8105e104>] ? try_to_del_timer_sync+0x44/0x110
[ 2401.076212] [<ffffffff81339ddd>] ? _spin_unlock_irqrestore+0x1d/0x50
[ 2401.082831] [<ffffffffa002e893>] kjournald+0xe3/0x260 [jbd]
[ 2401.088708] [<ffffffff8106e030>] ? autoremove_wake_function+0x0/0x40
[ 2401.095369] [<ffffffffa002e7b0>] ? kjournald+0x0/0x260 [jbd]
[ 2401.101337] [<ffffffff8106deee>] kthread+0x8e/0xa0
[ 2401.106354] [<ffffffff8100c30a>] child_rip+0xa/0x20
[ 2401.111477] [<ffffffff8106de60>] ? kthread+0x0/0xa0
[ 2401.116598] [<ffffffff8100c300>] ? child_rip+0x0/0x20
[ 2401.121893] INFO: task flush-253:2:3168 blocked for more than 120 seconds.
[ 2401.128983] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2401.137114] flush-253:2 D 0000000000000002 0 3168 2 0x00000000
[ 2401.144318] ffff88002c245a40 0000000000000046 ffff880035601600 ffff88002f621840
[ 2401.152248] 0000000000013780 ffff88003ceb9810 ffff88003ceb9750 ffff88003ea2c5f0
[ 2401.160169] ffff88003ceb9b00 ffff88002c245fd8 ffff88002c245a00 ffff880035601600
[ 2401.168048] Call Trace:
[ 2401.170608] [<ffffffff81076bb4>] ? ktime_get_ts+0xa4/0xd0
[ 2401.176303] [<ffffffff8133773e>] io_schedule+0x6e/0xc0
[ 2401.181723] [<ffffffff810ccb56>] sync_page+0x36/0x50
[ 2401.186970] [<ffffffff81337b3e>] __wait_on_bit_lock+0x4e/0xa0
[ 2401.192991] [<ffffffff810ccb20>] ? sync_page+0x0/0x50
[ 2401.198287] [<ffffffff810ccb05>] __lock_page+0x65/0x70
[ 2401.203687] [<ffffffff8106e070>] ? wake_bit_function+0x0/0x30
[ 2401.209687] [<ffffffff810d4366>] write_cache_pages+0x3d6/0x490
[ 2401.215802] [<ffffffff810d3da0>] ? __writepage+0x0/0x40
[ 2401.221291] [<ffffffff810d4442>] generic_writepages+0x22/0x30
[ 2401.227327] [<ffffffff810d4476>] do_writepages+0x26/0x30
[ 2401.232965] [<ffffffff8112fa24>] writeback_single_inode+0xa4/0x290
[ 2401.239412] [<ffffffff811304e2>] writeback_inodes_wb+0x2d2/0x420
[ 2401.245715] [<ffffffff81130756>] wb_writeback+0x126/0x1e0
[ 2401.251360] [<ffffffff81130a84>] wb_do_writeback+0x1a4/0x1c0
[ 2401.257287] [<ffffffff81130ad5>] bdi_writeback_task+0x35/0xd0
[ 2401.263317] [<ffffffff810e5cf0>] ? bdi_start_fn+0x0/0xf0
[ 2401.268886] [<ffffffff810e5d71>] bdi_start_fn+0x81/0xf0
[ 2401.274370] [<ffffffff810e5cf0>] ? bdi_start_fn+0x0/0xf0
[ 2401.279947] [<ffffffff8106deee>] kthread+0x8e/0xa0
[ 2401.285000] [<ffffffff8100c30a>] child_rip+0xa/0x20
[ 2401.290120] [<ffffffff8106de60>] ? kthread+0x0/0xa0
[ 2401.295247] [<ffffffff8100c300>] ? child_rip+0x0/0x20
[ 2401.300586] INFO: task reiserfs/0:3204 blocked for more than 120 seconds.
[ 2401.307590] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2401.315682] reiserfs/0 D ffff880016fdad48 0 3204 2 0x00000000
[ 2401.322884] ffff88002f1b1d10 0000000000000046 ffff88000180dda0 ffff88000180dec0
[ 2401.330754] 0000000000013780 ffff88003ea180c0 ffff88003ea18000 ffff88002f43aea0
[ 2401.338683] ffff88003ea183b0 ffff88002f1b1fd8 ffff88002f1b1cd0 ffffffff81048960
[ 2401.346684] Call Trace:
[ 2401.349252] [<ffffffff81048960>] ? update_curr+0xb0/0x170
[ 2401.354983] [<ffffffff813384f7>] __mutex_lock_slowpath+0x107/0x310
[ 2401.361480] [<ffffffff81338727>] mutex_lock+0x27/0x50
[ 2401.366791] [<ffffffffa0358b27>] flush_commit_list+0x137/0x6d0

I can't 100% separate out some hardware problem, but this system has been running 2.6.27.x rock solid for years till then..
Can somebody see something interesting in those backtraces?
If I can provide further information, I'll be glad to assist...
BR
nik


--
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 01 Ostrava

tel.: +420 596 603 142
fax: +420 596 621 273
mobil: +420 777 093 799

www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: servis@linuxbox.cz
-------------------------------------
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 05-09-2011, 01:16 PM
Christian May
 
Default 2.6.32.28 - md resync + pvmove - crash

Hi,



I've seen the similar backtraces on
my RHEL6.1 Snap 5 test system. *When ruring good path I/O I found
the following:



INFO: task kswapd0:98 blocked for more than 120 seconds.

"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.

kswapd0 * * * D eeda7ab8 * * 0 * *98
* * *2 0x00000000

f7292ab0 00000046 00000002 eeda7ab8 c9643b54 00000000 00000001 00000010

ec77c800 ee6aaac0 00001ef7 f30eedfc 00001ef7 c0ae2120 c0ae2120 f7292d58

c0ae2120 c0addb54 c0ae2120 f7292d58 c9643b54 c05d7a20 0202fcfa f7292ab0

Call Trace:

[<c05d7a20>] ? blk_unplug_timeout+0x0/0x50

[<c0463b8e>] ? mod_timer+0xfe/0x1e0

[<c05d857f>] ? blk_plug_device+0x6f/0xd0

[<c047dd90>] ? ktime_get_ts+0xd0/0x100

[<c0822b49>] ? io_schedule+0x59/0xa0

[<c05dacf1>] ? get_request_wait+0xc1/0x190

[<c05d28a5>] ? elv_merge+0x185/0x1b0

[<c0473ec0>] ? autoremove_wake_function+0x0/0x40

[<c05dae26>] ? __make_request+0x66/0x4a0

[<f7e4cf48>] ? dm_request+0x108/0x150 [dm_mod]

[<c05d9c7d>] ? generic_make_request+0x38d/0x5f0

[<c044615d>] ? activate_task+0x1d/0x30

[<c043f4ad>] ? enqueue_entity+0x37d/0x400

[<c05d9f59>] ? submit_bio+0x79/0x120

[<c0555001>] ? bio_alloc_bioset+0x41/0xc0

[<c0550709>] ? submit_bh+0xd9/0x120

[<c055219a>] ? __block_write_full_page+0x20a/0x3d0

[<c043f4ad>] ? enqueue_entity+0x37d/0x400

[<c05563c0>] ? blkdev_get_block+0x0/0xd0

[<c0552c67>] ? block_write_full_page_endio+0xa7/0xe0

[<c0551b80>] ? end_buffer_async_write+0x0/0x140

[<c05563c0>] ? blkdev_get_block+0x0/0xd0

[<c0552caf>] ? block_write_full_page+0xf/0x20

[<c0551b80>] ? end_buffer_async_write+0x0/0x140

[<c04f247c>] ? pageout.clone.1+0xfc/0x2b0

[<c04f28e3>] ? shrink_page_list.clone.0+0x2b3/0x460

[<c04f2d4d>] ? shrink_inactive_list+0x2bd/0x640

[<c04f3bcd>] ? shrink_zone+0x30d/0x460

[<c04f4ae9>] ? kswapd+0x699/0x8d0

[<c04f4d20>] ? isolate_pages_global+0x0/0x2c0

[<c0473ec0>] ? autoremove_wake_function+0x0/0x40

[<c04f4450>] ? kswapd+0x0/0x8d0

[<c0473c84>] ? kthread+0x74/0x80

[<c0473c10>] ? kthread+0x0/0x80

[<c040a03f>] ? kernel_thread_helper+0x7/0x10



The system was still accessible and
I/O continued. I've noticed kjournald messages too. Will keep an eye on
that..



Best regards,



Christian



*







From: * * *
*Nikola Ciprich <nikola.ciprich@linuxbox.cz>

To: * * *
*linux-kernel@vger.kernel.org

Cc: * * *
*nikola.ciprich@linuxbox.cz,
linux-raid@vger.kernel.org, dm-devel@redhat.com, stable@kernel.org

Date: * * *
*07.05.2011 12:41

Subject: * *
* *[dm-devel] 2.6.32.28
- md resync + pvmove - crash

Sent by: * *
* *dm-devel-bounces@redhat.com








Hi,

first, I'm sorry for crossposting and also CCing stable@, if that's not
OK, please let me knows.

Anyways, we've experienced hang of system running 2.6.32.28.

After upgrading to 2.6.32 and replacing failed disk, md resync has started.
Then when the technician started pvmove, dome deadlock must have occured,
because all disk requests started to hang and the whole system had to be
rebooted...



here's the backtrace:



[ 1229.645028] alg: No test for stdrng (krng)

[ 1229.668172] alg: No test for authenc(hmac(sha1),cbc(des3_ede)) (authenc(hmac(sha1-generic),cbc(des3_ede-generic)))

[ 1531.585167] md: bind<sda2>

[ 1531.927846] raid1: raid set md2 active with 1 out of 2 mirrors

[ 1531.934613] md2: detected capacity change from 0 to 2000133029888

[ 1549.850444] md1: bitmap file is out of date (0 < 439231) -- forcing
full recovery

[ 1549.858719] md1: bitmap file is out of date, doing full recovery

[ 1550.068105] md1: bitmap initialized from disk: read 11/11 pages, set
357576 bits

[ 1550.076054] created bitmap (175 pages) for device md1

[ 1561.449841] *md2: unknown partition table

[ 1561.501645] md2: bitmap file is out of date (0 < 4) -- forcing full
recovery

[ 1561.509999] md2: bitmap file is out of date, doing full recovery

[ 1562.158515] md2: bitmap initialized from disk: read 15/15 pages, set
476869 bits

[ 1562.167764] created bitmap (233 pages) for device md2

[ 2400.956019] INFO: task kjournald:1038 blocked for more than 120 seconds.

[ 2400.963280] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

[ 2400.971356] kjournald * * D ffff8800016ac400 * *
0 *1038 * * *2 0x00000000

[ 2400.978621] *ffff88003cc33c60 0000000000000046 ffff88003cc33bd0
ffffffff8119ba6f

[ 2400.986513] *0000000000013780 ffff88003f9746b0 ffff88003f9745f0
ffff88003ea2c5f0

[ 2400.994426] *ffff88003f9749a0 ffff88003cc33fd8 ffff88003d65b000
ffff880035600a00

[ 2401.002415] Call Trace:

[ 2401.005024] *[<ffffffff8119ba6f>] ? blk_unplug+0x2f/0xa0

[ 2401.010530] *[<ffffffff81076bb4>] ? ktime_get_ts+0xa4/0xd0

[ 2401.016182] *[<ffffffff8133773e>] io_schedule+0x6e/0xc0

[ 2401.021643] *[<ffffffff81136afe>] sync_buffer+0x3e/0x50

[ 2401.027029] *[<ffffffff81337c75>] __wait_on_bit+0x55/0x80

[ 2401.032638] *[<ffffffff81136ac0>] ? sync_buffer+0x0/0x50

[ 2401.038177] *[<ffffffff81136ac0>] ? sync_buffer+0x0/0x50

[ 2401.043659] *[<ffffffff81337d18>] out_of_line_wait_on_bit+0x78/0x90

[ 2401.050129] *[<ffffffff8106e070>] ? wake_bit_function+0x0/0x30

[ 2401.056143] *[<ffffffff81136a36>] __wait_on_buffer+0x26/0x30

[ 2401.062077] *[<ffffffffa002a097>] journal_commit_transaction+0x657/0x13c0
[jbd]

[ 2401.069693] *[<ffffffff8105e104>] ? try_to_del_timer_sync+0x44/0x110

[ 2401.076212] *[<ffffffff81339ddd>] ? _spin_unlock_irqrestore+0x1d/0x50

[ 2401.082831] *[<ffffffffa002e893>] kjournald+0xe3/0x260 [jbd]

[ 2401.088708] *[<ffffffff8106e030>] ? autoremove_wake_function+0x0/0x40

[ 2401.095369] *[<ffffffffa002e7b0>] ? kjournald+0x0/0x260 [jbd]

[ 2401.101337] *[<ffffffff8106deee>] kthread+0x8e/0xa0

[ 2401.106354] *[<ffffffff8100c30a>] child_rip+0xa/0x20

[ 2401.111477] *[<ffffffff8106de60>] ? kthread+0x0/0xa0

[ 2401.116598] *[<ffffffff8100c300>] ? child_rip+0x0/0x20

[ 2401.121893] INFO: task flush-253:2:3168 blocked for more than 120 seconds.

[ 2401.128983] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

[ 2401.137114] flush-253:2 * D 0000000000000002 * * 0 *3168
* * *2 0x00000000

[ 2401.144318] *ffff88002c245a40 0000000000000046 ffff880035601600
ffff88002f621840

[ 2401.152248] *0000000000013780 ffff88003ceb9810 ffff88003ceb9750
ffff88003ea2c5f0

[ 2401.160169] *ffff88003ceb9b00 ffff88002c245fd8 ffff88002c245a00
ffff880035601600

[ 2401.168048] Call Trace:

[ 2401.170608] *[<ffffffff81076bb4>] ? ktime_get_ts+0xa4/0xd0

[ 2401.176303] *[<ffffffff8133773e>] io_schedule+0x6e/0xc0

[ 2401.181723] *[<ffffffff810ccb56>] sync_page+0x36/0x50

[ 2401.186970] *[<ffffffff81337b3e>] __wait_on_bit_lock+0x4e/0xa0

[ 2401.192991] *[<ffffffff810ccb20>] ? sync_page+0x0/0x50

[ 2401.198287] *[<ffffffff810ccb05>] __lock_page+0x65/0x70

[ 2401.203687] *[<ffffffff8106e070>] ? wake_bit_function+0x0/0x30

[ 2401.209687] *[<ffffffff810d4366>] write_cache_pages+0x3d6/0x490

[ 2401.215802] *[<ffffffff810d3da0>] ? __writepage+0x0/0x40

[ 2401.221291] *[<ffffffff810d4442>] generic_writepages+0x22/0x30

[ 2401.227327] *[<ffffffff810d4476>] do_writepages+0x26/0x30

[ 2401.232965] *[<ffffffff8112fa24>] writeback_single_inode+0xa4/0x290

[ 2401.239412] *[<ffffffff811304e2>] writeback_inodes_wb+0x2d2/0x420

[ 2401.245715] *[<ffffffff81130756>] wb_writeback+0x126/0x1e0

[ 2401.251360] *[<ffffffff81130a84>] wb_do_writeback+0x1a4/0x1c0

[ 2401.257287] *[<ffffffff81130ad5>] bdi_writeback_task+0x35/0xd0

[ 2401.263317] *[<ffffffff810e5cf0>] ? bdi_start_fn+0x0/0xf0

[ 2401.268886] *[<ffffffff810e5d71>] bdi_start_fn+0x81/0xf0

[ 2401.274370] *[<ffffffff810e5cf0>] ? bdi_start_fn+0x0/0xf0

[ 2401.279947] *[<ffffffff8106deee>] kthread+0x8e/0xa0

[ 2401.285000] *[<ffffffff8100c30a>] child_rip+0xa/0x20

[ 2401.290120] *[<ffffffff8106de60>] ? kthread+0x0/0xa0

[ 2401.295247] *[<ffffffff8100c300>] ? child_rip+0x0/0x20

[ 2401.300586] INFO: task reiserfs/0:3204 blocked for more than 120 seconds.

[ 2401.307590] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

[ 2401.315682] reiserfs/0 * *D ffff880016fdad48 * *
0 *3204 * * *2 0x00000000

[ 2401.322884] *ffff88002f1b1d10 0000000000000046 ffff88000180dda0
ffff88000180dec0

[ 2401.330754] *0000000000013780 ffff88003ea180c0 ffff88003ea18000
ffff88002f43aea0

[ 2401.338683] *ffff88003ea183b0 ffff88002f1b1fd8 ffff88002f1b1cd0
ffffffff81048960

[ 2401.346684] Call Trace:

[ 2401.349252] *[<ffffffff81048960>] ? update_curr+0xb0/0x170

[ 2401.354983] *[<ffffffff813384f7>] __mutex_lock_slowpath+0x107/0x310

[ 2401.361480] *[<ffffffff81338727>] mutex_lock+0x27/0x50

[ 2401.366791] *[<ffffffffa0358b27>] flush_commit_list+0x137/0x6d0



I can't 100% separate out some hardware problem, but this system has been
running 2.6.27.x rock solid for years till then..

Can somebody see something interesting in those backtraces?

If I can provide further information, I'll be glad to assist...

BR

nik





--

-------------------------------------

Ing. Nikola CIPRICH

LinuxBox.cz, s.r.o.

28. rijna 168, 709 01 Ostrava



tel.: * +420 596 603 142

fax: * *+420 596 621 273

mobil: *+420 777 093 799



www.linuxbox.cz



mobil servis: +420 737 238 656

email servis: servis@linuxbox.cz

-------------------------------------

[attachment "attui1b8.dat" deleted by Christian May/Germany/IBM]
--

dm-devel mailing list

dm-devel@redhat.com

https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 05-09-2011, 01:16 PM
Christian May
 
Default 2.6.32.28 - md resync + pvmove - crash

Hi,



I've seen the similar backtraces on
my RHEL6.1 Snap 5 test system. *When ruring good path I/O I found
the following:



INFO: task kswapd0:98 blocked for more than 120 seconds.

"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.

kswapd0 * * * D eeda7ab8 * * 0 * *98
* * *2 0x00000000

f7292ab0 00000046 00000002 eeda7ab8 c9643b54 00000000 00000001 00000010

ec77c800 ee6aaac0 00001ef7 f30eedfc 00001ef7 c0ae2120 c0ae2120 f7292d58

c0ae2120 c0addb54 c0ae2120 f7292d58 c9643b54 c05d7a20 0202fcfa f7292ab0

Call Trace:

[<c05d7a20>] ? blk_unplug_timeout+0x0/0x50

[<c0463b8e>] ? mod_timer+0xfe/0x1e0

[<c05d857f>] ? blk_plug_device+0x6f/0xd0

[<c047dd90>] ? ktime_get_ts+0xd0/0x100

[<c0822b49>] ? io_schedule+0x59/0xa0

[<c05dacf1>] ? get_request_wait+0xc1/0x190

[<c05d28a5>] ? elv_merge+0x185/0x1b0

[<c0473ec0>] ? autoremove_wake_function+0x0/0x40

[<c05dae26>] ? __make_request+0x66/0x4a0

[<f7e4cf48>] ? dm_request+0x108/0x150 [dm_mod]

[<c05d9c7d>] ? generic_make_request+0x38d/0x5f0

[<c044615d>] ? activate_task+0x1d/0x30

[<c043f4ad>] ? enqueue_entity+0x37d/0x400

[<c05d9f59>] ? submit_bio+0x79/0x120

[<c0555001>] ? bio_alloc_bioset+0x41/0xc0

[<c0550709>] ? submit_bh+0xd9/0x120

[<c055219a>] ? __block_write_full_page+0x20a/0x3d0

[<c043f4ad>] ? enqueue_entity+0x37d/0x400

[<c05563c0>] ? blkdev_get_block+0x0/0xd0

[<c0552c67>] ? block_write_full_page_endio+0xa7/0xe0

[<c0551b80>] ? end_buffer_async_write+0x0/0x140

[<c05563c0>] ? blkdev_get_block+0x0/0xd0

[<c0552caf>] ? block_write_full_page+0xf/0x20

[<c0551b80>] ? end_buffer_async_write+0x0/0x140

[<c04f247c>] ? pageout.clone.1+0xfc/0x2b0

[<c04f28e3>] ? shrink_page_list.clone.0+0x2b3/0x460

[<c04f2d4d>] ? shrink_inactive_list+0x2bd/0x640

[<c04f3bcd>] ? shrink_zone+0x30d/0x460

[<c04f4ae9>] ? kswapd+0x699/0x8d0

[<c04f4d20>] ? isolate_pages_global+0x0/0x2c0

[<c0473ec0>] ? autoremove_wake_function+0x0/0x40

[<c04f4450>] ? kswapd+0x0/0x8d0

[<c0473c84>] ? kthread+0x74/0x80

[<c0473c10>] ? kthread+0x0/0x80

[<c040a03f>] ? kernel_thread_helper+0x7/0x10



The system was still accessible and
I/O continued. I've noticed kjournald messages too. Will keep an eye on
that..



Best regards,



Christian



*







From: * * *
*Nikola Ciprich <nikola.ciprich@linuxbox.cz>

To: * * *
*linux-kernel@vger.kernel.org

Cc: * * *
*nikola.ciprich@linuxbox.cz,
linux-raid@vger.kernel.org, dm-devel@redhat.com, stable@kernel.org

Date: * * *
*07.05.2011 12:41

Subject: * *
* *[dm-devel] 2.6.32.28
- md resync + pvmove - crash

Sent by: * *
* *dm-devel-bounces@redhat.com








Hi,

first, I'm sorry for crossposting and also CCing stable@, if that's not
OK, please let me knows.

Anyways, we've experienced hang of system running 2.6.32.28.

After upgrading to 2.6.32 and replacing failed disk, md resync has started.
Then when the technician started pvmove, dome deadlock must have occured,
because all disk requests started to hang and the whole system had to be
rebooted...



here's the backtrace:



[ 1229.645028] alg: No test for stdrng (krng)

[ 1229.668172] alg: No test for authenc(hmac(sha1),cbc(des3_ede)) (authenc(hmac(sha1-generic),cbc(des3_ede-generic)))

[ 1531.585167] md: bind<sda2>

[ 1531.927846] raid1: raid set md2 active with 1 out of 2 mirrors

[ 1531.934613] md2: detected capacity change from 0 to 2000133029888

[ 1549.850444] md1: bitmap file is out of date (0 < 439231) -- forcing
full recovery

[ 1549.858719] md1: bitmap file is out of date, doing full recovery

[ 1550.068105] md1: bitmap initialized from disk: read 11/11 pages, set
357576 bits

[ 1550.076054] created bitmap (175 pages) for device md1

[ 1561.449841] *md2: unknown partition table

[ 1561.501645] md2: bitmap file is out of date (0 < 4) -- forcing full
recovery

[ 1561.509999] md2: bitmap file is out of date, doing full recovery

[ 1562.158515] md2: bitmap initialized from disk: read 15/15 pages, set
476869 bits

[ 1562.167764] created bitmap (233 pages) for device md2

[ 2400.956019] INFO: task kjournald:1038 blocked for more than 120 seconds.

[ 2400.963280] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

[ 2400.971356] kjournald * * D ffff8800016ac400 * *
0 *1038 * * *2 0x00000000

[ 2400.978621] *ffff88003cc33c60 0000000000000046 ffff88003cc33bd0
ffffffff8119ba6f

[ 2400.986513] *0000000000013780 ffff88003f9746b0 ffff88003f9745f0
ffff88003ea2c5f0

[ 2400.994426] *ffff88003f9749a0 ffff88003cc33fd8 ffff88003d65b000
ffff880035600a00

[ 2401.002415] Call Trace:

[ 2401.005024] *[<ffffffff8119ba6f>] ? blk_unplug+0x2f/0xa0

[ 2401.010530] *[<ffffffff81076bb4>] ? ktime_get_ts+0xa4/0xd0

[ 2401.016182] *[<ffffffff8133773e>] io_schedule+0x6e/0xc0

[ 2401.021643] *[<ffffffff81136afe>] sync_buffer+0x3e/0x50

[ 2401.027029] *[<ffffffff81337c75>] __wait_on_bit+0x55/0x80

[ 2401.032638] *[<ffffffff81136ac0>] ? sync_buffer+0x0/0x50

[ 2401.038177] *[<ffffffff81136ac0>] ? sync_buffer+0x0/0x50

[ 2401.043659] *[<ffffffff81337d18>] out_of_line_wait_on_bit+0x78/0x90

[ 2401.050129] *[<ffffffff8106e070>] ? wake_bit_function+0x0/0x30

[ 2401.056143] *[<ffffffff81136a36>] __wait_on_buffer+0x26/0x30

[ 2401.062077] *[<ffffffffa002a097>] journal_commit_transaction+0x657/0x13c0
[jbd]

[ 2401.069693] *[<ffffffff8105e104>] ? try_to_del_timer_sync+0x44/0x110

[ 2401.076212] *[<ffffffff81339ddd>] ? _spin_unlock_irqrestore+0x1d/0x50

[ 2401.082831] *[<ffffffffa002e893>] kjournald+0xe3/0x260 [jbd]

[ 2401.088708] *[<ffffffff8106e030>] ? autoremove_wake_function+0x0/0x40

[ 2401.095369] *[<ffffffffa002e7b0>] ? kjournald+0x0/0x260 [jbd]

[ 2401.101337] *[<ffffffff8106deee>] kthread+0x8e/0xa0

[ 2401.106354] *[<ffffffff8100c30a>] child_rip+0xa/0x20

[ 2401.111477] *[<ffffffff8106de60>] ? kthread+0x0/0xa0

[ 2401.116598] *[<ffffffff8100c300>] ? child_rip+0x0/0x20

[ 2401.121893] INFO: task flush-253:2:3168 blocked for more than 120 seconds.

[ 2401.128983] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

[ 2401.137114] flush-253:2 * D 0000000000000002 * * 0 *3168
* * *2 0x00000000

[ 2401.144318] *ffff88002c245a40 0000000000000046 ffff880035601600
ffff88002f621840

[ 2401.152248] *0000000000013780 ffff88003ceb9810 ffff88003ceb9750
ffff88003ea2c5f0

[ 2401.160169] *ffff88003ceb9b00 ffff88002c245fd8 ffff88002c245a00
ffff880035601600

[ 2401.168048] Call Trace:

[ 2401.170608] *[<ffffffff81076bb4>] ? ktime_get_ts+0xa4/0xd0

[ 2401.176303] *[<ffffffff8133773e>] io_schedule+0x6e/0xc0

[ 2401.181723] *[<ffffffff810ccb56>] sync_page+0x36/0x50

[ 2401.186970] *[<ffffffff81337b3e>] __wait_on_bit_lock+0x4e/0xa0

[ 2401.192991] *[<ffffffff810ccb20>] ? sync_page+0x0/0x50

[ 2401.198287] *[<ffffffff810ccb05>] __lock_page+0x65/0x70

[ 2401.203687] *[<ffffffff8106e070>] ? wake_bit_function+0x0/0x30

[ 2401.209687] *[<ffffffff810d4366>] write_cache_pages+0x3d6/0x490

[ 2401.215802] *[<ffffffff810d3da0>] ? __writepage+0x0/0x40

[ 2401.221291] *[<ffffffff810d4442>] generic_writepages+0x22/0x30

[ 2401.227327] *[<ffffffff810d4476>] do_writepages+0x26/0x30

[ 2401.232965] *[<ffffffff8112fa24>] writeback_single_inode+0xa4/0x290

[ 2401.239412] *[<ffffffff811304e2>] writeback_inodes_wb+0x2d2/0x420

[ 2401.245715] *[<ffffffff81130756>] wb_writeback+0x126/0x1e0

[ 2401.251360] *[<ffffffff81130a84>] wb_do_writeback+0x1a4/0x1c0

[ 2401.257287] *[<ffffffff81130ad5>] bdi_writeback_task+0x35/0xd0

[ 2401.263317] *[<ffffffff810e5cf0>] ? bdi_start_fn+0x0/0xf0

[ 2401.268886] *[<ffffffff810e5d71>] bdi_start_fn+0x81/0xf0

[ 2401.274370] *[<ffffffff810e5cf0>] ? bdi_start_fn+0x0/0xf0

[ 2401.279947] *[<ffffffff8106deee>] kthread+0x8e/0xa0

[ 2401.285000] *[<ffffffff8100c30a>] child_rip+0xa/0x20

[ 2401.290120] *[<ffffffff8106de60>] ? kthread+0x0/0xa0

[ 2401.295247] *[<ffffffff8100c300>] ? child_rip+0x0/0x20

[ 2401.300586] INFO: task reiserfs/0:3204 blocked for more than 120 seconds.

[ 2401.307590] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

[ 2401.315682] reiserfs/0 * *D ffff880016fdad48 * *
0 *3204 * * *2 0x00000000

[ 2401.322884] *ffff88002f1b1d10 0000000000000046 ffff88000180dda0
ffff88000180dec0

[ 2401.330754] *0000000000013780 ffff88003ea180c0 ffff88003ea18000
ffff88002f43aea0

[ 2401.338683] *ffff88003ea183b0 ffff88002f1b1fd8 ffff88002f1b1cd0
ffffffff81048960

[ 2401.346684] Call Trace:

[ 2401.349252] *[<ffffffff81048960>] ? update_curr+0xb0/0x170

[ 2401.354983] *[<ffffffff813384f7>] __mutex_lock_slowpath+0x107/0x310

[ 2401.361480] *[<ffffffff81338727>] mutex_lock+0x27/0x50

[ 2401.366791] *[<ffffffffa0358b27>] flush_commit_list+0x137/0x6d0



I can't 100% separate out some hardware problem, but this system has been
running 2.6.27.x rock solid for years till then..

Can somebody see something interesting in those backtraces?

If I can provide further information, I'll be glad to assist...

BR

nik





--

-------------------------------------

Ing. Nikola CIPRICH

LinuxBox.cz, s.r.o.

28. rijna 168, 709 01 Ostrava



tel.: * +420 596 603 142

fax: * *+420 596 621 273

mobil: *+420 777 093 799



www.linuxbox.cz



mobil servis: +420 737 238 656

email servis: servis@linuxbox.cz

-------------------------------------

[attachment "attui1b8.dat" deleted by Christian May/Germany/IBM]
--

dm-devel mailing list

dm-devel@redhat.com

https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 05-19-2011, 01:56 PM
Christian May
 
Default 2.6.32.28 - md resync + pvmove - crash

Hi,



currently I have a ppc64 IBM blade running
with RHEL6.1 RC2 (2.6.32-131.0.13.el6.ppc64). Twelve multipath devices
are attached and I/O is running.

I've checked dmesg and found:



[root@elm8a211 ~]# dmesg

INFO: task kjournald:7938 blocked for
more than 120 seconds.

"echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

kjournald * * D 0000000000000000
* * 0 *7938 * * *2 0x00008080

Call Trace:

[c0000003c82f3420] [c0000003c82f34d0]
0xc0000003c82f34d0 (unreliable)

[c0000003c82f35f0] [c000000000014288]
.__switch_to+0xf8/0x1d0

[c0000003c82f3680] [c00000000059cc18]
.schedule+0x408/0xd30

[c0000003c82f3960] [c00000000059d5d0]
.io_schedule+0x90/0x110

[c0000003c82f39f0] [c0000000001fc2f8]
.sync_buffer+0x68/0x80

[c0000003c82f3a60] [c00000000059e104]
.__wait_on_bit+0xb4/0x130

[c0000003c82f3b10] [c00000000059e1f0]
.out_of_line_wait_on_bit+0x70/0x90

[c0000003c82f3bf0] [c0000000001fc1b0]
.__wait_on_buffer+0x30/0x50

[c0000003c82f3c70] [d000000004043bc4]
.journal_commit_transaction+0x624/0x1630 [jbd]

[c0000003c82f3de0] [d0000000040488fc]
.kjournald+0x11c/0x330 [jbd]

[c0000003c82f3ed0] [c0000000000bb02c]
.kthread+0xbc/0xd0

[c0000003c82f3f90] [c0000000000323f4]
.kernel_thread+0x54/0x70

[root@elm8a211 ~]#



I have no idea where it's coming from..system
i still up'n running.





Best regards,



Christian May

IBM

*--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 05-19-2011, 01:56 PM
Christian May
 
Default 2.6.32.28 - md resync + pvmove - crash

Hi,



currently I have a ppc64 IBM blade running
with RHEL6.1 RC2 (2.6.32-131.0.13.el6.ppc64). Twelve multipath devices
are attached and I/O is running.

I've checked dmesg and found:



[root@elm8a211 ~]# dmesg

INFO: task kjournald:7938 blocked for
more than 120 seconds.

"echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.

kjournald * * D 0000000000000000
* * 0 *7938 * * *2 0x00008080

Call Trace:

[c0000003c82f3420] [c0000003c82f34d0]
0xc0000003c82f34d0 (unreliable)

[c0000003c82f35f0] [c000000000014288]
.__switch_to+0xf8/0x1d0

[c0000003c82f3680] [c00000000059cc18]
.schedule+0x408/0xd30

[c0000003c82f3960] [c00000000059d5d0]
.io_schedule+0x90/0x110

[c0000003c82f39f0] [c0000000001fc2f8]
.sync_buffer+0x68/0x80

[c0000003c82f3a60] [c00000000059e104]
.__wait_on_bit+0xb4/0x130

[c0000003c82f3b10] [c00000000059e1f0]
.out_of_line_wait_on_bit+0x70/0x90

[c0000003c82f3bf0] [c0000000001fc1b0]
.__wait_on_buffer+0x30/0x50

[c0000003c82f3c70] [d000000004043bc4]
.journal_commit_transaction+0x624/0x1630 [jbd]

[c0000003c82f3de0] [d0000000040488fc]
.kjournald+0x11c/0x330 [jbd]

[c0000003c82f3ed0] [c0000000000bb02c]
.kthread+0xbc/0xd0

[c0000003c82f3f90] [c0000000000323f4]
.kernel_thread+0x54/0x70

[root@elm8a211 ~]#



I have no idea where it's coming from..system
i still up'n running.





Best regards,



Christian May

IBM

*--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 

Thread Tools




All times are GMT. The time now is 04:48 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org