FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Debian > Debian Kernel

 
 
LinkBack Thread Tools
 
Old 10-25-2011, 10:21 AM
Egon Eckert
 
Default Bug#644550: linux-image-2.6.32-5-amd64: indefinite soft lockup on rm

> > Package: linux-2.6
> > Version: 2.6.32-38
> > Severity: important
> > Tags: upstream
> >
> > the lockup happened on unlinking a large file (several GB) on XFS on fibre
> > channel storage (qla2xxx module). We experience very simiral trouble on
> > this machine though, using other kernels (from stable, stable-backports and
> > sid as well) and other fs/storage combinations (ext4 on local Dell PERC
> > controller).
>
> How long did the soft lockup state persist before you rebooted?

14 minutes

> > The full boot-time dmesg output may be found on
> >
> > http://joni.heaven-industries.com/~egon/tornado-dmesg.txt
> >
> > The bug is not too easy to trigger, so I hope there's another way to look
> > for it than bisecting 5 years of kernel commits... On the other hand, I'm
> > of course ready to spend a time on it if it helps to make the kernel better
> > .
> [...]
>
> When you say 'not too easy', do you mean that you don't know what
> specific circumstances it occurs in, or that it occurs as part of a long
> or complex sequence of operations?

The former.

> What sort of applications or services is this machine running?
>
> Given that you can reproduce this with the kernel version in sid, maybe
> we should take this straight to the upstream developers - but it's hard
> to guess which component or maintainer might be responsible.

I'm sorry for taking so long to respond--we tried to reproduce that in the
sid kernel and it didn't crash since then until today (well, the kernel from
squeeze-backports crashed here once in between). So I thought the BIOS
upgrade may have solved it and didn't want to bother you in that case. The
today's crash may not even be related to the original, so I should perhaps
open a new bug report... Anyway, here it goes (edited from netconsole
output):

[313919.501641] BUG: unable to handle kernel paging request at ffff88007f873133
[313919.501703] IP: [<ffffffff810ed6aa>] free_block+0x50/0xfc PGD 1604063 PUD 7f335067 PMD 0 Oops: 0002 [#1] SMP
[313919.501827] CPU 11
[313919.501836] Modules linked in:
dm_snapshot des_generic ecb md4 hmac nls_utf8 cifs nfs lockd fscache
auth_rpcgss nfs_acl sunrpc ext4 jbd2 crc16 xfs dm_round_robin
scsi_dh_emc dm_multipath scsi_dh netconsole configfs snd_pcm
snd_timer snd soundcore snd_page_alloc joydev tpm_tis pcspkr tpm
tpm_bios dcdbas i7core_edac evdev psmouse ghes serio_raw edac_core
acpi_power_meter hed button processor thermal_sys ext3 jbd mbcache
dm_mod sg sr_mod cdrom sd_mod ata_generic ses crc_t10dif enclosure
usbhid hid uhci_hcd qla2xxx scsi_transport_fc scsi_tgt ata_piix
libata ehci_hcd usbcore megaraid_sas scsi_mod bnx2 [last unloaded:
scsi_wait_scan]
[313919.502423] Pid: 78, comm: kswapd1 Not tainted 3.0.0-2-amd64 #1 Dell Inc. PowerEdge R810/0TT6JF
[313919.502486] RIP: 0010:[<ffffffff810ed6aa>] [<ffffffff810ed6aa>] free_block+0x50/0xfc
[313919.502540] RSP: 0018:ffff88085e35f9d0 EFLAGS: 00010082
[313919.502569] RAX: ffff880004013000 RBX: ffff88107f04a340 RCX: ffff880004018000
[313919.502616] RDX: ffff88087f4761c0 RSI: ffff88007f87312b RDI: ffff880004013200
[313919.502663] RBP: 000000000000000c R08: dead000000200200 R09: dead000000100100
[313919.502710] R10: dead000000200200 R11: dead000000100100 R12: 0000000000000003
[313919.502757] R13: ffff88107f01d5f0 R14: 0000000000000000 R15: ffff880004013200
[313919.502804] FS: 0000000000000000(0000) GS:ffff88107f4a0000(0000) knlGS:0000000000000000
[313919.502853] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[313919.502883] CR2: ffff88007f873133 CR3: 0000000001603000 CR4: 00000000000006e0
[313919.502930] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[313919.502977] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[313919.503024] Process kswapd1 (pid: 78, threadinfo ffff88085e35e000, task ffff88085e35ae60)
[313919.503072] Stack:
[313919.503094] 0000000000000082 ffff88107f01d5c0 ffff88087f4761c0 ffff88107f04a340
[313919.503160] 0000000000000000 0000000000000001 0000000000000000 ffffffff810ed7ea
[313919.503226] ffff8802f1a7c280 ffff88107f01d5c0 ffff88107f04a340 ffff880004716c90
[313919.503290] Call Trace:
[313919.503318] [<ffffffff810ed7ea>] ? __drain_alien_cache+0x94/0xa9
[313919.503351] [<ffffffff810ed486>] ? __cache_free.clone.30+0x8f/0x196
[313919.503383] [<ffffffff810ed5b2>] ? kmem_cache_free+0x25/0x69
[313919.503420] [<ffffffff8111d252>] ? free_buffer_head+0x1f/0x30
[313919.503452] [<ffffffff8111d4d8>] ? try_to_free_buffers+0x94/0xa6
[313919.508629] [<ffffffffa031ae19>] ? xfs_vm_releasepage+0x82/0x8b [xfs]
[313919.508668] [<ffffffff810c3b66>] ? shrink_page_list+0x2ef/0x496
[313919.508701] [<ffffffff810c412b>] ? shrink_inactive_list+0x23a/0x391
[313919.508739] [<ffffffff811a79ae>] ? radix_tree_gang_lookup_tag+0x7e/0xd2
[313919.508773] [<ffffffff810c4932>] ? shrink_zone+0x3ad/0x4ac
[313919.508810] [<ffffffff81335b15>] ? _cond_resched+0x9/0x20
[313919.508841] [<ffffffff810c55f9>] ? balance_pgdat+0x28e/0x514
[313919.508873] [<ffffffff810c5b33>] ? kswapd+0x2b4/0x2cd
[313919.508906] [<ffffffff8106023f>] ? wake_up_bit+0x23/0x23
[313919.508937] [<ffffffff810c587f>] ? balance_pgdat+0x514/0x514
[313919.508968] [<ffffffff810c587f>] ? balance_pgdat+0x514/0x514
[313919.508999] [<ffffffff8105fdc7>] ? kthread+0x7a/0x82
[313919.509032] [<ffffffff8133d1e4>] ? kernel_thread_helper+0x4/0x10
[313919.509064] [<ffffffff8105fd4d>] ? kthread_worker_fn+0x149/0x149
[313919.509096] [<ffffffff8133d1e0>] ? gs_change+0x13/0x13
[313919.509125] Code:
4c 89 ff e8 03 e8 ff ff 48 8b 48 08 48 8b 30 49 b9 00 01 10 00 00 00
ad de 49 b8 00 02 20 00 00 00 ad de 4a 8b 94 f3 68 10 00 00
[313919.509349]
89 4e 08 48 89 31 4c 89 08 4c 89 40 08 44 2b 78 18 8b 8b 10
[313919.509486] RIP [<ffffffff810ed6aa>] free_block+0x50/0xfc RSP <ffff88085e35f9d0> CR2: ffff88007f873133
[313919.509902] ---[ end trace 4e4c1859e166ba0c ]---

It was followed by (possibly irrelevant) backtrace from another CPU:

[313936.280974] ------------[ cut here ]------------
[313936.281068] WARNING: at /build/buildd-linux-2.6_3.0.0-5-amd64-UzNnAD/linux-2.6-3.0.0/debian/build/source_amd64_none/kernel/watchdog.c:240 watchdog_overflow_callback+0x96/0xa4()
[313936.281193] Hardware name: PowerEdge R810
[313936.281259] Watchdog detected hard LOCKUP on cpu 7
[313936.281314] Modules linked in:
dm_snapshot des_generic ecb md4 hmac nls_utf8 cifs nfs lockd fscache
auth_rpcgss nfs_acl sunrpc ext4 jbd2 crc16 xfs dm_round_robin
scsi_dh_emc dm_multipath scsi_dh netconsole configfs snd_pcm
snd_timer snd soundcore snd_page_alloc joydev tpm_tis pcspkr tpm
tpm_bios dcdbas i7core_edac evdev psmouse ghes serio_raw edac_core
acpi_power_meter hed button processor thermal_sys ext3 jbd mbcache
dm_mod sg sr_mod cdrom sd_mod ata_generic ses crc_t10dif enclosure
usbhid hid uhci_hcd qla2xxx scsi_transport_fc scsi_tgt ata_piix
libata ehci_hcd usbcore megaraid_sas scsi_mod bnx2 [last unloaded:
scsi_wait_scan]
[313936.284925] Pid: 38550, comm: sky-db Tainted: G D 3.0.0-2-amd64 #1
[313936.285001] Call Trace:
[313936.285066] <NMI> [<ffffffff81046591>] ? warn_slowpath_common+0x78/0x8c
[313936.285198] [<ffffffff81046646>] ? warn_slowpath_fmt+0x45/0x4a
[313936.285276] [<ffffffff8100e28d>] ? paravirt_read_tsc+0x5/0x8
[313936.285347] [<ffffffff81091b60>] ? watchdog_overflow_callback+0x96/0xa4
[313936.285424] [<ffffffff810b1a5c>] ? __perf_event_overflow+0x101/0x198
[313936.285500] [<ffffffff8101429b>] ? paravirt_read_msr+0x7/0xa
[313936.285571] [<ffffffff810142a9>] ? paravirt_write_msr+0xb/0xe
[313936.285646] [<ffffffff81017d48>] ? intel_pmu_handle_irq+0x35d/0x3c0
[313936.285726] [<ffffffff813383fa>] ? perf_event_nmi_handler+0x3e/0x87
[313936.285800] [<ffffffff81339ee3>] ? notifier_call_chain+0x2e/0x5b
[313936.285874] [<ffffffff81339f5f>] ? notify_die+0x2d/0x32
[313936.285945] [<ffffffff81337a3c>] ? do_nmi+0x61/0x202
[313936.286015] [<ffffffff81337750>] ? nmi+0x20/0x30
[313936.286089] [<ffffffff81072135>] ? do_raw_spin_lock+0x15/0x1b
[313936.286159] <<EOE>> [<ffffffff810ed471>] ? __cache_free.clone.30+0x7a/0x196
[313936.286299] [<ffffffff810ed5b2>] ? kmem_cache_free+0x25/0x69
[313936.286376] [<ffffffff8111d252>] ? free_buffer_head+0x1f/0x30
[313936.286448] [<ffffffff8111d4d8>] ? try_to_free_buffers+0x94/0xa6
[313936.286544] [<ffffffffa031ae19>] ? xfs_vm_releasepage+0x82/0x8b [xfs]
[313936.286623] [<ffffffff810c3b66>] ? shrink_page_list+0x2ef/0x496
[313936.286695] [<ffffffff810c412b>] ? shrink_inactive_list+0x23a/0x391
[313936.286767] [<ffffffff810c4932>] ? shrink_zone+0x3ad/0x4ac
[313936.286839] [<ffffffff810c4da0>] ? do_try_to_free_pages+0x106/0x315
[313936.286918] [<ffffffff810bc6bf>] ? get_page_from_freelist+0x651/0x697
[313936.286990] [<ffffffff810c5235>] ? try_to_free_pages+0xb6/0xf6
[313936.287062] [<ffffffff810bcd41>] ? __alloc_pages_nodemask+0x4d9/0x765
[313936.287137] [<ffffffff810beef7>] ? page_zone.clone.4+0x21/0x21
[313936.287214] [<ffffffff810e7acd>] ? alloc_pages_vma+0xdc/0xe1
[313936.287292] [<ffffffff810d263c>] ? handle_pte_fault+0x162/0x79b
[313936.287364] [<ffffffff810cfc31>] ? set_pmd+0x5/0x8
[313936.287434] [<ffffffff810cfd4c>] ? pte_offset_kernel+0x16/0x36
[313936.287506] [<ffffffff810d2ff1>] ? handle_mm_fault+0x1ea/0x22c
[313936.287578] [<ffffffff81339e90>] ? do_page_fault+0x2e9/0x30e
[313936.287651] [<ffffffff810d7c2e>] ? mmap_region+0x336/0x431
[313936.287722] [<ffffffff810d5c0e>] ? get_unmapped_area+0xe4/0x13e
[313936.287797] [<ffffffff81337495>] ? page_fault+0x25/0x30
[313936.287869] ---[ end trace 4e4c1859e166ba0d ]---

Here's the last atop sample (2 minutes until the crash):

ATOP - tornado 2011/10/25 03:45:02 600 seconds elapsed
PRC | sys 80.95s | user 67m20s | #proc 201 | #zombie 0 | #exit 4773 |
CPU | sys 14% | user 675% | irq 1% | idle 903% | wait 8% |
cpu | sys 1% | user 71% | irq 0% | idle 27% | cpu002 w 1% |
cpu | sys 1% | user 71% | irq 0% | idle 28% | cpu004 w 0% |
cpu | sys 1% | user 64% | irq 0% | idle 35% | cpu007 w 0% |
cpu | sys 1% | user 59% | irq 0% | idle 39% | cpu006 w 0% |
cpu | sys 1% | user 59% | irq 0% | idle 40% | cpu005 w 0% |
cpu | sys 1% | user 57% | irq 0% | idle 42% | cpu003 w 0% |
cpu | sys 1% | user 52% | irq 0% | idle 45% | cpu001 w 2% |
cpu | sys 1% | user 50% | irq 0% | idle 43% | cpu000 w 5% |
cpu | sys 1% | user 30% | irq 0% | idle 69% | cpu008 w 0% |
cpu | sys 1% | user 30% | irq 0% | idle 70% | cpu011 w 0% |
cpu | sys 0% | user 29% | irq 0% | idle 70% | cpu015 w 0% |
cpu | sys 1% | user 28% | irq 0% | idle 72% | cpu009 w 0% |
cpu | sys 1% | user 24% | irq 0% | idle 75% | cpu012 w 0% |
cpu | sys 1% | user 20% | irq 0% | idle 79% | cpu013 w 0% |
cpu | sys 0% | user 19% | irq 0% | idle 81% | cpu010 w 0% |
cpu | sys 0% | user 11% | irq 0% | idle 89% | cpu014 w 0% |
CPL | avg1 2.57 | avg5 1.94 | avg15 2.15 | csw 9919970 | intr 6679518 |
MEM | tot 63.0G | free 1.3G | cache 52.0G | buff 2.0M | slab 1.2G |
SWP | tot 9.3G | free 9.3G | | vmcom 13.7G | vmlim 40.8G |
PAG | scan 454328 | stall 0 | | swin 0 | swout 256 |
DSK | sda | busy 8% | read 9391 | write 17782 | avio 1 ms |
NET | transport | tcpi 738 | tcpo 684 | udpi 45 | udpo 45 |
NET | network | ipi 2033 | ipo 722 | ipfrw 0 | deliv 2032 |
NET | eth2 0% | pcki 3185 | pcko 601 | si 4 Kbps | so 2 Kbps |
NET | lo ---- | pcki 198 | pcko 198 | si 0 Kbps | so 0 Kbps |

PID SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPU CMD 1/217
38629 8.36s 8m29s 74264K 73832K 11916K 301.7M -- - R 86% sky-db
39009 14.48s 8m22s -0.1G -0.1G 556K 337.9M -- - R 86% sky-db
38674 8.55s 8m26s -13.3M -13.7M 9220K 215.9M -- - R 86% sky-db
38652 9.33s 8m25s 25112K 25016K 5184K 136.9M -- - R 86% sky-db
38563 8.28s 8m25s 20124K 19924K 8568K 68804K -- - R 86% sky-db
38957 8.55s 8m25s 34588K 34112K 7316K 52356K -- - R 86% sky-db
38576 8.75s 8m24s 5912K 5696K 1172K 14784K -- - R 85% sky-db
38550 10.55s 8m19s -10.0M -8288K 646.9M 555.2M -- - R 85% sky-db
3752 0.10s 0.42s -640K -528K 1616K 8K -- - S 0% sky-hps
4491 0.02s 0.48s -640K -528K 1196K 28K -- - S 0% sky-hps
3836 0.06s 0.42s 0K 0K 0K 0K -- - S 0% sky-hps
4075 0.45s 0.00s 0K 0K 0K 0K -- - S 0% flush-254:1
3910 0.10s 0.34s 0K -40K 0K 0K -- - S 0% skytrick
3828 0.22s 0.22s 0K 0K 0K 0K -- - S 0% sky-hps
59708 0.14s 0.30s 0K -120K 0K 0K -- - S 0% proxy
4438 0.09s 0.34s 0K -244K 0K 0K -- - S 0% sky-db
37660 0.35s 0.07s 0K -40K 0K 0K -- - S 0% skytrick
4445 0.08s 0.34s 0K -244K 0K 0K -- - S 0% sky-db
4459 0.09s 0.33s 0K -244K 0K 0K -- - S 0% sky-db
4184 0.07s 0.35s 0K -1268K 0K 0K -- - S 0% sky-db
4452 0.09s 0.32s 0K -244K 0K 0K -- - S 0% sky-db
4474 0.07s 0.34s 0K -244K 0K 0K -- - S 0% sky-db
4481 0.07s 0.30s 0K -244K 0K 0K -- - S 0% sky-db

All the sky-db process were messing (through libdb API) with the same big
(57GB) file on the XFS (over LVM2 logical volume over a local HW RAID)
filesystem. The application is being built locally against stock squeeze
libdb4.8-dev pkg.

I noticed a dramatic drop i the "buff" (2.0M value, see above) about an hour
before the crash--until then it was many hours steadily over 200 megabytes.
In the last hour it was only 1.8-2.0 MB as of the atop log.

The machine had 4 days uptime, the respective boot dmesg output is at

http://joni.heaven-industries.com/~egon/tornado-dmesg-111021.txt

It turned out we have a very similar another machine here which crashed in
the last days too with the outdated BIOS (no crash since the upgrade, but
it's only 4 days since then) running CentOS 5.x (kernel 2.6.18).

To clarify once again: the crash described above happened with a recent BIOS.

Thanks!

--
Egon Eckert, Heaven Industries, s.r.o.
E-mail: egon.eckert@heaven-industries.com

EARTH
smog | bricks
AIR -- mud -- FIRE
soda water | tequila
WATER
-- with thanks to fortune



--
To UNSUBSCRIBE, email to debian-kernel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20111025102124.GB19316@heaven-industries.com">http://lists.debian.org/20111025102124.GB19316@heaven-industries.com
 
Old 10-27-2011, 12:19 PM
Egon Eckert
 
Default Bug#644550: linux-image-2.6.32-5-amd64: indefinite soft lockup on rm

> > > Package: linux-2.6
> > > Version: 2.6.32-38
> > > Severity: important
> > > Tags: upstream
> > >
> > > the lockup happened on unlinking a large file (several GB) on XFS on fibre
> > > channel storage (qla2xxx module). We experience very simiral trouble on
> > > this machine though, using other kernels (from stable, stable-backports and
> > > sid as well) and other fs/storage combinations (ext4 on local Dell PERC
> > > controller).
> >
> > How long did the soft lockup state persist before you rebooted?
>
> 14 minutes

...after switching back to the stable kernel (where this bug report started),
the same soft lockup happened again last night after 36h of uptime. It
lasted 9h (until reboot), so it's really indefinite I guess.

The "rm" was again stuck on the same file, being produced and steadily grew
up a few hours from zero size collaboratively by these "sky-db" processes,
see the atop sample above.

To summarize: the stable kernel locks up in matter of hours/days, and the
unstable kernel crashes in days/weeks, without link to "rm" command. The
other similar machine crashed twice too (running CentOS) and we are reporting
this to the HW vendor to not waste time of kernel developers/maintainers
here.

Thanks,

Egon



--
To UNSUBSCRIBE, email to debian-kernel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20111027121923.GC7283@heaven-industries.com">http://lists.debian.org/20111027121923.GC7283@heaven-industries.com
 
Old 11-15-2011, 07:53 AM
Egon Eckert
 
Default Bug#644550: linux-image-2.6.32-5-amd64: indefinite soft lockup on rm

> To summarize: the stable kernel locks up in matter of hours/days, and the
> unstable kernel crashes in days/weeks

...the machine just reached 2 weeks of uptime after switching off the Turbo
Boost option in BIOS setup. Before that, it crashed the same (stable) kernel
repeatedly in max. 36 hours of uptime.

I think we should close this "bug" now, regardless of what's buggy or not
conforming to specs or whatever.

Thanks,

--
Egon Eckert



--
To UNSUBSCRIBE, email to debian-kernel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20111115085301.GD13472@heaven-industries.com">http://lists.debian.org/20111115085301.GD13472@heaven-industries.com
 

Thread Tools




All times are GMT. The time now is 02:52 AM.

VBulletin, Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright ©2007 - 2008, www.linux-archive.org