FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Debian > Debian Kernel

 
 
LinkBack Thread Tools
 
Old 09-28-2010, 08:47 AM
Proskurin Kirill
 
Default Bug#598323: linux-image-2.6.35.6: Servers reboot on heavy load on DRBD+OCFS2 partition

Package: linux-image-2.6.35.6
Version: 2.6.35.6-10.00.Custom
Severity: important


Hello.

First of all - this it my first bugreport to debian and I sorry if I do something wrong - just tell me what need to fix in it.

I have 2 servers Dell 2950 and try to use it as a email cluster.
I use DRBD with OCFS2 over it. Both nodes is reboot on heavy load every time.

I report bug for a package linux-image-2.6.35.6 but it is not true - I have this problem on 2.6.26(stable) and 2.6.32(testing). I just try latest kernel to be sure.
I try ocfs2-tools from stable and from testing - nodes reboot. I try DRBD8 from backports and then on 2.6.32 native and compile DRBD-8.3.8 from sourse with 2.6.35-6 - nodes reboot.
So I think it is a kernel relaited but I can be really wrong. Im not sure what couse this reboots.

What I do:
1) Create a DRBD md on both nodes
drbdadm create-md drbd0

2) Sync it
drbdadm -- --overwrite-data-of-peer primary drbd0
drbdsetup /dev/drbd0 syncer -r 110M

3) Make both primary
drbdadm primary drbd0

4) Make FS
mkfs.ocfs2 -L ocfs2_drbd -N 2 -T mail --fs-feature-level=max-features /dev/drbd0

5) Mount it on both nodes
mount /var/spool/dovecot
(fstab options - nodev,noauto,noatime,data=writeback)

6) Make folders for test
mkdir /var/spool/dovecot/iozone1
mkdir /var/spool/dovecot/iozone2

7) Start IO test on both nodes in different folders
iozone -RK -t 4 -s 10g -i 0 -i 1 -i 2 -b /tmp/`hostname`.xls

8) Allways got reboot after 30-180 min. Sometimes with stack trace and halt but not everytime.

OCFS2 partition seems to work ok at normal work.

P.S. If i was wrong to write this in sid like system - just tell me. This bug easly repeatable on stable or testing.

-- System Information:
Debian Release: squeeze/sid
APT prefers testing
APT policy: (500, 'testing')
Architecture: amd64 (x86_64)

Kernel: Linux 2.6.35.6 (SMP w/4 CPU cores)
Locale: LANG=ru_RU.UTF-8, LC_CTYPE=ru_RU.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages linux-image-2.6.35.6 depends on:
ii coreutils 8.5-1 GNU core utilities
ii debconf [debconf-2.0] 1.5.35 Debian configuration management sy

linux-image-2.6.35.6 recommends no packages.

Versions of packages linux-image-2.6.35.6 suggests:
pn fdutils <none> (no description available)
pn ksymoops <none> (no description available)
pn linux-doc-2.6.35.6 | linux-so <none> (no description available)
pn linux-image-2.6.35.6-dbg <none> (no description available)

-- debconf information:
linux-image-2.6.35.6/postinst/old-dir-initrd-link-2.6.35.6: true
linux-image-2.6.35.6/prerm/removing-running-kernel-2.6.35.6: true
linux-image-2.6.35.6/preinst/abort-overwrite-2.6.35.6:
linux-image-2.6.35.6/postinst/old-system-map-link-2.6.35.6: true
linux-image-2.6.35.6/preinst/already-running-this-2.6.35.6:
linux-image-2.6.35.6/preinst/overwriting-modules-2.6.35.6: true
linux-image-2.6.35.6/postinst/depmod-error-initrd-2.6.35.6: false
linux-image-2.6.35.6/postinst/kimage-is-a-directory:
linux-image-2.6.35.6/preinst/failed-to-move-modules-2.6.35.6:
linux-image-2.6.35.6/postinst/depmod-error-2.6.35.6: false
node:
ip_port = 7777
ip_address = 192.168.1.1
number = 0
name = mail01.fxclub.org
cluster = ocfs2

node:
ip_port = 7777
ip_address = 192.168.1.2
number = 1
name = mail02.fxclub.org
cluster = ocfs2

cluster:
node_count = 2
name = ocfs2
resource drbd0 {

on mail01.fxclub.org {
device /dev/drbd0;
disk /dev/sda9;
address 192.168.1.1:7789;
meta-disk internal;
}

on mail02.fxclub.org {
device /dev/drbd0;
disk /dev/sda9;
address 192.168.1.2:7789;
meta-disk internal;
}

}
global {
usage-count yes;
# minor-count dialog-refresh disable-ip-verification
}

common {
protocol C;

handlers {
# What should be done in case the node is primary, degraded (=no connection) and has inconsistent data.
#pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
#pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /sbin/ifconfig eth1 down";
# The node is currently primary, but lost the after split brain auto recovery procedure. As as consequence it should go away.
#pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
#pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /sbin/ifconfig eth1 down";
#local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
#outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
# fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
#split-brain "/usr/lib/drbd/notify-split-brain.sh root";
# out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
# before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
# after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
}

startup {
wfc-timeout 60;
degr-wfc-timeout 30;
outdated-wfc-timeout 15;
become-primary-on both;
# wait-after-sb;
}

disk {
fencing resource-and-stonith;
# RAID WITH BBU ONLY!!!
no-disk-flushes;
no-md-flushes;
no-disk-barrier;
# on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes
# no-disk-drain no-md-flushes max-bio-bvecs
}

net {
cram-hmac-alg sha1;
shared-secret "password";
allow-two-primaries;
ping-timeout 20;
#after-sb-0pri discard-zero-changes;
#after-sb-1pri discard-secondary;
#after-sb-2pri disconnect;
data-integrity-alg sha1;
# Tuning
max-buffers 8000;
max-epoch-size 8000;
sndbuf-size 0;
# snd.buf-size rcvbuf-size timeout connect-int ping-int ping-timeout max-buffers
# max-epoch-size ko-count allow-two-primaries cram-hmac-alg shared-secret
# after-sb-0pri after-sb-1pri after-sb-2pri data-integrity-alg no-tcp-cork
}

syncer {
# MagaBYTE! Not Bit.
rate 40M;
al-extents 3389;
# rate after al-extents use-rle cpu-mask verify-alg csums-alg
}
}
Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Stack glue driver: Loaded
Stack plugin "o2cb": Loaded
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster ocfs2: Online
Heartbeat dead threshold = 31
Network idle timeout: 15000
Network keepalive delay: 2000
Network reconnect delay: 2000
Checking O2CB heartbeat: Not active
Stable:
Message from syslogd@mail02 at Sep 16 09:03:19 ...
kernel:[92182.173794] ------------[ cut here ]------------

Message from syslogd@mail02 at Sep 16 09:03:19 ...
kernel:[92182.173872] invalid opcode: 0000 [#1] SMP

Message from syslogd@mail02 at Sep 16 09:03:19 ...
kernel:[92182.173899] last sysfs file: /sys/module/ocfs2/refcnt


Testing:
Message from syslogd@mail01 at Sep 16 15:18:37 ...
kernel:[ 1432.310479] ------------[ cut here ]------------

Message from syslogd@mail01 at Sep 16 15:18:37 ...
kernel:[ 1432.310648] invalid opcode: 0000 [#1] SMP

Message from syslogd@mail01 at Sep 16 15:18:37 ...
kernel:[ 1432.310801] last sysfs file: /sys/fs/o2cb/interface_revision

Message from syslogd@mail01 at Sep 16 15:18:37 ...
kernel:[ 1432.312251] Stack:

Message from syslogd@mail01 at Sep 16 15:18:37 ...
kernel:[ 1432.312251] Call Trace:

Message from syslogd@mail01 at Sep 16 15:18:37 ...
kernel:[ 1432.312251] Code: 83 c3 08 48 83 3b 00 eb ec 48 83 fd 10 0f 86 89 00 00 00 48 89 ef e8 b9 e8 ff ff 48 89 c7 48 8b 00 84 c0 78 13 66 a9 00 c0 75 04 <0f> 0b eb fe 5b 5d 41 5c e9 94 58 fd ff 48 8b 4c 24 18 4c 8b 4f

Testing: 2.6.35 + DRBD 8.3.8
mail01:/usr/local/sbin# mount /var/spool/dovecot

Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.451479] ------------[ cut here ]------------

Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.451530] invalid opcode: 0000 [#1] SMP

Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.451557] last sysfs file: /sys/module/drbd/parameters/cn_idx

Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.452451] Stack:

Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.452623] Call Trace:

Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.452841] Code: c5 10 48 83 7d 00 00 eb e6 48 83 fb 10 0f 86 80 00 00 00 48 89 df e8 a9 f0 ff ff 48 89 c6 48 8b 00 84 c0 78 16 66 a9 00 c0 75 04 <0f> 0b eb fe 5b 5d 41 5c 48 89 f7 e9 7d 75 fd ff 48 8b 4c 24 18

Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.461099] general protection fault: 0000 [#2] SMP

Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.461269] last sysfs file: /sys/module/drbd/parameters/cn_idx
mail01:/usr/local/sbin#
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.465065] Stack:

Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.465065] Call Trace:

Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.465065] Code: 0f 1f 44 00 00 49 89 c7 fa 66 0f 1f 44 00 00 65 4c 8b 04 25 b0 ea 00 00 48 8b 45 00 49 01 c0 49 8b 18 48 85 db 74 0d 48 63 45 18 <48> 8b 04 03 49 89 00 eb 11 83 ca ff 44 89 f6 48 89 ef e8 a1 f1


55921.451479] ------------[ cut here ]------------
[55921.451506] kernel BUG at mm/slub.c:2834!
[55921.451530] invalid opcode: 0000 [#1] SMP
[55921.451557] last sysfs file: /sys/module/drbd/parameters/cn_idx
[55921.451584] CPU 1
[55921.451589] Modules linked in: ocfs2 jbd2 quota_tree drbd xt_multiport sha1_generic hmac lru_cache cn xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables ocf
s2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs ext2 loop snd_pcm i5000_edac edac_core i5k_amb snd_timer processor snd evdev button rng_core shpchp soundcore snd_page_alloc tpm
_tis pci_hotplug psmouse dcdbas tpm pcspkr tpm_bios serio_raw ext3 jbd mbcache ide_cd_mod uhci_hcd cdrom ata_generic ata_piix libata ses sd_mod enclosure crc_t10dif ehci_hcd megaraid_sas piix ide_core usbcor
e scsi_mod nls_base bnx2 thermal thermal_sys [last unloaded: drbd]
[55921.451964]
[55921.451984] Pid: 2995, comm: udevd Not tainted 2.6.35.6 #1 0NH278/PowerEdge 2950
[55921.452027] RIP: 0010:[<ffffffff810df05d>] [<ffffffff810df05d>] kfree+0x5b/0xc8
[55921.452076] RSP: 0018:ffff88012aa61d58 EFLAGS: 00010246
[55921.452102] RAX: 0200000000000400 RBX: ffff880100000001 RCX: 0000000000000002
[55921.452131] RDX: ffffea0000000000 RSI: ffffea0003800000 RDI: ffff880100000001
[55921.452160] RBP: ffff8800375d8f00 R08: 0000000000000000 R09: 0000000000000000
[55921.452189] R10: ffff88012bce1070 R11: ffff8800375d8f00 R12: ffffffff810f061e
[55921.452219] R13: 0000000018000040 R14: ffff88012c375cf0 R15: ffff88012bce1070
[55921.452248] FS: 00007f7646a967a0(0000) GS:ffff880001a40000(0000) knlGS:0000000000000000
[55921.452293] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[55921.452319] CR2: 00007f7646a9c000 CR3: 000000012d245000 CR4: 00000000000006e0
[55921.452349] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[55921.452378] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[55921.452407] Process udevd (pid: 2995, threadinfo ffff88012aa60000, task ffff880121f4d890)
[55921.452451] Stack:
[55921.452471] 0000000000000000 ffff8800375d8f00 ffff88012bce1070 ffffffff810f061e
[55921.452505] <0> ffff880108000080 000000002bce1070 ffff88012c3759d0 ffff880100000001
[55921.452556] <0> 0000029d0000029d ffff8800375d8fa0 ffff88012f8a4900 ffff8800375d8f00
[55921.452623] Call Trace:
[55921.452647] [<ffffffff810f061e>] ? vfs_rename+0x3d3/0x3e4
[55921.452674] [<ffffffff810f1c78>] ? sys_renameat+0x1aa/0x22b
[55921.452702] [<ffffffff810d13ab>] ? free_pages_and_swap_cache+0x53/0x6e
[55921.452732] [<ffffffff810c83fb>] ? tlb_finish_mmu+0x2a/0x33
[55921.452759] [<ffffffff810c8470>] ? remove_vma+0x6c/0x74
[55921.452786] [<ffffffff810c95d8>] ? do_munmap+0x307/0x329
[55921.452814] [<ffffffff810089c2>] ? system_call_fastpath+0x16/0x1b
[55921.452841] Code: c5 10 48 83 7d 00 00 eb e6 48 83 fb 10 0f 86 80 00 00 00 48 89 df e8 a9 f0 ff ff 48 89 c6 48 8b 00 84 c0 78 16 66 a9 00 c0 75 04 <0f> 0b eb fe 5b 5d 41 5c 48 89 f7 e9 7d 75 fd ff 48 8b 4
c 24 18
[55921.453030] RIP [<ffffffff810df05d>] kfree+0x5b/0xc8
[55921.453057] RSP <ffff88012aa61d58>
[55921.453437] ---[ end trace 3f96fca7c9cbfb03 ]---
[55921.454368] JBD: Ignoring recovery information on journal
[55921.461099] general protection fault: 0000 [#2] SMP
[55921.461269] last sysfs file: /sys/module/drbd/parameters/cn_idx
[55921.461338] CPU 1
[55921.461385] Modules linked in: ocfs2 jbd2 quota_tree drbd xt_multiport sha1_generic hmac lru_cache cn xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs ext2 loop snd_pcm i5000_edac edac_core i5k_amb snd_timer processor snd evdev button rng_core shpchp soundcore snd_page_alloc tpm_tis pci_hotplug psmouse dcdbas tpm pcspkr tpm_bios serio_raw ext3 jbd mbcache ide_cd_mod uhci_hcd cdrom ata_generic ata_piix libata ses sd_mod enclosure crc_t10dif ehci_hcd megaraid_sas piix ide_core usbcore scsi_mod nls_base bnx2 thermal thermal_sys [last unloaded: drbd]
[55921.464840]
[55921.464902] Pid: 9281, comm: mount.ocfs2 Tainted: G D 2.6.35.6 #1 0NH278/PowerEdge 2950
[55921.464990] RIP: 0010:[<ffffffff810dffaa>] [<ffffffff810dffaa>] __kmalloc+0xd3/0x136
[55921.465065] RSP: 0018:ffff880103e21ba8 EFLAGS: 00010006
[55921.465065] RAX: 0000000000000000 RBX: 0800000000000000 RCX: ffffffffa0449421
[55921.465065] RDX: 0000000000000000 RSI: ffff88012cfaf000 RDI: 0000000000000004
[55921.465065] RBP: ffffffff81625520 R08: ffff880001a524d0 R09: 0000000000000000
[55921.465065] R10: ffff88012cfaf260 R11: ffff88012ca24420 R12: 000000000000000a
[55921.465065] R13: 00000000000080d0 R14: 00000000000080d0 R15: 0000000000000246
[55921.465065] FS: 00007fee60afe720(0000) GS:ffff880001a40000(0000) knlGS:0000000000000000
[55921.465065] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[55921.465065] CR2: 00007f764630ab8c CR3: 000000012eae3000 CR4: 00000000000006e0
[55921.465065] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[55921.465065] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[55921.465065] Process mount.ocfs2 (pid: 9281, threadinfo ffff880103e20000, task ffff88012ca24420)
[55921.465065] Stack:
[55921.465065] 0000000000000000 ffffffffa0449421 ffff88012cfaf108 ffff88012cfaf000
[55921.465065] <0> ffff88012cfaf000 ffff88012cfaf000 ffff88012aa2e000 ffff88012ca24420
[55921.465065] <0> 0000000000000200 ffffffffa0449421 0000000000000000 ffffffffa044ccec
[55921.465065] Call Trace:
[55921.465065] [<ffffffffa0449421>] ? ocfs2_compute_replay_slots+0x31/0x10f [ocfs2]
[55921.465065] [<ffffffffa0449421>] ? ocfs2_compute_replay_slots+0x31/0x10f [ocfs2]
[55921.465065] [<ffffffffa044ccec>] ? ocfs2_journal_load+0x1d0/0x2b1 [ocfs2]
[55921.465065] [<ffffffffa0473525>] ? ocfs2_fill_super+0x19a2/0x2101 [ocfs2]
[55921.465065] [<ffffffff8118aa8f>] ? snprintf+0x36/0x3b
[55921.465065] [<ffffffff810e9f9e>] ? get_sb_bdev+0x137/0x19a
[55921.465065] [<ffffffffa0471b83>] ? ocfs2_fill_super+0x0/0x2101 [ocfs2]
[55921.465065] [<ffffffff810e9675>] ? vfs_kern_mount+0xa6/0x196
[55921.465065] [<ffffffff810e97c4>] ? do_kern_mount+0x49/0xe7
[55921.465065] [<ffffffff810fdabb>] ? do_mount+0x75c/0x7d6
[55921.465065] [<ffffffff810d829a>] ? alloc_pages_current+0x9f/0xc2
[55921.465065] [<ffffffff810fdbbd>] ? sys_mount+0x88/0xc3
[55921.465065] [<ffffffff810089c2>] ? system_call_fastpath+0x16/0x1b
[55921.465065] Code: 0f 1f 44 00 00 49 89 c7 fa 66 0f 1f 44 00 00 65 4c 8b 04 25 b0 ea 00 00 48 8b 45 00 49 01 c0 49 8b 18 48 85 db 74 0d 48 63 45 18 <48> 8b 04 03 49 89 00 eb 11 83 ca ff 44 89 f6 48 89 ef e8 a1 f1
[55921.465065] RIP [<ffffffff810dffaa>] __kmalloc+0xd3/0x136
[55921.465065] RSP <ffff880103e21ba8>
[55921.465065] ---[ end trace 3f96fca7c9cbfb04 ]---
[55941.839304] o2net: accepted connection from node mail02.fxclub.org (num 1) at 192.168.1.2:7777
[55946.003594] o2dlm: Node 1 joins domain E4B99C68B65449068DC403326917DC29
[55946.003673] o2dlm: Nodes in domain E4B99C68B65449068DC403326917DC29: 0 1


Message from syslogd@mail01 at Sep 28 07:27:03 ...
kernel:[57519.645448] general protection fault: 0000 [#3] SMP

Message from syslogd@mail01 at Sep 28 07:27:03 ...
kernel:[57519.645615] last sysfs file: /sys/module/drbd/parameters/cn_idx

Message from syslogd@mail01 at Sep 28 07:27:03 ...
kernel:[57519.649409] Stack:

Message from syslogd@mail01 at Sep 28 07:27:03 ...
kernel:[57519.649409] Call Trace:

Message from syslogd@mail01 at Sep 28 07:27:03 ...
kernel:[57519.649409] Code: 0f 1f 44 00 00 49 89 c7 fa 66 0f 1f 44 00 00 65 4c 8b 04 25 b0 ea 00 00 48 8b 45 00 49 01 c0 49 8b 18 48 85 db 74 0d 48 63 45 18 <48> 8b 04 03 49 89 00 eb 11 83 ca ff 44 89 f6 48 89 ef e8 a1 f1
 
Old 09-28-2010, 09:08 PM
Ben Hutchings
 
Default Bug#598323: linux-image-2.6.35.6: Servers reboot on heavy load on DRBD+OCFS2 partition

On Tue, 2010-09-28 at 09:47 +0100, Proskurin Kirill wrote:
> Package: linux-image-2.6.35.6
> Version: 2.6.35.6-10.00.Custom
> Severity: important
>
>
> Hello.
>
> First of all - this it my first bugreport to debian and I sorry if I
> do something wrong - just tell me what need to fix in it.
>
> I have 2 servers Dell 2950 and try to use it as a email cluster.
> I use DRBD with OCFS2 over it. Both nodes is reboot on heavy load
> every time.
>
> I report bug for a package linux-image-2.6.35.6 but it is not true - I
> have this problem on 2.6.26(stable) and 2.6.32(testing). I just try
> latest kernel to be sure.
> I try ocfs2-tools from stable and from testing - nodes reboot. I try
> DRBD8 from backports and then on 2.6.32 native and compile DRBD-8.3.8
> from sourse with 2.6.35-6 - nodes reboot.
> So I think it is a kernel relaited but I can be really wrong. Im not
> sure what couse this reboots.

Can you reproduce this in 2.6.35 or 2.6.36-rc5 (current version in
experimental) using the version of drbd that is included in it rather
than a separately built version?

Ben.

--
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.
 
Old 09-29-2010, 02:17 PM
Proskurin Kirill
 
Default Bug#598323: linux-image-2.6.35.6: Servers reboot on heavy load on DRBD+OCFS2 partition

On 29/09/10 01:08, Ben Hutchings wrote:

On Tue, 2010-09-28 at 09:47 +0100, Proskurin Kirill wrote:

Package: linux-image-2.6.35.6
Version: 2.6.35.6-10.00.Custom
Severity: important


Hello.

First of all - this it my first bugreport to debian and I sorry if I
do something wrong - just tell me what need to fix in it.

I have 2 servers Dell 2950 and try to use it as a email cluster.
I use DRBD with OCFS2 over it. Both nodes is reboot on heavy load
every time.

I report bug for a package linux-image-2.6.35.6 but it is not true - I
have this problem on 2.6.26(stable) and 2.6.32(testing). I just try
latest kernel to be sure.
I try ocfs2-tools from stable and from testing - nodes reboot. I try
DRBD8 from backports and then on 2.6.32 native and compile DRBD-8.3.8
from sourse with 2.6.35-6 - nodes reboot.
So I think it is a kernel relaited but I can be really wrong. Im not
sure what couse this reboots.


Can you reproduce this in 2.6.35 or 2.6.36-rc5 (current version in
experimental) using the version of drbd that is included in it rather
than a separately built version?


Ok. I working on it. Have problem to get work bnx2 driver in 2.6.36-rc5

update-initramfs: Generating /boot/initrd.img-2.6.36-rc5
W: Possible missing firmware
/lib/firmware/bnx2/bnx2-rv2p-09ax-5.0.0.j10.fw for module bnx2
W: Possible missing firmware
/lib/firmware/bnx2/bnx2-rv2p-09-5.0.0.j10.fw for module bnx2
W: Possible missing firmware
/lib/firmware/bnx2/bnx2-mips-09-5.0.0.j15.fw for module bnx2
W: Possible missing firmware /lib/firmware/bnx2/bnx2-mips-06-5.0.0.j6.fw
for module bnx2



Lates firmware-bnx2 not helps. Build from source fail with many errors.
In 2.6.35 it is seems to work ok. 2.6.36 check is mandatory?

--
Best regards,
Proskurin Kirill



--
To UNSUBSCRIBE, email to debian-kernel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4CA34A6A.7090502@fxclub.org">http://lists.debian.org/4CA34A6A.7090502@fxclub.org
 
Old 09-30-2010, 12:49 AM
Ben Hutchings
 
Default Bug#598323: linux-image-2.6.35.6: Servers reboot on heavy load on DRBD+OCFS2 partition

On Wed, 2010-09-29 at 18:17 +0400, Proskurin Kirill wrote:
> On 29/09/10 01:08, Ben Hutchings wrote:
> > On Tue, 2010-09-28 at 09:47 +0100, Proskurin Kirill wrote:
> >> Package: linux-image-2.6.35.6
> >> Version: 2.6.35.6-10.00.Custom
> >> Severity: important
> >>
> >>
> >> Hello.
> >>
> >> First of all - this it my first bugreport to debian and I sorry if I
> >> do something wrong - just tell me what need to fix in it.
> >>
> >> I have 2 servers Dell 2950 and try to use it as a email cluster.
> >> I use DRBD with OCFS2 over it. Both nodes is reboot on heavy load
> >> every time.
> >>
> >> I report bug for a package linux-image-2.6.35.6 but it is not true - I
> >> have this problem on 2.6.26(stable) and 2.6.32(testing). I just try
> >> latest kernel to be sure.
> >> I try ocfs2-tools from stable and from testing - nodes reboot. I try
> >> DRBD8 from backports and then on 2.6.32 native and compile DRBD-8.3.8
> >> from sourse with 2.6.35-6 - nodes reboot.
> >> So I think it is a kernel relaited but I can be really wrong. Im not
> >> sure what couse this reboots.
> >
> > Can you reproduce this in 2.6.35 or 2.6.36-rc5 (current version in
> > experimental) using the version of drbd that is included in it rather
> > than a separately built version?
>
> Ok. I working on it. Have problem to get work bnx2 driver in 2.6.36-rc5
>
> update-initramfs: Generating /boot/initrd.img-2.6.36-rc5
> W: Possible missing firmware
> /lib/firmware/bnx2/bnx2-rv2p-09ax-5.0.0.j10.fw for module bnx2
> W: Possible missing firmware
> /lib/firmware/bnx2/bnx2-rv2p-09-5.0.0.j10.fw for module bnx2
> W: Possible missing firmware
> /lib/firmware/bnx2/bnx2-mips-09-5.0.0.j15.fw for module bnx2
> W: Possible missing firmware /lib/firmware/bnx2/bnx2-mips-06-5.0.0.j6.fw
> for module bnx2

Oops. I've added the new firmware here:
<http://svn.debian.org/wsvn/kernel/dists/trunk/firmware-nonfree/bnx2/bnx2/>

> Lates firmware-bnx2 not helps. Build from source fail with many errors.
> In 2.6.35 it is seems to work ok. 2.6.36 check is mandatory?

No, it's OK to test 2.6.35.

Ben.

--
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.
 
Old 09-30-2010, 03:10 PM
Proskurin Kirill
 
Default Bug#598323: linux-image-2.6.35.6: Servers reboot on heavy load on DRBD+OCFS2 partition

On 30/09/10 04:49, Ben Hutchings wrote:

On Wed, 2010-09-29 at 18:17 +0400, Proskurin Kirill wrote:

On 29/09/10 01:08, Ben Hutchings wrote:

On Tue, 2010-09-28 at 09:47 +0100, Proskurin Kirill wrote:

Package: linux-image-2.6.35.6
Version: 2.6.35.6-10.00.Custom
Severity: important


Hello.

First of all - this it my first bugreport to debian and I sorry if I
do something wrong - just tell me what need to fix in it.

I have 2 servers Dell 2950 and try to use it as a email cluster.
I use DRBD with OCFS2 over it. Both nodes is reboot on heavy load
every time.

I report bug for a package linux-image-2.6.35.6 but it is not true - I
have this problem on 2.6.26(stable) and 2.6.32(testing). I just try
latest kernel to be sure.
I try ocfs2-tools from stable and from testing - nodes reboot. I try
DRBD8 from backports and then on 2.6.32 native and compile DRBD-8.3.8
from sourse with 2.6.35-6 - nodes reboot.
So I think it is a kernel relaited but I can be really wrong. Im not
sure what couse this reboots.


Can you reproduce this in 2.6.35 or 2.6.36-rc5 (current version in
experimental) using the version of drbd that is included in it rather
than a separately built version?


Ok. I working on it. Have problem to get work bnx2 driver in 2.6.36-rc5

update-initramfs: Generating /boot/initrd.img-2.6.36-rc5
W: Possible missing firmware
/lib/firmware/bnx2/bnx2-rv2p-09ax-5.0.0.j10.fw for module bnx2
W: Possible missing firmware
/lib/firmware/bnx2/bnx2-rv2p-09-5.0.0.j10.fw for module bnx2
W: Possible missing firmware
/lib/firmware/bnx2/bnx2-mips-09-5.0.0.j15.fw for module bnx2
W: Possible missing firmware /lib/firmware/bnx2/bnx2-mips-06-5.0.0.j6.fw
for module bnx2


Oops. I've added the new firmware here:
<http://svn.debian.org/wsvn/kernel/dists/trunk/firmware-nonfree/bnx2/bnx2/>


Lates firmware-bnx2 not helps. Build from source fail with many errors.
In 2.6.35 it is seems to work ok. 2.6.36 check is mandatory?


No, it's OK to test 2.6.35.

Ben.



Something strange here:
http://packages.debian.org/experimental/linux-source-2.6.35

Links goes to
http://ftp.de.debian.org/debian/pool/main/l/linux-2.6/linux-2.6_2.6.36~rc5.orig.tar.gz


36, not 35.

Any way - your firmware helps and I go with 2.6.36-rc5

# cd /usr/srs
# wget
http://ftp.de.debian.org/debian/pool/main/l/linux-2.6/linux-2.6_2.6.36~rc5.orig.tar.gz
# wget
http://ftp.de.debian.org/debian/pool/main/l/linux-2.6/linux-2.6_2.6.36~rc5-1~experimental.1.dsc
# wget
http://ftp.de.debian.org/debian/pool/main/l/linux-2.6/linux-2.6_2.6.36~rc5-1~experimental.1.diff.gz

# tar xf linux-2.6_2.6.36~rc5.orig.tar.gz
# gzip -dc linux-2.6_2.6.36~rc5-1~experimental.1.diff.gz >
linux-2.6_2.6.36~rc5-1~experimental.1.diff

# cd linux-2.6-2.6.36~rc5
# patch -p1 < ../linux-2.6_2.6.36~rc5-1~experimental.1.diff
# cp /boot/config-2.6.32-5-amd64 config-2.6.32-5-amd64.config
# make-kpkg --rootcmd fakeroot --initrd --us --uc kernel_image

*answer all question by default*

dpkg -i ../linux-image-2.6.36-rc5_2.6.36-rc5-10.00.Custom_amd64.deb

reboot

DRBD recommends use 8.3.8 with 2.6.35+ so I will build it from experemental.

wget, patch, build with:
dpkg-buildpackage -us -uc -sa -rfakeroot

dpkg -i drbd8-utils_8.3.8.1-1_amd64.deb

and install maintainers global_common.conf on both nodes but add:

net {
allow-two-primaries;

on both to make it usable with OCFS2. And:

syncer {
rate 30M;

To make sync fast - nodes connected via 1Gbits.
(DRBD recommends to make this attribute brandwith/3)

So I get:
# drbd-overview
0:drbd0 Connected Primary/Primary UpToDate/UpToDate C r----

Summary:

Kernel: 2.6.36-rc5 SMP x86_64 (from experimental)
DRBD-utils-8.3.8(from experimental)
OCFS2-1.4.4-3(from testing)
iozone3-308-1(from testing)

While update(aptitude safe-upgrade) first node I get kernel panic.
Screenshot in attachment.


Reboot.

I mount OCFS2 partition and... get another hang. See it in attachment.

Hm, seems to it is not stable enough for test but I will try one more time.

NB: At most times during previous test and not I see panic on first node
- second just reboots.


reboot.

Now I able to mount OCFS2 and start iozone test.
It runs for few hours and seems to will end good I will tell how it ends
tomorrow.



--
Best regards,
Proskurin Kirill
 
Old 10-01-2010, 02:49 AM
Ben Hutchings
 
Default Bug#598323: linux-image-2.6.35.6: Servers reboot on heavy load on DRBD+OCFS2 partition

On Thu, 2010-09-30 at 19:10 +0400, Proskurin Kirill wrote:
[...]
> Summary:
>
> Kernel: 2.6.36-rc5 SMP x86_64 (from experimental)
> DRBD-utils-8.3.8(from experimental)
> OCFS2-1.4.4-3(from testing)

ocfs2 is already included in the kernel package and you should use that.

> iozone3-308-1(from testing)
>
> While update(aptitude safe-upgrade) first node I get kernel panic.
> Screenshot in attachment.
[...]

This panic shows "Tainted: G D" which means there was a previous "oops"
message. You need to record the first one.

Ben.

--
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.
 
Old 10-22-2010, 06:32 AM
Proskurin Kirill
 
Default Bug#598323: linux-image-2.6.35.6: Servers reboot on heavy load on DRBD+OCFS2 partition

Hello!

Sorry for such big delay - I was ill and then on vacation.
I you still have an interest in this problem - I have new results.

On 01/10/10 06:49, Ben Hutchings wrote:

On Thu, 2010-09-30 at 19:10 +0400, Proskurin Kirill wrote:
[...]

Summary:

Kernel: 2.6.36-rc5 SMP x86_64 (from experimental)
DRBD-utils-8.3.8(from experimental)
OCFS2-1.4.4-3(from testing)


ocfs2 is already included in the kernel package and you should use that.
OCFS2-1.4.4-3(from testing) - it is a userspace utility like mkfs.ocfs2.
Of course I use driver from kernel.



While update(aptitude safe-upgrade) first node I get kernel panic.
Screenshot in attachment.

[...]

This panic shows "Tainted: G D" which means there was a previous "oops"
message. You need to record the first one.

Well I not got it twice.

I can confirm what on configuration above(all testing + kernel
2.6.36-rc5) I don`t got a reboot. iozone complete successfully without
any problems so yes - it is a kernel relaited problem. I retest it on
latest 2.6.32 from testing - and got reboot.


So... what should I do now?


--
Best regards,
Proskurin Kirill



--
To UNSUBSCRIBE, email to debian-kernel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4CC12FF8.3000401@fxclub.org">http://lists.debian.org/4CC12FF8.3000401@fxclub.org
 
Old 10-25-2010, 08:05 PM
Ben Hutchings
 
Default Bug#598323: linux-image-2.6.35.6: Servers reboot on heavy load on DRBD+OCFS2 partition

On Fri, Oct 22, 2010 at 10:32:24AM +0400, Proskurin Kirill wrote:
> Hello!
>
> Sorry for such big delay - I was ill and then on vacation.
> I you still have an interest in this problem - I have new results.
>
> On 01/10/10 06:49, Ben Hutchings wrote:
>> On Thu, 2010-09-30 at 19:10 +0400, Proskurin Kirill wrote:
>> [...]
>>> Summary:
>>>
>>> Kernel: 2.6.36-rc5 SMP x86_64 (from experimental)
>>> DRBD-utils-8.3.8(from experimental)
>>> OCFS2-1.4.4-3(from testing)
>>
>> ocfs2 is already included in the kernel package and you should use that.
> OCFS2-1.4.4-3(from testing) - it is a userspace utility like mkfs.ocfs2.
> Of course I use driver from kernel.

OK, good.

>>> While update(aptitude safe-upgrade) first node I get kernel panic.
>>> Screenshot in attachment.
>> [...]
>>
>> This panic shows "Tainted: G D" which means there was a previous "oops"
>> message. You need to record the first one.
> Well I not got it twice.
>
> I can confirm what on configuration above(all testing + kernel
> 2.6.36-rc5) I don`t got a reboot. iozone complete successfully without
> any problems so yes - it is a kernel relaited problem. I retest it on
> latest 2.6.32 from testing - and got reboot.
>
> So... what should I do now?

I'm sorry but I don't have any idea where the problem is. So far as I
can see, there are no bug fixes to drbd or ocfs2 in 2.6.36-rc5 that are
not also in 2.6.35.6. Maybe the bug is elsewhere and just triggered by
this combination of storage driver and filesystem. Or, given that you
said that even 2.6.36-rc5 did crash once, it could be that the hardware
is unreliable.

So there are two things you could try, but I am not very hopeful:
1. Run a RAM test such as memtest86+.
2. Use 'git bisect' to find the change that makes the difference.
Normally you would use this to find when a bug was introduced, but
you can also use it to find when a bug was fixed if you reverse the
'good' and 'bad' labels.
See <http://book.git-scm.com/5_finding_issues_-_git_bisect.html>.

Ben.

Ben.

--
Ben Hutchings
We get into the habit of living before acquiring the habit of thinking.
- Albert Camus
 

Thread Tools




All times are GMT. The time now is 07:57 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org