Bug#598323: linux-image-2.6.35.6: Servers reboot on heavy load on DRBD+OCFS2 partition
Package: linux-image-2.6.35.6
Version: 2.6.35.6-10.00.Custom
Severity: important
Hello.
First of all - this it my first bugreport to debian and I sorry if I do something wrong - just tell me what need to fix in it.
I have 2 servers Dell 2950 and try to use it as a email cluster.
I use DRBD with OCFS2 over it. Both nodes is reboot on heavy load every time.
I report bug for a package linux-image-2.6.35.6 but it is not true - I have this problem on 2.6.26(stable) and 2.6.32(testing). I just try latest kernel to be sure.
I try ocfs2-tools from stable and from testing - nodes reboot. I try DRBD8 from backports and then on 2.6.32 native and compile DRBD-8.3.8 from sourse with 2.6.35-6 - nodes reboot.
So I think it is a kernel relaited but I can be really wrong. Im not sure what couse this reboots.
What I do:
1) Create a DRBD md on both nodes
drbdadm create-md drbd0
Kernel: Linux 2.6.35.6 (SMP w/4 CPU cores)
Locale: LANG=ru_RU.UTF-8, LC_CTYPE=ru_RU.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Versions of packages linux-image-2.6.35.6 depends on:
ii coreutils 8.5-1 GNU core utilities
ii debconf [debconf-2.0] 1.5.35 Debian configuration management sy
linux-image-2.6.35.6 recommends no packages.
Versions of packages linux-image-2.6.35.6 suggests:
pn fdutils <none> (no description available)
pn ksymoops <none> (no description available)
pn linux-doc-2.6.35.6 | linux-so <none> (no description available)
pn linux-image-2.6.35.6-dbg <none> (no description available)
node:
ip_port = 7777
ip_address = 192.168.1.2
number = 1
name = mail02.fxclub.org
cluster = ocfs2
cluster:
node_count = 2
name = ocfs2
resource drbd0 {
on mail01.fxclub.org {
device /dev/drbd0;
disk /dev/sda9;
address 192.168.1.1:7789;
meta-disk internal;
}
on mail02.fxclub.org {
device /dev/drbd0;
disk /dev/sda9;
address 192.168.1.2:7789;
meta-disk internal;
}
}
global {
usage-count yes;
# minor-count dialog-refresh disable-ip-verification
}
common {
protocol C;
handlers {
# What should be done in case the node is primary, degraded (=no connection) and has inconsistent data.
#pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
#pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /sbin/ifconfig eth1 down";
# The node is currently primary, but lost the after split brain auto recovery procedure. As as consequence it should go away.
#pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
#pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /sbin/ifconfig eth1 down";
#local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
#outdate-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
# fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
#split-brain "/usr/lib/drbd/notify-split-brain.sh root";
# out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
# before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
# after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
}
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.461099] general protection fault: 0000 [#2] SMP
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.461269] last sysfs file: /sys/module/drbd/parameters/cn_idx
mail01:/usr/local/sbin#
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.465065] Stack:
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.465065] Call Trace:
Message from syslogd@mail01 at Sep 28 07:00:25 ...
kernel:[55921.465065] Code: 0f 1f 44 00 00 49 89 c7 fa 66 0f 1f 44 00 00 65 4c 8b 04 25 b0 ea 00 00 48 8b 45 00 49 01 c0 49 8b 18 48 85 db 74 0d 48 63 45 18 <48> 8b 04 03 49 89 00 eb 11 83 ca ff 44 89 f6 48 89 ef e8 a1 f1
Message from syslogd@mail01 at Sep 28 07:27:03 ...
kernel:[57519.645448] general protection fault: 0000 [#3] SMP
Message from syslogd@mail01 at Sep 28 07:27:03 ...
kernel:[57519.645615] last sysfs file: /sys/module/drbd/parameters/cn_idx
Message from syslogd@mail01 at Sep 28 07:27:03 ...
kernel:[57519.649409] Stack:
Message from syslogd@mail01 at Sep 28 07:27:03 ...
kernel:[57519.649409] Call Trace:
Message from syslogd@mail01 at Sep 28 07:27:03 ...
kernel:[57519.649409] Code: 0f 1f 44 00 00 49 89 c7 fa 66 0f 1f 44 00 00 65 4c 8b 04 25 b0 ea 00 00 48 8b 45 00 49 01 c0 49 8b 18 48 85 db 74 0d 48 63 45 18 <48> 8b 04 03 49 89 00 eb 11 83 ca ff 44 89 f6 48 89 ef e8 a1 f1
09-28-2010, 09:08 PM
Ben Hutchings
Bug#598323: linux-image-2.6.35.6: Servers reboot on heavy load on DRBD+OCFS2 partition
On Tue, 2010-09-28 at 09:47 +0100, Proskurin Kirill wrote:
> Package: linux-image-2.6.35.6
> Version: 2.6.35.6-10.00.Custom
> Severity: important
>
>
> Hello.
>
> First of all - this it my first bugreport to debian and I sorry if I
> do something wrong - just tell me what need to fix in it.
>
> I have 2 servers Dell 2950 and try to use it as a email cluster.
> I use DRBD with OCFS2 over it. Both nodes is reboot on heavy load
> every time.
>
> I report bug for a package linux-image-2.6.35.6 but it is not true - I
> have this problem on 2.6.26(stable) and 2.6.32(testing). I just try
> latest kernel to be sure.
> I try ocfs2-tools from stable and from testing - nodes reboot. I try
> DRBD8 from backports and then on 2.6.32 native and compile DRBD-8.3.8
> from sourse with 2.6.35-6 - nodes reboot.
> So I think it is a kernel relaited but I can be really wrong. Im not
> sure what couse this reboots.
Can you reproduce this in 2.6.35 or 2.6.36-rc5 (current version in
experimental) using the version of drbd that is included in it rather
than a separately built version?
Ben.
--
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.
09-29-2010, 02:17 PM
Proskurin Kirill
Bug#598323: linux-image-2.6.35.6: Servers reboot on heavy load on DRBD+OCFS2 partition
On 29/09/10 01:08, Ben Hutchings wrote:
On Tue, 2010-09-28 at 09:47 +0100, Proskurin Kirill wrote:
Package: linux-image-2.6.35.6
Version: 2.6.35.6-10.00.Custom
Severity: important
Hello.
First of all - this it my first bugreport to debian and I sorry if I
do something wrong - just tell me what need to fix in it.
I have 2 servers Dell 2950 and try to use it as a email cluster.
I use DRBD with OCFS2 over it. Both nodes is reboot on heavy load
every time.
I report bug for a package linux-image-2.6.35.6 but it is not true - I
have this problem on 2.6.26(stable) and 2.6.32(testing). I just try
latest kernel to be sure.
I try ocfs2-tools from stable and from testing - nodes reboot. I try
DRBD8 from backports and then on 2.6.32 native and compile DRBD-8.3.8
from sourse with 2.6.35-6 - nodes reboot.
So I think it is a kernel relaited but I can be really wrong. Im not
sure what couse this reboots.
Can you reproduce this in 2.6.35 or 2.6.36-rc5 (current version in
experimental) using the version of drbd that is included in it rather
than a separately built version?
Ok. I working on it. Have problem to get work bnx2 driver in 2.6.36-rc5
update-initramfs: Generating /boot/initrd.img-2.6.36-rc5
W: Possible missing firmware
/lib/firmware/bnx2/bnx2-rv2p-09ax-5.0.0.j10.fw for module bnx2
W: Possible missing firmware
/lib/firmware/bnx2/bnx2-rv2p-09-5.0.0.j10.fw for module bnx2
W: Possible missing firmware
/lib/firmware/bnx2/bnx2-mips-09-5.0.0.j15.fw for module bnx2
W: Possible missing firmware /lib/firmware/bnx2/bnx2-mips-06-5.0.0.j6.fw
for module bnx2
Lates firmware-bnx2 not helps. Build from source fail with many errors.
In 2.6.35 it is seems to work ok. 2.6.36 check is mandatory?
--
Best regards,
Proskurin Kirill
--
To UNSUBSCRIBE, email to debian-kernel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4CA34A6A.7090502@fxclub.org">http://lists.debian.org/4CA34A6A.7090502@fxclub.org
09-30-2010, 12:49 AM
Ben Hutchings
Bug#598323: linux-image-2.6.35.6: Servers reboot on heavy load on DRBD+OCFS2 partition
On Wed, 2010-09-29 at 18:17 +0400, Proskurin Kirill wrote:
> On 29/09/10 01:08, Ben Hutchings wrote:
> > On Tue, 2010-09-28 at 09:47 +0100, Proskurin Kirill wrote:
> >> Package: linux-image-2.6.35.6
> >> Version: 2.6.35.6-10.00.Custom
> >> Severity: important
> >>
> >>
> >> Hello.
> >>
> >> First of all - this it my first bugreport to debian and I sorry if I
> >> do something wrong - just tell me what need to fix in it.
> >>
> >> I have 2 servers Dell 2950 and try to use it as a email cluster.
> >> I use DRBD with OCFS2 over it. Both nodes is reboot on heavy load
> >> every time.
> >>
> >> I report bug for a package linux-image-2.6.35.6 but it is not true - I
> >> have this problem on 2.6.26(stable) and 2.6.32(testing). I just try
> >> latest kernel to be sure.
> >> I try ocfs2-tools from stable and from testing - nodes reboot. I try
> >> DRBD8 from backports and then on 2.6.32 native and compile DRBD-8.3.8
> >> from sourse with 2.6.35-6 - nodes reboot.
> >> So I think it is a kernel relaited but I can be really wrong. Im not
> >> sure what couse this reboots.
> >
> > Can you reproduce this in 2.6.35 or 2.6.36-rc5 (current version in
> > experimental) using the version of drbd that is included in it rather
> > than a separately built version?
>
> Ok. I working on it. Have problem to get work bnx2 driver in 2.6.36-rc5
>
> update-initramfs: Generating /boot/initrd.img-2.6.36-rc5
> W: Possible missing firmware
> /lib/firmware/bnx2/bnx2-rv2p-09ax-5.0.0.j10.fw for module bnx2
> W: Possible missing firmware
> /lib/firmware/bnx2/bnx2-rv2p-09-5.0.0.j10.fw for module bnx2
> W: Possible missing firmware
> /lib/firmware/bnx2/bnx2-mips-09-5.0.0.j15.fw for module bnx2
> W: Possible missing firmware /lib/firmware/bnx2/bnx2-mips-06-5.0.0.j6.fw
> for module bnx2
Oops. I've added the new firmware here:
<http://svn.debian.org/wsvn/kernel/dists/trunk/firmware-nonfree/bnx2/bnx2/>
> Lates firmware-bnx2 not helps. Build from source fail with many errors.
> In 2.6.35 it is seems to work ok. 2.6.36 check is mandatory?
No, it's OK to test 2.6.35.
Ben.
--
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.
09-30-2010, 03:10 PM
Proskurin Kirill
Bug#598323: linux-image-2.6.35.6: Servers reboot on heavy load on DRBD+OCFS2 partition
On 30/09/10 04:49, Ben Hutchings wrote:
On Wed, 2010-09-29 at 18:17 +0400, Proskurin Kirill wrote:
On 29/09/10 01:08, Ben Hutchings wrote:
On Tue, 2010-09-28 at 09:47 +0100, Proskurin Kirill wrote:
Package: linux-image-2.6.35.6
Version: 2.6.35.6-10.00.Custom
Severity: important
Hello.
First of all - this it my first bugreport to debian and I sorry if I
do something wrong - just tell me what need to fix in it.
I have 2 servers Dell 2950 and try to use it as a email cluster.
I use DRBD with OCFS2 over it. Both nodes is reboot on heavy load
every time.
I report bug for a package linux-image-2.6.35.6 but it is not true - I
have this problem on 2.6.26(stable) and 2.6.32(testing). I just try
latest kernel to be sure.
I try ocfs2-tools from stable and from testing - nodes reboot. I try
DRBD8 from backports and then on 2.6.32 native and compile DRBD-8.3.8
from sourse with 2.6.35-6 - nodes reboot.
So I think it is a kernel relaited but I can be really wrong. Im not
sure what couse this reboots.
Can you reproduce this in 2.6.35 or 2.6.36-rc5 (current version in
experimental) using the version of drbd that is included in it rather
than a separately built version?
Ok. I working on it. Have problem to get work bnx2 driver in 2.6.36-rc5
update-initramfs: Generating /boot/initrd.img-2.6.36-rc5
W: Possible missing firmware
/lib/firmware/bnx2/bnx2-rv2p-09ax-5.0.0.j10.fw for module bnx2
W: Possible missing firmware
/lib/firmware/bnx2/bnx2-rv2p-09-5.0.0.j10.fw for module bnx2
W: Possible missing firmware
/lib/firmware/bnx2/bnx2-mips-09-5.0.0.j15.fw for module bnx2
W: Possible missing firmware /lib/firmware/bnx2/bnx2-mips-06-5.0.0.j6.fw
for module bnx2
Oops. I've added the new firmware here:
<http://svn.debian.org/wsvn/kernel/dists/trunk/firmware-nonfree/bnx2/bnx2/>
Lates firmware-bnx2 not helps. Build from source fail with many errors.
In 2.6.35 it is seems to work ok. 2.6.36 check is mandatory?
ocfs2 is already included in the kernel package and you should use that.
OCFS2-1.4.4-3(from testing) - it is a userspace utility like mkfs.ocfs2.
Of course I use driver from kernel.
While update(aptitude safe-upgrade) first node I get kernel panic.
Screenshot in attachment.
[...]
This panic shows "Tainted: G D" which means there was a previous "oops"
message. You need to record the first one.
Well I not got it twice.
I can confirm what on configuration above(all testing + kernel
2.6.36-rc5) I don`t got a reboot. iozone complete successfully without
any problems so yes - it is a kernel relaited problem. I retest it on
latest 2.6.32 from testing - and got reboot.
So... what should I do now?
--
Best regards,
Proskurin Kirill
--
To UNSUBSCRIBE, email to debian-kernel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4CC12FF8.3000401@fxclub.org">http://lists.debian.org/4CC12FF8.3000401@fxclub.org
10-25-2010, 08:05 PM
Ben Hutchings
Bug#598323: linux-image-2.6.35.6: Servers reboot on heavy load on DRBD+OCFS2 partition
On Fri, Oct 22, 2010 at 10:32:24AM +0400, Proskurin Kirill wrote:
> Hello!
>
> Sorry for such big delay - I was ill and then on vacation.
> I you still have an interest in this problem - I have new results.
>
> On 01/10/10 06:49, Ben Hutchings wrote:
>> On Thu, 2010-09-30 at 19:10 +0400, Proskurin Kirill wrote:
>> [...]
>>> Summary:
>>>
>>> Kernel: 2.6.36-rc5 SMP x86_64 (from experimental)
>>> DRBD-utils-8.3.8(from experimental)
>>> OCFS2-1.4.4-3(from testing)
>>
>> ocfs2 is already included in the kernel package and you should use that.
> OCFS2-1.4.4-3(from testing) - it is a userspace utility like mkfs.ocfs2.
> Of course I use driver from kernel.
OK, good.
>>> While update(aptitude safe-upgrade) first node I get kernel panic.
>>> Screenshot in attachment.
>> [...]
>>
>> This panic shows "Tainted: G D" which means there was a previous "oops"
>> message. You need to record the first one.
> Well I not got it twice.
>
> I can confirm what on configuration above(all testing + kernel
> 2.6.36-rc5) I don`t got a reboot. iozone complete successfully without
> any problems so yes - it is a kernel relaited problem. I retest it on
> latest 2.6.32 from testing - and got reboot.
>
> So... what should I do now?
I'm sorry but I don't have any idea where the problem is. So far as I
can see, there are no bug fixes to drbd or ocfs2 in 2.6.36-rc5 that are
not also in 2.6.35.6. Maybe the bug is elsewhere and just triggered by
this combination of storage driver and filesystem. Or, given that you
said that even 2.6.36-rc5 did crash once, it could be that the hardware
is unreliable.
So there are two things you could try, but I am not very hopeful:
1. Run a RAM test such as memtest86+.
2. Use 'git bisect' to find the change that makes the difference.
Normally you would use this to find when a bug was introduced, but
you can also use it to find when a bug was fixed if you reverse the
'good' and 'bad' labels.
See <http://book.git-scm.com/5_finding_issues_-_git_bisect.html>.
Ben.
Ben.
--
Ben Hutchings
We get into the habit of living before acquiring the habit of thinking.
- Albert Camus