When running linux-image-2.6.32-trunk-amd64, the network stops
responding if large amounts of traffic are transmitted/received. Running
ifdown eth0 followed by ifup eth0 restores operation of the network.
There are no errors relating to this failure logged in /var/log that I
could see.
Downgrading to linux-image-2.6.30-2-amd64 results in a stable network.
Not sure if this is a forcedeth specific problem or a general problem in
the newer kernel (I have seen problems with forcedeth on other
distro/kernel combinations).
Happy to run further diagnostics to tie this down if you let me know
what to run.
-- Package-specific info:
** Kernel log: boot messages should be attached
** Model information
sys_vendor: Supermicro
product_name: H8DMU
product_version: 1234567890
chassis_vendor: To Be Filled By O.E.M.
chassis_version: To Be Filled By O.E.M.
bios_vendor: American Megatrends Inc.
bios_version: 080014
board_vendor: Supermicro
board_name: H8DMU
board_version: 1234567890
** Network interface configuration:
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).
# The loopback network interface
auto lo
iface lo inet loopback
Interrupt: pin A routed to IRQ 11
Region 0: I/O ports at dc00 [size=64]
Region 4: I/O ports at 2d00 [size=64]
Region 5: I/O ports at 2e00 [size=64]
Capabilities: [44] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Kernel driver in use: nForce2_smbus
00:02.0 USB Controller [0c03]: nVidia Corporation MCP55 USB Controller
[10de:036c] (rev a1) (prog-if 10 [OHCI])
Latency: 64 (2000ns min), Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 10
Region 0: Memory at f0000000 (32-bit, prefetchable) [size=128M]
Region 1: I/O ports at e000 [size=256]
Region 2: Memory at febf0000 (32-bit, non-prefetchable) [size=64K]
Expansion ROM at feb00000 [disabled] [size=128K]
Capabilities: [50] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
** USB devices:
Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 001 Device 002: ID 14dd:0002 Raritan Computer, Inc.
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Kernel: Linux 2.6.30-2-amd64 (SMP w/8 CPU cores)
Locale: LANG=en_IE.UTF-8, LC_CTYPE=en_IE.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Versions of packages linux-image-2.6.32-trunk-amd64 depends on:
ii debconf [debconf-2.0] 1.5.28 Debian configuration
management sy
ii initramfs-tools [linux-initr 0.93.4 tools for generating an
initramfs
ii module-init-tools 3.12~pre1-1 tools for managing Linux
kernel mo
Versions of packages linux-image-2.6.32-trunk-amd64 recommends:
ii firmware-linux-free 2.6.32-5 Binary firmware for various
driver
Versions of packages linux-image-2.6.32-trunk-amd64 suggests:
pn grub | lilo <none> (no description available)
pn linux-doc-2.6.32 <none> (no description available)
Versions of packages linux-image-2.6.32-trunk-amd64 is related to:
pn firmware-bnx2 <none> (no description available)
pn firmware-bnx2x <none> (no description available)
pn firmware-ipw2x00 <none> (no description available)
pn firmware-ivtv <none> (no description available)
pn firmware-iwlwifi <none> (no description available)
pn firmware-linux <none> (no description available)
pn firmware-linux-nonfree <none> (no description available)
pn firmware-qlogic <none> (no description available)
pn firmware-ralink <none> (no description available)
--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.ie http://webstar.deri.ie http://sindice.com
--
To UNSUBSCRIBE, email to debian-kernel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4B8CE982.8080900@deri.org">http://lists.debian.org/4B8CE982.8080900@deri.org
03-05-2010, 02:54 AM
Ben Hutchings
Bug#572201: linux-image-2.6.32-trunk-amd64: forcedeth driver hangs under heavy load
On Tue, 2010-03-02 at 10:33 +0000, stephen mulcahy wrote:
> Package: linux-2.6
> Version: 2.6.32-5
> Severity: grave
>
> When running linux-image-2.6.32-trunk-amd64, the network stops
> responding if large amounts of traffic are transmitted/received. Running
> ifdown eth0 followed by ifup eth0 restores operation of the network.
> There are no errors relating to this failure logged in /var/log that I
> could see.
There is a 'TX watchdog' that will log a warning if a network device
stops transmitting and the driver has not reported that the link is
down. If you do not see such a warning in the log then the problem may
lie outside this driver.
What protocol(s) are you using when this occurs?
> Downgrading to linux-image-2.6.30-2-amd64 results in a stable network.
> Not sure if this is a forcedeth specific problem or a general problem in
> the newer kernel (I have seen problems with forcedeth on other
> distro/kernel combinations).
I don't believe this is a general problem.
> Happy to run further diagnostics to tie this down if you let me know
> what to run.
We'll want to see the kernel log (output from dmesg) after this happens,
even if you can't spot anything in it.
The device statistics (output from ethtool -S eth0) might also be
informative.
Ben.
--
Ben Hutchings
Q. Which is the greater problem in the world today, ignorance or apathy?
A. I don't know and I couldn't care less.
03-08-2010, 02:41 PM
stephen mulcahy
Bug#572201: linux-image-2.6.32-trunk-amd64: forcedeth driver hangs under heavy load
Ben Hutchings wrote:
What protocol(s) are you using when this occurs?
I was running the Hadoop application (http://hadoop.apache.org/) which
uses TCP as far as I know.
I just tried to reproduce the problem using iperf but have sent GBs
between two machines running 2.6.32 without seeing any problems. Will
try running Hadoop across all my systems, with only 2 running 2.6.32 and
see if I can replicate the problem. Otherwise will roll them all back to
2.6.32 and work from there.
Happy to run further diagnostics to tie this down if you let me know
what to run.
We'll want to see the kernel log (output from dmesg) after this happens,
even if you can't spot anything in it.
The device statistics (output from ethtool -S eth0) might also be
informative.
Ok, will post both of those when I manage to reproduce the problem again.
-stephen
--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.ie http://webstar.deri.ie http://sindice.com
--
To UNSUBSCRIBE, email to debian-kernel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4B951AB3.9050601@deri.org">http://lists.debian.org/4B951AB3.9050601@deri.org
03-09-2010, 08:41 AM
stephen mulcahy
Bug#572201: linux-image-2.6.32-trunk-amd64: forcedeth driver hangs under heavy load
So I changed all 45 nodes in the cluster back to the 2.6.32 kernel and
restarted a test job. After 15-20 minutes, some of the nodes had dropped
out - responding still to pings, but impossible to ssh to them.
Output from one node as follows (after connecting to console),
Ben Hutchings wrote:
We'll want to see the kernel log (output from dmesg) after this happens,
even if you can't spot anything in it.
The problem first started at 08:56 - as you can see the last messages in
/var/log/kern.log are from 08:31 (just after booting) - the output from
dmesg is identical
Mar 9 08:31:11 node20 kernel: [ 9.102710] sdc1
Mar 9 08:31:11 node20 kernel: [ 9.102890] sd 2:0:0:0: [sdc] Attached
SCSI disk
Mar 9 08:31:11 node20 kernel: [ 9.104582] sda1 sda2 < sda5 sda6 >
Mar 9 08:31:11 node20 kernel: [ 9.131101] sd 0:0:0:0: [sda] Attached
SCSI disk
Mar 9 08:31:11 node20 kernel: [ 9.133794] sd 0:0:0:0: Attached scsi
generic sg0 type 0
Mar 9 08:31:11 node20 kernel: [ 9.133824] sd 1:0:0:0: Attached scsi
generic sg1 type 0
Mar 9 08:31:11 node20 kernel: [ 9.133849] sd 2:0:0:0: Attached scsi
generic sg2 type 0
Mar 9 08:31:11 node20 kernel: [ 9.133875] sd 3:0:0:0: Attached scsi
generic sg3 type 0
Mar 9 08:31:11 node20 kernel: [ 9.133970] sr 4:0:0:0: Attached scsi
generic sg4 type 5
Mar 9 08:31:11 node20 kernel: [ 9.324705] PM: Starting manual resume
from disk
Mar 9 08:31:11 node20 kernel: [ 9.339131] EXT4-fs (sda1): INFO:
recovery required on readonly filesystem
Mar 9 08:31:11 node20 kernel: [ 9.339135] EXT4-fs (sda1): write
access will be enabled during recovery
Mar 9 08:31:11 node20 kernel: [ 10.429924] EXT4-fs (sda1): recovery
complete
Mar 9 08:31:11 node20 kernel: [ 10.430834] EXT4-fs (sda1): mounted
filesystem with ordered data mode
Mar 9 08:31:11 node20 kernel: [ 11.227223] udev: starting version 151
Mar 9 08:31:11 node20 kernel: [ 11.447716] processor LNXCPU:00:
registered as cooling_device0
Mar 9 08:31:11 node20 kernel: [ 11.448024] processor LNXCPU:01:
registered as cooling_device1
Mar 9 08:31:11 node20 kernel: [ 11.448329] processor LNXCPU:02:
registered as cooling_device2
Mar 9 08:31:11 node20 kernel: [ 11.448646] processor LNXCPU:03:
registered as cooling_device3
Mar 9 08:31:11 node20 kernel: [ 11.448950] processor LNXCPU:04:
registered as cooling_device4
Mar 9 08:31:11 node20 kernel: [ 11.449253] processor LNXCPU:05:
registered as cooling_device5
Mar 9 08:31:11 node20 kernel: [ 11.449557] processor LNXCPU:06:
registered as cooling_device6
Mar 9 08:31:11 node20 kernel: [ 11.449858] processor LNXCPU:07:
registered as cooling_device7
Mar 9 08:31:11 node20 kernel: [ 11.584225] i2c i2c-0: nForce2 SMBus
adapter at 0x2d00
Mar 9 08:31:11 node20 kernel: [ 11.584244] i2c i2c-1: nForce2 SMBus
adapter at 0x2e00
Mar 9 08:31:11 node20 kernel: [ 11.614078] input: PC Speaker as
/devices/platform/pcspkr/input/input5
Mar 9 08:31:11 node20 kernel: [ 11.699803] EDAC MC: Ver: 2.1.0 Jan 10
2010
Mar 9 08:31:11 node20 kernel: [ 11.826247] EDAC amd64_edac: Ver:
3.2.0 Jan 10 2010
Mar 9 08:31:11 node20 kernel: [ 11.826765] Error: Driver 'pcspkr' is
already registered, aborting...
Mar 9 08:31:11 node20 kernel: [ 11.826812] EDAC amd64: ECC is enabled
by BIOS.
Mar 9 08:31:11 node20 kernel: [ 11.826992] EDAC amd64: ECC is enabled
by BIOS.
Mar 9 08:31:11 node20 kernel: [ 11.827029] EDAC MC: F10h CPU detected
Mar 9 08:31:11 node20 kernel: [ 11.827095] EDAC MC0: Giving out
device to 'amd64_edac' 'Family 10h': DEV 0000:00:18.2
Mar 9 08:31:11 node20 kernel: [ 11.827098] EDAC MC: F10h CPU detected
Mar 9 08:31:11 node20 kernel: [ 11.827157] EDAC MC1: Giving out
device to 'amd64_edac' 'Family 10h': DEV 0000:00:19.2
Mar 9 08:31:11 node20 kernel: [ 11.827174] EDAC PCI0: Giving out
device to module 'amd64_edac' controller 'EDAC PCI controller': DEV
'0000:00:18.2' (POLLED)
Mar 9 08:31:11 node20 kernel: [ 12.124910] Adding 32170120k swap on
/dev/sda6. Priority:-1 extents:1 across:32170120k
Mar 9 08:31:11 node20 kernel: [ 12.338321] loop: module loaded
Mar 9 08:31:11 node20 kernel: [ 14.407171] EXT4-fs (sda5): mounted
filesystem with ordered data mode
Mar 9 08:31:11 node20 kernel: [ 15.002385] EXT4-fs (sdb1): recovery
complete
Mar 9 08:31:11 node20 kernel: [ 15.002786] EXT4-fs (sdb1): mounted
filesystem with ordered data mode
Mar 9 08:31:11 node20 kernel: [ 15.568677] EXT4-fs (sdc1): recovery
complete
Mar 9 08:31:11 node20 kernel: [ 15.570225] EXT4-fs (sdc1): mounted
filesystem with ordered data mode
Mar 9 08:31:11 node20 kernel: [ 16.180705] EXT4-fs (sdd1): recovery
complete
Mar 9 08:31:11 node20 kernel: [ 16.180909] EXT4-fs (sdd1): mounted
filesystem with ordered data mode
Mar 9 08:31:11 node20 kernel: [ 16.834881] alloc irq_desc for 30 on
node 0
Mar 9 08:31:11 node20 kernel: [ 16.834885] alloc kstat_irqs on node 0
Mar 9 08:31:11 node20 kernel: [ 16.834895] forcedeth 0000:00:08.0:
irq 30 for MSI/MSI-X
Mar 9 08:31:21 node20 kernel: [ 27.456017] eth0: no IPv6 routers present
The device statistics (output from ethtool -S eth0) might also be
informative.
If I ifdown eth0 and then ifup eth0, I can again connect to the system
without problems.
Thanks,
-stephen
--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.ie http://webstar.deri.ie http://sindice.com
--
To UNSUBSCRIBE, email to debian-kernel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4B9617B9.4030203@deri.org">http://lists.debian.org/4B9617B9.4030203@deri.org