appologies if this has already been reported but I couldn't see anything quite matching what I'm seeing.
I have a 26TB debian squeeze fileserver providing NFS mounts to a large number of users. The system has been working flawlessly for a number of months but twice in the last week NFS seems to have crashed. The first thing I noticed is that users reported being unable to access shares. Logging into the system I see a single nfsd process taking 100% CPU with a very long run time. Restarting nfs-kernel-server has no effect. The process is unkillable (even with -9) and the system has required a reboot to get it usable again. jnettop is not showing significant network traffic and lsof on /export/ (where all my NFS exports are located) shows no nfs access to any files.
Please let me know if you need any further information. I am going to reboot the server now, so I may not be able to reproduce the problem straight away (but as its happened twice, I am quite sure it will happen again at some point...).
Thanks in advance for your help.
Dan Tomlinson
My /etc/exports file is below:
# /etc/exports: the access control list for filesystems which may be exported
# to NFS clients. See exports(5).
#
# Example for NFSv2 and NFSv3:
# /srv/homes hostname1(no_subtree_check,rw,sync,no_subtree_chec k) hostname2(ro,sync,no_subtree_check)
#
# Example for NFSv4:
#
Kernel: Linux 2.6.32-5-amd64 (SMP w/16 CPU cores)
Locale: LANG=en_GB.UTF-8, LC_CTYPE=en_GB.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Versions of packages nfs-kernel-server depends on:
ii libblkid1 2.17.2-9 block device id library
ii libc6 2.11.2-10 Embedded GNU C Library: Shared lib
ii libcomerr2 1.41.12-2 common error description library
ii libgssapi-krb5-2 1.8.3+dfsg-4 MIT Kerberos runtime libraries - k
ii libgssglue1 0.1-4 mechanism-switch gssapi library
ii libk5crypto3 1.8.3+dfsg-4 MIT Kerberos runtime libraries - C
ii libkrb5-3 1.8.3+dfsg-4 MIT Kerberos runtime libraries
ii libnfsidmap2 0.23-2 An nfs idmapping library
ii librpcsecgss3 0.19-2 allows secure rpc communication us
ii libwrap0 7.6.q-19 Wietse Venema's TCP wrappers libra
ii lsb-base 3.2-23.2squeeze1 Linux Standard Base 3.2 init scrip
ii nfs-common 1:1.2.2-4 NFS support files common to client
ii ucf 3.0025+nmu1 Update Configuration File: preserv
nfs-kernel-server recommends no packages.
nfs-kernel-server suggests no packages.
-- no debconf information
--
To UNSUBSCRIBE, email to debian-kernel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 20110310123058.28324.69854.reportbug@fileserver2.s ysbiol.internal.cam.ac.uk">http://lists.debian.org/20110310123058.28324.69854.reportbug@fileserver2.s ysbiol.internal.cam.ac.uk
03-20-2011, 04:20 PM
Luk Claes
Bug#617666: nfs-kernel-server: Periodic nfsd failure - single nfsd process with high CPU and no mounts working
> On 10/03/11 12:54, Debian Bug Tracking System wrote:
>
> I have some extra information about this problem - the syslog contains
> some kernel error messages related to nfs and xfs (the filesystem of the
> /export partition). I have attached the relevant log section...
>
> It could be this is a problem with xfs or even with our hardware raid
> controller. I have rebooted the machine with /export unmounted and am
> currently running xfs_repair over it to see if that picks up any problems.
Hi
I guess your xfs_repair finished by now? Did it shed some more light on
the issue or should we look more closely into the nfs code?
Cheers
Luk
--
To UNSUBSCRIBE, email to debian-kernel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4D86374F.5070102@debian.org">http://lists.debian.org/4D86374F.5070102@debian.org
03-21-2011, 10:33 AM
Dan Tomlinson
Bug#617666: nfs-kernel-server: Periodic nfsd failure - single nfsd process with high CPU and no mounts working
On 20/03/11 17:20, Luk Claes wrote:
On 10/03/11 12:54, Debian Bug Tracking System wrote:
I have some extra information about this problem - the syslog contains
some kernel error messages related to nfs and xfs (the filesystem of the
/export partition). I have attached the relevant log section...
It could be this is a problem with xfs or even with our hardware raid
controller. I have rebooted the machine with /export unmounted and am
currently running xfs_repair over it to see if that picks up any problems.
Hi
I guess your xfs_repair finished by now? Did it shed some more light on
the issue or should we look more closely into the nfs code?
Cheers
Luk
Hi Luk,
thanks for getting back to me. My xfs_repair did finish and it found a
few errors, but I'm not sure if they are from hard resetting the machine
or some indication of a more serious hardware error. I am however
pretty sure that this is not a purely NFS problem - since the repair
finished, the system has crashed in a couple of different ways. Once it
dumped the kernel to the console and went completely unresponsive and
another time the /export partition unmounted itself and wouldn't remount
(giving IO errors). In both cases there was no weird NFS process
hanging around (the mounts just became inaccessible as you would expect
them to after such crashes).
At this point I am pretty sure that I have a hardware issue on my hands,
either with bad RAM or my raid controller. I think we can safely say
NFS is in the clear Sorry for wasting your time!
Dan
--
To UNSUBSCRIBE, email to debian-kernel-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 4D8737A3.4010005@flymine.org">http://lists.debian.org/4D8737A3.4010005@flymine.org