FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Debian > Debian User

 
 
LinkBack Thread Tools
 
Old 01-04-2009, 04:05 PM
"Francesco Pietra"
 
Default Fwd: Failure to load amd64 overcome, though mem problems

Posted again from the e-mail address I am registered to


---------- Forwarded message ----------
From: Francesco Pietra <francesco.pietra@accademialucchese.it>
Date: Fri, Jan 2, 2009 at 8:26 PM
Subject: Failure to load amd64 overcome, though mem problems
To: amd64 Debian <debian-amd64@lists.debian.org>, debian-users
<debian-user@lists.debian.org>


Hi:
Near the end of last year, in a period of vacation, I posted to amd64
about failure to start amd64 lenny with a Supermicro H8QC8
motherboard. This board has chipset nVidia CK804, which is also memory
controller, and AMD 8132. It bears 4 dual opteron 875 CPUs, two WD
Raptor under RAID as well as 8 KVR400D4R3A/2G and 8 KVR400D4R3A/1G.
Lenny is set not to load the X system. The computer is powered through
an APC 1500 and Enermax EGX1000EWL. Cooling is extremely efficient.
The system was shut down correctly when top indicated 24GB total RAM.
After a few days untouched, the OS did not load, the screen showing a
series of lines starting with RDX RBP R10 R13 FS CS CR2 DR0 DR3,
followed by

Call Trace:
ffff do_oage
fff handle_mm_fault
fff vma_link
fff error_exit
fff clear_user
fff padzero
fff get_arg_page
fff copy_strings
fff search_binary_handler
fffdo_execve
fff sys_execve
fff stub_execve

After that such lines alternate, and the whole <Call Trace> started
several times anew, everything disappeared from the screen and could not be
recovered with the keyboard.

Knoppix 5.3.1 loaded correctly, detected all 8 logical CPUs, the raid1
partitions (mdadm) were OK, however it detected 20GB total mem,
instead of the 24GB expected.

memtest86+-2.11 detected 17GB total mem and was let to run for the
whole 8 cycles (which took seven hours), reporting no mem errors. DMI
mem device info showed:

DIMM 0 to DIMM 7: size 64; speed 400; type DDR

DIMM 8 to DIMM 10: size empty; speed 200; type DDR

DIMM 11: size 2048; speed 200; type DDR

DIMM 12 to DIMM 15: size 64; speed 200; type DDR.

On rebooting, lenny started correctly. Top showed 18079572k total,
also when running a parallelized application that engaged all 8 CPUs.

lshw agreed with memtest as to the DIMMs, except for the one marked of
size = 2048, which lshw marked of size=64.

I was surprised that half of the slots were indicated by both memtest
and lshw at speed=200; I tentatively assume this is a feature of the
mainboard not of the mem slots.
=============

The actual mem size is insufficient for my computations and the empty
DIMMs need attention I believe. There is no system maintainer here and
I have to try to restore the system alone, also because I assembled
the computer. My question is from where to start at this point. The
mem slots seem to be plugged in as before but I did not try to remove
and replug.

The four blocks on the mainboard were filled as follows:

DIMMA-2A 1GB
DIMMA-2B 1GB
DIMMA-1A 2GB
DIMMA-1B 2GB

DIMMB-1B 2GB
DIMMB-1A 2GB
DIMMB-2B 1GB
DIMMB-2A 1GB

DIMMC-2A 1GB
DIMMC-2B 1GB
DIMMC-1A 2GB
DIMMC-1B 2GB

DIMMD-1B 2GB
DIMMD-1A 2GB
DIMMD-2B 1GB
DIMMD-2A 1GB
=============================
This mail started originally under the hypothesis that the problem was
some degradation of lenny. I understand now that this mail is largely
out of topic both on amd64 and users. Hope only that experienced users
may suggest from their experience.

Thanks and happy 2009!
francesco pietra


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 
Old 01-05-2009, 02:19 PM
 
Default Fwd: Failure to load amd64 overcome, though mem problems

On Sun, Jan 04, 2009 at 06:05:13PM +0100, Francesco Pietra wrote:
> Posted again from the e-mail address I am registered to
>
>
> ---------- Forwarded message ----------
> From: Francesco Pietra <francesco.pietra@accademialucchese.it>
> Date: Fri, Jan 2, 2009 at 8:26 PM
> Subject: Failure to load amd64 overcome, though mem problems
> To: amd64 Debian <debian-amd64@lists.debian.org>, debian-users
> <debian-user@lists.debian.org>
>
>
> Hi:
> Near the end of last year, in a period of vacation, I posted to amd64
> about failure to start amd64 lenny with a Supermicro H8QC8
> motherboard. This board has chipset nVidia CK804, which is also memory
> controller, and AMD 8132. It bears 4 dual opteron 875 CPUs, two WD
> Raptor under RAID as well as 8 KVR400D4R3A/2G and 8 KVR400D4R3A/1G.
> Lenny is set not to load the X system. The computer is powered through
> an APC 1500 and Enermax EGX1000EWL. Cooling is extremely efficient.
> The system was shut down correctly when top indicated 24GB total RAM.
> After a few days untouched, the OS did not load, the screen showing a
> series of lines starting with RDX RBP R10 R13 FS CS CR2 DR0 DR3,
> followed by
>
> Call Trace:
> ffff do_oage
> fff handle_mm_fault
> fff vma_link
> fff error_exit
> fff clear_user
> fff padzero
> fff get_arg_page
> fff copy_strings
> fff search_binary_handler
> fffdo_execve
> fff sys_execve
> fff stub_execve
>
> After that such lines alternate, and the whole <Call Trace> started
> several times anew, everything disappeared from the screen and could not be
> recovered with the keyboard.
>
> Knoppix 5.3.1 loaded correctly, detected all 8 logical CPUs, the raid1
> partitions (mdadm) were OK, however it detected 20GB total mem,
> instead of the 24GB expected.
>
> memtest86+-2.11 detected 17GB total mem and was let to run for the
> whole 8 cycles (which took seven hours), reporting no mem errors. DMI
> mem device info showed:
>
> DIMM 0 to DIMM 7: size 64; speed 400; type DDR
>
> DIMM 8 to DIMM 10: size empty; speed 200; type DDR
>
> DIMM 11: size 2048; speed 200; type DDR
>
> DIMM 12 to DIMM 15: size 64; speed 200; type DDR.

So it looks like DIMM 0 to 7 and 12 to 15 are behaving properly. Now
assuming they are numbered in some kind of sensible order, that probably
means the ram on CPU 0, 1 and 3 is working properly, but that the ram on
CPU 2 is not working right. If you lost all the ram on one CPU, that
would drop you from 24 to 18GB, which seems to match what you are
seeing.

Unfortunately that starts to sound not like a ram problem, but mroe
likely a failure of the memory controller of that CPU or perhaps of the
voltage regulator for the memory slots on that CPU.

You could try removing all the ram from the 3rd CPU and see if the
system still reports 18GB. If it does, then that would confirm that
your ram on that CPU is not being detected.

If you then installed that ram in place of the ram on another CPU you
could find out if the ram is still working, since if it still shows 18GB
working, then most likely your ram is fine.

To determine if it is a mainboard or CPU problem gets more annoying.

You would have to swap CPU 3 with another CPU to see if the ram failure
follows the CPU to another socket, or remains with the slots of CPU 3.

Now perhaps the slots are not numbered sanely in which case it could get
tricky to figure out what is what. Still with 6GB missing it sure looks
a lot like all the ram from one CPU has simply vanished. Does top still
say you have 4 working CPUs?

> On rebooting, lenny started correctly. Top showed 18079572k total,
> also when running a parallelized application that engaged all 8 CPUs.
>
> lshw agreed with memtest as to the DIMMs, except for the one marked of
> size = 2048, which lshw marked of size=64.
>
> I was surprised that half of the slots were indicated by both memtest
> and lshw at speed=200; I tentatively assume this is a feature of the
> mainboard not of the mem slots.
> =============
>
> The actual mem size is insufficient for my computations and the empty
> DIMMs need attention I believe. There is no system maintainer here and
> I have to try to restore the system alone, also because I assembled
> the computer. My question is from where to start at this point. The
> mem slots seem to be plugged in as before but I did not try to remove
> and replug.
>
> The four blocks on the mainboard were filled as follows:
>
> DIMMA-2A 1GB
> DIMMA-2B 1GB
> DIMMA-1A 2GB
> DIMMA-1B 2GB
>
> DIMMB-1B 2GB
> DIMMB-1A 2GB
> DIMMB-2B 1GB
> DIMMB-2A 1GB
>
> DIMMC-2A 1GB
> DIMMC-2B 1GB
> DIMMC-1A 2GB
> DIMMC-1B 2GB
>
> DIMMD-1B 2GB
> DIMMD-1A 2GB
> DIMMD-2B 1GB
> DIMMD-2A 1GB
> =============================
> This mail started originally under the hypothesis that the problem was
> some degradation of lenny. I understand now that this mail is largely
> out of topic both on amd64 and users. Hope only that experienced users
> may suggest from their experience.

--
Len Sorensen


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
 

Thread Tools




All times are GMT. The time now is 10:55 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org