FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Debian > Debian User

 
 
LinkBack Thread Tools
 
Old 08-17-2012, 07:53 AM
Stan Hoeppner
 
Default continuous reboots in a two nodes cluster with heartbeat and pacemaker.

On 8/17/2012 1:52 AM, Mauro wrote:
> On 14 August 2012 08:24, Mauro <mrsanna1@gmail.com> wrote:
>> On 13 August 2012 22:58, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>>
>>> That being the case I'd suspect something other than server hardware.
>>> To be sure, manually remove one node from the cluster and see how long
>>> the remaining node runs without rebooting. If it doesn't reboot at all,
>>> that eliminates hardware as the fault point.
>>
>> good idea, I do it now.
>
> I've done what you have suggested.
> It seems that the node reboots without reason.
> It is like it is powered off, in fact in the boolog I see that the
> journal filesystem is recovered.
> It seems very strange to me, perhaps ram bugged?

I'd be thoroughly inspecting the power circuits feeding those servers at
this point. Do you have the machines set to automatically power back on
after power loss? If you do, switch that mode so they stay off after AC
power loss. That should confirm whether the problem is total loss of AC
voltage or a severely deep sag.

If the problem is a less severe sag, however, this test won't isolate
the problem. For that you must dig into the UPS monitoring interface.
If you don't have a UPS, you'll have to put a tap on the AC circuit and
monitor the voltage. This will require specialized equipment, as it
must be able to log the sag. Some of the nicer Fluke meters can log the
lowest voltage, but probably can't tell you the time of day when the sag
occurs. Thus, you'll need to highly trained electrician with the proper
equipment.

This could also be a thermal issue. Do you have hardware monitoring
installed and properly configured? The 'sensors' package? Over temp
conditions will often cause random reboots. Do the boxes have plenty of
zero restriction cool airflow? Less than 25 Celsius intake air temperature?

The odds of having defective hardware in two HP servers causing random
reboots in both machines is extremely low, though possible. If this is
the case it's a design flaw, not simply two defective parts.

It's also possible you have the wrong memory installed. Can you provide
the specs on all DIMMs installed in both machines? Did all of the
memory come preinstalled from HP? Is it HP memory or aftermarket memory
from Kingston, Crucial, etc?

--
Stan


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 502DF86A.6030505@hardwarefreak.com">http://lists.debian.org/502DF86A.6030505@hardwarefreak.com
 
Old 08-18-2012, 11:36 AM
Mauro
 
Default continuous reboots in a two nodes cluster with heartbeat and pacemaker.

On 17 August 2012 09:53, Stan Hoeppner <stan@hardwarefreak.com> wrote:


> I'd be thoroughly inspecting the power circuits feeding those servers at
> this point. Do you have the machines set to automatically power back on
> after power loss? If you do, switch that mode so they stay off after AC
> power loss. That should confirm whether the problem is total loss of AC
> voltage or a severely deep sag.

Is that setting in the bios?


> If the problem is a less severe sag, however, this test won't isolate
> the problem. For that you must dig into the UPS monitoring interface.
> If you don't have a UPS, you'll have to put a tap on the AC circuit and
> monitor the voltage. This will require specialized equipment, as it
> must be able to log the sag. Some of the nicer Fluke meters can log the
> lowest voltage, but probably can't tell you the time of day when the sag
> occurs. Thus, you'll need to highly trained electrician with the proper
> equipment.
>
> This could also be a thermal issue. Do you have hardware monitoring
> installed and properly configured? The 'sensors' package? Over temp
> conditions will often cause random reboots. Do the boxes have plenty of
> zero restriction cool airflow? Less than 25 Celsius intake air temperature?

I have others HP server of the same type, some with linux and others
with windows.
Thay are all in the same room so if it is a temperature problem I
think that also other servers can have the same problem but it is not
the case.
Only mine reboots.

>
> The odds of having defective hardware in two HP servers causing random
> reboots in both machines is extremely low, though possible. If this is
> the case it's a design flaw, not simply two defective parts.
>
> It's also possible you have the wrong memory installed. Can you provide
> the specs on all DIMMs installed in both machines? Did all of the
> memory come preinstalled from HP? Is it HP memory or aftermarket memory
> from Kingston, Crucial, etc?

I've upgraded ram from 32 to 64G.
I've reinstalled all simms.
The bios reports no ram problems.
Also other server are upgraded to 64G.
Reboots are sometime on node1 and sometime on node2.
Bugged ram is on both servers? Strange to me.
Other server don't reboot.


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: CAE17a0VAXxyaRZtoGT6i4XkaE-N9CY7D07NkZVpkwDkQ50NHwQ@mail.gmail.com">http://lists.debian.org/CAE17a0VAXxyaRZtoGT6i4XkaE-N9CY7D07NkZVpkwDkQ50NHwQ@mail.gmail.com
 
Old 08-19-2012, 12:32 AM
Stan Hoeppner
 
Default continuous reboots in a two nodes cluster with heartbeat and pacemaker.

On 8/18/2012 6:36 AM, Mauro wrote:

> I've upgraded ram from 32 to 64G.

Did the reboots occur before doing this?

> I've reinstalled all simms.

DIMMs. SIMMs haven't been used for over a decade. But the fact you
mentioned SIMMs tells me you've been at this game a while.

> The bios reports no ram problems.

It may not.

> Also other server are upgraded to 64G.

Are all the DIMMs fully buffered ECC DDR2-667? If you added unbuffered
ECC DDR2-667 DIMMs to go from 32GB to 64GB, the BIOS may no throw any
errors, but you'd see things like random reboots, random lockups, kernel
errors, processes crashing, etc.

Do these two machines have all four processors and all four memory
cartridges installed, or only two of each?

> Bugged ram is on both servers? Strange to me.

It's probably not bad RAM, but it seems possible that you may have used
unregistered DIMMs when you upgraded to 64GB. Again, the BIOS may not
balk at this, and the combo might seem to work until the memory
subsystem is sufficiently exercised.

--
Stan


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 5030342C.2090105@hardwarefreak.com">http://lists.debian.org/5030342C.2090105@hardwarefreak.com
 
Old 08-19-2012, 10:25 AM
Mauro
 
Default continuous reboots in a two nodes cluster with heartbeat and pacemaker.

On 19 August 2012 02:32, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 8/18/2012 6:36 AM, Mauro wrote:
>
>> I've upgraded ram from 32 to 64G.
>
> Did the reboots occur before doing this?
>
>> I've reinstalled all simms.
>
> DIMMs. SIMMs haven't been used for over a decade. But the fact you
> mentioned SIMMs tells me you've been at this game a while.
>
>> The bios reports no ram problems.
>
> It may not.
>
>> Also other server are upgraded to 64G.
>
> Are all the DIMMs fully buffered ECC DDR2-667? If you added unbuffered
> ECC DDR2-667 DIMMs to go from 32GB to 64GB, the BIOS may no throw any
> errors, but you'd see things like random reboots, random lockups, kernel
> errors, processes crashing, etc.
>
> Do these two machines have all four processors and all four memory
> cartridges installed, or only two of each?
>
>> Bugged ram is on both servers? Strange to me.
>
> It's probably not bad RAM, but it seems possible that you may have used
> unregistered DIMMs when you upgraded to 64GB. Again, the BIOS may not
> balk at this, and the combo might seem to work until the memory
> subsystem is sufficiently exercised.


Thank you for your replies.
I'll check the DIMMs and tell you.


--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: CAE17a0Xg6+w_GkGeV0pj+NmgQTEEs1+QEcenWtDYPtA0tWMAC A@mail.gmail.com">http://lists.debian.org/CAE17a0Xg6+w_GkGeV0pj+NmgQTEEs1+QEcenWtDYPtA0tWMAC A@mail.gmail.com
 

Thread Tools




All times are GMT. The time now is 11:40 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org