continuous reboots in a two nodes cluster with heartbeat and pacemaker.
Hello, I'm experiencing continuous reboots of my two nodes in a
heartbeat+pacemaker cluster.
Reboots are random, one day they happen one other day not, sometime
for 7 days they don't happen, sometimes they happen at night.
They happen at random days and random time.
Nodes are connected to a Cisco 3570 switch and a SAN storage system.
Perhaps there is a misconfiguration in the interfaces?
Here is my interfaces file:
--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: http://lists.debian.org/CAE17a0VzO_Vz1gL1aBJmgQORWrEGoN=nHnx0Hc5NzciNoUgmu w@mail.gmail.com
08-11-2012, 05:23 PM
Stan Hoeppner
continuous reboots in a two nodes cluster with heartbeat and pacemaker.
On 8/11/2012 8:59 AM, Mauro wrote:
> Hello, I'm experiencing continuous reboots of my two nodes in a
> heartbeat+pacemaker cluster.
> Reboots are random, one day they happen one other day not, sometime
> for 7 days they don't happen, sometimes they happen at night.
> They happen at random days and random time.
> Nodes are connected to a Cisco 3570 switch and a SAN storage system.
> Perhaps there is a misconfiguration in the interfaces?
> Here is my interfaces file:
....
> Do you think there are some errors?
To determine that you need to look at your logs files, not your config
files. If the nodes are rebooting due to fencing it will be logged
somewhere, as should the underlying network errors that cause the fence
to close.
--
Stan
--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 502694FF.50207@hardwarefreak.com">http://lists.debian.org/502694FF.50207@hardwarefreak.com
08-12-2012, 09:44 AM
Mauro
continuous reboots in a two nodes cluster with heartbeat and pacemaker.
On 11 August 2012 19:23, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 8/11/2012 8:59 AM, Mauro wrote:
>> Hello, I'm experiencing continuous reboots of my two nodes in a
>> heartbeat+pacemaker cluster.
>> Reboots are random, one day they happen one other day not, sometime
>> for 7 days they don't happen, sometimes they happen at night.
>> They happen at random days and random time.
>> Nodes are connected to a Cisco 3570 switch and a SAN storage system.
>> Perhaps there is a misconfiguration in the interfaces?
>> Here is my interfaces file:
> ....
>
>
>> Do you think there are some errors?
>
> To determine that you need to look at your logs files, not your config
> files. If the nodes are rebooting due to fencing it will be logged
> somewhere, as should the underlying network errors that cause the fence
> to close.
Yes, I look at my logs but the only thing I see is that node 1 fence
node 2 or node 2 fence node 1 because one node doesn't see other node,
but I don't understard what is the problem, if it is a problem of my
NIC or other.
--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: http://lists.debian.org/CAE17a0X7N3WGOyH=bjtds4K28BYiKoSvpwMY=JQ=3W7MVjNUm g@mail.gmail.com
08-12-2012, 06:39 PM
Stan Hoeppner
continuous reboots in a two nodes cluster with heartbeat and pacemaker.
On 8/12/2012 4:44 AM, Mauro wrote:
> On 11 August 2012 19:23, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>> On 8/11/2012 8:59 AM, Mauro wrote:
>>> Hello, I'm experiencing continuous reboots of my two nodes in a
>>> heartbeat+pacemaker cluster.
>>> Reboots are random, one day they happen one other day not, sometime
>>> for 7 days they don't happen, sometimes they happen at night.
>>> They happen at random days and random time.
>>> Nodes are connected to a Cisco 3570 switch and a SAN storage system.
>>> Perhaps there is a misconfiguration in the interfaces?
>>> Here is my interfaces file:
>> ....
>>
>>
>>> Do you think there are some errors?
>>
>> To determine that you need to look at your logs files, not your config
>> files. If the nodes are rebooting due to fencing it will be logged
>> somewhere, as should the underlying network errors that cause the fence
>> to close.
>
> Yes, I look at my logs but the only thing I see is that node 1 fence
> node 2 or node 2 fence node 1 because one node doesn't see other node,
> but I don't understard what is the problem, if it is a problem of my
> NIC or other.
Is there more than one set of these in any dmes files on either host:
Jul 26 00:38:26 [host] kernel: e100 0000:00:0d.0: eth0: NIC Link is Down
Jul 26 00:38:28 [host] kernel: e100 0000:00:0d.0: eth0: NIC Link is Up
100 Mbps Full Duplex
If so it may indicate a flaky NIC or switch port, possibly a bad patch
cable. Is there a switch between the hosts or a cross over cable?
But, look at the time interval between the down/up states. If it's
always less than the cluster action threshold then this shouldn't be an
issue. If it's greater than the threshold it is likely the cause of the
software fence activating.
There are other possible causes. This is simply the first that comes to
mind.
--
Stan
--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 5027F87D.2080306@hardwarefreak.com">http://lists.debian.org/5027F87D.2080306@hardwarefreak.com
08-12-2012, 09:27 PM
Mauro
continuous reboots in a two nodes cluster with heartbeat and pacemaker.
On 12 August 2012 20:39, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 8/12/2012 4:44 AM, Mauro wrote:
>> On 11 August 2012 19:23, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>>> On 8/11/2012 8:59 AM, Mauro wrote:
>>>> Hello, I'm experiencing continuous reboots of my two nodes in a
>>>> heartbeat+pacemaker cluster.
>>>> Reboots are random, one day they happen one other day not, sometime
>>>> for 7 days they don't happen, sometimes they happen at night.
>>>> They happen at random days and random time.
>>>> Nodes are connected to a Cisco 3570 switch and a SAN storage system.
>>>> Perhaps there is a misconfiguration in the interfaces?
>>>> Here is my interfaces file:
>>> ....
>>>
>>>
>>>> Do you think there are some errors?
>>>
>>> To determine that you need to look at your logs files, not your config
>>> files. If the nodes are rebooting due to fencing it will be logged
>>> somewhere, as should the underlying network errors that cause the fence
>>> to close.
>>
>> Yes, I look at my logs but the only thing I see is that node 1 fence
>> node 2 or node 2 fence node 1 because one node doesn't see other node,
>> but I don't understard what is the problem, if it is a problem of my
>> NIC or other.
>
> Is there more than one set of these in any dmes files on either host:
>
> Jul 26 00:38:26 [host] kernel: e100 0000:00:0d.0: eth0: NIC Link is Down
> Jul 26 00:38:28 [host] kernel: e100 0000:00:0d.0: eth0: NIC Link is Up
> 100 Mbps Full Duplex
No, any link down in any log file :-(
I really don't understand why the reboots :-(
> If so it may indicate a flaky NIC or switch port, possibly a bad patch
> cable. Is there a switch between the hosts or a cross over cable?
There is a cisco 3570 switch.
--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: http://lists.debian.org/CAE17a0U8D95r2qZT=rmHWQui7J0=p95vqk9ibkmuPDR6Kzgu= g@mail.gmail.com
08-13-2012, 03:43 AM
Stan Hoeppner
continuous reboots in a two nodes cluster with heartbeat and pacemaker.
On 8/12/2012 4:27 PM, Mauro wrote:
> On 12 August 2012 20:39, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>> On 8/12/2012 4:44 AM, Mauro wrote:
>>> On 11 August 2012 19:23, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>>>> On 8/11/2012 8:59 AM, Mauro wrote:
>>>>> Hello, I'm experiencing continuous reboots of my two nodes in a
>>>>> heartbeat+pacemaker cluster.
>>>>> Reboots are random, one day they happen one other day not, sometime
>>>>> for 7 days they don't happen, sometimes they happen at night.
>>>>> They happen at random days and random time.
>>>>> Nodes are connected to a Cisco 3570 switch and a SAN storage system.
>>>>> Perhaps there is a misconfiguration in the interfaces?
>>>>> Here is my interfaces file:
>>>> ....
>>>>
>>>>
>>>>> Do you think there are some errors?
>>>>
>>>> To determine that you need to look at your logs files, not your config
>>>> files. If the nodes are rebooting due to fencing it will be logged
>>>> somewhere, as should the underlying network errors that cause the fence
>>>> to close.
>>>
>>> Yes, I look at my logs but the only thing I see is that node 1 fence
>>> node 2 or node 2 fence node 1 because one node doesn't see other node,
>>> but I don't understard what is the problem, if it is a problem of my
>>> NIC or other.
>>
>> Is there more than one set of these in any dmes files on either host:
>>
>> Jul 26 00:38:26 [host] kernel: e100 0000:00:0d.0: eth0: NIC Link is Down
>> Jul 26 00:38:28 [host] kernel: e100 0000:00:0d.0: eth0: NIC Link is Up
>> 100 Mbps Full Duplex
>
> No, any link down in any log file :-(
> I really don't understand why the reboots :-(
>
>> If so it may indicate a flaky NIC or switch port, possibly a bad patch
>> cable. Is there a switch between the hosts or a cross over cable?
>
> There is a cisco 3570 switch.
Are these controlled shutdowns? Or are these hardware crash/reboots
that are occurring?
If the former you should see syslog entries for the shutdown sequence.
If the latter, you won't see anything in the logs. This would suggest
you've got a hardware problem, and not related to faulty NICs or switches.
What kind of UPS are these machines powered from? Have you checked the
UPS and verified they are functioning properly? If you have a power
even and the UPS drop the load, the machines will reboot without a hint
in the logs as to what caused the reboot.
Finally, what servers are theses? Dell/HP/IBM or whitebox? Memory
mismatch or simply bad memory can cause inexplicable reboots. If the
machines are decent quality, they BIOS should log such events.
--
Stan
--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 502877FF.1070601@hardwarefreak.com">http://lists.debian.org/502877FF.1070601@hardwarefreak.com
08-13-2012, 08:37 AM
Mauro
continuous reboots in a two nodes cluster with heartbeat and pacemaker.
>
> Are these controlled shutdowns? Or are these hardware crash/reboots
> that are occurring?
>
> If the former you should see syslog entries for the shutdown sequence.
> If the latter, you won't see anything in the logs. This would suggest
> you've got a hardware problem, and not related to faulty NICs or switches.
>
> What kind of UPS are these machines powered from? Have you checked the
> UPS and verified they are functioning properly? If you have a power
> even and the UPS drop the load, the machines will reboot without a hint
> in the logs as to what caused the reboot.
>
> Finally, what servers are theses? Dell/HP/IBM or whitebox? Memory
> mismatch or simply bad memory can cause inexplicable reboots. If the
> machines are decent quality, they BIOS should log such events.
Servers are Hp proliant DL580G5.
I'm afraid that I have hardware problems :-(
The strange thing is that happens alternately in both nodes.
--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: CAE17a0UUeX0io7pcQBo+WO2TWnzGHaSSbPwPHBYmoLDDRZePQ w@mail.gmail.com">http://lists.debian.org/CAE17a0UUeX0io7pcQBo+WO2TWnzGHaSSbPwPHBYmoLDDRZePQ w@mail.gmail.com
08-13-2012, 08:58 PM
Stan Hoeppner
continuous reboots in a two nodes cluster with heartbeat and pacemaker.
On 8/13/2012 3:37 AM, Mauro wrote:
>>
>> Are these controlled shutdowns? Or are these hardware crash/reboots
>> that are occurring?
>>
>> If the former you should see syslog entries for the shutdown sequence.
>> If the latter, you won't see anything in the logs. This would suggest
>> you've got a hardware problem, and not related to faulty NICs or switches.
>>
>> What kind of UPS are these machines powered from? Have you checked the
>> UPS and verified they are functioning properly? If you have a power
>> even and the UPS drop the load, the machines will reboot without a hint
>> in the logs as to what caused the reboot.
>>
>> Finally, what servers are theses? Dell/HP/IBM or whitebox? Memory
>> mismatch or simply bad memory can cause inexplicable reboots. If the
>> machines are decent quality, they BIOS should log such events.
>
> Servers are Hp proliant DL580G5.
> I'm afraid that I have hardware problems :-(
I don't think you have enough solid information yet to make that
assumption, unless you've discovered something you didn't share with us.
> The strange thing is that happens alternately in both nodes.
That being the case I'd suspect something other than server hardware.
To be sure, manually remove one node from the cluster and see how long
the remaining node runs without rebooting. If it doesn't reboot at all,
that eliminates hardware as the fault point.
--
Stan
--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: 50296A85.7050205@hardwarefreak.com">http://lists.debian.org/50296A85.7050205@hardwarefreak.com
08-14-2012, 06:24 AM
Mauro
continuous reboots in a two nodes cluster with heartbeat and pacemaker.
On 13 August 2012 22:58, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> That being the case I'd suspect something other than server hardware.
> To be sure, manually remove one node from the cluster and see how long
> the remaining node runs without rebooting. If it doesn't reboot at all,
> that eliminates hardware as the fault point.
good idea, I do it now.
--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: http://lists.debian.org/CAE17a0VdXgugBaOAdBTR3Pk8GfOS7bVUgsoKJVTdJ=sFwrR6S A@mail.gmail.com
08-17-2012, 06:52 AM
Mauro
continuous reboots in a two nodes cluster with heartbeat and pacemaker.
On 14 August 2012 08:24, Mauro <mrsanna1@gmail.com> wrote:
> On 13 August 2012 22:58, Stan Hoeppner <stan@hardwarefreak.com> wrote:
>
>> That being the case I'd suspect something other than server hardware.
>> To be sure, manually remove one node from the cluster and see how long
>> the remaining node runs without rebooting. If it doesn't reboot at all,
>> that eliminates hardware as the fault point.
>
> good idea, I do it now.
I've done what you have suggested.
It seems that the node reboots without reason.
It is like it is powered off, in fact in the boolog I see that the
journal filesystem is recovered.
It seems very strange to me, perhaps ram bugged?
--
To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
Archive: http://lists.debian.org/CAE17a0V39QMbTnBRGESazYm=f8LrQ6C5Bp4WK7s6m8ETbeB7Q A@mail.gmail.com