continuous reboots in a two nodes cluster with heartbeat and pacemaker.
Hello, I'm experiencing continuous reboots of my two nodes in a
heartbeat+pacemaker cluster. Reboots are random, one day they happen one other day not, sometime for 7 days they don't happen, sometimes they happen at night. They happen at random days and random time. Nodes are connected to a Cisco 3570 switch and a SAN storage system. Perhaps there is a misconfiguration in the interfaces? Here is my interfaces file: # # XEN VLAN CONFIGURATION # # BACKEND MANAGEMENT VIRTUAL INFRASTRUCTURE - VLAN ID 118 - PH. IFACE eth0 auto eth0.118 iface eth0.118 inet static address 192.168.244.10 netmask 255.255.255.0 broadcast 192.168.244.255 gateway 192.168.244.1 vlan_raw_device eth0 # DMZ INTERNET VISIT-CAGLIARI - VLAN ID 109 - PH. IFACE eth1 auto eth1.109 auto xenbr.109 iface xenbr.109 inet manual bridge_ports eth1.109 bridge_maxwait 0 # DMZ INTERNET - VLAN ID 111 - PH. IFACE eth1 auto eth1.111 auto xenbr.111 iface xenbr.111 inet manual bridge_ports eth1.111 bridge_maxwait 0 # DMZ INTRANET - VLAN ID 112 - PH. IFACE eth1 auto eth1.112 auto xenbr.112 iface xenbr.112 inet manual bridge_ports eth1.112 bridge_maxwait 0 # BACKEND APPLICATION INTERNET - VLAN ID 113 - PH. IFACE eth2 auto eth2.113 auto xenbr.113 iface xenbr.113 inet manual bridge_ports eth2.113 bridge_maxwait 0 # BACKEND APPLICATION INTRANET - VLAN ID 114 - PH. IFACE eth2 auto eth2.114 auto xenbr.114 iface xenbr.114 inet manual bridge_ports eth2.114 bridge_maxwait 0 # BACKEND DATABASE INTERNET - VLAN ID 115 - PH. IFACE eth2 uto eth2.115 auto xenbr.115 iface xenbr.115 inet manual bridge_ports eth2.115 bridge_maxwait 0 # BACKEND DATABASE INTRANET - VLAN ID 116 - PH. IFACE eth2 auto eth2.116 auto xenbr.116 iface xenbr.116 inet manual bridge_ports eth2.116 bridge_maxwait 0 # BACKEND AUTHENTICATION/AUTHORIZATION - VLAN ID 117 - PH. IFACE eth2 auto eth2.117 auto xenbr.117 iface xenbr.117 inet manual bridge_ports eth2.117 bridge_maxwait 0 # BACKEND BACKUP - VLAN ID 119 - PH. IFACE eth3 auto eth3.119 auto xenbr.119 iface xenbr.119 inet manual bridge_ports eth3.119 bridge_maxwait 0 bridge_fd 0 # LOCAL XEN POOL NETWORKS # # LIVE MIGRATION - VLAN ID 2001 - PH. IFACE eth0 auto eth0.2001 auto eth3.2001 iface eth0.2001 inet manual vlan_raw_device eth0 iface eth3.2001 inet manual vlan_raw_device eth3 auto bond.2001 iface bond.2001 inet static address 10.1.0.1 netmask 255.255.255.0 bond-mode active-backup slaves eth0.2001 eth3.2001 bond-miimon 100 # CLUSTER DOM0 - VLAN ID 2002 - PH. IFACE eth0/3 auto eth0.2002 iface eth0.2002 inet manual vlan_raw_device eth0 auto eth3.2002 iface eth3.2002 inet manual vlan_raw_device eth3 auto bond.2002 iface bond.2002 inet static address 10.2.0.1 netmask 255.255.255.0 bond-mode active-backup slaves eth0.2002 eth3.2002 bond-miimon 100 # CLUSTER WEB-INTERNET - VLAN ID 2003 - PH. IFACE eth0/3 auto eth0.2003 auto eth3.2003 iface eth0.2003 inet manual vlan_raw_device eth0 iface eth3.2003 inet manual vlan_raw_device eth3 auto bond.2003 iface bond.2003 inet manual bond-mode active-backup slaves eth0.2003 eth3.2003 bond-miimon 100 auto xenbr.2003 iface xenbr.2003 inet manual bridge_ports bond.2003 bridge_maxwait 0 bridge_fd 0 # CLUSTER WEB-INTRANET - VLAN ID 2004 - PH. IFACE eth0/3 auto eth0.2004 auto eth3.2004 iface eth0.2004 inet manual vlan_raw_device eth0 iface eth3.2004 inet manual vlan_raw_device eth3 auto bond.2004 iface bond.2004 inet manual bond-mode active-backup slaves eth0.2004 eth3.2004 bond-miimon 100 auto xenbr.2004 iface xenbr.2004 inet manual bridge_ports bond.2004 bridge_maxwait 0 bridge_fd 0 # CLUSTER STREAMING - VLAN ID 2005 - PH. IFACE eth0/3 auto eth0.2005 auto eth3.2005 iface eth0.2005 inet manual vlan_raw_device eth0 iface eth3.2005 inet manual vlan_raw_device eth3 auto bond.2005 iface bond.2005 inet manual bond-mode active-backup slaves eth0.2005 eth3.2005 bond-miimon 100 auto xenbr.2005 iface xenbr.2005 inet manual bridge_ports bond.2005 bridge_maxwait 0 bridge_fd 0 # CLUSTER MAIL - VLAN ID 2006 - PH. IFACE eth0/3 auto eth0.2006 auto eth3.2006 iface eth0.2006 inet manual vlan_raw_device eth0 iface eth3.2006 inet manual vlan_raw_device eth3 auto bond.2006 iface bond.2006 inet manual bond-mode active-backup slaves eth0.2006 eth3.2006 bond-miimon 100 auto xenbr.2006 iface xenbr.2006 inet manual bridge_ports bond.2006 bridge_maxwait 0 bridge_fd 0 # CLUSTER CONTENT FILTER - VLAN ID 2007 - PH. IFACE eth0/3 auto eth0.2007 auto eth3.2007 iface eth0.2007 inet manual vlan_raw_device eth0 iface eth3.2007 inet manual vlan_raw_device eth3 auto bond.2007 iface bond.2007 inet manual bond-mode active-backup slaves eth0.2007 eth3.2007 bond-miimon 100 auto xenbr.2007 iface xenbr.2007 inet manual bridge_ports bond.2007 bridge_maxwait 0 bridge_fd 0 # CLUSTER PAGHE - VLAND ID 2008 - PH. IFACE eth0/3 auto eth0.2008 auto eth3.2008 iface eth0.2008 inet manual vlan_raw_device eth0 iface eth3.2008 inet manual vlan_raw_device eth3 uto bond.2008 iface bond.2008 inet manual bond-mode active-backup slaves eth0.2008 eth3.2008 bond-miimon 100 auto xenbr.2008 iface xenbr.2008 inet manual bridge_ports bond.2008 bridge_maxwait 0 bridge_fd 0 # CLUSTER DB PAGHE - VLAN ID 2009 - PH. IFACE eth0/3 auto eth0.2009 auto eth3.2009 iface eth0.2009 inet manual vlan_raw_device eth0 iface eth3.2009 inet manual vlan_raw_device eth3 auto bond.2009 iface bond.2009 inet manual bond-mode active-backup slaves eth0.2009 eth3.2009 bond-miimon 100 auto xenbr.2009 iface xenbr.2009 inet manual bridge_ports bond.2009 bridge_maxwait 0 bridge_fd 0 # CLUSTER PROXY-LDAP - VLAN ID 2010 - PH. IFACE eth0/3 auto eth0.2010 auto eth3.2010 iface eth0.2010 inet manual vlan_raw_device eth0 iface eth3.2010 inet manual vlan_raw_device eth3 auto bond.2010 iface bond.2010 inet manual bond-mode active-backup slaves eth0.2010 eth3.2010 bond-miimon 100 auto xenbr.2010 iface xenbr.2010 inet manual bridge_ports bond.2010 bridge_maxwait 0 bridge_fd 0 # CLUSTER LDAP - VLAN ID 2011 - PH. IFACE eth0/3 auto eth0.2011 auto eth3.2011 iface eth0.2011 inet manual vlan_raw_device eth0 iface eth3.2011 inet manual vlan_raw_device eth3 auto bond.2011 iface bond.2011 inet manual bond-mode active-backup slaves eth0.2011 eth3.2011 bond-miimon 100 auto xenbr.2011 iface xenbr.2011 inet manual bridge_ports bond.2011 bridge_maxwait 0 bridge_fd 0 ha.cf is: autojoin none ucast bond.2001 10.1.0.2 warntime 15 deadtime 60 initdead 120 keepalive 5 node xen-p01 node xen-p02 crm respawn Do you think there are some errors? Thank you. -- To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org Archive: http://lists.debian.org/CAE17a0VzO_Vz1gL1aBJmgQORWrEGoN=nHnx0Hc5NzciNoUgmu w@mail.gmail.com |
continuous reboots in a two nodes cluster with heartbeat and pacemaker.
On 8/11/2012 8:59 AM, Mauro wrote:
> Hello, I'm experiencing continuous reboots of my two nodes in a > heartbeat+pacemaker cluster. > Reboots are random, one day they happen one other day not, sometime > for 7 days they don't happen, sometimes they happen at night. > They happen at random days and random time. > Nodes are connected to a Cisco 3570 switch and a SAN storage system. > Perhaps there is a misconfiguration in the interfaces? > Here is my interfaces file: .... > Do you think there are some errors? To determine that you need to look at your logs files, not your config files. If the nodes are rebooting due to fencing it will be logged somewhere, as should the underlying network errors that cause the fence to close. -- Stan -- To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org Archive: 502694FF.50207@hardwarefreak.com">http://lists.debian.org/502694FF.50207@hardwarefreak.com |
continuous reboots in a two nodes cluster with heartbeat and pacemaker.
On 11 August 2012 19:23, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 8/11/2012 8:59 AM, Mauro wrote: >> Hello, I'm experiencing continuous reboots of my two nodes in a >> heartbeat+pacemaker cluster. >> Reboots are random, one day they happen one other day not, sometime >> for 7 days they don't happen, sometimes they happen at night. >> They happen at random days and random time. >> Nodes are connected to a Cisco 3570 switch and a SAN storage system. >> Perhaps there is a misconfiguration in the interfaces? >> Here is my interfaces file: > .... > > >> Do you think there are some errors? > > To determine that you need to look at your logs files, not your config > files. If the nodes are rebooting due to fencing it will be logged > somewhere, as should the underlying network errors that cause the fence > to close. Yes, I look at my logs but the only thing I see is that node 1 fence node 2 or node 2 fence node 1 because one node doesn't see other node, but I don't understard what is the problem, if it is a problem of my NIC or other. -- To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org Archive: http://lists.debian.org/CAE17a0X7N3WGOyH=bjtds4K28BYiKoSvpwMY=JQ=3W7MVjNUm g@mail.gmail.com |
continuous reboots in a two nodes cluster with heartbeat and pacemaker.
On 8/12/2012 4:44 AM, Mauro wrote:
> On 11 August 2012 19:23, Stan Hoeppner <stan@hardwarefreak.com> wrote: >> On 8/11/2012 8:59 AM, Mauro wrote: >>> Hello, I'm experiencing continuous reboots of my two nodes in a >>> heartbeat+pacemaker cluster. >>> Reboots are random, one day they happen one other day not, sometime >>> for 7 days they don't happen, sometimes they happen at night. >>> They happen at random days and random time. >>> Nodes are connected to a Cisco 3570 switch and a SAN storage system. >>> Perhaps there is a misconfiguration in the interfaces? >>> Here is my interfaces file: >> .... >> >> >>> Do you think there are some errors? >> >> To determine that you need to look at your logs files, not your config >> files. If the nodes are rebooting due to fencing it will be logged >> somewhere, as should the underlying network errors that cause the fence >> to close. > > Yes, I look at my logs but the only thing I see is that node 1 fence > node 2 or node 2 fence node 1 because one node doesn't see other node, > but I don't understard what is the problem, if it is a problem of my > NIC or other. Is there more than one set of these in any dmes files on either host: Jul 26 00:38:26 [host] kernel: e100 0000:00:0d.0: eth0: NIC Link is Down Jul 26 00:38:28 [host] kernel: e100 0000:00:0d.0: eth0: NIC Link is Up 100 Mbps Full Duplex If so it may indicate a flaky NIC or switch port, possibly a bad patch cable. Is there a switch between the hosts or a cross over cable? But, look at the time interval between the down/up states. If it's always less than the cluster action threshold then this shouldn't be an issue. If it's greater than the threshold it is likely the cause of the software fence activating. There are other possible causes. This is simply the first that comes to mind. -- Stan -- To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org Archive: 5027F87D.2080306@hardwarefreak.com">http://lists.debian.org/5027F87D.2080306@hardwarefreak.com |
continuous reboots in a two nodes cluster with heartbeat and pacemaker.
On 12 August 2012 20:39, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> On 8/12/2012 4:44 AM, Mauro wrote: >> On 11 August 2012 19:23, Stan Hoeppner <stan@hardwarefreak.com> wrote: >>> On 8/11/2012 8:59 AM, Mauro wrote: >>>> Hello, I'm experiencing continuous reboots of my two nodes in a >>>> heartbeat+pacemaker cluster. >>>> Reboots are random, one day they happen one other day not, sometime >>>> for 7 days they don't happen, sometimes they happen at night. >>>> They happen at random days and random time. >>>> Nodes are connected to a Cisco 3570 switch and a SAN storage system. >>>> Perhaps there is a misconfiguration in the interfaces? >>>> Here is my interfaces file: >>> .... >>> >>> >>>> Do you think there are some errors? >>> >>> To determine that you need to look at your logs files, not your config >>> files. If the nodes are rebooting due to fencing it will be logged >>> somewhere, as should the underlying network errors that cause the fence >>> to close. >> >> Yes, I look at my logs but the only thing I see is that node 1 fence >> node 2 or node 2 fence node 1 because one node doesn't see other node, >> but I don't understard what is the problem, if it is a problem of my >> NIC or other. > > Is there more than one set of these in any dmes files on either host: > > Jul 26 00:38:26 [host] kernel: e100 0000:00:0d.0: eth0: NIC Link is Down > Jul 26 00:38:28 [host] kernel: e100 0000:00:0d.0: eth0: NIC Link is Up > 100 Mbps Full Duplex No, any link down in any log file :-( I really don\'t understand why the reboots :-( > If so it may indicate a flaky NIC or switch port, possibly a bad patch > cable. Is there a switch between the hosts or a cross over cable? There is a cisco 3570 switch. -- To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org Archive: http://lists.debian.org/CAE17a0U8D95r2qZT=rmHWQui7J0=p95vqk9ibkmuPDR6Kzgu= g@mail.gmail.com |
continuous reboots in a two nodes cluster with heartbeat and pacemaker.
On 8/12/2012 4:27 PM, Mauro wrote:
> On 12 August 2012 20:39, Stan Hoeppner <stan@hardwarefreak.com> wrote: >> On 8/12/2012 4:44 AM, Mauro wrote: >>> On 11 August 2012 19:23, Stan Hoeppner <stan@hardwarefreak.com> wrote: >>>> On 8/11/2012 8:59 AM, Mauro wrote: >>>>> Hello, I\'m experiencing continuous reboots of my two nodes in a >>>>> heartbeat+pacemaker cluster. >>>>> Reboots are random, one day they happen one other day not, sometime >>>>> for 7 days they don\'t happen, sometimes they happen at night. >>>>> They happen at random days and random time. >>>>> Nodes are connected to a Cisco 3570 switch and a SAN storage system. >>>>> Perhaps there is a misconfiguration in the interfaces? >>>>> Here is my interfaces file: >>>> .... >>>> >>>> >>>>> Do you think there are some errors? >>>> >>>> To determine that you need to look at your logs files, not your config >>>> files. If the nodes are rebooting due to fencing it will be logged >>>> somewhere, as should the underlying network errors that cause the fence >>>> to close. >>> >>> Yes, I look at my logs but the only thing I see is that node 1 fence >>> node 2 or node 2 fence node 1 because one node doesn\'t see other node, >>> but I don\'t understard what is the problem, if it is a problem of my >>> NIC or other. >> >> Is there more than one set of these in any dmes files on either host: >> >> Jul 26 00:38:26 [host] kernel: e100 0000:00:0d.0: eth0: NIC Link is Down >> Jul 26 00:38:28 [host] kernel: e100 0000:00:0d.0: eth0: NIC Link is Up >> 100 Mbps Full Duplex > > No, any link down in any log file :-( > I really don\'t understand why the reboots :-( > >> If so it may indicate a flaky NIC or switch port, possibly a bad patch >> cable. Is there a switch between the hosts or a cross over cable? > > There is a cisco 3570 switch. Are these controlled shutdowns? Or are these hardware crash/reboots that are occurring? If the former you should see syslog entries for the shutdown sequence. If the latter, you won\'t see anything in the logs. This would suggest you\'ve got a hardware problem, and not related to faulty NICs or switches. What kind of UPS are these machines powered from? Have you checked the UPS and verified they are functioning properly? If you have a power even and the UPS drop the load, the machines will reboot without a hint in the logs as to what caused the reboot. Finally, what servers are theses? Dell/HP/IBM or whitebox? Memory mismatch or simply bad memory can cause inexplicable reboots. If the machines are decent quality, they BIOS should log such events. -- Stan -- To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org Archive: 502877FF.1070601@hardwarefreak.com">http://lists.debian.org/502877FF.1070601@hardwarefreak.com |
continuous reboots in a two nodes cluster with heartbeat and pacemaker.
>
> Are these controlled shutdowns? Or are these hardware crash/reboots > that are occurring? > > If the former you should see syslog entries for the shutdown sequence. > If the latter, you won\'t see anything in the logs. This would suggest > you\'ve got a hardware problem, and not related to faulty NICs or switches. > > What kind of UPS are these machines powered from? Have you checked the > UPS and verified they are functioning properly? If you have a power > even and the UPS drop the load, the machines will reboot without a hint > in the logs as to what caused the reboot. > > Finally, what servers are theses? Dell/HP/IBM or whitebox? Memory > mismatch or simply bad memory can cause inexplicable reboots. If the > machines are decent quality, they BIOS should log such events. Servers are Hp proliant DL580G5. I\'m afraid that I have hardware problems :-( The strange thing is that happens alternately in both nodes. -- To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org Archive: CAE17a0UUeX0io7pcQBo+WO2TWnzGHaSSbPwPHBYmoLDDRZePQ w@mail.gmail.com">http://lists.debian.org/CAE17a0UUeX0io7pcQBo+WO2TWnzGHaSSbPwPHBYmoLDDRZePQ w@mail.gmail.com |
continuous reboots in a two nodes cluster with heartbeat and pacemaker.
On 8/13/2012 3:37 AM, Mauro wrote:
>> >> Are these controlled shutdowns? Or are these hardware crash/reboots >> that are occurring? >> >> If the former you should see syslog entries for the shutdown sequence. >> If the latter, you won\'t see anything in the logs. This would suggest >> you\'ve got a hardware problem, and not related to faulty NICs or switches. >> >> What kind of UPS are these machines powered from? Have you checked the >> UPS and verified they are functioning properly? If you have a power >> even and the UPS drop the load, the machines will reboot without a hint >> in the logs as to what caused the reboot. >> >> Finally, what servers are theses? Dell/HP/IBM or whitebox? Memory >> mismatch or simply bad memory can cause inexplicable reboots. If the >> machines are decent quality, they BIOS should log such events. > > Servers are Hp proliant DL580G5. > I\'m afraid that I have hardware problems :-( I don\'t think you have enough solid information yet to make that assumption, unless you\'ve discovered something you didn\'t share with us. > The strange thing is that happens alternately in both nodes. That being the case I\'d suspect something other than server hardware. To be sure, manually remove one node from the cluster and see how long the remaining node runs without rebooting. If it doesn\'t reboot at all, that eliminates hardware as the fault point. -- Stan -- To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org Archive: 50296A85.7050205@hardwarefreak.com">http://lists.debian.org/50296A85.7050205@hardwarefreak.com |
continuous reboots in a two nodes cluster with heartbeat and pacemaker.
On 13 August 2012 22:58, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> That being the case I\'d suspect something other than server hardware. > To be sure, manually remove one node from the cluster and see how long > the remaining node runs without rebooting. If it doesn\'t reboot at all, > that eliminates hardware as the fault point. good idea, I do it now. -- To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org Archive: http://lists.debian.org/CAE17a0VdXgugBaOAdBTR3Pk8GfOS7bVUgsoKJVTdJ=sFwrR6S A@mail.gmail.com |
continuous reboots in a two nodes cluster with heartbeat and pacemaker.
On 14 August 2012 08:24, Mauro <mrsanna1@gmail.com> wrote:
> On 13 August 2012 22:58, Stan Hoeppner <stan@hardwarefreak.com> wrote: > >> That being the case I\'d suspect something other than server hardware. >> To be sure, manually remove one node from the cluster and see how long >> the remaining node runs without rebooting. If it doesn\'t reboot at all, >> that eliminates hardware as the fault point. > > good idea, I do it now. I\'ve done what you have suggested. It seems that the node reboots without reason. It is like it is powered off, in fact in the boolog I see that the journal filesystem is recovered. It seems very strange to me, perhaps ram bugged? -- To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org Archive: http://lists.debian.org/CAE17a0V39QMbTnBRGESazYm=f8LrQ6C5Bp4WK7s6m8ETbeB7Q A@mail.gmail.com |
| All times are GMT. The time now is 11:00 AM. |
VBulletin, Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.