Linux Archive

Linux Archive (http://www.linux-archive.org/)
-   Cluster Development (http://www.linux-archive.org/cluster-development/)
-   -   fence daemon problems (http://www.linux-archive.org/cluster-development/709241-fence-daemon-problems.html)

Dietmar Maurer 10-03-2012 08:03 AM

fence daemon problems
 
I observe strange problems with fencing when a cluster loose quorum for a short time.

¬*

After regain quorum, fenced reports ‚Äėwait state¬*¬* messages‚Äô, and whole cluster

is blocked waiting for fenced.

¬*

I can reproduce that bug here easily. It always happens with the following test:

¬*

Software: RHEL6.3 based kernel, corosync 1.4.4, cluster-3.1.93

¬*

I have 4 nodes. node hp4 is turned off for this test:

¬*

hp2:~# cman_tool nodes

Node¬* Sts¬*¬* Inc¬*¬* Joined¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬* Name

¬*¬* 1¬*¬* X¬*¬*¬*¬*¬* 0¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬* hp4

¬*¬* 2¬*¬* M¬*¬* 1232¬*¬* 2012-10-03 08:59:08¬* hp1

¬*¬* 3¬*¬* M¬*¬* 1228¬*¬* 2012-10-03 08:58:58¬* hp3

¬*¬* 4¬*¬* M¬*¬* 1220¬*¬* 2012-10-03 08:58:58¬* hp2

¬*

hp2:~# fence_tool ls

fence domain

member count¬* 3

victim count¬* 0

victim now¬*¬*¬* 0

master nodeid 3

wait state¬*¬*¬* none

members¬*¬*¬*¬*¬*¬* 2 3 4

¬*

Everything runs fine so far (fence_tool ls output match on all nodes).

¬*

Now I unplug the network cable on hp1:

¬*

hp2:~# cman_tool nodes

Node¬* Sts¬*¬* Inc¬*¬* Joined¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬* Name

¬*¬* 1¬*¬* X¬*¬*¬*¬*¬* 0¬*¬*¬*¬*¬*¬*¬*¬*¬*¬* ¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*hp4

¬*¬* 2¬*¬* X¬*¬* 1232¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬* hp1

¬*¬* 3¬*¬* M¬*¬* 1228¬*¬* 2012-10-03 08:58:58¬* hp3

¬*¬* 4¬*¬* M¬*¬* 1220¬*¬* 2012-10-03 08:58:58¬* hp2

¬*

hp2:~# fence_tool ls

fence domain

member count¬* 2

victim count¬* 1

victim now¬*¬*¬* 0

master nodeid 3

wait state¬*¬*¬* quorum

members¬*¬*¬*¬*¬*¬* 2 3 4

¬*

Same output on hp3 ‚Äď so far so good .

In the fenced log I can find the following entries:

¬*

hp2:~# cat /var/log/cluster/fenced.log

Oct 03 08:59:08 fenced fenced 1349169030 started

Oct 03 08:59:09 fenced fencing deferred to hp3

¬*

on hp3:

¬*

hp3:~# cat /var/log/cluster/fenced.log

Oct 03 08:57:12 fenced fencing node hp4

Oct 03 08:57:21 fenced fence hp4 success

¬*

hp2:~# dlm_tool ls

dlm lockspaces

name¬*¬*¬*¬*¬*¬*¬*¬*¬* rgmanager

id¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬* 0x5231f3eb

flags¬*¬*¬*¬*¬*¬*¬*¬* 0x00000004 kern_stop

change¬*¬*¬*¬*¬*¬*¬* member 3 joined 1 remove 0 failed 0 seq 2,2

members¬*¬*¬*¬*¬*¬* 2 3 4

new change¬*¬*¬* member 2 joined 0 remove 1 failed 1 seq 3,3

new status¬*¬*¬* wait_messages 0 wait_condition 1 fencing

new members¬*¬* 3 4

¬*

same output on hp3.

¬*

Now I reconnect the network on hp1:

¬*

# cman_tool nodes

Node¬* Sts¬*¬* Inc¬*¬* Joined¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬* Name

¬*¬* 1¬*¬* X¬*¬*¬*¬*¬* 0¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬*¬* hp4

¬*¬* 2¬*¬* M¬*¬* 1240¬*¬* 2012-10-03 09:07:41¬* hp1

¬*¬* 3¬*¬* M¬*¬* 1228¬*¬* 2012-10-03 08:58:58¬* hp3

¬*¬* 4¬*¬* M¬*¬* 1220¬*¬* 2012-10-03 08:58:58¬* hp2

¬*

So we have quorum again.

¬*

hp2:~# fence_tool ls

fence domain

member count¬* 3

victim count¬* 1

victim now¬*¬*¬* 0

master nodeid 3

wait state¬*¬*¬* messages

members¬*¬*¬*¬*¬*¬* 2 3 4

¬*

same output on hp3, hp1 is different:

¬*

hp1:~# fence_tool ls

fence domain

member count¬* 3

victim count¬* 2

victim now¬*¬*¬* 0

master nodeid 3

wait state¬*¬*¬* messages

members¬*¬*¬*¬*¬*¬* 2 3 4

¬*

Here are the fenced dumps ‚Äď maybe someone can see what is wrong here?

¬*

hp2:~# fence_tool dump

…

1349247553 receive_complete 3:3 len 232

1349247751 cluster node 2 removed seq 1236

1349247751 fenced:daemon conf 2 0 1 memb 3 4 join left 2

1349247751 fenced:default conf 2 0 1 memb 3 4 join left 2

1349247751 add_change cg 3 remove nodeid 2 reason 3

1349247751 add_change cg 3 m 2 j 0 r 1 f 1

1349247751 add_victims node 2

1349247751 check_ringid cluster 1236 cpg 2:1232

1349247751 fenced:default ring 4:1236 2 memb 4 3

1349247751 check_ringid done cluster 1236 cpg 4:1236

1349247751 check_quorum not quorate

1349247751 fenced:daemon ring 4:1236 2 memb 4 3

1349248061 cluster node 2 added seq 1240

1349248061 check_ringid cluster 1240 cpg 4:1236

1349248061 fenced:daemon conf 3 1 0 memb 2 3 4 join 2 left

1349248061 cpg_mcast_joined retried 5 protocol

1349248061 fenced:daemon ring 2:1240 3 memb 2 4 3

1349248061 fenced:default conf 3 1 0 memb 2 3 4 join 2 left

1349248061 add_change cg 4 joined nodeid 2

1349248061 add_change cg 4 m 3 j 1 r 0 f 0

1349248061 check_ringid cluster 1240 cpg 4:1236

1349248061 fenced:default ring 2:1240 3 memb 2 4 3

1349248061 check_ringid done cluster 1240 cpg 2:1240

1349248061 check_quorum done

1349248061 send_start 4:4 flags 2 started 2 m 3 j 1 r 0 f 0

1349248061 receive_protocol from 4 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 4 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 4 join 1349247548 left 0 local quorum 1349248061

1349248061 receive_protocol from 3 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 3 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 3 join 1349247548 left 0 local quorum 1349248061

1349248061 receive_start 4:4 len 232

1349248061 match_change 4:4 skip cg 3 expect counts 2 0 1 1

1349248061 match_change 4:4 matches cg 4

1349248061 wait_messages cg 4 need 2 of 3

1349248061 receive_protocol from 2 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 2 max 0.0.0.0 run 0.0.0.0

1349248061 daemon node 2 join 1349248061 left 1349247751 local quorum 1349248061

1349248061 daemon node 2 stateful merge

1349248061 receive_protocol from 2 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 2 max 0.0.0.0 run 0.0.0.0

1349248061 daemon node 2 join 1349248061 left 1349247751 local quorum 1349248061

1349248061 daemon node 2 stateful merge

1349248061 receive_start 3:5 len 232

1349248061 match_change 3:5 skip cg 3 expect counts 2 0 1 1

1349248061 match_change 3:5 matches cg 4

1349248061 wait_messages cg 4 need 1 of 3

1349248061 receive_start 2:5 len 232

1349248061 match_change 2:5 skip cg 3 sender not member

1349248061 match_change 2:5 matches cg 4

1349248061 receive_start 2:5 add node with started_count 1

1349248061 wait_messages cg 4 need 1 of 3

¬*

hp3:~# fence_tool dump

…

1349247553 receive_complete 3:3 len 232

1349247751 cluster node 2 removed seq 1236

1349247751 fenced:daemon conf 2 0 1 memb 3 4 join left 2

1349247751 fenced:default conf 2 0 1 memb 3 4 join left 2

1349247751 add_change cg 4 remove nodeid 2 reason 3

1349247751 add_change cg 4 m 2 j 0 r 1 f 1

1349247751 add_victims node 2

1349247751 check_ringid cluster 1236 cpg 2:1232

1349247751 fenced:default ring 4:1236 2 memb 4 3

1349247751 check_ringid done cluster 1236 cpg 4:1236

1349247751 check_quorum not quorate

1349247751 fenced:daemon ring 4:1236 2 memb 4 3

1349248061 cluster node 2 added seq 1240

1349248061 check_ringid cluster 1240 cpg 4:1236

1349248061 fenced:daemon conf 3 1 0 memb 2 3 4 join 2 left

1349248061 cpg_mcast_joined retried 5 protocol

1349248061 fenced:daemon ring 2:1240 3 memb 2 4 3

1349248061 receive_protocol from 4 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 4 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 4 join 1349247548 left 0 local quorum 1349248061

1349248061 fenced:default conf 3 1 0 memb 2 3 4 join 2 left

1349248061 add_change cg 5 joined nodeid 2

1349248061 add_change cg 5 m 3 j 1 r 0 f 0

1349248061 check_ringid cluster 1240 cpg 4:1236

1349248061 fenced:default ring 2:1240 3 memb 2 4 3

1349248061 check_ringid done cluster 1240 cpg 2:1240

1349248061 check_quorum done

1349248061 send_start 3:5 flags 2 started 3 m 3 j 1 r 0 f 0

1349248061 receive_protocol from 3 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 3 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 3 join 1349247425 left 0 local quorum 1349248061

1349248061 receive_start 4:4 len 232

1349248061 match_change 4:4 skip cg 4 expect counts 2 0 1 1

1349248061 match_change 4:4 matches cg 5

1349248061 wait_messages cg 5 need 2 of 3

1349248061 receive_start 3:5 len 232

1349248061 match_change 3:5 skip cg 4 expect counts 2 0 1 1

1349248061 match_change 3:5 matches cg 5

1349248061 wait_messages cg 5 need 1 of 3

1349248061 receive_protocol from 2 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 2 max 0.0.0.0 run 0.0.0.0

1349248061 daemon node 2 join 1349248061 left 1349247751 local quorum 1349248061

1349248061 daemon node 2 stateful merge

1349248061 receive_protocol from 2 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 2 max 0.0.0.0 run 0.0.0.0

1349248061 daemon node 2 join 1349248061 left 1349247751 local quorum 1349248061

1349248061 daemon node 2 stateful merge

1349248061 receive_start 2:5 len 232

1349248061 match_change 2:5 skip cg 4 sender not member

1349248061 match_change 2:5 matches cg 5

1349248061 receive_start 2:5 add node with started_count 1

1349248061 wait_messages cg 5 need 1 of 3

¬*

hp1:~# fence_tool dump

…

1349247551 our_nodeid 2 our_name hp1

1349247552 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/fenced.log

1349247552 logfile cur mode 100644

1349247552 cpg_join fenced:daemon ...

1349247552 setup_cpg_daemon 10

1349247552 group_mode 3 compat 0

1349247552 fenced:daemon conf 3 1 0 memb 2 3 4 join 2 left

1349247552 fenced:daemon ring 2:1232 3 memb 2 4 3

1349247552 receive_protocol from 4 max 1.1.1.0 run 1.1.1.1

1349247552 daemon node 4 max 0.0.0.0 run 0.0.0.0

1349247552 daemon node 4 join 1349247552 left 0 local quorum 1349247551

1349247552 run protocol from nodeid 4

1349247552 daemon run 1.1.1 max 1.1.1

1349247552 receive_protocol from 3 max 1.1.1.0 run 1.1.1.1

1349247552 daemon node 3 max 0.0.0.0 run 0.0.0.0

1349247552 daemon node 3 join 1349247552 left 0 local quorum 1349247551

1349247552 receive_protocol from 2 max 1.1.1.0 run 0.0.0.0

1349247552 daemon node 2 max 0.0.0.0 run 0.0.0.0

1349247552 daemon node 2 join 1349247552 left 0 local quorum 1349247551

1349247552 receive_protocol from 2 max 1.1.1.0 run 1.1.1.0

1349247552 daemon node 2 max 1.1.1.0 run 0.0.0.0

1349247552 daemon node 2 join 1349247552 left 0 local quorum 1349247551

1349247553 client connection 3 fd 13

1349247553 added 4 nodes from ccs

1349247553 cpg_join fenced:default ...

1349247553 fenced:default conf 3 1 0 memb 2 3 4 join 2 left

1349247553 add_change cg 1 joined nodeid 2

1349247553 add_change cg 1 m 3 j 1 r 0 f 0

1349247553 add_victims_init nodeid 1

1349247553 check_ringid cluster 1232 cpg 0:0

1349247553 fenced:default ring 2:1232 3 memb 2 4 3

1349247553 check_ringid done cluster 1232 cpg 2:1232

1349247553 check_quorum done

1349247553 send_start 2:1 flags 1 started 0 m 3 j 1 r 0 f 0

1349247553 receive_start 3:3 len 232

1349247553 match_change 3:3 matches cg 1

1349247553 save_history 1 master 3 time 1349247441 how 1

1349247553 wait_messages cg 1 need 2 of 3

1349247553 receive_start 2:1 len 232

1349247553 match_change 2:1 matches cg 1

1349247553 wait_messages cg 1 need 1 of 3

1349247553 receive_start 4:2 len 232

1349247553 match_change 4:2 matches cg 1

1349247553 wait_messages cg 1 got all 3

1349247553 set_master from 0 to complete node 3

1349247553 fencing deferred to hp3

1349247553 receive_complete 3:3 len 232

1349247553 receive_complete clear victim nodeid 1 init 1

1349247750 cluster node 3 removed seq 1236

1349247750 cluster node 4 removed seq 1236

1349247751 fenced:daemon conf 2 0 1 memb 2 4 join left 3

1349247751 fenced:daemon conf 1 0 1 memb 2 join left 4

1349247751 fenced:daemon ring 2:1236 1 memb 2

1349247751 fenced:default conf 2 0 1 memb 2 4 join left 3

1349247751 add_change cg 2 remove nodeid 3 reason 3

1349247751 add_change cg 2 m 2 j 0 r 1 f 1

1349247751 add_victims node 3

1349247751 check_ringid cluster 1236 cpg 2:1232

1349247751 fenced:default conf 1 0 1 memb 2 join left 4

1349247751 add_change cg 3 remove nodeid 4 reason 3

1349247751 add_change cg 3 m 1 j 0 r 1 f 1

1349247751 add_victims node 4

1349247751 check_ringid cluster 1236 cpg 2:1232

1349247751 fenced:default ring 2:1236 1 memb 2

1349247751 check_ringid done cluster 1236 cpg 2:1236

1349247751 check_quorum not quorate

1349248061 cluster node 3 added seq 1240

1349248061 cluster node 4 added seq 1240

1349248061 check_ringid cluster 1240 cpg 2:1236

1349248061 fenced:daemon conf 2 1 0 memb 2 3 join 3 left

1349248061 cpg_mcast_joined retried 6 protocol

1349248061 fenced:daemon conf 3 1 0 memb 2 3 4 join 4 left

1349248061 fenced:daemon ring 2:1240 3 memb 2 4 3

1349248061 receive_protocol from 4 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 4 max 0.0.0.0 run 0.0.0.0

1349248061 daemon node 4 join 1349248061 left 1349247751 local quorum 1349248061

1349248061 daemon node 4 stateful merge

1349248061 receive_protocol from 3 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 3 max 0.0.0.0 run 0.0.0.0

1349248061 daemon node 3 join 1349248061 left 1349247751 local quorum 1349248061

1349248061 daemon node 3 stateful merge

1349248061 fenced:default conf 2 1 0 memb 2 3 join 3 left

1349248061 add_change cg 4 joined nodeid 3

1349248061 add_change cg 4 m 2 j 1 r 0 f 0

1349248061 check_ringid cluster 1240 cpg 2:1236

1349248061 fenced:default conf 3 1 0 memb 2 3 4 join 4 left

1349248061 add_change cg 5 joined nodeid 4

1349248061 add_change cg 5 m 3 j 1 r 0 f 0

1349248061 check_ringid cluster 1240 cpg 2:1236

1349248061 fenced:default ring 2:1240 3 memb 2 4 3

1349248061 check_ringid done cluster 1240 cpg 2:1240

1349248061 check_quorum done

1349248061 send_start 2:5 flags 2 started 1 m 3 j 1 r 0 f 0

1349248061 receive_start 4:4 len 232

1349248061 match_change 4:4 skip cg 2 created 1349247751 cluster add 1349248061

1349248061 match_change 4:4 skip cg 3 sender not member

1349248061 match_change 4:4 skip cg 4 sender not member

1349248061 match_change 4:4 matches cg 5

1349248061 receive_start 4:4 add node with started_count 2

1349248061 wait_messages cg 5 need 3 of 3

1349248061 receive_start 3:5 len 232

1349248061 match_change 3:5 skip cg 2 sender not member

1349248061 match_change 3:5 skip cg 3 sender not member

1349248061 match_change 3:5 skip cg 4 expect counts 2 1 0 0

1349248061 match_change 3:5 matches cg 5

1349248061 receive_start 3:5 add node with started_count 3

1349248061 wait_messages cg 5 need 3 of 3

1349248061 receive_start 2:5 len 232

1349248061 match_change 2:5 skip cg 2 expect counts 2 0 1 1

1349248061 match_change 2:5 skip cg 3 expect counts 1 0 1 1

1349248061 match_change 2:5 skip cg 4 expect counts 2 1 0 0

1349248061 match_change 2:5 matches cg 5

1349248061 wait_messages cg 5 need 2 of 3

1349248061 receive_protocol from 2 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 2 max 1.1.1.0 run 1.1.1.0

1349248061 daemon node 2 join 1349247552 left 0 local quorum 1349248061

1349248061 receive_protocol from 2 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 2 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 2 join 1349247552 left 0 local quorum 1349248061

Dietmar Maurer 10-03-2012 09:25 AM

fence daemon problems
 
> I observe strange problems with fencing when a cluster loose quorum for a
> short time.
>
> After regain quorum, fenced reports 'wait state** messages', and whole
> cluster is blocked waiting for fenced.

Just found the following in fenced/cpg.c:

/* This is how we deal with cpg's that are partitioned and
then merge back together. When the merge happens, the
cpg on each side will see nodes from the other side being
added, and neither side will have zero started_count. So,
both sides will ignore start messages from the other side.
This causes the the domain on each side to continue waiting
for the missing start messages indefinately. To unblock
things, all nodes from one side of the former partition
need to fail. */

So the observed behavior is expected?

David Teigland 10-03-2012 02:46 PM

fence daemon problems
 
On Wed, Oct 03, 2012 at 09:25:08AM +0000, Dietmar Maurer wrote:
> So the observed behavior is expected?

Yes, it's a stateful partition merge, and I think /var/log/messages should
have mentioned something about that. When a node is partitioned from the
others (e.g. network disconnected), it has to be cleanly reset before it's
allowed back. "cleanly reset" typically means rebooted. If it comes back
without being reset (e.g. network reconnected), then the others ignore it,
which is what you saw.

Dietmar Maurer 10-03-2012 04:08 PM

fence daemon problems
 
> Subject: Re: [Cluster-devel] fence daemon problems
>
> On Wed, Oct 03, 2012 at 09:25:08AM +0000, Dietmar Maurer wrote:
> > So the observed behavior is expected?
>
> Yes, it's a stateful partition merge, and I think /var/log/messages should have
> mentioned something about that.

What message should I look for?

Dietmar Maurer 10-03-2012 04:12 PM

fence daemon problems
 
> Yes, it's a stateful partition merge, and I think /var/log/messages should have
> mentioned something about that. When a node is partitioned from the
> others (e.g. network disconnected), it has to be cleanly reset before it's
> allowed back. "cleanly reset" typically means rebooted. If it comes back
> without being reset (e.g. network reconnected), then the others ignore it,
> which is what you saw.

I don't really understand why 'dlm_controld' initiates fencing, although the
node does not has quorum?

I thought 'dlm_controld' should wait until cluster is quorate before starting fence actions?

David Teigland 10-03-2012 04:24 PM

fence daemon problems
 
On Wed, Oct 03, 2012 at 04:12:10PM +0000, Dietmar Maurer wrote:
> > Yes, it's a stateful partition merge, and I think /var/log/messages should have
> > mentioned something about that. When a node is partitioned from the
> > others (e.g. network disconnected), it has to be cleanly reset before it's
> > allowed back. "cleanly reset" typically means rebooted. If it comes back
> > without being reset (e.g. network reconnected), then the others ignore it,
> > which is what you saw.

> What message should I look for?

I was wrong, I was thinking about the "daemon node %d stateful merge"
messages which are debug, but should probably be changed to error.

> I don't really understand why 'dlm_controld' initiates fencing, although
> the node does not has quorum?
>
> I thought 'dlm_controld' should wait until cluster is quorate before
> starting fence actions?

I guess you're talking about the dlm_tool ls output? The "fencing" there
means it is waiting for fenced to finish fencing before it starts dlm
recovery. fenced waits for quorum.

hp2:~# dlm_tool ls
dlm lockspaces
name rgmanager
id 0x5231f3eb
flags 0x00000004 kern_stop
change member 3 joined 1 remove 0 failed 0 seq 2,2
members 2 3 4
new change member 2 joined 0 remove 1 failed 1 seq 3,3
new status wait_messages 0 wait_condition 1 fencing
new members 3 4

Dietmar Maurer 10-03-2012 04:26 PM

fence daemon problems
 
> I guess you're talking about the dlm_tool ls output?

Yes.

> The "fencing" there
> means it is waiting for fenced to finish fencing before it starts dlm recovery.
> fenced waits for quorum.

So who actually starts fencing when cluster is not quorate? rgmanager?

David Teigland 10-03-2012 04:44 PM

fence daemon problems
 
On Wed, Oct 03, 2012 at 04:26:35PM +0000, Dietmar Maurer wrote:
> > I guess you're talking about the dlm_tool ls output?
>
> Yes.
>
> > The "fencing" there
> > means it is waiting for fenced to finish fencing before it starts dlm recovery.
> > fenced waits for quorum.
>
> So who actually starts fencing when cluster is not quorate? rgmanager?

fenced always starts fencing, but it waits for quorum first. In other
words, if your cluster looses quorum, nothing happens, not even fencing.

The intention of that is to prevent an inquorate node/partition from
killing a quorate group of nodes that are running normally. e.g. if a 5
node cluster is partitioned into 2/3 or 1/4. You don't want the 2 or 1
node group to fence the 3 or 4 nodes that are fine.

The difficult cases, which I think you're seeing, are partitions where no
group has quorum, e.g. 2/2. In this case we do nothing, and the user has
to resolve it by resetting some of the nodes. You might be able to assign
different numbers of votes to reduce the likelihood of everyone loosing
quorum.

Dietmar Maurer 10-03-2012 04:55 PM

fence daemon problems
 
> The intention of that is to prevent an inquorate node/partition from killing a
> quorate group of nodes that are running normally. e.g. if a 5 node cluster is
> partitioned into 2/3 or 1/4. You don't want the 2 or 1 node group to fence
> the 3 or 4 nodes that are fine.

sure, I understand that.

> The difficult cases, which I think you're seeing, are partitions where no group
> has quorum, e.g. 2/2. In this case we do nothing, and the user has to resolve
> it by resetting some of the nodes

The problem with that is that those 'difficult' cases are very likely. For example
a switch reboot results in that state if you do not have redundant network (yes,
I know that this setup is simply wrong).

And things get worse, because it is not possible to reboot such nodes, because
rgmanager shutdown simply hangs. Is there any way to avoid that, so that it is at
least possible to reboot those nodes?

David Teigland 10-03-2012 05:10 PM

fence daemon problems
 
On Wed, Oct 03, 2012 at 04:55:55PM +0000, Dietmar Maurer wrote:
> > The difficult cases, which I think you're seeing, are partitions where
> > no group has quorum, e.g. 2/2. In this case we do nothing, and the
> > user has to resolve it by resetting some of the nodes
>
> The problem with that is that those 'difficult' cases are very likely.
> For example a switch reboot results in that state if you do not have
> redundant network (yes, I know that this setup is simply wrong).
>
> And things get worse, because it is not possible to reboot such nodes,
> because rgmanager shutdown simply hangs. Is there any way to avoid that,
> so that it is at least possible to reboot those nodes?

Fabio's checkquorum script will reboot nodes that loose quorum.


All times are GMT. The time now is 02:10 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.