FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Cluster Development

 
 
LinkBack Thread Tools
 
Old 11-19-2009, 10:35 AM
"Fabio M. Di Nitto"
 
Default fencing conditions: what should trigger a fencing operation?

Hi guys,

I have just hit what I think itīs a bug and I think we need review our
fencing policies.

This is what I saw:

- 6 nodes cluster (node1-3 x86, node4-6 x86_64)
- node1 and node4 perform a simple mount gfs2 -> wait -> umount -> wait
-> mount -> and loop forever
- node2 and node5 perform read/write operation on the same gfs2
partition (nothing fancy really)
- node3 is in charge of creating and removing clustered lv volumes.
- node6 is in charge of constantly relocating rgmanager services.

cluster is running qdisk too.

It is a known issue that node1 will crash at some point (kernel OOPS).

Here are the interesting bits:

node1 is hanging in mount/umount (expected)
node2, node4, node5 will continue to operate as normal.
node3 is now hanging creating a vg.
node6 is trying to stop service from node1 (it happened to be located
there at the time of the crash).

I was expecting, that after a failure, node1 would be fenced but nothing
is happening automatically.

Manually fencing the node will recover all hanging operations.

Talking to Steven W. it appears that our methods to define and detect a
failure should be improved.

My questions, simply driven by the fact that I am not a fence expert, are:

- what are the current fencing policies?
- what can we do to improve them?
- should we monitor for more failures than we do now?

Cheers
Fabio
 
Old 11-19-2009, 04:04 PM
David Teigland
 
Default fencing conditions: what should trigger a fencing operation?

On Thu, Nov 19, 2009 at 12:35:05PM +0100, Fabio M. Di Nitto wrote:

> - what are the current fencing policies?

node failure

> - what can we do to improve them?

node failure is a simple, black and white, fact

> - should we monitor for more failures than we do now?

corosync *exists* to to detect node failure

> It is a known issue that node1 will crash at some point (kernel OOPS).

oops is not necessarily node failure; if you *want* it to be, then you
sysctl -w kernel.panic_on_oops=1

(gfs has also had it's own mount options over the years to force this
behavior, even if the sysctl isn't set properly; it's a common issue.
It seems panic_on_oops has had inconsistent default values over various
releases, sometimes 0, sometimes 1; setting it has historically been part
of cluster/gfs documentation since most customers want it to be 1.)

Dave
 
Old 11-19-2009, 04:16 PM
David Teigland
 
Default fencing conditions: what should trigger a fencing operation?

On Thu, Nov 19, 2009 at 11:04:04AM -0600, David Teigland wrote:
> (gfs has also had it's own mount options over the years to force this
> behavior, even if the sysctl isn't set properly; it's a common issue.

gfs1 does still have "-o oopses_ok", I think gfs2 recently changed this
due to a customer who couldn't get it to work right.

http://git.fedorahosted.org/git/cluster.git?p=cluster.git;a=blob;f=gfs/man/gfs_mount.8;h=faf5d8345801070b7ce3183a62d81c21db6b 6023;hb=RHEL4#l137
 
Old 11-19-2009, 04:28 PM
David Teigland
 
Default fencing conditions: what should trigger a fencing operation?

On Thu, Nov 19, 2009 at 04:15:58PM +0000, Steven Whitehouse wrote:
> Hi,
>
> On Thu, 2009-11-19 at 11:04 -0600, David Teigland wrote:
> > On Thu, Nov 19, 2009 at 12:35:05PM +0100, Fabio M. Di Nitto wrote:
> >
> > > - what are the current fencing policies?
> >
> > node failure
> >
> I think what Fabio is asking is what event is considered to be a node
> failure? It sounds from your description that it means a failure of
> corosync communications.

corosync's main job is to define node up/down states and notify everyone
if it changes, i.e. "cluster membership"

> Are there other things which can feed into this though? For example dlm
> seems to have some kind of timeout mechanism which sends a message to
> userspace, and I wonder whether that contributes to the decision too?

lock timeouts? lock timeouts are a just a normal lock manager feature,
although we don't use them. (The dlm also has a variation on lock
timeouts where it doesn't cancel the timed out lock, but instead sends a
notice to the deadlock detection code that there may be a deadlock, so a
new deadlock detection cycle is started.)

> It certainly isn't desirable for all types of filesystem failure to
> result in fencing & automatic recovery. I think we've got that wrong in
> the past. I posted a patch a few days back to try and address some of
> that. In the case we find an invalid block in a journal during recovery
> we certainly don't want to try and recover the journal on another node,
> nor even kill the recovering node since it will only result in another
> node trying to recover the same journal and hitting the same error.
> Eventually it will bring down the whole cluster.
>
> The aim of the patch was to return a suitable status indicating why
> journal recovery failed so that it can then be handled appropriately,

Historically, gfs will panic if it finds an error that will keep it from
making progress or handling further fs access. This, of course, was in
the interest of HA since you don't want one bad fs on one node to prevent
all the *other* nodes from working too.

Dave
 
Old 11-19-2009, 05:10 PM
"Fabio M. Di Nitto"
 
Default fencing conditions: what should trigger a fencing operation?

David Teigland wrote:
> On Thu, Nov 19, 2009 at 12:35:05PM +0100, Fabio M. Di Nitto wrote:
>
>> - what are the current fencing policies?
>
> node failure
>
>> - what can we do to improve them?
>
> node failure is a simple, black and white, fact
>
>> - should we monitor for more failures than we do now?
>
> corosync *exists* to to detect node failure
>
>> It is a known issue that node1 will crash at some point (kernel OOPS).
>
> oops is not necessarily node failure; if you *want* it to be, then you
> sysctl -w kernel.panic_on_oops=1
>
> (gfs has also had it's own mount options over the years to force this
> behavior, even if the sysctl isn't set properly; it's a common issue.
> It seems panic_on_oops has had inconsistent default values over various
> releases, sometimes 0, sometimes 1; setting it has historically been part
> of cluster/gfs documentation since most customers want it to be 1.)

So a cluster can hang because our code failed, but we donīt detect that
it did fail.... so what determines a node failure? only when corosync dies?

panic_on_oops is not cluster specific and not all OOPS are panic == not
a clean solution.

Fabio
 
Old 11-19-2009, 06:49 PM
David Teigland
 
Default fencing conditions: what should trigger a fencing operation?

On Thu, Nov 19, 2009 at 07:10:54PM +0100, Fabio M. Di Nitto wrote:
> David Teigland wrote:
> > On Thu, Nov 19, 2009 at 12:35:05PM +0100, Fabio M. Di Nitto wrote:
> >
> >> - what are the current fencing policies?
> >
> > node failure
> >
> >> - what can we do to improve them?
> >
> > node failure is a simple, black and white, fact
> >
> >> - should we monitor for more failures than we do now?
> >
> > corosync *exists* to to detect node failure
> >
> >> It is a known issue that node1 will crash at some point (kernel OOPS).
> >
> > oops is not necessarily node failure; if you *want* it to be, then you
> > sysctl -w kernel.panic_on_oops=1
> >
> > (gfs has also had it's own mount options over the years to force this
> > behavior, even if the sysctl isn't set properly; it's a common issue.
> > It seems panic_on_oops has had inconsistent default values over various
> > releases, sometimes 0, sometimes 1; setting it has historically been part
> > of cluster/gfs documentation since most customers want it to be 1.)
>
> So a cluster can hang because our code failed, but we don?t detect that
> it did fail.... so what determines a node failure? only when corosync dies?

The error is detected in gfs. For every error in every bit of code, the
developer needs to consider what the appropriate error handling should be:
What are the consequences (with respect to availability and data
integrity), both locally and remotely, of the error handling they choose?
It's case by case.

If the error could lead to data corruption, then the proper error handling
is usually to fail fast and hard.

If the error can result in remote nodes being blocked, then the proper
error handling is usually self-sacrifice to avoid blocking other nodes.

Self-sacrifice means forcibly removing the local node from the cluster so
that others can recover for it and move on. There are different ways of
doing self-sacrifice:

- panic the local machine (kernel code usually uses this method)
- killing corosync on the local machine (daemons usually do this)
- calling reboot (I think rgmanager has used this method)

> panic_on_oops is not cluster specific and not all OOPS are panic == not
> a clean solution.

So you want gfs oopses to result in a panic, and non-gfs oopses to *not*
result in a panic? There's probably a combination of options that would
produce this effect. Most people interested in HA will want all oopses to
result in a panic and recovery since an oops puts a node in a precarious
position regardless of where it came from.

Dave
 
Old 11-20-2009, 06:26 AM
"Fabio M. Di Nitto"
 
Default fencing conditions: what should trigger a fencing operation?

David Teigland wrote:
> On Thu, Nov 19, 2009 at 07:10:54PM +0100, Fabio M. Di Nitto wrote:
>
> The error is detected in gfs. For every error in every bit of code, the
> developer needs to consider what the appropriate error handling should be:
> What are the consequences (with respect to availability and data
> integrity), both locally and remotely, of the error handling they choose?
> It's case by case.
>
> If the error could lead to data corruption, then the proper error handling
> is usually to fail fast and hard.

of course, agreed.

>
> If the error can result in remote nodes being blocked, then the proper
> error handling is usually self-sacrifice to avoid blocking other nodes.

ok, so this is the case we are seeing here. the cluster is half blocked
but there is no self-sacrifice action happening.

>
> Self-sacrifice means forcibly removing the local node from the cluster so
> that others can recover for it and move on. There are different ways of
> doing self-sacrifice:
>
> - panic the local machine (kernel code usually uses this method)
> - killing corosync on the local machine (daemons usually do this)
> - calling reboot (I think rgmanager has used this method)

I donīt have an opinion on how it happens really, as long as it works.

>
>> panic_on_oops is not cluster specific and not all OOPS are panic == not
>> a clean solution.
>
> So you want gfs oopses to result in a panic, and non-gfs oopses to *not*
> result in a panic?

Well partially yes.

We canīt take decision for OOPSes that are not generated within our
code. The user will have to configure that via panic_on_oops or other
means. Maybe our task is to make sure users are aware of this
situation/option (i didnīt check if it is documented).

You have a point by saying that it depends from error to error and this
is exactly where Iīd like to head. Maybe itīs time to review our error
paths and make better decisions on what to do. At least within our code.

There's probably a combination of options that would
> produce this effect. Most people interested in HA will want all oopses to
> result in a panic and recovery since an oops puts a node in a precarious
> position regardless of where it came from.

I agree, but I donīt think we can kill the node on every OOPS by
default. We can agree that has to be a user configurable choice but we
can improve our stuff to do the right thing (or do better what it does now).

Fabio
 
Old 11-20-2009, 04:40 PM
David Teigland
 
Default fencing conditions: what should trigger a fencing operation?

On Fri, Nov 20, 2009 at 08:26:57AM +0100, Fabio M. Di Nitto wrote:
> We can?t take decision for OOPSes that are not generated within our
> code. The user will have to configure that via panic_on_oops or other
> means. Maybe our task is to make sure users are aware of this
> situation/option (i didn?t check if it is documented).

Yeah, in past we've told (and documented) people to set panic_on_oops=1 if
it's not already set that way (see the gfs_mount man page I gave a link to
for one example).

As I said, in some releases, like RHEL4 and RHEL5, panic_on_oops is 1 by
default, so everyone tends to forget about it. But I think upstream
kernels currently default to 0, so this will bite people using upstream
kernels who don't happen to read our documentation about setting the
systctl.

Dave
 

Thread Tools




All times are GMT. The time now is 05:30 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright Đ2007 - 2008, www.linux-archive.org