fencing conditions: what should trigger a fencing operation?
On Thu, 2009-11-19 at 11:04 -0600, David Teigland wrote:
> On Thu, Nov 19, 2009 at 12:35:05PM +0100, Fabio M. Di Nitto wrote:
> > - what are the current fencing policies?
> node failure
I think what Fabio is asking is what event is considered to be a node
failure? It sounds from your description that it means a failure of
corosync communications. Are there other things which can feed into this
though? For example dlm seems to have some kind of timeout mechanism
which sends a message to userspace, and I wonder whether that
contributes to the decision too?
It certainly isn't desirable for all types of filesystem failure to
result in fencing & automatic recovery. I think we've got that wrong in
the past. I posted a patch a few days back to try and address some of
that. In the case we find an invalid block in a journal during recovery
we certainly don't want to try and recover the journal on another node,
nor even kill the recovering node since it will only result in another
node trying to recover the same journal and hitting the same error.
Eventually it will bring down the whole cluster.
The aim of the patch was to return a suitable status indicating why
journal recovery failed so that it can then be handled appropriately,