FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Device-mapper Development

 
 
LinkBack Thread Tools
 
Old 04-25-2011, 05:29 PM
John Ruemker
 
Default multipath failover & rhcs

On 04/25/2011 01:01 PM, Dave Sullivan wrote:

Hi Guys,

It seems recently that we have just run into this problem where we
don't fully understand the timeouts that drive multipath fail-over.


We did thorough testing of pulling fibre/failing hbas manualling and
multipath handled things perfectly.


Recently we enountered SCSI Block errors, where the multipath
fail-over did not occur before the qdisk timeout.


This was attributed to the scsi block errors and the scsi lun timeout
of 60 seconds which is set by default.


I added a comment to the first link below that discusses a situation
that would cause this to occur. We think that this was due to a
defective HBA under high I/O load.


Once we get the HBA in question we will run some tests to validate
that modifying the scsi block timeouts in fact allows multipath to
fail-over in time to beat the qdisk timeout.


I'm getting ready to to take a look at the code to see if I can
validate these theories. The area that is still somewhat gray is the
true definition for multipath timings for failover.


I don't think there is a true definition of a multipath timeout, per
see. I see it as the following:


multipath check = every 20 seconds for no failed paths
multipath check (if failed paths) = every 5 seconds on failed paths only

multipath failover occurs = driver timeout attribute met ( Emulex
lpfc_devloss_tmo value)

--capture pulling fibre
--capture disabling hba

or (for other types of failures)

multipath failover occurs =scsi block timeout + driver timeout (not
sure if the driver timeout attribute is a added)


https://access.redhat.com/kb/docs/DOC-2881
https://docspace.corp.redhat.com/docs/DOC-32822

Hmm, just found out that there was new fix in rhel5u5 for this it
looks like from this case in salesforce 00085953.


-Dave



Hi Dave,
These are issues we have recently been working to resolve with this and
other qdisk articles. The problem is as you described it: we don't have
an accurate definition of how long it will take multipath to fail a path
in all scenarios. The formula used in the article is basically wrong,
and we're working to fix it, but coming up with a formula for a path
timeout has been difficult. This calculation should not be based on
no_path_retry at all, as we are really only concerned in the amount of
time it takes for the scsi layer to return an error, allowing qdisk's
I/O operation to be sent down an alternate path.


Regarding the formula you posted:

>> multipath check = every 20 seconds for no failed paths
>> multipath check (if failed paths) = every 5 seconds on failed paths
only


Just to clarify, the polling interval doubles after each successful path
check, up to 4 times the original. So you're correct, that for a
healthy path you should see it checking every 20s after the first few
checks. Likewise, your second statement is also accurate in that after
a failed check, it drops back to the configured polling interval until
the path returns to active status.


Regarding case 00085953, I was actually the owner of that one. There
was a change that went into 5.5 which lowered the default
tur/readsector0 SCSI I/O timeout down from 300 to the checker_timeout
value (which defaults to the timeout value in
/sys/block/sdX/device/timeout).


I am very interested in any information you come up with on the
calculation of how long a path failure will take. We will integrate
that into this article if you can come up with anything.


Let me know if you have any questions.

--
John Ruemker, RHCA
Technical Account Manager
Global Support Services
Red Hat, Inc.
Office: 919-754-4941
Cell: 919-793-8549

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 

Thread Tools




All times are GMT. The time now is 12:13 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org