Hmm, just found out that there was new fix in rhel5u5 for this it
looks like from this case in salesforce 00085953.
These are issues we have recently been working to resolve with this and
other qdisk articles. The problem is as you described it: we don't have
an accurate definition of how long it will take multipath to fail a path
in all scenarios. The formula used in the article is basically wrong,
and we're working to fix it, but coming up with a formula for a path
timeout has been difficult. This calculation should not be based on
no_path_retry at all, as we are really only concerned in the amount of
time it takes for the scsi layer to return an error, allowing qdisk's
I/O operation to be sent down an alternate path.
Regarding the formula you posted:
>> multipath check = every 20 seconds for no failed paths
>> multipath check (if failed paths) = every 5 seconds on failed paths
Just to clarify, the polling interval doubles after each successful path
check, up to 4 times the original. So you're correct, that for a
healthy path you should see it checking every 20s after the first few
checks. Likewise, your second statement is also accurate in that after
a failed check, it drops back to the configured polling interval until
the path returns to active status.
Regarding case 00085953, I was actually the owner of that one. There
was a change that went into 5.5 which lowered the default
tur/readsector0 SCSI I/O timeout down from 300 to the checker_timeout
value (which defaults to the timeout value in
I am very interested in any information you come up with on the
calculation of how long a path failure will take. We will integrate
that into this article if you can come up with anything.
Let me know if you have any questions.
John Ruemker, RHCA
Technical Account Manager
Global Support Services
Red Hat, Inc.
dm-devel mailing list