Linux Archive

Linux Archive (http://www.linux-archive.org/)
-   Device-mapper Development (http://www.linux-archive.org/device-mapper-development/)
-   -   multipath_busy() stalls IO due to scsi_host_is_busy() (http://www.linux-archive.org/device-mapper-development/667412-multipath_busy-stalls-io-due-scsi_host_is_busy.html)

Bernd Schubert 05-16-2012 12:28 PM

multipath_busy() stalls IO due to scsi_host_is_busy()
 
Hello,

while I actually want to benchmark FhGFS on a NetApp system, I'm somehow
running from one kernel problem to another.
Yesterday we had to recable and while we are now still using multipath,
each priority group now only has one underlying devices (we don't have
sufficient IB srp ports on our test systems, but still want to benchmark
a system as close as possible to a production system).
So after recabling actually all failover paths disappeared, which
*shouldn't* have any influence on the performance. However, unexpectedly
performance is now by less than 50% when I'm doing buffered IO. With
direct IO it also still fine and reducing nr_requests of the multipath
device to 8 also 'fixes' the problem. I then guessed it right and simply
made multipath_busy() always to return 0, which also fixes the issue.



- problem:
- iostat -x -m 1 shows that alternating one multipath devices starts to
stall IO for several minutes
- the other multipath device then does IO during that time with about
600 to 700 MB/s, until it starts to stall IO
- the active NetApp controller could server both multipath devices with
about 600 to 700 MB/s


problem solutions:
- add another passive sdX device to the multipath group
- use direct IO
- reduce /sys/block/dm-X/queue/nr_requests to 8
- /sys/block/sdX does not need to be updated
- disbable multipath_busy() by letting it return 0

Looking through the call chain, I see the underlying problem seems to be
in scsi_host_is_busy().



static inline int scsi_host_is_busy(struct Scsi_Host *shost)
{
if ((shost->can_queue > 0 && shost->host_busy >= shost->can_queue) ||
shost->host_blocked || shost->host_self_blocked)
return 1;

return 0;
}



shost->can_queue -> 62 here
shost->host_busy -> 62 when one of the multipath groups does IO, further
multipath groups then seem to get stalled.


I'm not sure yet why multipath_busy() does not stall IO when there is a
passive path in the prio group.


Any idea how to properly address this problem?


Thanks,
Bernd

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

James Bottomley 05-16-2012 02:06 PM

multipath_busy() stalls IO due to scsi_host_is_busy()
 
On Wed, 2012-05-16 at 14:28 +0200, Bernd Schubert wrote:
> shost->can_queue -> 62 here
> shost->host_busy -> 62 when one of the multipath groups does IO, further
> multipath groups then seem to get stalled.
>
> I'm not sure yet why multipath_busy() does not stall IO when there is a
> passive path in the prio group.
>
> Any idea how to properly address this problem?

shost->can_queue is supposed to represent the maximum number of possible
outstanding commands per HBA (i.e. the HBA hardware limit). Assuming
the driver got it right, the only way of increasing this is to buy a
better HBA.

James


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Bernd Schubert 05-16-2012 02:29 PM

multipath_busy() stalls IO due to scsi_host_is_busy()
 
On 05/16/2012 04:06 PM, James Bottomley wrote:

On Wed, 2012-05-16 at 14:28 +0200, Bernd Schubert wrote:

shost->can_queue -> 62 here
shost->host_busy -> 62 when one of the multipath groups does IO, further
multipath groups then seem to get stalled.

I'm not sure yet why multipath_busy() does not stall IO when there is a
passive path in the prio group.

Any idea how to properly address this problem?


shost->can_queue is supposed to represent the maximum number of possible
outstanding commands per HBA (i.e. the HBA hardware limit). Assuming
the driver got it right, the only way of increasing this is to buy a
better HBA.


HBA is a mellanox IB adapter. I have not checked yet where the limit of
62 queue entries comes from. This is also not a real problem. Real
problem is that multipath suspends IO, although it should not.
As I said, if I remove the functionality of those busy functions
everything is fine. I think what happens is that dm-multipath suspends
IO for too long and in the mean time the other path already submits IO
again. So I guess the underlying problem is an unfair queuing strategy.



Cheers,
Bernd

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mike Christie 05-16-2012 03:27 PM

multipath_busy() stalls IO due to scsi_host_is_busy()
 
On 05/16/2012 09:29 AM, Bernd Schubert wrote:
> On 05/16/2012 04:06 PM, James Bottomley wrote:
>> On Wed, 2012-05-16 at 14:28 +0200, Bernd Schubert wrote:
>>> shost->can_queue -> 62 here
>>> shost->host_busy -> 62 when one of the multipath groups does IO,
>>> further
>>> multipath groups then seem to get stalled.
>>>
>>> I'm not sure yet why multipath_busy() does not stall IO when there is a
>>> passive path in the prio group.
>>>
>>> Any idea how to properly address this problem?
>>
>> shost->can_queue is supposed to represent the maximum number of possible
>> outstanding commands per HBA (i.e. the HBA hardware limit). Assuming
>> the driver got it right, the only way of increasing this is to buy a
>> better HBA.
>
> HBA is a mellanox IB adapter. I have not checked yet where the limit of

What driver is this with? SRP or iSER or something else?

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Bernd Schubert 05-16-2012 03:54 PM

multipath_busy() stalls IO due to scsi_host_is_busy()
 
On 05/16/2012 05:27 PM, Mike Christie wrote:

On 05/16/2012 09:29 AM, Bernd Schubert wrote:

On 05/16/2012 04:06 PM, James Bottomley wrote:

On Wed, 2012-05-16 at 14:28 +0200, Bernd Schubert wrote:

shost->can_queue -> 62 here
shost->host_busy -> 62 when one of the multipath groups does IO,
further
multipath groups then seem to get stalled.

I'm not sure yet why multipath_busy() does not stall IO when there is a
passive path in the prio group.

Any idea how to properly address this problem?


shost->can_queue is supposed to represent the maximum number of possible
outstanding commands per HBA (i.e. the HBA hardware limit). Assuming
the driver got it right, the only way of increasing this is to buy a
better HBA.


HBA is a mellanox IB adapter. I have not checked yet where the limit of


What driver is this with? SRP or iSER or something else?



Its SRP. The command queue limit comes from SRP_RQ_SIZE. The value seems
a bit low, IMHO. And its definitely lower than needed for optimal
performance. However, given that I get good performance when
multipath_busy() is a noop, I think this is the primary issue here. And
it is always possible that a single LUN could use all command queues.
Other LUNs still shouldn't be stalled completely.


So in summary we actually have two issues:

1) Unfair queuing/waiting of dm-mpath, which stalls an entire path and
brings down overall performance.


2) Low SRP command queues. Is there a reason why
SRP_RQ_SHIFT/SRP_RQ_SIZE and their depend values such as SRP_RQ_SIZE are
so small?



Thanks,
Bernd


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

David Dillow 05-16-2012 05:03 PM

multipath_busy() stalls IO due to scsi_host_is_busy()
 
On Wed, 2012-05-16 at 11:54 -0400, Bernd Schubert wrote:
> 2) Low SRP command queues. Is there a reason why
> SRP_RQ_SHIFT/SRP_RQ_SIZE and their depend values such as SRP_RQ_SIZE are
> so small?

That's a decision that has been around since the beginning of the driver
as far as I can tell. It looks to be a balance between device needs and
memory usage, and it can certainly be raised -- I've run locally with
SRP_RQ_SHIFT set to 7 (shost.can_queue 126) and I'm sure 8 would be no
problem, either. I didn't see a performance improvement on my workload,
but may you will.

Because we take the minimum of our initiator queue depth and the initial
credits from the target (each req consumes a credit), going higher than
8 doesn't buy us much, as I don't know off-hand of any target that gives
out more than 256 credits.

The memory used for the command ring will vary depending on the value of
SRP_RQ_SHIFT and the number of s/g entries allows to be put in the
command. 255 s/g entries requires an 8 KB allocation for each request
(~4200 bytes), so we currently require 512 KB of buffers for the send
queue for each target. Going to 8 will require 2 MB max per target,
which probably isn't a real issue.

There's also a response ring with an allocation size that depends on the
target, but those should be pretty small buffers, say 1 KB * (1 <<
SRP_RQ_SHIFT).

--
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Bernd Schubert 05-16-2012 08:34 PM

multipath_busy() stalls IO due to scsi_host_is_busy()
 
On 05/16/2012 07:03 PM, David Dillow wrote:
> On Wed, 2012-05-16 at 11:54 -0400, Bernd Schubert wrote:
>> 2) Low SRP command queues. Is there a reason why
>> SRP_RQ_SHIFT/SRP_RQ_SIZE and their depend values such as SRP_RQ_SIZE are
>> so small?
>
> That's a decision that has been around since the beginning of the driver
> as far as I can tell. It looks to be a balance between device needs and
> memory usage, and it can certainly be raised -- I've run locally with
> SRP_RQ_SHIFT set to 7 (shost.can_queue 126) and I'm sure 8 would be no
> problem, either. I didn't see a performance improvement on my workload,
> but may you will.

Ah, thanks a lot! In the past I tested the DDN S2A and figured out a
queue size of 16 per device provides optimal performance. So with
typically 7 primary devices per Server that makes 112, so SRP_RQ_SHIFT=7
is perfectly fine. But then with another typical configuration of 14
devices per server and with the current multipath-busy strategy, you
already should see a performance drop.
Right now I'm running tests on a NetApp and don't know yet optimal
parameters. So I set the queue size to the maximum, but didn't expect
such multipath issues...

>
> Because we take the minimum of our initiator queue depth and the initial
> credits from the target (each req consumes a credit), going higher than
> 8 doesn't buy us much, as I don't know off-hand of any target that gives
> out more than 256 credits.
>
> The memory used for the command ring will vary depending on the value of
> SRP_RQ_SHIFT and the number of s/g entries allows to be put in the
> command. 255 s/g entries requires an 8 KB allocation for each request
> (~4200 bytes), so we currently require 512 KB of buffers for the send
> queue for each target. Going to 8 will require 2 MB max per target,
> which probably isn't a real issue.
>
> There's also a response ring with an allocation size that depends on the
> target, but those should be pretty small buffers, say 1 KB * (1 <<
> SRP_RQ_SHIFT).
>

Maybe we should covert the entire parameter to a module option? I will
look into it tomorrow.
And unless someone already comes up with a dm-mpath patch, I will try to
fix the first. I think I will simply always allow a few requests per
prio-group. Lets see if this gets accepted.


Thanks,
Bernd


PS: Thanks a lot for your ib-srp large IO patches you already sent last
year! I just noticed those last week.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

"Jun'ichi Nomura" 05-17-2012 09:09 AM

multipath_busy() stalls IO due to scsi_host_is_busy()
 
Hi,

On 05/16/12 21:28, Bernd Schubert wrote:
> Looking through the call chain, I see the underlying problem seems to be in scsi_host_is_busy().
>
>> static inline int scsi_host_is_busy(struct Scsi_Host *shost)
>> {
>> if ((shost->can_queue > 0 && shost->host_busy >= shost->can_queue) ||
>> shost->host_blocked || shost->host_self_blocked)
>> return 1;
>>
>> return 0;
>> }

multipath_busy() was introduced because, without that,
a request would be prematurely sent down to SCSI,
lose the chance of additional merges and result in
bad performance.

However, when it is target/host that is busy, I think dm should
send the request down and let SCSI, which has better knowledge
about the shared resource, do appropriate starvation control.

Could you try the attached patch?

---
Jun'ichi Nomura, NEC Corporation

If sdev is not busy but starget and/or host is busy,
it is better to accept a request from stacking driver.
Otherwise, the stacking device could be starved by other device
sharing the same target/host.

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 5dfd749..0eb4602 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1378,16 +1378,13 @@ static int scsi_lld_busy(struct request_queue *q)
{
struct scsi_device *sdev = q->queuedata;
struct Scsi_Host *shost;
- struct scsi_target *starget;

if (!sdev)
return 0;

shost = sdev->host;
- starget = scsi_target(sdev);

- if (scsi_host_in_recovery(shost) || scsi_host_is_busy(shost) ||
- scsi_target_is_busy(starget) || scsi_device_is_busy(sdev))
+ if (scsi_host_in_recovery(shost) || scsi_device_is_busy(sdev))
return 1;

return 0;

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mike Snitzer 05-17-2012 01:46 PM

multipath_busy() stalls IO due to scsi_host_is_busy()
 
On Thu, May 17 2012 at 5:09am -0400,
Jun'ichi Nomura <j-nomura@ce.jp.nec.com> wrote:

> Hi,
>
> On 05/16/12 21:28, Bernd Schubert wrote:
> > Looking through the call chain, I see the underlying problem seems to be in scsi_host_is_busy().
> >
> >> static inline int scsi_host_is_busy(struct Scsi_Host *shost)
> >> {
> >> if ((shost->can_queue > 0 && shost->host_busy >= shost->can_queue) ||
> >> shost->host_blocked || shost->host_self_blocked)
> >> return 1;
> >>
> >> return 0;
> >> }
>
> multipath_busy() was introduced because, without that,
> a request would be prematurely sent down to SCSI,
> lose the chance of additional merges and result in
> bad performance.
>
> However, when it is target/host that is busy, I think dm should
> send the request down and let SCSI, which has better knowledge
> about the shared resource, do appropriate starvation control.
>
> Could you try the attached patch?
>
> ---
> Jun'ichi Nomura, NEC Corporation
>
> If sdev is not busy but starget and/or host is busy,
> it is better to accept a request from stacking driver.
> Otherwise, the stacking device could be starved by other device
> sharing the same target/host.

Great suggestion.

It should be noted that DM mpath is the only caller of blk_lld_busy (and
scsi_lld_busy). So even though this patch may _seem_ like the tail
(mpath) wagging the dog (SCSI), it is reasonable to change SCSI's
definition of a LLD being "busy" if it benefits multipath.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Bernd Schubert 05-21-2012 03:42 PM

multipath_busy() stalls IO due to scsi_host_is_busy()
 
On 05/17/2012 11:09 AM, Jun'ichi Nomura wrote:

Hi,

On 05/16/12 21:28, Bernd Schubert wrote:

Looking through the call chain, I see the underlying problem seems to be in scsi_host_is_busy().


static inline int scsi_host_is_busy(struct Scsi_Host *shost)
{
if ((shost->can_queue> 0&& shost->host_busy>= shost->can_queue) ||
shost->host_blocked || shost->host_self_blocked)
return 1;

return 0;
}


multipath_busy() was introduced because, without that,
a request would be prematurely sent down to SCSI,
lose the chance of additional merges and result in
bad performance.

However, when it is target/host that is busy, I think dm should
send the request down and let SCSI, which has better knowledge
about the shared resource, do appropriate starvation control.

Could you try the attached patch?

---
Jun'ichi Nomura, NEC Corporation

If sdev is not busy but starget and/or host is busy,
it is better to accept a request from stacking driver.
Otherwise, the stacking device could be starved by other device
sharing the same target/host.

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 5dfd749..0eb4602 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1378,16 +1378,13 @@ static int scsi_lld_busy(struct request_queue *q)
{
struct scsi_device *sdev = q->queuedata;
struct Scsi_Host *shost;
- struct scsi_target *starget;

if (!sdev)
return 0;

shost = sdev->host;
- starget = scsi_target(sdev);

- if (scsi_host_in_recovery(shost) || scsi_host_is_busy(shost) ||
- scsi_target_is_busy(starget) || scsi_device_is_busy(sdev))
+ if (scsi_host_in_recovery(shost) || scsi_device_is_busy(sdev))
return 1;

return 0;


Thanks, that works fine! I had something else in my mind for multipath,
but if this can go into scsi we don't need the multipath patch anymore.

Are you going to officially submit it (missing signed-off)?

Thanks,
Bernd

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


All times are GMT. The time now is 05:28 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.