multipath_busy() stalls IO due to scsi_host_is_busy()
Hello,
while I actually want to benchmark FhGFS on a NetApp system, I'm somehow running from one kernel problem to another. Yesterday we had to recable and while we are now still using multipath, each priority group now only has one underlying devices (we don't have sufficient IB srp ports on our test systems, but still want to benchmark a system as close as possible to a production system). So after recabling actually all failover paths disappeared, which *shouldn't* have any influence on the performance. However, unexpectedly performance is now by less than 50% when I'm doing buffered IO. With direct IO it also still fine and reducing nr_requests of the multipath device to 8 also 'fixes' the problem. I then guessed it right and simply made multipath_busy() always to return 0, which also fixes the issue. - problem: - iostat -x -m 1 shows that alternating one multipath devices starts to stall IO for several minutes - the other multipath device then does IO during that time with about 600 to 700 MB/s, until it starts to stall IO - the active NetApp controller could server both multipath devices with about 600 to 700 MB/s problem solutions: - add another passive sdX device to the multipath group - use direct IO - reduce /sys/block/dm-X/queue/nr_requests to 8 - /sys/block/sdX does not need to be updated - disbable multipath_busy() by letting it return 0 Looking through the call chain, I see the underlying problem seems to be in scsi_host_is_busy(). static inline int scsi_host_is_busy(struct Scsi_Host *shost) { if ((shost->can_queue > 0 && shost->host_busy >= shost->can_queue) || shost->host_blocked || shost->host_self_blocked) return 1; return 0; } shost->can_queue -> 62 here shost->host_busy -> 62 when one of the multipath groups does IO, further multipath groups then seem to get stalled. I'm not sure yet why multipath_busy() does not stall IO when there is a passive path in the prio group. Any idea how to properly address this problem? Thanks, Bernd -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel |
multipath_busy() stalls IO due to scsi_host_is_busy()
On Wed, 2012-05-16 at 14:28 +0200, Bernd Schubert wrote:
> shost->can_queue -> 62 here > shost->host_busy -> 62 when one of the multipath groups does IO, further > multipath groups then seem to get stalled. > > I'm not sure yet why multipath_busy() does not stall IO when there is a > passive path in the prio group. > > Any idea how to properly address this problem? shost->can_queue is supposed to represent the maximum number of possible outstanding commands per HBA (i.e. the HBA hardware limit). Assuming the driver got it right, the only way of increasing this is to buy a better HBA. James -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel |
multipath_busy() stalls IO due to scsi_host_is_busy()
On 05/16/2012 04:06 PM, James Bottomley wrote:
On Wed, 2012-05-16 at 14:28 +0200, Bernd Schubert wrote: shost->can_queue -> 62 here shost->host_busy -> 62 when one of the multipath groups does IO, further multipath groups then seem to get stalled. I'm not sure yet why multipath_busy() does not stall IO when there is a passive path in the prio group. Any idea how to properly address this problem? shost->can_queue is supposed to represent the maximum number of possible outstanding commands per HBA (i.e. the HBA hardware limit). Assuming the driver got it right, the only way of increasing this is to buy a better HBA. HBA is a mellanox IB adapter. I have not checked yet where the limit of 62 queue entries comes from. This is also not a real problem. Real problem is that multipath suspends IO, although it should not. As I said, if I remove the functionality of those busy functions everything is fine. I think what happens is that dm-multipath suspends IO for too long and in the mean time the other path already submits IO again. So I guess the underlying problem is an unfair queuing strategy. Cheers, Bernd -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel |
multipath_busy() stalls IO due to scsi_host_is_busy()
On 05/16/2012 09:29 AM, Bernd Schubert wrote:
> On 05/16/2012 04:06 PM, James Bottomley wrote: >> On Wed, 2012-05-16 at 14:28 +0200, Bernd Schubert wrote: >>> shost->can_queue -> 62 here >>> shost->host_busy -> 62 when one of the multipath groups does IO, >>> further >>> multipath groups then seem to get stalled. >>> >>> I'm not sure yet why multipath_busy() does not stall IO when there is a >>> passive path in the prio group. >>> >>> Any idea how to properly address this problem? >> >> shost->can_queue is supposed to represent the maximum number of possible >> outstanding commands per HBA (i.e. the HBA hardware limit). Assuming >> the driver got it right, the only way of increasing this is to buy a >> better HBA. > > HBA is a mellanox IB adapter. I have not checked yet where the limit of What driver is this with? SRP or iSER or something else? -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel |
multipath_busy() stalls IO due to scsi_host_is_busy()
On 05/16/2012 05:27 PM, Mike Christie wrote:
On 05/16/2012 09:29 AM, Bernd Schubert wrote: On 05/16/2012 04:06 PM, James Bottomley wrote: On Wed, 2012-05-16 at 14:28 +0200, Bernd Schubert wrote: shost->can_queue -> 62 here shost->host_busy -> 62 when one of the multipath groups does IO, further multipath groups then seem to get stalled. I'm not sure yet why multipath_busy() does not stall IO when there is a passive path in the prio group. Any idea how to properly address this problem? shost->can_queue is supposed to represent the maximum number of possible outstanding commands per HBA (i.e. the HBA hardware limit). Assuming the driver got it right, the only way of increasing this is to buy a better HBA. HBA is a mellanox IB adapter. I have not checked yet where the limit of What driver is this with? SRP or iSER or something else? Its SRP. The command queue limit comes from SRP_RQ_SIZE. The value seems a bit low, IMHO. And its definitely lower than needed for optimal performance. However, given that I get good performance when multipath_busy() is a noop, I think this is the primary issue here. And it is always possible that a single LUN could use all command queues. Other LUNs still shouldn't be stalled completely. So in summary we actually have two issues: 1) Unfair queuing/waiting of dm-mpath, which stalls an entire path and brings down overall performance. 2) Low SRP command queues. Is there a reason why SRP_RQ_SHIFT/SRP_RQ_SIZE and their depend values such as SRP_RQ_SIZE are so small? Thanks, Bernd -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel |
multipath_busy() stalls IO due to scsi_host_is_busy()
On Wed, 2012-05-16 at 11:54 -0400, Bernd Schubert wrote:
> 2) Low SRP command queues. Is there a reason why > SRP_RQ_SHIFT/SRP_RQ_SIZE and their depend values such as SRP_RQ_SIZE are > so small? That's a decision that has been around since the beginning of the driver as far as I can tell. It looks to be a balance between device needs and memory usage, and it can certainly be raised -- I've run locally with SRP_RQ_SHIFT set to 7 (shost.can_queue 126) and I'm sure 8 would be no problem, either. I didn't see a performance improvement on my workload, but may you will. Because we take the minimum of our initiator queue depth and the initial credits from the target (each req consumes a credit), going higher than 8 doesn't buy us much, as I don't know off-hand of any target that gives out more than 256 credits. The memory used for the command ring will vary depending on the value of SRP_RQ_SHIFT and the number of s/g entries allows to be put in the command. 255 s/g entries requires an 8 KB allocation for each request (~4200 bytes), so we currently require 512 KB of buffers for the send queue for each target. Going to 8 will require 2 MB max per target, which probably isn't a real issue. There's also a response ring with an allocation size that depends on the target, but those should be pretty small buffers, say 1 KB * (1 << SRP_RQ_SHIFT). -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel |
multipath_busy() stalls IO due to scsi_host_is_busy()
On 05/16/2012 07:03 PM, David Dillow wrote:
> On Wed, 2012-05-16 at 11:54 -0400, Bernd Schubert wrote: >> 2) Low SRP command queues. Is there a reason why >> SRP_RQ_SHIFT/SRP_RQ_SIZE and their depend values such as SRP_RQ_SIZE are >> so small? > > That's a decision that has been around since the beginning of the driver > as far as I can tell. It looks to be a balance between device needs and > memory usage, and it can certainly be raised -- I've run locally with > SRP_RQ_SHIFT set to 7 (shost.can_queue 126) and I'm sure 8 would be no > problem, either. I didn't see a performance improvement on my workload, > but may you will. Ah, thanks a lot! In the past I tested the DDN S2A and figured out a queue size of 16 per device provides optimal performance. So with typically 7 primary devices per Server that makes 112, so SRP_RQ_SHIFT=7 is perfectly fine. But then with another typical configuration of 14 devices per server and with the current multipath-busy strategy, you already should see a performance drop. Right now I'm running tests on a NetApp and don't know yet optimal parameters. So I set the queue size to the maximum, but didn't expect such multipath issues... > > Because we take the minimum of our initiator queue depth and the initial > credits from the target (each req consumes a credit), going higher than > 8 doesn't buy us much, as I don't know off-hand of any target that gives > out more than 256 credits. > > The memory used for the command ring will vary depending on the value of > SRP_RQ_SHIFT and the number of s/g entries allows to be put in the > command. 255 s/g entries requires an 8 KB allocation for each request > (~4200 bytes), so we currently require 512 KB of buffers for the send > queue for each target. Going to 8 will require 2 MB max per target, > which probably isn't a real issue. > > There's also a response ring with an allocation size that depends on the > target, but those should be pretty small buffers, say 1 KB * (1 << > SRP_RQ_SHIFT). > Maybe we should covert the entire parameter to a module option? I will look into it tomorrow. And unless someone already comes up with a dm-mpath patch, I will try to fix the first. I think I will simply always allow a few requests per prio-group. Lets see if this gets accepted. Thanks, Bernd PS: Thanks a lot for your ib-srp large IO patches you already sent last year! I just noticed those last week. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel |
multipath_busy() stalls IO due to scsi_host_is_busy()
Hi,
On 05/16/12 21:28, Bernd Schubert wrote: > Looking through the call chain, I see the underlying problem seems to be in scsi_host_is_busy(). > >> static inline int scsi_host_is_busy(struct Scsi_Host *shost) >> { >> if ((shost->can_queue > 0 && shost->host_busy >= shost->can_queue) || >> shost->host_blocked || shost->host_self_blocked) >> return 1; >> >> return 0; >> } multipath_busy() was introduced because, without that, a request would be prematurely sent down to SCSI, lose the chance of additional merges and result in bad performance. However, when it is target/host that is busy, I think dm should send the request down and let SCSI, which has better knowledge about the shared resource, do appropriate starvation control. Could you try the attached patch? --- Jun'ichi Nomura, NEC Corporation If sdev is not busy but starget and/or host is busy, it is better to accept a request from stacking driver. Otherwise, the stacking device could be starved by other device sharing the same target/host. diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 5dfd749..0eb4602 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1378,16 +1378,13 @@ static int scsi_lld_busy(struct request_queue *q) { struct scsi_device *sdev = q->queuedata; struct Scsi_Host *shost; - struct scsi_target *starget; if (!sdev) return 0; shost = sdev->host; - starget = scsi_target(sdev); - if (scsi_host_in_recovery(shost) || scsi_host_is_busy(shost) || - scsi_target_is_busy(starget) || scsi_device_is_busy(sdev)) + if (scsi_host_in_recovery(shost) || scsi_device_is_busy(sdev)) return 1; return 0; -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel |
multipath_busy() stalls IO due to scsi_host_is_busy()
On Thu, May 17 2012 at 5:09am -0400,
Jun'ichi Nomura <j-nomura@ce.jp.nec.com> wrote: > Hi, > > On 05/16/12 21:28, Bernd Schubert wrote: > > Looking through the call chain, I see the underlying problem seems to be in scsi_host_is_busy(). > > > >> static inline int scsi_host_is_busy(struct Scsi_Host *shost) > >> { > >> if ((shost->can_queue > 0 && shost->host_busy >= shost->can_queue) || > >> shost->host_blocked || shost->host_self_blocked) > >> return 1; > >> > >> return 0; > >> } > > multipath_busy() was introduced because, without that, > a request would be prematurely sent down to SCSI, > lose the chance of additional merges and result in > bad performance. > > However, when it is target/host that is busy, I think dm should > send the request down and let SCSI, which has better knowledge > about the shared resource, do appropriate starvation control. > > Could you try the attached patch? > > --- > Jun'ichi Nomura, NEC Corporation > > If sdev is not busy but starget and/or host is busy, > it is better to accept a request from stacking driver. > Otherwise, the stacking device could be starved by other device > sharing the same target/host. Great suggestion. It should be noted that DM mpath is the only caller of blk_lld_busy (and scsi_lld_busy). So even though this patch may _seem_ like the tail (mpath) wagging the dog (SCSI), it is reasonable to change SCSI's definition of a LLD being "busy" if it benefits multipath. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel |
multipath_busy() stalls IO due to scsi_host_is_busy()
On 05/17/2012 11:09 AM, Jun'ichi Nomura wrote:
Hi, On 05/16/12 21:28, Bernd Schubert wrote: Looking through the call chain, I see the underlying problem seems to be in scsi_host_is_busy(). static inline int scsi_host_is_busy(struct Scsi_Host *shost) { if ((shost->can_queue> 0&& shost->host_busy>= shost->can_queue) || shost->host_blocked || shost->host_self_blocked) return 1; return 0; } multipath_busy() was introduced because, without that, a request would be prematurely sent down to SCSI, lose the chance of additional merges and result in bad performance. However, when it is target/host that is busy, I think dm should send the request down and let SCSI, which has better knowledge about the shared resource, do appropriate starvation control. Could you try the attached patch? --- Jun'ichi Nomura, NEC Corporation If sdev is not busy but starget and/or host is busy, it is better to accept a request from stacking driver. Otherwise, the stacking device could be starved by other device sharing the same target/host. diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 5dfd749..0eb4602 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1378,16 +1378,13 @@ static int scsi_lld_busy(struct request_queue *q) { struct scsi_device *sdev = q->queuedata; struct Scsi_Host *shost; - struct scsi_target *starget; if (!sdev) return 0; shost = sdev->host; - starget = scsi_target(sdev); - if (scsi_host_in_recovery(shost) || scsi_host_is_busy(shost) || - scsi_target_is_busy(starget) || scsi_device_is_busy(sdev)) + if (scsi_host_in_recovery(shost) || scsi_device_is_busy(sdev)) return 1; return 0; Thanks, that works fine! I had something else in my mind for multipath, but if this can go into scsi we don't need the multipath patch anymore. Are you going to officially submit it (missing signed-off)? Thanks, Bernd -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel |
| All times are GMT. The time now is 12:00 AM. |
VBulletin, Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.