dm-crypt: parallel processing
On 08/21/2012 11:08 AM, Milan Broz wrote:
> Hello, > > I promised to Mikulas that I will post remaining of his dm-crypt parallel > processing patchset (plus some related changes) with some comments. > > The problem: > > The current implementation (using per cpu struct) always use encryption > on the same CPU which submitted IO. > > With very fast storage (usually SSD or MD RAID0) one CPU core can > be limiting and the throughput of encrypted disk is worse in > comparison with underlying storage. > > Idea here is to distribute encryption to other CPU cores/CPUs. > > (Side effect of patches is nice clean up dmcrypt code. :) Better adding cc to Tejun here, I still think there are several things which perhaps should be done through kernel wq... (I would prefer to use kernel wq as well btw.) Milan > > Mikulas Patocka (20): > dm-crypt: remove per-cpu structure > dm-crypt: use unbound workqueue for request processing > dm-crypt: remove completion restart > dm-crypt: use encryption threads > dm-crypt: Unify spinlock > dm-crypt: Introduce an option that sets the number of threads. > dm-crypt: don't use write queue > dm-crypt: simplify io queue > dm-crypt: unify io_queue and crypt_queue > dm-crypt: don't allocate pages for a partial request. > dm-crypt: avoid deadlock in mempools > dm-crypt: simplify cc_pending > dm-crypt merge convert_context and dm_crypt_io > dm-crypt: move error handling to crypt_convert. > dm-crypt: remove io_pending field > dm-crypt: small changes > dm-crypt: move temporary values to stack > dm-crypt: offload writes to thread > dm-crypt: retain write ordering > dm-crypt: sort writes > > drivers/md/dm-crypt.c | 838 +++++++++++++++++++++++++++---------------------- > 1 file changed, 464 insertions(+), 374 deletions(-) > > > My notes: > > I extensively tested this (on top of 3.6.0-rc2) and while I like > simplification and the main logic (if we have enough power why > not use other CPUs) I see several problems here. > > 1) The implementation is not much useful on modern CPUs with hw accelerated > AES (with AES-NI even one core can saturate very fast storage). > (Some optimized crypto behaves similar, like twofish optimized modules.) > > 2) The patchset targets linear access pattern mainly and one > process generating IOs. (With more processes/CPUs generating IOs > you get parallel processing even with current code.) > > I can see significant improvement (~30%) for linear read > (if not using AES-NI) and if underlying storage is zero target > (ideal situation removing scheduler from the stack). > > For random access pattern (and more IO producers) I cannot find > reliable improvement except ideal zero target case (where improvement > is always >30%). For more producers it doesn't help at all. > I tested RAID0 over SATA disks and very fast SSD on quad core cpu, > dmcrypt running with 1, 3 or 4 threads (and with cfq and noop scheduler) > using fio threaded test with 1 or 4 threads. > > Notes to implementation: > > 1) Last two patches (19/20) provides sorting of IO requests, this > logically should be done in IO scheduler. > > I don't think this should be in dmcrypt, if scheduler doesn't work > properly, it should be fixed or tuned for crypt access pattern. > > 2) Could be kernel workqueue used/fixed here instead? Basically all it needs > is to prefer submitting CPU, if it is busy just move work to another CPU. > > 3) It doesn't honour CPU hotplugging. Number of CPU is determined > in crypt constructor. If hotplugging is used for CPU power saving, > it will not work properly. > > 4) With debugging options on, everything is extremely slower > (Perhaps bug in some debug option, I just noticed this on Fedora > rawhide with all debug on.) > > > I posted it mainly because I think this should move forward, > whatever direction... > > Milan -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel |
dm-crypt: parallel processing
[cc'ing Chris Mason]
On Tue, Aug 21 2012 at 5:08am -0400, Milan Broz <mbroz@redhat.com> wrote: > Hello, > > I promised to Mikulas that I will post remaining of his dm-crypt parallel > processing patchset (plus some related changes) with some comments. > > The problem: > > The current implementation (using per cpu struct) always use encryption > on the same CPU which submitted IO. > > With very fast storage (usually SSD or MD RAID0) one CPU core can > be limiting and the throughput of encrypted disk is worse in > comparison with underlying storage. > > Idea here is to distribute encryption to other CPU cores/CPUs. > > (Side effect of patches is nice clean up dmcrypt code. :) > > Mikulas Patocka (20): > dm-crypt: remove per-cpu structure > dm-crypt: use unbound workqueue for request processing > dm-crypt: remove completion restart > dm-crypt: use encryption threads > dm-crypt: Unify spinlock > dm-crypt: Introduce an option that sets the number of threads. > dm-crypt: don't use write queue > dm-crypt: simplify io queue > dm-crypt: unify io_queue and crypt_queue > dm-crypt: don't allocate pages for a partial request. > dm-crypt: avoid deadlock in mempools > dm-crypt: simplify cc_pending > dm-crypt merge convert_context and dm_crypt_io > dm-crypt: move error handling to crypt_convert. > dm-crypt: remove io_pending field > dm-crypt: small changes > dm-crypt: move temporary values to stack > dm-crypt: offload writes to thread > dm-crypt: retain write ordering > dm-crypt: sort writes > > drivers/md/dm-crypt.c | 838 +++++++++++++++++++++++++++---------------------- > 1 file changed, 464 insertions(+), 374 deletions(-) > > > My notes: > > I extensively tested this (on top of 3.6.0-rc2) and while I like > simplification and the main logic (if we have enough power why > not use other CPUs) I see several problems here. > > 1) The implementation is not much useful on modern CPUs with hw accelerated > AES (with AES-NI even one core can saturate very fast storage). > (Some optimized crypto behaves similar, like twofish optimized modules.) > > 2) The patchset targets linear access pattern mainly and one > process generating IOs. (With more processes/CPUs generating IOs > you get parallel processing even with current code.) > > I can see significant improvement (~30%) for linear read > (if not using AES-NI) and if underlying storage is zero target > (ideal situation removing scheduler from the stack). > > For random access pattern (and more IO producers) I cannot find > reliable improvement except ideal zero target case (where improvement > is always >30%). For more producers it doesn't help at all. > I tested RAID0 over SATA disks and very fast SSD on quad core cpu, > dmcrypt running with 1, 3 or 4 threads (and with cfq and noop scheduler) > using fio threaded test with 1 or 4 threads. > > Notes to implementation: > > 1) Last two patches (19/20) provides sorting of IO requests, this > logically should be done in IO scheduler. > > I don't think this should be in dmcrypt, if scheduler doesn't work > properly, it should be fixed or tuned for crypt access pattern. > > 2) Could be kernel workqueue used/fixed here instead? Basically all it needs > is to prefer submitting CPU, if it is busy just move work to another CPU. > > 3) It doesn't honour CPU hotplugging. Number of CPU is determined > in crypt constructor. If hotplugging is used for CPU power saving, > it will not work properly. > > 4) With debugging options on, everything is extremely slower > (Perhaps bug in some debug option, I just noticed this on Fedora > rawhide with all debug on.) > > > I posted it mainly because I think this should move forward, > whatever direction... I had a conversation with Chris Mason, maybe 2 years ago, where he expressed concern about dm-crypt growing too much intelligence about parallel cpu usage. I seem to recall Chris' concern being relative to btrfs and the requirement that IOs not get reordered (his point was any dm-crypt change to improve cpu utilization shouldn't result in the possibility to reorder IOs). Could be that IO reordering isn't a concern dm-crypt (before or after this patchset). But I just wanted to make this point so that it is kept in mind (also allows Chris to raise any other new dm-crypt issue/concern he might have?). Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel |
dm-crypt: parallel processing
Hello,
(cc'ing Jens and Vivek, hi!) On Tue, Aug 21, 2012 at 11:37:43AM +0200, Milan Broz wrote: > Better adding cc to Tejun here, I still think there are several things > which perhaps should be done through kernel wq... > > (I would prefer to use kernel wq as well btw.) What do you mean by kernel wq? One of the system_*_wq's? If not, from scanning the patch names, it seems like it's converting to unbound workqueue from bound one. > > 1) Last two patches (19/20) provides sorting of IO requests, this > > logically should be done in IO scheduler. > > > > I don't think this should be in dmcrypt, if scheduler doesn't work > > properly, it should be fixed or tuned for crypt access pattern. I kinda agree but preserving (not strictly but at least most of the time) issuing order across stacking driver like dm probably isn't a bad idea. I *think* the direction block layer should be headed is to reduce the amount of work it does as the speed and characteristics of underlying devices improve. We could afford to do a LOT of things to better cater to devices with spindles but the operating margin continues to become narrower. Jens, Vivek, what do you guys think? > > 2) Could be kernel workqueue used/fixed here instead? Basically all it needs > > is to prefer submitting CPU, if it is busy just move work to another CPU. The problem, I suppose, is that w/ wq, it's either bound or completely unbound. If bound, the local CPU can become the bottleneck. If unbound, wq doesn't discern local and remote at all and thus loses any benefit from locality association. It would be nice if workqueue can somehow accomodate the situation better - maybe by migrating the worker to the issuing CPU before setting it loose so that the scheduler needs to migrate it away explicitly. Maybe we can do it opportunistically - e.g. record which CPU an unbound worker was on before entering idle and queue to local one if it exists. It wouldn't be trivial to implement tho. I'll think more about it. Thanks. -- tejun -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel |
dm-crypt: parallel processing
On Tue, Aug 21, 2012 at 11:23:40AM -0700, Tejun Heo wrote:
[..] > > > 1) Last two patches (19/20) provides sorting of IO requests, this > > > logically should be done in IO scheduler. > > > > > > I don't think this should be in dmcrypt, if scheduler doesn't work > > > properly, it should be fixed or tuned for crypt access pattern. > > I kinda agree but preserving (not strictly but at least most of the > time) issuing order across stacking driver like dm probably isn't a > bad idea. I *think* the direction block layer should be headed is to > reduce the amount of work it does as the speed and characteristics of > underlying devices improve. We could afford to do a LOT of things to > better cater to devices with spindles but the operating margin > continues to become narrower. Jens, Vivek, what do you guys think? I think we can make various block layer functionalities optional and faster drivers can choose which ones do they want. For example, "queue/nomerges" disables merging. May be another sys tunable or a queue flag (which driver provides at the time of registration) for "nosorting" can be used to reduce the amount of work block layer does. That way slower devices can still make use of loaded block layer and faster devices can opt out of those features. Thanks Vivek -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel |
dm-crypt: parallel processing
On 08/21/2012 08:23 PM, Tejun Heo wrote:
> Hello, > > (cc'ing Jens and Vivek, hi!) > > On Tue, Aug 21, 2012 at 11:37:43AM +0200, Milan Broz wrote: >> Better adding cc to Tejun here, I still think there are several things >> which perhaps should be done through kernel wq... >> >> (I would prefer to use kernel wq as well btw.) > > What do you mean by kernel wq? One of the system_*_wq's? If not, > from scanning the patch names, it seems like it's converting to > unbound workqueue from bound one. I meant just extend bound workqueue as you mentioned below. ... >>> 2) Could be kernel workqueue used/fixed here instead? Basically all it needs >>> is to prefer submitting CPU, if it is busy just move work to another CPU. > > The problem, I suppose, is that w/ wq, it's either bound or completely > unbound. If bound, the local CPU can become the bottleneck. If > unbound, wq doesn't discern local and remote at all and thus loses any > benefit from locality association. > > It would be nice if workqueue can somehow accomodate the situation > better - maybe by migrating the worker to the issuing CPU before > setting it loose so that the scheduler needs to migrate it away > explicitly. Maybe we can do it opportunistically - e.g. record which > CPU an unbound worker was on before entering idle and queue to local > one if it exists. It wouldn't be trivial to implement tho. I'll > think more about it. Yes, something like this. dmcrypt basically should have parameter which says how it will use workqueue(s) IMHO three basic cases: 1) just use bound 1 wq per device. (this mode is usual for specific cases, some people have several dmcrypt devices with RAID on top - it was workaround to 1 thread per dmcrypt device in older kernel. This increased throughput but with recent kernel it does exact opposite... So this mode provides workaround for them.) 2) Use all CPU possible just prefer local CPU if available (something between bound and unbound wq) 3) the same as 2) just with limited # of CPU per crypt device. (but Mikulas' code uses also some batching of request - not sure how to incorporate this) Whatever, if logic is implemented in workqueue code, others can use t as well. I would really prefer not to have "too smart" dmcrypt... (Someone mentioned btrfs on top of it with all workqueues - how it can behave nicely if every layer will try to implement own smart logic.) Anyway, thanks for discussion, this is exactly what was missing here :) Milan -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel |
dm-crypt: parallel processing
Hello,
On Wed, Aug 22, 2012 at 12:28:41PM +0200, Milan Broz wrote: > Whatever, if logic is implemented in workqueue code, others can use t as well. > I would really prefer not to have "too smart" dmcrypt... > (Someone mentioned btrfs on top of it with all workqueues - how it can behave nicely > if every layer will try to implement own smart logic.) > > Anyway, thanks for discussion, this is exactly what was missing here :) I've been thinking about it and am still unsure. If there are enough use cases, we can try to manage unbound workers such that each CPU has some standby ones, but at this point such approach seems a bit overkill and I'm skeptical how useful a purely opportunistic approach would be. Another thing is that this is something which really belongs to the scheduler. The scheduler can know better and do things like this much better. Unfortunately, kthreads don't have mechanisms to be discovered in terms of its optimal association (as opposed to, say, autonuma for userland). So... I don't know. dm-crypt probably is the most extreme use case in kernel, so maybe going for specialized solution might not be too crazy. Also, how much of a problem is this? Is it really worth solving? Thanks. -- tejun -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel |
| All times are GMT. The time now is 04:57 AM. |
VBulletin, Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.