Linux Archive

Linux Archive (http://www.linux-archive.org/)
-   Device-mapper Development (http://www.linux-archive.org/device-mapper-development/)
-   -   dm-crypt: parallel processing (http://www.linux-archive.org/device-mapper-development/696262-dm-crypt-parallel-processing.html)

Milan Broz 08-21-2012 09:37 AM

dm-crypt: parallel processing
 
On 08/21/2012 11:08 AM, Milan Broz wrote:
> Hello,
>
> I promised to Mikulas that I will post remaining of his dm-crypt parallel
> processing patchset (plus some related changes) with some comments.
>
> The problem:
>
> The current implementation (using per cpu struct) always use encryption
> on the same CPU which submitted IO.
>
> With very fast storage (usually SSD or MD RAID0) one CPU core can
> be limiting and the throughput of encrypted disk is worse in
> comparison with underlying storage.
>
> Idea here is to distribute encryption to other CPU cores/CPUs.
>
> (Side effect of patches is nice clean up dmcrypt code. :)

Better adding cc to Tejun here, I still think there are several things
which perhaps should be done through kernel wq...

(I would prefer to use kernel wq as well btw.)

Milan

>
> Mikulas Patocka (20):
> dm-crypt: remove per-cpu structure
> dm-crypt: use unbound workqueue for request processing
> dm-crypt: remove completion restart
> dm-crypt: use encryption threads
> dm-crypt: Unify spinlock
> dm-crypt: Introduce an option that sets the number of threads.
> dm-crypt: don't use write queue
> dm-crypt: simplify io queue
> dm-crypt: unify io_queue and crypt_queue
> dm-crypt: don't allocate pages for a partial request.
> dm-crypt: avoid deadlock in mempools
> dm-crypt: simplify cc_pending
> dm-crypt merge convert_context and dm_crypt_io
> dm-crypt: move error handling to crypt_convert.
> dm-crypt: remove io_pending field
> dm-crypt: small changes
> dm-crypt: move temporary values to stack
> dm-crypt: offload writes to thread
> dm-crypt: retain write ordering
> dm-crypt: sort writes
>
> drivers/md/dm-crypt.c | 838 +++++++++++++++++++++++++++----------------------
> 1 file changed, 464 insertions(+), 374 deletions(-)
>
>
> My notes:
>
> I extensively tested this (on top of 3.6.0-rc2) and while I like
> simplification and the main logic (if we have enough power why
> not use other CPUs) I see several problems here.
>
> 1) The implementation is not much useful on modern CPUs with hw accelerated
> AES (with AES-NI even one core can saturate very fast storage).
> (Some optimized crypto behaves similar, like twofish optimized modules.)
>
> 2) The patchset targets linear access pattern mainly and one
> process generating IOs. (With more processes/CPUs generating IOs
> you get parallel processing even with current code.)
>
> I can see significant improvement (~30%) for linear read
> (if not using AES-NI) and if underlying storage is zero target
> (ideal situation removing scheduler from the stack).
>
> For random access pattern (and more IO producers) I cannot find
> reliable improvement except ideal zero target case (where improvement
> is always >30%). For more producers it doesn't help at all.
> I tested RAID0 over SATA disks and very fast SSD on quad core cpu,
> dmcrypt running with 1, 3 or 4 threads (and with cfq and noop scheduler)
> using fio threaded test with 1 or 4 threads.
>
> Notes to implementation:
>
> 1) Last two patches (19/20) provides sorting of IO requests, this
> logically should be done in IO scheduler.
>
> I don't think this should be in dmcrypt, if scheduler doesn't work
> properly, it should be fixed or tuned for crypt access pattern.
>
> 2) Could be kernel workqueue used/fixed here instead? Basically all it needs
> is to prefer submitting CPU, if it is busy just move work to another CPU.
>
> 3) It doesn't honour CPU hotplugging. Number of CPU is determined
> in crypt constructor. If hotplugging is used for CPU power saving,
> it will not work properly.
>
> 4) With debugging options on, everything is extremely slower
> (Perhaps bug in some debug option, I just noticed this on Fedora
> rawhide with all debug on.)
>
>
> I posted it mainly because I think this should move forward,
> whatever direction...
>
> Milan

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Mike Snitzer 08-21-2012 01:32 PM

dm-crypt: parallel processing
 
[cc'ing Chris Mason]

On Tue, Aug 21 2012 at 5:08am -0400,
Milan Broz <mbroz@redhat.com> wrote:

> Hello,
>
> I promised to Mikulas that I will post remaining of his dm-crypt parallel
> processing patchset (plus some related changes) with some comments.
>
> The problem:
>
> The current implementation (using per cpu struct) always use encryption
> on the same CPU which submitted IO.
>
> With very fast storage (usually SSD or MD RAID0) one CPU core can
> be limiting and the throughput of encrypted disk is worse in
> comparison with underlying storage.
>
> Idea here is to distribute encryption to other CPU cores/CPUs.
>
> (Side effect of patches is nice clean up dmcrypt code. :)
>
> Mikulas Patocka (20):
> dm-crypt: remove per-cpu structure
> dm-crypt: use unbound workqueue for request processing
> dm-crypt: remove completion restart
> dm-crypt: use encryption threads
> dm-crypt: Unify spinlock
> dm-crypt: Introduce an option that sets the number of threads.
> dm-crypt: don't use write queue
> dm-crypt: simplify io queue
> dm-crypt: unify io_queue and crypt_queue
> dm-crypt: don't allocate pages for a partial request.
> dm-crypt: avoid deadlock in mempools
> dm-crypt: simplify cc_pending
> dm-crypt merge convert_context and dm_crypt_io
> dm-crypt: move error handling to crypt_convert.
> dm-crypt: remove io_pending field
> dm-crypt: small changes
> dm-crypt: move temporary values to stack
> dm-crypt: offload writes to thread
> dm-crypt: retain write ordering
> dm-crypt: sort writes
>
> drivers/md/dm-crypt.c | 838 +++++++++++++++++++++++++++----------------------
> 1 file changed, 464 insertions(+), 374 deletions(-)
>
>
> My notes:
>
> I extensively tested this (on top of 3.6.0-rc2) and while I like
> simplification and the main logic (if we have enough power why
> not use other CPUs) I see several problems here.
>
> 1) The implementation is not much useful on modern CPUs with hw accelerated
> AES (with AES-NI even one core can saturate very fast storage).
> (Some optimized crypto behaves similar, like twofish optimized modules.)
>
> 2) The patchset targets linear access pattern mainly and one
> process generating IOs. (With more processes/CPUs generating IOs
> you get parallel processing even with current code.)
>
> I can see significant improvement (~30%) for linear read
> (if not using AES-NI) and if underlying storage is zero target
> (ideal situation removing scheduler from the stack).
>
> For random access pattern (and more IO producers) I cannot find
> reliable improvement except ideal zero target case (where improvement
> is always >30%). For more producers it doesn't help at all.
> I tested RAID0 over SATA disks and very fast SSD on quad core cpu,
> dmcrypt running with 1, 3 or 4 threads (and with cfq and noop scheduler)
> using fio threaded test with 1 or 4 threads.
>
> Notes to implementation:
>
> 1) Last two patches (19/20) provides sorting of IO requests, this
> logically should be done in IO scheduler.
>
> I don't think this should be in dmcrypt, if scheduler doesn't work
> properly, it should be fixed or tuned for crypt access pattern.
>
> 2) Could be kernel workqueue used/fixed here instead? Basically all it needs
> is to prefer submitting CPU, if it is busy just move work to another CPU.
>
> 3) It doesn't honour CPU hotplugging. Number of CPU is determined
> in crypt constructor. If hotplugging is used for CPU power saving,
> it will not work properly.
>
> 4) With debugging options on, everything is extremely slower
> (Perhaps bug in some debug option, I just noticed this on Fedora
> rawhide with all debug on.)
>
>
> I posted it mainly because I think this should move forward,
> whatever direction...

I had a conversation with Chris Mason, maybe 2 years ago, where he
expressed concern about dm-crypt growing too much intelligence about
parallel cpu usage. I seem to recall Chris' concern being relative to
btrfs and the requirement that IOs not get reordered (his point was any
dm-crypt change to improve cpu utilization shouldn't result in the
possibility to reorder IOs).

Could be that IO reordering isn't a concern dm-crypt (before or after
this patchset). But I just wanted to make this point so that it is kept
in mind (also allows Chris to raise any other new dm-crypt issue/concern
he might have?).

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Tejun Heo 08-21-2012 06:23 PM

dm-crypt: parallel processing
 
Hello,

(cc'ing Jens and Vivek, hi!)

On Tue, Aug 21, 2012 at 11:37:43AM +0200, Milan Broz wrote:
> Better adding cc to Tejun here, I still think there are several things
> which perhaps should be done through kernel wq...
>
> (I would prefer to use kernel wq as well btw.)

What do you mean by kernel wq? One of the system_*_wq's? If not,
from scanning the patch names, it seems like it's converting to
unbound workqueue from bound one.

> > 1) Last two patches (19/20) provides sorting of IO requests, this
> > logically should be done in IO scheduler.
> >
> > I don't think this should be in dmcrypt, if scheduler doesn't work
> > properly, it should be fixed or tuned for crypt access pattern.

I kinda agree but preserving (not strictly but at least most of the
time) issuing order across stacking driver like dm probably isn't a
bad idea. I *think* the direction block layer should be headed is to
reduce the amount of work it does as the speed and characteristics of
underlying devices improve. We could afford to do a LOT of things to
better cater to devices with spindles but the operating margin
continues to become narrower. Jens, Vivek, what do you guys think?

> > 2) Could be kernel workqueue used/fixed here instead? Basically all it needs
> > is to prefer submitting CPU, if it is busy just move work to another CPU.

The problem, I suppose, is that w/ wq, it's either bound or completely
unbound. If bound, the local CPU can become the bottleneck. If
unbound, wq doesn't discern local and remote at all and thus loses any
benefit from locality association.

It would be nice if workqueue can somehow accomodate the situation
better - maybe by migrating the worker to the issuing CPU before
setting it loose so that the scheduler needs to migrate it away
explicitly. Maybe we can do it opportunistically - e.g. record which
CPU an unbound worker was on before entering idle and queue to local
one if it exists. It wouldn't be trivial to implement tho. I'll
think more about it.

Thanks.

--
tejun

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Vivek Goyal 08-21-2012 07:26 PM

dm-crypt: parallel processing
 
On Tue, Aug 21, 2012 at 11:23:40AM -0700, Tejun Heo wrote:

[..]
> > > 1) Last two patches (19/20) provides sorting of IO requests, this
> > > logically should be done in IO scheduler.
> > >
> > > I don't think this should be in dmcrypt, if scheduler doesn't work
> > > properly, it should be fixed or tuned for crypt access pattern.
>
> I kinda agree but preserving (not strictly but at least most of the
> time) issuing order across stacking driver like dm probably isn't a
> bad idea. I *think* the direction block layer should be headed is to
> reduce the amount of work it does as the speed and characteristics of
> underlying devices improve. We could afford to do a LOT of things to
> better cater to devices with spindles but the operating margin
> continues to become narrower. Jens, Vivek, what do you guys think?

I think we can make various block layer functionalities optional and
faster drivers can choose which ones do they want. For example,
"queue/nomerges" disables merging. May be another sys tunable or a queue
flag (which driver provides at the time of registration) for "nosorting"
can be used to reduce the amount of work block layer does.

That way slower devices can still make use of loaded block layer and
faster devices can opt out of those features.

Thanks
Vivek

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Milan Broz 08-22-2012 10:28 AM

dm-crypt: parallel processing
 
On 08/21/2012 08:23 PM, Tejun Heo wrote:
> Hello,
>
> (cc'ing Jens and Vivek, hi!)
>
> On Tue, Aug 21, 2012 at 11:37:43AM +0200, Milan Broz wrote:
>> Better adding cc to Tejun here, I still think there are several things
>> which perhaps should be done through kernel wq...
>>
>> (I would prefer to use kernel wq as well btw.)
>
> What do you mean by kernel wq? One of the system_*_wq's? If not,
> from scanning the patch names, it seems like it's converting to
> unbound workqueue from bound one.

I meant just extend bound workqueue as you mentioned below.

...

>>> 2) Could be kernel workqueue used/fixed here instead? Basically all it needs
>>> is to prefer submitting CPU, if it is busy just move work to another CPU.
>
> The problem, I suppose, is that w/ wq, it's either bound or completely
> unbound. If bound, the local CPU can become the bottleneck. If
> unbound, wq doesn't discern local and remote at all and thus loses any
> benefit from locality association.
>
> It would be nice if workqueue can somehow accomodate the situation
> better - maybe by migrating the worker to the issuing CPU before
> setting it loose so that the scheduler needs to migrate it away
> explicitly. Maybe we can do it opportunistically - e.g. record which
> CPU an unbound worker was on before entering idle and queue to local
> one if it exists. It wouldn't be trivial to implement tho. I'll
> think more about it.

Yes, something like this.

dmcrypt basically should have parameter which says how it will use workqueue(s)

IMHO three basic cases:

1) just use bound 1 wq per device. (this mode is usual for specific cases,
some people have several dmcrypt devices with RAID on top - it was workaround
to 1 thread per dmcrypt device in older kernel. This increased throughput
but with recent kernel it does exact opposite... So this mode provides
workaround for them.)

2) Use all CPU possible just prefer local CPU if available
(something between bound and unbound wq)

3) the same as 2) just with limited # of CPU per crypt device.

(but Mikulas' code uses also some batching of request - not sure how
to incorporate this)

Whatever, if logic is implemented in workqueue code, others can use t as well.
I would really prefer not to have "too smart" dmcrypt...
(Someone mentioned btrfs on top of it with all workqueues - how it can behave nicely
if every layer will try to implement own smart logic.)

Anyway, thanks for discussion, this is exactly what was missing here :)

Milan

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Tejun Heo 08-23-2012 08:15 PM

dm-crypt: parallel processing
 
Hello,

On Wed, Aug 22, 2012 at 12:28:41PM +0200, Milan Broz wrote:
> Whatever, if logic is implemented in workqueue code, others can use t as well.
> I would really prefer not to have "too smart" dmcrypt...
> (Someone mentioned btrfs on top of it with all workqueues - how it can behave nicely
> if every layer will try to implement own smart logic.)
>
> Anyway, thanks for discussion, this is exactly what was missing here :)

I've been thinking about it and am still unsure. If there are enough
use cases, we can try to manage unbound workers such that each CPU has
some standby ones, but at this point such approach seems a bit
overkill and I'm skeptical how useful a purely opportunistic approach
would be.

Another thing is that this is something which really belongs to the
scheduler. The scheduler can know better and do things like this much
better. Unfortunately, kthreads don't have mechanisms to be
discovered in terms of its optimal association (as opposed to, say,
autonuma for userland).

So... I don't know. dm-crypt probably is the most extreme use case in
kernel, so maybe going for specialized solution might not be too
crazy. Also, how much of a problem is this? Is it really worth
solving?

Thanks.

--
tejun

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


All times are GMT. The time now is 02:31 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.