FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Device-mapper Development

 
 
LinkBack Thread Tools
 
Old 07-16-2012, 08:28 AM
 
Default DM RAID: Add support for MD RAID10

On Mon, Jul 16, 2012 at 04:14:31PM +1000, NeilBrown wrote:
> On Fri, 13 Jul 2012 10:29:23 +0200 keld@keldix.com wrote:
>
> > On Fri, Jul 13, 2012 at 11:27:17AM +1000, NeilBrown wrote:
> > > On Fri, 13 Jul 2012 03:15:05 +0200 keld@keldix.com wrote:
> > >
> > Well, I have an idea for the odd number of devices:
> > Have the disks arranged in groups (for N=2 in pairs) and then the last group extended with
> > the leftover disks in the way it is done now.
> >
> > For 2 copies, this would be a number of pairs, and then a rest group of 3 disks.
> > For 3 copies, this would be a number of triplets, and then 4 or 5 disks in the last group.
>
> Certainly possible, but it feels clumsy. I'm not convinced it is a good idea.

Well, it is for this time only a proposal. I do think it preserves all the benefits of raid10,far
especially the striping for reading, and I believe it brings redundancy beyond 1 drive up
from zero for odd-drive arrays to almost raid-1+0 - in my mind this is as good as it can get theoretically.
In my guts it feels like good design. I am exited about it:-)

Why do you feel it is clumsy? Because it is not as general as the current code, but it would take a few more lines to do it?
We do gain a lot of redundancy for say 20 more lines of code.

> > Especially that we should only advice the new layout, and there is no reason for the
> > current implementation except for backwards compatibility?
>
> The main reason for the current implementation is that is currently
> implemented.
> Until an alternate implementation exists, it seems pointless to recommend
> that people use it.

That is not the intention. The intention is for new implementers not
to implement the old scheme. That is, for new implementers to do it right.

> Maybe you are suggesting that dmraid should not support raid10-far or
> raid10-offset until the "new" approach is implemented.

I don't know. It may take a while to get it implemented as long as no seasoned
kernel hackers are working on it. As it is implemented now by Barrow, why not then go
forward as planned.

For the offset layout I don't have a good idea on how to improve the redundancy.
Maybe you or others have good ideas. Or is the offset layout an implementation
of a standard layout? Then there is not much ado. Except if we could find a layout that has
the same advantages but with better redundancy.

> Maybe that is sensible, but only if someone steps forwards and actually
> implements the "new" approach.

I would have hoped you would do it!

I am not a seasoned kernel hacker like you, but could you point me to the code where
the sequence of the blocks is done for far and offset? I would think it would only be a few places.

And then how to handle the two different layouts, and how to do a migration? Where would that be done?

Best regards
keld

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 07-16-2012, 10:06 PM
Brassow Jonathan
 
Default DM RAID: Add support for MD RAID10

On Jul 12, 2012, at 1:32 AM, NeilBrown wrote:

>>
>> @@ -76,6 +80,7 @@ static struct raid_type {
>> const unsigned algorithm; /* RAID algorithm. */
>> } raid_types[] = {
>> {"raid1", "RAID1 (mirroring)", 0, 2, 1, 0 /* NONE */},
>> + {"raid10", "RAID10 (striped mirrors)", 0, 2, 10, -1 /* Varies */},
>> {"raid4", "RAID4 (dedicated parity disk)", 1, 2, 5, ALGORITHM_PARITY_0},
>> {"raid5_la", "RAID5 (left asymmetric)", 1, 2, 5, ALGORITHM_LEFT_ASYMMETRIC},
>> {"raid5_ra", "RAID5 (right asymmetric)", 1, 2, 5, ALGORITHM_RIGHT_ASYMMETRIC},
>
> Initialising the "unsigned" algorithm to "-1" looks like it is asking for
> trouble.

I'm initializing to -1 because I know it is a value that will fail if not set later in the setup process. I could just as easily choose the default values (near = 2).

>
>> @@ -493,7 +554,7 @@ static int parse_raid_params(struct raid
>> */
>> value /= 2;
>>
>> - if (rs->raid_type->level < 5) {
>> + if (rs->raid_type->level != 5) {
>> rs->ti->error = "Inappropriate argument: stripe_cache";
>> return -EINVAL;
>> }
>
> This leaves RAID6 out in the cold. Maybe
> level < 5 || level > 6
> or !=5 !=6
> or a switch statement?

Thanks. For some reason I had thought that raid6 had level=5, like raid4... Fixed.



>> @@ -536,9 +605,25 @@ static int parse_raid_params(struct raid
>> if (dm_set_target_max_io_len(rs->ti, max_io_len))
>> return -EINVAL;
>>
>> - if ((rs->raid_type->level > 1) &&
>> - sector_div(sectors_per_dev, (rs->md.raid_disks - rs->raid_type->parity_devs))) {
>> + if (rs->raid_type->level == 10) {
>> + /* (Len * Stripes) / Mirrors */
>> + sectors_per_dev *= rs->md.raid_disks;
>> + if (sector_div(sectors_per_dev, raid10_copies)) {
>> + rs->ti->error = "Target length not divisible by number of data devices";
>> + return -EINVAL;
>> + }
>
> I'm not entirely sure what you are trying to do here, but I don't think it
> works.
>
> At the very least you would need to convert the "sectors_per_dev" number to a
> 'chunks_per_dev' before multiplying by raid_disks and dividing by copies.
>
> But even that isn't necessary. If you have a 3-device near=2 array with an
> odd number of chunks per device, that will still work. The last chunk won't
> be mirrored, so won't be used.
> Until a couple of weeks ago a recovery of the last device would trigger an
> error when we try to recover that last chunk, but that is fixed now.
>
> So if you want to impose this limitation (and there is some merit in making
> sure people don't confuse themselves), I suggest it get imposed in the
> user-space tools which create the RAID10.

I have seen what you do in calc_sectors(), I suppose I could bring that in. However, I'm simply trying to find the value that will be preliminarily assigned to mddev->dev_sectors. It is an enforced (perhaps unnecessary) constraint that the array length be divisible by the number of data devices. I'm not accounting for chunk boundaries - I leave that to userspace to configure and MD to catch.

brassow


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 07-16-2012, 10:53 PM
Brassow Jonathan
 
Default DM RAID: Add support for MD RAID10

On Jul 16, 2012, at 3:28 AM, keld@keldix.com wrote:

>>
>> Maybe you are suggesting that dmraid should not support raid10-far or
>> raid10-offset until the "new" approach is implemented.
>
> I don't know. It may take a while to get it implemented as long as no seasoned
> kernel hackers are working on it. As it is implemented now by Barrow, why not then go
> forward as planned.
>
> For the offset layout I don't have a good idea on how to improve the redundancy.
> Maybe you or others have good ideas. Or is the offset layout an implementation
> of a standard layout? Then there is not much ado. Except if we could find a layout that has
> the same advantages but with better redundancy.

Excuse me, s/Barrow/Brassow/ - my parents insist.

I've got a "simple" idea for improving the redundancy of the "far" algorithms. Right now, when calculating the device on which the far copy will go, we perform:
d += geo->near_copies;
d %= geo->raid_disks;
This effectively "shifts" the copy rows over by 'near_copies' (1 in the simple case), as follows:
disk1 disk2 or disk1 disk2 disk3
===== ===== ===== ===== =====
A1 A2 A1 A2 A3
.. .. .. .. ..
A2 A1 A3 A1 A2
For all odd numbers of 'far' copies, this is what we should do. However, for an even number of far copies, we should shift "near_copies + 1" - unless (far_copies == (raid_disks / near_copies)), in which case it should be simply "near_copies". This should provide maximum redundancy for all cases, I think. I will call the number of devices the copy is shifted the "device stride", or dev_stride. Here are a couple examples:
2-devices, near=1, far=2, offset=0/1: dev_stride = nc (SAME AS CURRENT ALGORITHM)

3-devices, near=1, far=2, offset=0/1: dev_stride = nc + 1. Layout changes as follows:
disk1 disk2 disk3
===== ===== =====
A1 A2 A3
.. .. ..
A2 A3 A1

4-devices, near=1, far=2, offset=0/1: dev_stride = nc + 1. Layout changes as follows:
disk1 disk2 disk3 disk4
===== ===== ===== =====
A1 A2 A3 A4
.. .. .. ..
A3 A4 A1 A2
This gives max redundancy for 3, 4, 5, etc far copies too, as long as each stripe that's copied is laid down at: d += geo->dev_stride * copy#; (where far=2, copy# would be 0 and 1. Far=3, copy# would be 0, 1, 2).
Here's a couple more quick examples to make that clear:
3-devices, near=1, far=3, offset=0/1: dev_stride = nc (SHOULD BE SAME AS CURRENT)
disk1 disk2 disk3
===== ===== =====
A1 A2 A3
.. .. ..
A3 A1 A2
.. .. ..
A2 A3 A1 -- Each copy "shifted" 'nc' from the last
.. .. ..

5-devices, near=1, far=4, offset=0/1: dev_stride = nc + 1. Layout changes to:
disk1 disk2 disk3 disk4 disk5
===== ===== ===== ===== =====
A1 A2 A3 A4 A5
.. .. .. .. ..
A4 A5 A1 A2 A3
.. .. .. .. ..
A2 A3 A4 A5 A1 -- Each copy "shifted" (nc + 1) from the last
.. .. .. .. ..
A5 A1 A2 A3 A4
.. .. .. .. ..

This should require a new bit in 'layout' (bit 17) to signify a different calculation in the way the copy device selection happens. We then need to replace 'd += geo->near_copies' with 'd += geo->dev_stride' and set dev_stride in 'setup_geo'. I'm not certain how much work it is beyond that, but I don't *think* it looks that bad and I'd be happy to do it.

So, should I allow the current "far" and "offset" in dm-raid, or should I simply allow "near" for now?

brassow


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 07-17-2012, 02:29 AM
NeilBrown
 
Default DM RAID: Add support for MD RAID10

On Mon, 16 Jul 2012 17:53:53 -0500 Brassow Jonathan <jbrassow@redhat.com>
wrote:

>
> On Jul 16, 2012, at 3:28 AM, keld@keldix.com wrote:
>
> >>
> >> Maybe you are suggesting that dmraid should not support raid10-far or
> >> raid10-offset until the "new" approach is implemented.
> >
> > I don't know. It may take a while to get it implemented as long as no seasoned
> > kernel hackers are working on it. As it is implemented now by Barrow, why not then go
> > forward as planned.
> >
> > For the offset layout I don't have a good idea on how to improve the redundancy.
> > Maybe you or others have good ideas. Or is the offset layout an implementation
> > of a standard layout? Then there is not much ado. Except if we could find a layout that has
> > the same advantages but with better redundancy.
>
> Excuse me, s/Barrow/Brassow/ - my parents insist.
>
> I've got a "simple" idea for improving the redundancy of the "far" algorithms. Right now, when calculating the device on which the far copy will go, we perform:
> d += geo->near_copies;
> d %= geo->raid_disks;
> This effectively "shifts" the copy rows over by 'near_copies' (1 in the simple case), as follows:
> disk1 disk2 or disk1 disk2 disk3
> ===== ===== ===== ===== =====
> A1 A2 A1 A2 A3
> .. .. .. .. ..
> A2 A1 A3 A1 A2
> For all odd numbers of 'far' copies, this is what we should do. However, for an even number of far copies, we should shift "near_copies + 1" - unless (far_copies == (raid_disks / near_copies)), in which case it should be simply "near_copies". This should provide maximum redundancy for all cases, I think. I will call the number of devices the copy is shifted the "device stride", or dev_stride. Here are a couple examples:
> 2-devices, near=1, far=2, offset=0/1: dev_stride = nc (SAME AS CURRENT ALGORITHM)
>
> 3-devices, near=1, far=2, offset=0/1: dev_stride = nc + 1. Layout changes as follows:
> disk1 disk2 disk3
> ===== ===== =====
> A1 A2 A3
> .. .. ..
> A2 A3 A1
>
> 4-devices, near=1, far=2, offset=0/1: dev_stride = nc + 1. Layout changes as follows:
> disk1 disk2 disk3 disk4
> ===== ===== ===== =====
> A1 A2 A3 A4
> .. .. .. ..
> A3 A4 A1 A2

Hi Jon,
This looks good for 4 devices, but I think it breaks down for e.g. 6 devices.

I think a useful measure is how many different pairs of devices exist such
that when both fail we lose data (thinking of far=2 layouts only). We want to
keep this number low. Call it the number of vulnerable pairs.

With the current layout with N devices, there are N pairs that are vulnerable.
(x and x+1 for each x). If N==2, the two pairs are 0,1 and 1,0. These
pairs are identical so there is only one vulnerable pair.

With your layout there are still N pairs (x and x+2) except when there are 4
devices (N=2), we get 0,2 1,3 2,0 3,1 in which case 2 sets of pairs are
identical (1,3 == 3,1 and 2,4==4,2).
With N=6 the 6 pairs are
0,2 1,3 2,4 3,5 4,0 5,1
and no two pairs are identical. So there is no gain.

The layout with data stored on device 'x' is mirrored on device 'x^1' has
N/2 pairs which are vulnerable.
An alternate way to gain this low level of vulnerability would be to mirror
data on X onto 'X+N/2' This is the same as your arrangement for N==4.
For N==6 it would be:

A B C D E F
G H I J K L
....
D E F A B C
J K L G H I
...

so the vulnerable pairs are 0,3 1,4 2,5
This might be slightly easier to implement (as you suggest: have a
dev_stride, only set it to raid_disks/fc*nc).

>
> This should require a new bit in 'layout' (bit 17) to signify a different calculation in the way the copy device selection happens. We then need to replace 'd += geo->near_copies' with 'd += geo->dev_stride' and set dev_stride in 'setup_geo'. I'm not certain how much work it is beyond that, but I don't *think* it looks that bad and I'd be happy to do it.

I'm tempted to set bit 31 to mean "bits 0xFF are number of copies and bits
0xFF00 define the layout of those copies".. but just adding a bit17 probably
makes more sense.

If you create and test a patch using the calculation I suggested, I'll be
happy to review it.

>
> So, should I allow the current "far" and "offset" in dm-raid, or should I simply allow "near" for now?

That's up to you. However it might be sensible not to rush into supporting
the current far and offset layouts until this conversation has run its course.

Thanks,
NeilBrown

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 07-17-2012, 02:34 AM
NeilBrown
 
Default DM RAID: Add support for MD RAID10

On Mon, 16 Jul 2012 17:06:32 -0500 Brassow Jonathan <jbrassow@redhat.com>
wrote:

>
> On Jul 12, 2012, at 1:32 AM, NeilBrown wrote:
>
> >>
> >> @@ -76,6 +80,7 @@ static struct raid_type {
> >> const unsigned algorithm; /* RAID algorithm. */
> >> } raid_types[] = {
> >> {"raid1", "RAID1 (mirroring)", 0, 2, 1, 0 /* NONE */},
> >> + {"raid10", "RAID10 (striped mirrors)", 0, 2, 10, -1 /* Varies */},
> >> {"raid4", "RAID4 (dedicated parity disk)", 1, 2, 5, ALGORITHM_PARITY_0},
> >> {"raid5_la", "RAID5 (left asymmetric)", 1, 2, 5, ALGORITHM_LEFT_ASYMMETRIC},
> >> {"raid5_ra", "RAID5 (right asymmetric)", 1, 2, 5, ALGORITHM_RIGHT_ASYMMETRIC},
> >
> > Initialising the "unsigned" algorithm to "-1" looks like it is asking for
> > trouble.
>
> I'm initializing to -1 because I know it is a value that will fail if not set later in the setup process. I could just as easily choose the default values (near = 2).

I don't much care what value you use as long as it is of the same type as the
variable you assign it to. So maybe UINT_MAX, or maybe '2'.


>
> >
> >> @@ -493,7 +554,7 @@ static int parse_raid_params(struct raid
> >> */
> >> value /= 2;
> >>
> >> - if (rs->raid_type->level < 5) {
> >> + if (rs->raid_type->level != 5) {
> >> rs->ti->error = "Inappropriate argument: stripe_cache";
> >> return -EINVAL;
> >> }
> >
> > This leaves RAID6 out in the cold. Maybe
> > level < 5 || level > 6
> > or !=5 !=6
> > or a switch statement?
>
> Thanks. For some reason I had thought that raid6 had level=5, like raid4... Fixed.

I love consistency ... or at least I think I would if only I could find it
somewhere!

>
>
>
> >> @@ -536,9 +605,25 @@ static int parse_raid_params(struct raid
> >> if (dm_set_target_max_io_len(rs->ti, max_io_len))
> >> return -EINVAL;
> >>
> >> - if ((rs->raid_type->level > 1) &&
> >> - sector_div(sectors_per_dev, (rs->md.raid_disks - rs->raid_type->parity_devs))) {
> >> + if (rs->raid_type->level == 10) {
> >> + /* (Len * Stripes) / Mirrors */
> >> + sectors_per_dev *= rs->md.raid_disks;
> >> + if (sector_div(sectors_per_dev, raid10_copies)) {
> >> + rs->ti->error = "Target length not divisible by number of data devices";
> >> + return -EINVAL;
> >> + }
> >
> > I'm not entirely sure what you are trying to do here, but I don't think it
> > works.
> >
> > At the very least you would need to convert the "sectors_per_dev" number to a
> > 'chunks_per_dev' before multiplying by raid_disks and dividing by copies.
> >
> > But even that isn't necessary. If you have a 3-device near=2 array with an
> > odd number of chunks per device, that will still work. The last chunk won't
> > be mirrored, so won't be used.
> > Until a couple of weeks ago a recovery of the last device would trigger an
> > error when we try to recover that last chunk, but that is fixed now.
> >
> > So if you want to impose this limitation (and there is some merit in making
> > sure people don't confuse themselves), I suggest it get imposed in the
> > user-space tools which create the RAID10.
>
> I have seen what you do in calc_sectors(), I suppose I could bring that in. However, I'm simply trying to find the value that will be preliminarily assigned to mddev->dev_sectors. It is an enforced (perhaps unnecessary) constraint that the array length be divisible by the number of data devices. I'm not accounting for chunk boundaries - I leave that to userspace to configure and MD to catch.

In that case I suggest you keep it out of dmraid.

It might make sense to check that the resulting array size matches what
user-space said to expect - or is at-least-as-big-as. However I don't see
much point in other size changes there.

Thanks,
NeilBrown

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 07-17-2012, 02:40 AM
NeilBrown
 
Default DM RAID: Add support for MD RAID10

On Mon, 16 Jul 2012 10:28:43 +0200 keld@keldix.com wrote:

> On Mon, Jul 16, 2012 at 04:14:31PM +1000, NeilBrown wrote:
> > On Fri, 13 Jul 2012 10:29:23 +0200 keld@keldix.com wrote:
> >
> > > On Fri, Jul 13, 2012 at 11:27:17AM +1000, NeilBrown wrote:
> > > > On Fri, 13 Jul 2012 03:15:05 +0200 keld@keldix.com wrote:
> > > >
> > > Well, I have an idea for the odd number of devices:
> > > Have the disks arranged in groups (for N=2 in pairs) and then the last group extended with
> > > the leftover disks in the way it is done now.
> > >
> > > For 2 copies, this would be a number of pairs, and then a rest group of 3 disks.
> > > For 3 copies, this would be a number of triplets, and then 4 or 5 disks in the last group.
> >
> > Certainly possible, but it feels clumsy. I'm not convinced it is a good idea.
>
> Well, it is for this time only a proposal. I do think it preserves all the benefits of raid10,far
> especially the striping for reading, and I believe it brings redundancy beyond 1 drive up
> from zero for odd-drive arrays to almost raid-1+0 - in my mind this is as good as it can get theoretically.
> In my guts it feels like good design. I am exited about it:-)
>
> Why do you feel it is clumsy? Because it is not as general as the current code, but it would take a few more lines to do it?
> We do gain a lot of redundancy for say 20 more lines of code.

Yes, because it is not general. That makes it hard to explain or describe.
That in turn makes mistake easier.

That doesn't mean that I am dead-set against it, but I'm not sure I want to
encourage it.
I don't think RAID10 array with 'odd' numbers of drives are that common in
the first place.
You suggestion would make no difference for 3 devices, a small difference for
5 (4 vulnerable pairs instead of 5) and only becomes significant with 7 or
more devices. Are people going to make arrays with 7 devices (with 5
vulnerable pairs) instead of 8 devices (with 4 vulnerable pairs)?

>
> > Maybe that is sensible, but only if someone steps forwards and actually
> > implements the "new" approach.
>
> I would have hoped you would do it!

In my copious spare time.


>
> I am not a seasoned kernel hacker like you, but could you point me to the code where
> the sequence of the blocks is done for far and offset? I would think it would only be a few places.
>
> And then how to handle the two different layouts, and how to do a migration? Where would that be done?

Let's see if Jon comes up with something.

NeilBrown

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 07-17-2012, 04:15 PM
Brassow Jonathan
 
Default DM RAID: Add support for MD RAID10

On Jul 16, 2012, at 9:34 PM, NeilBrown wrote:

>>>>
>>>> @@ -536,9 +605,25 @@ static int parse_raid_params(struct raid
>>>> if (dm_set_target_max_io_len(rs->ti, max_io_len))
>>>> return -EINVAL;
>>>>
>>>> - if ((rs->raid_type->level > 1) &&
>>>> - sector_div(sectors_per_dev, (rs->md.raid_disks - rs->raid_type->parity_devs))) {
>>>> + if (rs->raid_type->level == 10) {
>>>> + /* (Len * Stripes) / Mirrors */
>>>> + sectors_per_dev *= rs->md.raid_disks;
>>>> + if (sector_div(sectors_per_dev, raid10_copies)) {
>>>> + rs->ti->error = "Target length not divisible by number of data devices";
>>>> + return -EINVAL;
>>>> + }
>>>
>>> I'm not entirely sure what you are trying to do here, but I don't think it
>>> works.
>>>
>>> At the very least you would need to convert the "sectors_per_dev" number to a
>>> 'chunks_per_dev' before multiplying by raid_disks and dividing by copies.
>>>
>>> But even that isn't necessary. If you have a 3-device near=2 array with an
>>> odd number of chunks per device, that will still work. The last chunk won't
>>> be mirrored, so won't be used.
>>> Until a couple of weeks ago a recovery of the last device would trigger an
>>> error when we try to recover that last chunk, but that is fixed now.
>>>
>>> So if you want to impose this limitation (and there is some merit in making
>>> sure people don't confuse themselves), I suggest it get imposed in the
>>> user-space tools which create the RAID10.
>>
>> I have seen what you do in calc_sectors(), I suppose I could bring that in. However, I'm simply trying to find the value that will be preliminarily assigned to mddev->dev_sectors. It is an enforced (perhaps unnecessary) constraint that the array length be divisible by the number of data devices. I'm not accounting for chunk boundaries - I leave that to userspace to configure and MD to catch.
>
> In that case I suggest you keep it out of dmraid.
>
> It might make sense to check that the resulting array size matches what
> user-space said to expect - or is at-least-as-big-as. However I don't see
> much point in other size changes there.

I'm not changing any sizes. I'm finding a value for mddev->dev_sectors (what should I set it to if not the above?), which is the size of each device as computed from the total array size. If the value can't be computed evenly, then an error is given. I'm not sure what you are suggesting my alternative is when you say, "keep it out of dm-raid"... I think this is the least I can do in dm-raid.

brassow

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 07-17-2012, 08:30 PM
Brassow Jonathan
 
Default DM RAID: Add support for MD RAID10

On Jul 16, 2012, at 9:29 PM, NeilBrown wrote:

> The layout with data stored on device 'x' is mirrored on device 'x^1' has
> N/2 pairs which are vulnerable.
> An alternate way to gain this low level of vulnerability would be to mirror
> data on X onto 'X+N/2' This is the same as your arrangement for N==4.
> For N==6 it would be:
>
> A B C D E F
> G H I J K L
> ....
> D E F A B C
> J K L G H I
> ...
>
> so the vulnerable pairs are 0,3 1,4 2,5
> This might be slightly easier to implement (as you suggest: have a
> dev_stride, only set it to raid_disks/fc*nc).

Yes, that looks nice. Plus, it doesn't have as many conditions as my idea.

>
>>
>> This should require a new bit in 'layout' (bit 17) to signify a different calculation in the way the copy device selection happens. We then need to replace 'd += geo->near_copies' with 'd += geo->dev_stride' and set dev_stride in 'setup_geo'. I'm not certain how much work it is beyond that, but I don't *think* it looks that bad and I'd be happy to do it.
>
> I'm tempted to set bit 31 to mean "bits 0xFF are number of copies and bits
> 0xFF00 define the layout of those copies".. but just adding a bit17 probably
> makes more sense.
>
> If you create and test a patch using the calculation I suggested, I'll be
> happy to review it.

ok. Might come after my current dm-raid patch (that will skip over the current "far" implementations.

If the 17th bit makes it simple to change the 'far' algorithms, that's probably the right idea. We could save the very last bits for when we need a completely changed view of how the variable is used. I don't think we are there yet.

brassow

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 07-17-2012, 08:43 PM
Jonathan Brassow
 
Default DM RAID: Add support for MD RAID10

Neil,

I've pulled out the 'far' and 'offset' format options for now. Also
note that I haven't changed the 'sectors_per_dev' (aka
mddev->dev_sectors) calculations...

brassow

dm raid: add md raid10 support

Support the MD RAID10 personality through dm-raid.c

Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>

Index: linux-upstream/drivers/md/dm-raid.c
================================================== =================
--- linux-upstream.orig/drivers/md/dm-raid.c
+++ linux-upstream/drivers/md/dm-raid.c
@@ -11,6 +11,7 @@
#include "md.h"
#include "raid1.h"
#include "raid5.h"
+#include "raid10.h"
#include "bitmap.h"

#include <linux/device-mapper.h>
@@ -52,7 +53,10 @@ struct raid_dev {
#define DMPF_MAX_RECOVERY_RATE 0x20
#define DMPF_MAX_WRITE_BEHIND 0x40
#define DMPF_STRIPE_CACHE 0x80
-#define DMPF_REGION_SIZE 0X100
+#define DMPF_REGION_SIZE 0x100
+#define DMPF_RAID10_COPIES 0x200
+#define DMPF_RAID10_FORMAT 0x400
+
struct raid_set {
struct dm_target *ti;

@@ -76,6 +80,7 @@ static struct raid_type {
const unsigned algorithm; /* RAID algorithm. */
} raid_types[] = {
{"raid1", "RAID1 (mirroring)", 0, 2, 1, 0 /* NONE */},
+ {"raid10", "RAID10 (striped mirrors)", 0, 2, 10, UINT_MAX /* Varies */},
{"raid4", "RAID4 (dedicated parity disk)", 1, 2, 5, ALGORITHM_PARITY_0},
{"raid5_la", "RAID5 (left asymmetric)", 1, 2, 5, ALGORITHM_LEFT_ASYMMETRIC},
{"raid5_ra", "RAID5 (right asymmetric)", 1, 2, 5, ALGORITHM_RIGHT_ASYMMETRIC},
@@ -86,6 +91,17 @@ static struct raid_type {
{"raid6_nc", "RAID6 (N continue)", 2, 4, 6, ALGORITHM_ROTATING_N_CONTINUE}
};

+static unsigned raid10_md_layout_to_copies(int layout)
+{
+ return layout & 0xFF;
+}
+
+static int raid10_format_to_md_layout(char *format, unsigned copies)
+{
+ /* 1 "far" copy, and 'copies' "near" copies */
+ return (1 << 8) | (copies & 0xFF);
+}
+
static struct raid_type *get_raid_type(char *name)
{
int i;
@@ -339,10 +355,16 @@ static int validate_region_size(struct r
* [max_write_behind <sectors>] See '-write-behind=' (man mdadm)
* [stripe_cache <sectors>] Stripe cache size for higher RAIDs
* [region_size <sectors>] Defines granularity of bitmap
+ *
+ * RAID10-only options:
+ * [raid10_copies <# copies>] Number of copies. (Default: 2)
+ * [raid10_format <near>] Layout algorithm. (Default: near)
*/
static int parse_raid_params(struct raid_set *rs, char **argv,
unsigned num_raid_params)
{
+ char *raid10_format = "near";
+ unsigned raid10_copies = 2;
unsigned i, rebuild_cnt = 0;
unsigned long value, region_size = 0;
sector_t sectors_per_dev = rs->ti->len;
@@ -416,11 +438,28 @@ static int parse_raid_params(struct raid
}

key = argv[i++];
+
+ /* Parameters that take a string value are checked here. */
+ if (!strcasecmp(key, "raid10_format")) {
+ if (rs->raid_type->level != 10) {
+ rs->ti->error = "'raid10_format' is an invalid parameter for this RAID type";
+ return -EINVAL;
+ }
+ if (strcmp("near", argv[i])) {
+ rs->ti->error = "Invalid 'raid10_format' value given";
+ return -EINVAL;
+ }
+ raid10_format = argv[i];
+ rs->print_flags |= DMPF_RAID10_FORMAT;
+ continue;
+ }
+
if (strict_strtoul(argv[i], 10, &value) < 0) {
rs->ti->error = "Bad numerical argument given in raid params";
return -EINVAL;
}

+ /* Parameters that take a numeric value are checked here */
if (!strcasecmp(key, "rebuild")) {
rebuild_cnt++;
rs->ti->error = NULL;
@@ -436,6 +475,7 @@ static int parse_raid_params(struct raid
if (rebuild_cnt > rs->raid_type->parity_devs)
rs->ti->error = "Too many rebuild devices specified for given RAID type";
break;
+ case 10:
default:
DMERR("The rebuild parameter is not supported for %s", rs->raid_type->name);
rs->ti->error = "Rebuild not supported for this RAID type";
@@ -493,7 +533,8 @@ static int parse_raid_params(struct raid
*/
value /= 2;

- if (rs->raid_type->level < 5) {
+ if ((rs->raid_type->level != 5) &&
+ (rs->raid_type->level != 6)) {
rs->ti->error = "Inappropriate argument: stripe_cache";
return -EINVAL;
}
@@ -518,6 +559,14 @@ static int parse_raid_params(struct raid
} else if (!strcasecmp(key, "region_size")) {
rs->print_flags |= DMPF_REGION_SIZE;
region_size = value;
+ } else if (!strcasecmp(key, "raid10_copies") &&
+ (rs->raid_type->level == 10)) {
+ if ((value < 2) || (value > 0xFF)) {
+ rs->ti->error = "Bad value for 'raid10_copies'";
+ return -EINVAL;
+ }
+ rs->print_flags |= DMPF_RAID10_COPIES;
+ raid10_copies = value;
} else {
DMERR("Unable to parse RAID parameter: %s", key);
rs->ti->error = "Unable to parse RAID parameters";
@@ -536,9 +585,25 @@ static int parse_raid_params(struct raid
if (dm_set_target_max_io_len(rs->ti, max_io_len))
return -EINVAL;

- if ((rs->raid_type->level > 1) &&
- sector_div(sectors_per_dev, (rs->md.raid_disks - rs->raid_type->parity_devs))) {
+ if (rs->raid_type->level == 10) {
+ /* (Len * Stripes) / Mirrors */
+ sectors_per_dev *= rs->md.raid_disks;
+ if (sector_div(sectors_per_dev, raid10_copies)) {
+ rs->ti->error = "Target length not divisible by number of data devices";
+ return -EINVAL;
+ }
+ if (raid10_copies > rs->md.raid_disks) {
+ rs->ti->error = "Not enough devices to satisfy specification";
+ return -EINVAL;
+ }
+ rs->md.layout = raid10_format_to_md_layout(raid10_format,
+ raid10_copies);
+ rs->md.new_layout = rs->md.layout;
+ } else if ((rs->raid_type->level > 1) &&
+ sector_div(sectors_per_dev,
+ (rs->md.raid_disks - rs->raid_type->parity_devs))) {
rs->ti->error = "Target length not divisible by number of data devices";
+
return -EINVAL;
}
rs->md.dev_sectors = sectors_per_dev;
@@ -564,6 +629,9 @@ static int raid_is_congested(struct dm_t
if (rs->raid_type->level == 1)
return md_raid1_congested(&rs->md, bits);

+ if (rs->raid_type->level == 10)
+ return md_raid10_congested(&rs->md, bits);
+
return md_raid5_congested(&rs->md, bits);
}

@@ -882,6 +950,9 @@ static int analyse_superblocks(struct dm
case 6:
redundancy = rs->raid_type->parity_devs;
break;
+ case 10:
+ redundancy = raid10_md_layout_to_copies(mddev->layout) - 1;
+ break;
default:
ti->error = "Unknown RAID type";
return -EINVAL;
@@ -1201,6 +1272,13 @@ static int raid_status(struct dm_target
DMEMIT(" region_size %lu",
rs->md.bitmap_info.chunksize >> 9);

+ if (rs->print_flags & DMPF_RAID10_COPIES)
+ DMEMIT(" raid10_copies %u",
+ raid10_md_layout_to_copies(rs->md.layout));
+
+ if (rs->print_flags & DMPF_RAID10_FORMAT)
+ DMEMIT(" raid10_format near");
+
DMEMIT(" %d", rs->md.raid_disks);
for (i = 0; i < rs->md.raid_disks; i++) {
if (rs->dev[i].meta_dev)
@@ -1275,7 +1353,7 @@ static void raid_resume(struct dm_target

static struct target_type raid_target = {
.name = "raid",
- .version = {1, 2, 0},
+ .version = {1, 3, 0},
.module = THIS_MODULE,
.ctr = raid_ctr,
.dtr = raid_dtr,
@@ -1302,6 +1380,8 @@ module_init(dm_raid_init);
module_exit(dm_raid_exit);

MODULE_DESCRIPTION(DM_NAME " raid4/5/6 target");
+MODULE_ALIAS("dm-raid1");
+MODULE_ALIAS("dm-raid10");
MODULE_ALIAS("dm-raid4");
MODULE_ALIAS("dm-raid5");
MODULE_ALIAS("dm-raid6");
Index: linux-upstream/Documentation/device-mapper/dm-raid.txt
================================================== =================
--- linux-upstream.orig/Documentation/device-mapper/dm-raid.txt
+++ linux-upstream/Documentation/device-mapper/dm-raid.txt
@@ -27,6 +27,10 @@ The target is named "raid" and it accept
- rotating parity N (right-to-left) with data restart
raid6_nc RAID6 N continue
- rotating parity N (right-to-left) with data continuation
+ raid10 Various RAID10 inspired algorithms chosen by additional params
+ - RAID10: Striped Mirrors (aka 'Striping on top of mirrors')
+ - RAID1E: Integrated Adjacent Stripe Mirroring
+ - and other similar RAID10 variants

Reference: Chapter 4 of
http://www.snia.org/sites/default/files/SNIA_DDF_Technical_Position_v2.0.pdf
@@ -59,6 +63,28 @@ The target is named "raid" and it accept
logical size of the array. The bitmap records the device
synchronisation state for each region.

+ [raid10_copies <# copies>]
+ [raid10_format near]
+ These two options are used to alter the default layout of
+ a RAID10 configuration. The number of copies is can be
+ specified, but the default is 2. There are other variations
+ to how the copies are laid down - the default and only current
+ option is "near". Near copies are what most people think of
+ with respect to mirroring. If these options are left
+ unspecified, or 'raid10_copies 2' and/or 'raid10_format near'
+ are given, then the layouts for 2, 3 and 4 devices are:
+ 2 drives 3 drives 4 drives
+ -------- ---------- --------------
+ A1 A1 A1 A1 A2 A1 A1 A2 A2
+ A2 A2 A2 A3 A3 A3 A3 A4 A4
+ A3 A3 A4 A4 A5 A5 A5 A6 A6
+ A4 A4 A5 A6 A6 A7 A7 A8 A8
+ .. .. .. .. .. .. .. .. ..
+ The 2-device layout is equivalent 2-way RAID1. The 4-device
+ layout is what a traditional RAID10 would look like. The
+ 3-device layout is what might be called a 'RAID1E - Integrated
+ Adjacent Stripe Mirroring'.
+
<#raid_devs>: The number of devices composing the array.
Each device consists of two entries. The first is the device
containing the metadata (if any); the second is the one containing the


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 07-18-2012, 01:11 AM
NeilBrown
 
Default DM RAID: Add support for MD RAID10

On Tue, 17 Jul 2012 11:15:56 -0500 Brassow Jonathan <jbrassow@redhat.com>
wrote:

>
> On Jul 16, 2012, at 9:34 PM, NeilBrown wrote:
>
> >>>>
> >>>> @@ -536,9 +605,25 @@ static int parse_raid_params(struct raid
> >>>> if (dm_set_target_max_io_len(rs->ti, max_io_len))
> >>>> return -EINVAL;
> >>>>
> >>>> - if ((rs->raid_type->level > 1) &&
> >>>> - sector_div(sectors_per_dev, (rs->md.raid_disks - rs->raid_type->parity_devs))) {
> >>>> + if (rs->raid_type->level == 10) {
> >>>> + /* (Len * Stripes) / Mirrors */
> >>>> + sectors_per_dev *= rs->md.raid_disks;
> >>>> + if (sector_div(sectors_per_dev, raid10_copies)) {
> >>>> + rs->ti->error = "Target length not divisible by number of data devices";
> >>>> + return -EINVAL;
> >>>> + }
> >>>
> >>> I'm not entirely sure what you are trying to do here, but I don't think it
> >>> works.
> >>>
> >>> At the very least you would need to convert the "sectors_per_dev" number to a
> >>> 'chunks_per_dev' before multiplying by raid_disks and dividing by copies.
> >>>
> >>> But even that isn't necessary. If you have a 3-device near=2 array with an
> >>> odd number of chunks per device, that will still work. The last chunk won't
> >>> be mirrored, so won't be used.
> >>> Until a couple of weeks ago a recovery of the last device would trigger an
> >>> error when we try to recover that last chunk, but that is fixed now.
> >>>
> >>> So if you want to impose this limitation (and there is some merit in making
> >>> sure people don't confuse themselves), I suggest it get imposed in the
> >>> user-space tools which create the RAID10.
> >>
> >> I have seen what you do in calc_sectors(), I suppose I could bring that in. However, I'm simply trying to find the value that will be preliminarily assigned to mddev->dev_sectors. It is an enforced (perhaps unnecessary) constraint that the array length be divisible by the number of data devices. I'm not accounting for chunk boundaries - I leave that to userspace to configure and MD to catch.
> >
> > In that case I suggest you keep it out of dmraid.
> >
> > It might make sense to check that the resulting array size matches what
> > user-space said to expect - or is at-least-as-big-as. However I don't see
> > much point in other size changes there.
>
> I'm not changing any sizes. I'm finding a value for mddev->dev_sectors (what should I set it to if not the above?), which is the size of each device as computed from the total array size. If the value can't be computed evenly, then an error is given. I'm not sure what you are suggesting my alternative is when you say, "keep it out of dm-raid"... I think this is the least I can do in dm-raid.

Sorry, I mean 'checks', not 'changes'.

I was confused a bit by the code. It might be clearer to have

sectors_per_dev = ti->len * rs->md.raid_disks;
remainder = sector_div(sectors_per_dev, raid10_copies);

In almost all cases raid10_copies will be 2 and ti->len will be a multiple of
8 (as we don't support chunk sizes less than 4K), so remainder will be 0.
However this doesn't really prove anything. It is still entirely possible
that the resulting 'sectors_per_dev' will not result in a RAID10 which has
the full ti->len sectors required.
So your test is not sufficient.

Also your calculation is incorrect as it doesn't allow for the possibility of
unused space in the array.
A 3-drive 2-copy RAID10 with an odd number of chunks will not use the last
chunk on the last device. In that case the above calculation will result in
a sectors_per_dev that it too small.

I think your sectors_per_dev calculation should round up to a whole number of
chunks.

sectors_per_dev = DIV_ROUND_UP_SECTOR_T(ti->len * rs->md.raid_disks,
raid10_copies);
sectors_per_dev = round_up(sectors_per_dev, rs->md.chunk_sectors);

And then after calling md_run(), you should check that
pers->size(mddev, 0, 0) == ti->len
and abort if not.

Note that getting the sectors_per_dev right is really really important for
RAID10-far. For other layouts and levels a smaller sectors_per_dev will just
result in less data being available.
For raid10-far, sectors_per_dev is included in the layout calculations so
changing it will change the location of data and could result in corruption.

NeilBrown
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 

Thread Tools




All times are GMT. The time now is 12:43 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org