Linux Archive

Linux Archive (http://www.linux-archive.org/)
-   Ubuntu User (http://www.linux-archive.org/ubuntu-user/)
-   -   ANNOUNCE: mdadm 3.1.1 - A tool for managing Soft RAID under Linux (http://www.linux-archive.org/ubuntu-user/286518-announce-mdadm-3-1-1-tool-managing-soft-raid-under-linux.html)

Hans de Goede 11-26-2009 08:31 AM

ANNOUNCE: mdadm 3.1.1 - A tool for managing Soft RAID under Linux
 
Hi Doug,

That is a lot of information in there, let me try to summarize it
and please let me know if I've missed anything:

1) The default chunksize for raid4/5/6 is changing, this should
not be a problem as we do not specify a chunksize when creating
new arrays

2) The default bitmap chunk size changed, again not a problem as
we don't use bitmaps in anaconda atm

3) We need to change the not using of a bitmap, we should use a bitmap
by default except when the array will be used for /boot or swap.

Questions:
1) What commandline option should we pass to "mdadm --create" to
achieve this?

4) We need to start specifying a superblock version, and preferably
version 1.1

5) Specifying a superblock version of 1.1 will render systems non
bootable, I assume this only applies to systems which have
a raid1 /boot, so I guess that we need to specify a superblock
version of 1.1, except when the raid set will be used for /boot,
where we should keep using 0.9

Questions:
1) Is the above correct ?

6) When creating 1.1 superblock sets we need to pass in:
--homehost=<hostname>
--name=<devicename>
-e{1.0,1.1,1.2}

Questions
1) Currently when creating a set, we do for example:
mdadm --create /dev/md0 --run --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1

What would this look like with the new mdadm, esp, what would happen to the
/dev/md0 argument ?

If we can still specify which minor to use when creating a new array, even though
that minor may change after the first reboot, then the amount of changes needed
to the installer are minimal and we can likely do this without problems for RHEL-6.

Regards,

Hans






On 11/26/2009 03:59 AM, Doug Ledford wrote:

Please keep me on the Cc: as I'm not on this list.

Upstream recently released mdadm-3.1.1, which I intend to include in
Fedora soon. It finally updates three default settings that should have
been updated a long time ago.

The default chunk size for raid4/5/6 is now 512K. Anaconda needs to be
updated to either leave the default alone or use 512K itself. In the
past it has passed in 256K, but extensive performance testing shows that
512K is indeed the sweet spot on pretty much any SATA device, which
simply due to SATA being the overwhelming majority of disks we run on
today, it's sweet spot should be our default.

It updates the default bitmap chunk to be at least 65536K when using an
internal bitmap. Performance tests showed as much as a 10% performance
penalty for the old default bitmap chunk (8192K). The new bitmap chunk
reduces that performance penalty (although we don't have solid numbers
on how much...I'll work on that). However, we've never used a bitmap by
default on any arrays we create. That needs to change. The simple
logic is this: no bitmap on /boot or any swap partitions, use a bitmap
on anything else. If we need a bitmap chunk other than the default,
I'll follow up here.

It updates the default superblock format from the old, antiquated,
deprecated version 0.90 superblock that we should have quit using years
ago to version 1.1. This is the real kicker. Since anaconda has never
actively set the superblock metadata version (even though we should have
been using 1.1 long ago), it's now going to have to start. The reason
is that unless you upgrade machines to use an md raid aware boot loader,
such as grub2 for x86 although I have no idea what would work on non-x86
arches, version 1.1 superblocks will render all installs unbootable.
More importantly though, unless the anaconda team decides to blindly set
all superblocks back to the old version 0.90 format, this change
necessitates more than just a change to controlling which version of 1.x
superblock we use on any given array, but also a change to how we create
and name arrays in general. Version 0.90 superblocks are from back in
the day when we thought it was smart/reasonable to name arrays by number
and to mount scsi devices in fstab by their /dev/ entry. That day has
long since been gone, dead and buried. We switched filesystems to mount
by label so they are immune to device number changes and similarly
version 1.x superblocks totally do away with the preferred-minor field
in the superblock. Instead, they have a homehost and name field that
are used to control device *naming*, not numbering, and in a properly
running version 1.x superblock system, the device numbers are not
guaranteed to be static from boot to boot (although they usually are).
This doesn't appear to be much problem for dracut, but as an example,
I'm attaching the mkinitrd patch I have to apply to an F11 system after
every mkinitrd update in order to get initrd images that mount by name
properly.

So, those are the major differences. Switching to any of the version
1.x superblocks necessitates that anaconda pass a few arguments that it
hasn't in the past. Right now, these are the things anaconda is going
to need to start passing in on any mdadm create commands (that I don't
currently believe it does, but I haven't checked and could be wrong):

--homehost=<hostname>
--name=<devicename>
-e{1.0,1.1,1.2}

In addition, we should start passing the bitmap option as I outlined above.

We will also likely need to set the HOMEHOST entry in mdadm.conf and
possibly the AUTO entry in mdadm.conf as well.

And this brings me to a different point. Hans asked me to comment on
bz537329. I would suggest people look at my comments there for some
additional explanation of why ideas like trying to make things work
without mdadm.conf are probably a bad idea.

So here are a few additional things that I think are worth taking into
consideration.

If an array is listed in mdadm.conf, then *every* item on the array line
must match the array or else it will fail to start. This means that
ARRAY lines that list things that can change by using mdadm --grow to
change aspects of the array can result in the array failing to be found
on the next reboot. Therefore, it would be best if each new ARRAY line
we write includes nothing besides the name of the array, the metadata
version, and the UUID.

If an array is listed in mdadm.conf, then both the --homehost and --name
settings will be overridden by the name in the mdadm.conf file, so do
not depend on either having an effect for arrays listed in mdadm.conf.

However, homehost and name are both used heavily any time the array is
not listed in mdadm.conf so setting them correctly is still important.
There are a number of common scenarios that make this important: you are
carrying an array from machine to machine (like an external drive tower,
or raid1 usb flash drive, etc.), when an array is visible to multiple
hosts (like arrays built over SAN devices), or when you've built a
machine to replace an existing machine and you temporarily install the
drives from the machine being replaced in the new machine to copy data
across in which case you are starting both your new array and the old
array on the same machine. They are also relied upon heavily in order
to attempt to satisfy those people that think the md raid stack should
work without any mdadm.conf file at all. And there is a special case
exception in the name field that is used to attempt to preserve back
compatibility. The intersection of all these attempts to satisfy
various needs is tricky. Here's how names are determined:

1) If the array is identified in mdadm.conf, the name from the ARRAY
line is used.
2) If HOMEHOST has been set in the config
a) If the array uses a version 0.90 superblock, check to see if the
HOMEHOST has been encoded in the UUID via hash. If not, treat as
foreign, if so, treat as local.
b) For version 1.x superblocks check the homehost in the superblock
against the set homehost. If they match, treat as local, else if the
homehost in the superblock is not empty treat as named foreign else
treat as foreign.
3) else
a) for version 0.90 superblocks treat the array as foreign.
b) for 1.x if homehost is set then named foreign else foreign.

In case #1, the name as it's in the file is used. If the remainder of
cases, local means to attempt to create the array with the requested
number (in the case of 0.90 superblocks) or requested name (in the case
of version 1.x superblocks). Foreign means that the array will be
started with the requested name + a suffix. For example, version 0.90
superblock with preferred-minor of 0 would get created with a random
*actual* minor number and the name /dev/md0_0 or md0_1 if md0_0 already
exists, etc. A version 1.x superblock with the name root would get
created as /dev/md/root_0. Named foreign is used whenever a version 1.x
superblock can't be identified as local but it has a valid homehost
entry in the superblock. The format attempt is /dev/md/homehost:name so
that if you were to mount an array from workstation2:root on
workstation1, it would be /dev/md/workstation2:root.

There is a special exception for version 1.x superblock arrays. If the
name field of the superblock contains a specially formatted name, then
it will be treated as a request to create the device with a given minor
number and name identical to an old version 0.90 superblock array.
Those special case names are:
a) a bare number (aka, 0)
b) a bare name using standard number format (aka, md0 or md_d0)
c) a full name using standard number format (aka, /dev/md0 or /dev/md_d0)

If an array uses a name instead of a number, then the named entry
created in /dev/md/ will be a symlink to a random numeric md device in
/dev/. For example, /dev/md/root, since it's the first device started
and since we start grabbing md devices at 127 and counting backwards
when starting named devices, will almost always point to /dev/md127.
The /dev/md127 file will be the real device file while the entries in
/dev/md/ are always symlinks. This is in order to be consistent with
the fact that our /sys/block entry will be md127 and our entry in
/proc/mdstat will also be md127. This is because the current /sys/block
setup does not allow /sys/block/md/root, only md<number>.




_______________________________________________
Anaconda-devel-list mailing list
Anaconda-devel-list@redhat.com
https://www.redhat.com/mailman/listinfo/anaconda-devel-list


_______________________________________________
Anaconda-devel-list mailing list
Anaconda-devel-list@redhat.com
https://www.redhat.com/mailman/listinfo/anaconda-devel-list

Doug Ledford 11-28-2009 12:02 AM

ANNOUNCE: mdadm 3.1.1 - A tool for managing Soft RAID under Linux
 
On 11/26/2009 04:31 AM, Hans de Goede wrote:
> Hi Doug,
>
> That is a lot of information in there, let me try to summarize it
> and please let me know if I've missed anything:
>
> 1) The default chunksize for raid4/5/6 is changing, this should
> not be a problem as we do not specify a chunksize when creating
> new arrays

I thought we did specify a chunksize. Oh well, that just means our
default raid array performance will improve dramatically. The old
default of 64k was horrible for performance relative to the new 512k
default.

4 disks on MB 5 disks on MB 4 disks on PM
write read write read write read
64K 509.373 388.870 403.947 370.963 103.743 61.127
512K 502.123 498.510 460.817 487.720 113.897 111.980

MB = Motherboard ports
PM = single eSATA port to a port multiplier
Note: going from 4 disks to 5 disks on this one machine resulted in a
performance drop which is a likely indicator that there were bus
saturation issues between the memory subsystem and the southbridge and
that 5 disks simply over saturated the southbridge's capacity.

> 2) The default bitmap chunk size changed, again not a problem as
> we don't use bitmaps in anaconda atm
>
> 3) We need to change the not using of a bitmap, we should use a bitmap
> by default except when the array will be used for /boot or swap.

Correct. The typical /boot array is too small to worry about, it can
usually be resynced in its entirety in a matter of seconds. Swap
partitions shouldn't use a bitmap because we don't want the overhead of
sync operations on the swap subsystem, especially since its data is
generally speaking transient. Other filesystems, especially once you
get to 10GB or larger, can benefit from the bitmap in the event of an
improper shutdown.

> Questions:
> 1) What commandline option should we pass to "mdadm --create" to
> achieve this?

--bitmap={none,internal}

In the future if we opt for something other than the default bitmap
chunk, then when the above is internal, we would also pass:

--bitmap-chunk=<chunksize in KB, default is 65536>

> 4) We need to start specifying a superblock version, and preferably
> version 1.1

No, we *must* start specifying a superblock version or else we will no
longer be able to boot our machines after a clean install. The new
default is 1.1, and I'm perfectly happy to use that as the default, but
as far as I'm aware, the only boot loader that can use a 1.1 superblock
based raid1 /boot partition is grub2, so all the other arches would not
be able to boot and we would have to forcibly upgrade all systems using
grub to grub2.

> 5) Specifying a superblock version of 1.1 will render systems non
> bootable, I assume this only applies to systems which have
> a raid1 /boot, so I guess that we need to specify a superblock
> version of 1.1, except when the raid set will be used for /boot,
> where we should keep using 0.9
>
> Questions:
> 1) Is the above correct ?

No, not quite. You can use superblock version 1.0 on /boot and grub
will then work. Both version 0.90 and version 1.0 superblocks are at
the end of the device and do not confuse boot loaders. Here's a summary
of superblock format differences:

Version 0.90:
Stored at end of device
Has no homehost field in the superblock but most recent versions of
mdadm would hash the name of the machine and use that for half of the
UUID, which provided a pseudo homehost entry
Limited to 27 constituent devices
Has no name field in the superblock
Has a preferred-minor field in the superblock
Does not contain sufficient information to distinguish between a
superblock at the end of a whole device or a superblock at the end of a
single partition on the whole device (aka, create a single partition on
a drive that uses the whole drive, place a version 0.90 superblock on
that drive, then you will be able to pass in either the whole disk or
the partition to an mdadm assemble command and mdadm can't tell via the
information in the superblock if you have passed in the right device).

Common to all version 1.x superblocks:
Has homehost and name fields (actually, one field with a max length of
32 chars)
Full UUID is generated, none hashed, so more bits of randomness on UUID
No limit to number of constituent devices
Has no preferred-minor field in the superblock, but can be emulated by
use of appropriate entry in name field

Version 1.0:
Located at end of device where version 0.90 superblocks are also located
Contains sufficient information to differentiate between being a
superblock for the whole device or just a partition on the device

Version 1.1:
Located at very beginning of device. If placed on a whole disk device,
occupies the same space as the MBR and partition table and does not
leave room for them. Data is offset after superblock, and as such the
normal device can not be used to access the data, only the md device.

Version 1.2:
Located at beginning of device + 4K. This offset allows for the MBR
and partition table to have the first 4K. This can, however, cause
confusing situations when used on whole disk devices as you are able to
partition the device, but the entire device is the raid device, so the
partition is meaningless even if present. It does, however, allow for
booting off of these devices (theoretically, I don't think anyone is
doing so and I suspect even grub2 would need more work to make this
operational).

> 6) When creating 1.1 superblock sets we need to pass in:
> --homehost=<hostname>
> --name=<devicename>
> -e{1.0,1.1,1.2}
>
> Questions
> 1) Currently when creating a set, we do for example:
> mdadm --create /dev/md0 --run --level=1 --raid-devices=2 /dev/sda1
> /dev/sdb1
>
> What would this look like with the new mdadm, esp, what would
> happen to the
> /dev/md0 argument ?

The /dev/md0 argument is arbitrary. It could be /dev/md0, it could be
/dev/md/foobar. However, if we insist on sticking with the old numbered
device files, then it is certain that we should also do our best to make
sure that the --name field we pass in is in the special format needed to
get mdadm to automatically assume we want numbered devices. In this
case, --name=0 would be appropriate.

But this actually ignores a real situation that some of us use to get
around the brokenness of anaconda for many releases now. I typically
start any install by first burning the install image to CD, then booting
into rescue mode, then hand running fdisk on all my disks to get the
layout I want, then hand creating md raid arrays with the options I
want, then hand creating filesystems on those arrays or swap spaces on
those arrays with the options I want. Then I reboot in the install mode
on the same CD, and when it gets to the disk layout, I specify custom
layout and then I simply use all the filesystems and md raid devices I
created previously. However, even if I use version 1.x superblocks, and
even if I use named md raid arrays, anaconda always insists on ignoring
the names I've given them and assigning them numbers. Of course, the
numbers don't necessarily match up to the order in which I created them,
so I have to guess at which numbered array corresponds to which named
array (unless there are obvious hints like different sizes, but in the
last instance I was doing this I had 7 arrays that were all the same
size, each intended to be a root filesystem for a different version of
either RHEL or Fedora). Then, once the install is all complete, I have
to go back into rescue mode, remount the root filesystem, hand edit the
mdadm.conf to use names instead of numbers, remake the initrd images
(now dracut images), change any fstab entries, then I can finally use
the names. Really, it's *very* annoying that this minor number
dependence in anaconda has gone on so long. It was outdated 7 or 8
Fedora releases ago.

> If we can still specify which minor to use when creating a new array,
> even though
> that minor may change after the first reboot, then the amount of
> changes needed
> to the installer are minimal and we can likely do this without
> problems for RHEL-6.

I don't understand. Please enlighten me as to these requirements on
minor numbers in the installer. After all, it's not like there isn't a
simple means of naming these things:

If md raid device used for lvm pv, name it /dev/md/pv-#
If md raid device used for swap, name it /dev/md/swap-#
If md raid device used for /, name it /dev/md/root
If md raid device used for any other data partition, name it
/dev/md/<basename of mount point>

And it's not like anaconda doesn't already have that information
available when its creating filesystem labels, so I'm curious why it's
so hard to use names instead of numbers for arrays in anaconda?

> Regards,
>
> Hans
>
>
>
>
>
>
> On 11/26/2009 03:59 AM, Doug Ledford wrote:
>> Please keep me on the Cc: as I'm not on this list.
>>
>> Upstream recently released mdadm-3.1.1, which I intend to include in
>> Fedora soon. It finally updates three default settings that should have
>> been updated a long time ago.
>>
>> The default chunk size for raid4/5/6 is now 512K. Anaconda needs to be
>> updated to either leave the default alone or use 512K itself. In the
>> past it has passed in 256K, but extensive performance testing shows that
>> 512K is indeed the sweet spot on pretty much any SATA device, which
>> simply due to SATA being the overwhelming majority of disks we run on
>> today, it's sweet spot should be our default.
>>
>> It updates the default bitmap chunk to be at least 65536K when using an
>> internal bitmap. Performance tests showed as much as a 10% performance
>> penalty for the old default bitmap chunk (8192K). The new bitmap chunk
>> reduces that performance penalty (although we don't have solid numbers
>> on how much...I'll work on that). However, we've never used a bitmap by
>> default on any arrays we create. That needs to change. The simple
>> logic is this: no bitmap on /boot or any swap partitions, use a bitmap
>> on anything else. If we need a bitmap chunk other than the default,
>> I'll follow up here.
>>
>> It updates the default superblock format from the old, antiquated,
>> deprecated version 0.90 superblock that we should have quit using years
>> ago to version 1.1. This is the real kicker. Since anaconda has never
>> actively set the superblock metadata version (even though we should have
>> been using 1.1 long ago), it's now going to have to start. The reason
>> is that unless you upgrade machines to use an md raid aware boot loader,
>> such as grub2 for x86 although I have no idea what would work on non-x86
>> arches, version 1.1 superblocks will render all installs unbootable.
>> More importantly though, unless the anaconda team decides to blindly set
>> all superblocks back to the old version 0.90 format, this change
>> necessitates more than just a change to controlling which version of 1.x
>> superblock we use on any given array, but also a change to how we create
>> and name arrays in general. Version 0.90 superblocks are from back in
>> the day when we thought it was smart/reasonable to name arrays by number
>> and to mount scsi devices in fstab by their /dev/ entry. That day has
>> long since been gone, dead and buried. We switched filesystems to mount
>> by label so they are immune to device number changes and similarly
>> version 1.x superblocks totally do away with the preferred-minor field
>> in the superblock. Instead, they have a homehost and name field that
>> are used to control device *naming*, not numbering, and in a properly
>> running version 1.x superblock system, the device numbers are not
>> guaranteed to be static from boot to boot (although they usually are).
>> This doesn't appear to be much problem for dracut, but as an example,
>> I'm attaching the mkinitrd patch I have to apply to an F11 system after
>> every mkinitrd update in order to get initrd images that mount by name
>> properly.
>>
>> So, those are the major differences. Switching to any of the version
>> 1.x superblocks necessitates that anaconda pass a few arguments that it
>> hasn't in the past. Right now, these are the things anaconda is going
>> to need to start passing in on any mdadm create commands (that I don't
>> currently believe it does, but I haven't checked and could be wrong):
>>
>> --homehost=<hostname>
>> --name=<devicename>
>> -e{1.0,1.1,1.2}
>>
>> In addition, we should start passing the bitmap option as I outlined
>> above.
>>
>> We will also likely need to set the HOMEHOST entry in mdadm.conf and
>> possibly the AUTO entry in mdadm.conf as well.
>>
>> And this brings me to a different point. Hans asked me to comment on
>> bz537329. I would suggest people look at my comments there for some
>> additional explanation of why ideas like trying to make things work
>> without mdadm.conf are probably a bad idea.
>>
>> So here are a few additional things that I think are worth taking into
>> consideration.
>>
>> If an array is listed in mdadm.conf, then *every* item on the array line
>> must match the array or else it will fail to start. This means that
>> ARRAY lines that list things that can change by using mdadm --grow to
>> change aspects of the array can result in the array failing to be found
>> on the next reboot. Therefore, it would be best if each new ARRAY line
>> we write includes nothing besides the name of the array, the metadata
>> version, and the UUID.
>>
>> If an array is listed in mdadm.conf, then both the --homehost and --name
>> settings will be overridden by the name in the mdadm.conf file, so do
>> not depend on either having an effect for arrays listed in mdadm.conf.
>>
>> However, homehost and name are both used heavily any time the array is
>> not listed in mdadm.conf so setting them correctly is still important.
>> There are a number of common scenarios that make this important: you are
>> carrying an array from machine to machine (like an external drive tower,
>> or raid1 usb flash drive, etc.), when an array is visible to multiple
>> hosts (like arrays built over SAN devices), or when you've built a
>> machine to replace an existing machine and you temporarily install the
>> drives from the machine being replaced in the new machine to copy data
>> across in which case you are starting both your new array and the old
>> array on the same machine. They are also relied upon heavily in order
>> to attempt to satisfy those people that think the md raid stack should
>> work without any mdadm.conf file at all. And there is a special case
>> exception in the name field that is used to attempt to preserve back
>> compatibility. The intersection of all these attempts to satisfy
>> various needs is tricky. Here's how names are determined:
>>
>> 1) If the array is identified in mdadm.conf, the name from the ARRAY
>> line is used.
>> 2) If HOMEHOST has been set in the config
>> a) If the array uses a version 0.90 superblock, check to see if the
>> HOMEHOST has been encoded in the UUID via hash. If not, treat as
>> foreign, if so, treat as local.
>> b) For version 1.x superblocks check the homehost in the superblock
>> against the set homehost. If they match, treat as local, else if the
>> homehost in the superblock is not empty treat as named foreign else
>> treat as foreign.
>> 3) else
>> a) for version 0.90 superblocks treat the array as foreign.
>> b) for 1.x if homehost is set then named foreign else foreign.
>>
>> In case #1, the name as it's in the file is used. If the remainder of
>> cases, local means to attempt to create the array with the requested
>> number (in the case of 0.90 superblocks) or requested name (in the case
>> of version 1.x superblocks). Foreign means that the array will be
>> started with the requested name + a suffix. For example, version 0.90
>> superblock with preferred-minor of 0 would get created with a random
>> *actual* minor number and the name /dev/md0_0 or md0_1 if md0_0 already
>> exists, etc. A version 1.x superblock with the name root would get
>> created as /dev/md/root_0. Named foreign is used whenever a version 1.x
>> superblock can't be identified as local but it has a valid homehost
>> entry in the superblock. The format attempt is /dev/md/homehost:name so
>> that if you were to mount an array from workstation2:root on
>> workstation1, it would be /dev/md/workstation2:root.
>>
>> There is a special exception for version 1.x superblock arrays. If the
>> name field of the superblock contains a specially formatted name, then
>> it will be treated as a request to create the device with a given minor
>> number and name identical to an old version 0.90 superblock array.
>> Those special case names are:
>> a) a bare number (aka, 0)
>> b) a bare name using standard number format (aka, md0 or md_d0)
>> c) a full name using standard number format (aka, /dev/md0 or
>> /dev/md_d0)
>>
>> If an array uses a name instead of a number, then the named entry
>> created in /dev/md/ will be a symlink to a random numeric md device in
>> /dev/. For example, /dev/md/root, since it's the first device started
>> and since we start grabbing md devices at 127 and counting backwards
>> when starting named devices, will almost always point to /dev/md127.
>> The /dev/md127 file will be the real device file while the entries in
>> /dev/md/ are always symlinks. This is in order to be consistent with
>> the fact that our /sys/block entry will be md127 and our entry in
>> /proc/mdstat will also be md127. This is because the current /sys/block
>> setup does not allow /sys/block/md/root, only md<number>.
>>
>>
>>
>>
>> _______________________________________________
>> Anaconda-devel-list mailing list
>> Anaconda-devel-list@redhat.com
>> https://www.redhat.com/mailman/listinfo/anaconda-devel-list


--
Doug Ledford <dledford@redhat.com>
GPG KeyID: CFBFF194
http://people.redhat.com/dledford

Infiniband specific RPMs available at
http://people.redhat.com/dledford/Infiniband

_______________________________________________
Anaconda-devel-list mailing list
Anaconda-devel-list@redhat.com
https://www.redhat.com/mailman/listinfo/anaconda-devel-list

Hans de Goede 12-01-2009 12:48 PM

ANNOUNCE: mdadm 3.1.1 - A tool for managing Soft RAID under Linux
 
Hi,

On 11/28/2009 02:02 AM, Doug Ledford wrote:

On 11/26/2009 04:31 AM, Hans de Goede wrote:

Hi Doug,

That is a lot of information in there, let me try to summarize it
and please let me know if I've missed anything:

1) The default chunksize for raid4/5/6 is changing, this should
not be a problem as we do not specify a chunksize when creating
new arrays


I thought we did specify a chunksize. Oh well, that just means our
default raid array performance will improve dramatically. The old
default of 64k was horrible for performance relative to the new 512k
default.

4 disks on MB 5 disks on MB 4 disks on PM
write read write read write read
64K 509.373 388.870 403.947 370.963 103.743 61.127
512K 502.123 498.510 460.817 487.720 113.897 111.980

MB = Motherboard ports
PM = single eSATA port to a port multiplier
Note: going from 4 disks to 5 disks on this one machine resulted in a
performance drop which is a likely indicator that there were bus
saturation issues between the memory subsystem and the southbridge and
that 5 disks simply over saturated the southbridge's capacity.


2) The default bitmap chunk size changed, again not a problem as
we don't use bitmaps in anaconda atm

3) We need to change the not using of a bitmap, we should use a bitmap
by default except when the array will be used for /boot or swap.


Correct. The typical /boot array is too small to worry about, it can
usually be resynced in its entirety in a matter of seconds. Swap
partitions shouldn't use a bitmap because we don't want the overhead of
sync operations on the swap subsystem, especially since its data is
generally speaking transient. Other filesystems, especially once you
get to 10GB or larger, can benefit from the bitmap in the event of an
improper shutdown.


Questions:
1) What commandline option should we pass to "mdadm --create" to
achieve this?


--bitmap={none,internal}

In the future if we opt for something other than the default bitmap
chunk, then when the above is internal, we would also pass:

--bitmap-chunk=<chunksize in KB, default is 65536>



Ok, I'll try to write a patch for this next week (this week I've some
parted stuff that needs doing).


4) We need to start specifying a superblock version, and preferably
version 1.1


No, we *must* start specifying a superblock version or else we will no
longer be able to boot our machines after a clean install. The new
default is 1.1, and I'm perfectly happy to use that as the default, but
as far as I'm aware, the only boot loader that can use a 1.1 superblock
based raid1 /boot partition is grub2, so all the other arches would not
be able to boot and we would have to forcibly upgrade all systems using
grub to grub2.


5) Specifying a superblock version of 1.1 will render systems non
bootable, I assume this only applies to systems which have
a raid1 /boot, so I guess that we need to specify a superblock
version of 1.1, except when the raid set will be used for /boot,
where we should keep using 0.9

Questions:
1) Is the above correct ?


No, not quite. You can use superblock version 1.0 on /boot and grub
will then work. Both version 0.90 and version 1.0 superblocks are at
the end of the device and do not confuse boot loaders. Here's a summary
of superblock format differences:



Ok, so for /boot we must specify a superblock version, should we use 1.0 or
0.9 (I assume 1.0, but confirmation of that would be good).

<snip>




6) When creating 1.1 superblock sets we need to pass in:
--homehost=<hostname>
--name=<devicename>
-e{1.0,1.1,1.2}

Questions
1) Currently when creating a set, we do for example:
mdadm --create /dev/md0 --run --level=1 --raid-devices=2 /dev/sda1
/dev/sdb1

What would this look like with the new mdadm, esp, what would
happen to the
/dev/md0 argument ?


The /dev/md0 argument is arbitrary. It could be /dev/md0, it could be
/dev/md/foobar. However, if we insist on sticking with the old numbered
device files, then it is certain that we should also do our best to make
sure that the --name field we pass in is in the special format needed to
get mdadm to automatically assume we want numbered devices. In this
case, --name=0 would be appropriate.

But this actually ignores a real situation that some of us use to get
around the brokenness of anaconda for many releases now. I typically
start any install by first burning the install image to CD, then booting
into rescue mode, then hand running fdisk on all my disks to get the
layout I want, then hand creating md raid arrays with the options I
want, then hand creating filesystems on those arrays or swap spaces on
those arrays with the options I want. Then I reboot in the install mode
on the same CD, and when it gets to the disk layout, I specify custom
layout and then I simply use all the filesystems and md raid devices I
created previously. However, even if I use version 1.x superblocks, and
even if I use named md raid arrays, anaconda always insists on ignoring
the names I've given them and assigning them numbers. Of course, the
numbers don't necessarily match up to the order in which I created them,
so I have to guess at which numbered array corresponds to which named
array (unless there are obvious hints like different sizes, but in the
last instance I was doing this I had 7 arrays that were all the same
size, each intended to be a root filesystem for a different version of
either RHEL or Fedora). Then, once the install is all complete, I have
to go back into rescue mode, remount the root filesystem, hand edit the
mdadm.conf to use names instead of numbers, remake the initrd images
(now dracut images), change any fstab entries, then I can finally use
the names. Really, it's *very* annoying that this minor number
dependence in anaconda has gone on so long. It was outdated 7 or 8
Fedora releases ago.



Then you should have asked us to change this 7 or 8 releases ago, changing
this so close to RHEL-6 is just not going to happen.


If we can still specify which minor to use when creating a new array,
even though
that minor may change after the first reboot, then the amount of
changes needed
to the installer are minimal and we can likely do this without
problems for RHEL-6.


I don't understand. Please enlighten me as to these requirements on
minor numbers in the installer. After all, it's not like there isn't a
simple means of naming these things:

If md raid device used for lvm pv, name it /dev/md/pv-#
If md raid device used for swap, name it /dev/md/swap-#
If md raid device used for /, name it /dev/md/root
If md raid device used for any other data partition, name it
/dev/md/<basename of mount point>

And it's not like anaconda doesn't already have that information
available when its creating filesystem labels, so I'm curious why it's
so hard to use names instead of numbers for arrays in anaconda?



It is not that hard, but currently all mdraid code inside anaconda is
based on the assumption that they are identified by their minor, changing
this takes time, time we do not have before RHEL-6.

So fixing this will have to wait till Fedora 14 I'm afraid.

Regards,

Hans

p.s.

Can you please reply to bug 537329 one more time, I've tried to explain
why I think that we can simplify mdraid activation in the proposed way
despite your objections. If you insist on keeping things as is, that is fine
too, then I'll come up with a separate solution for the Intel BIOS RAID
problems the current activation setup causes.

_______________________________________________
Anaconda-devel-list mailing list
Anaconda-devel-list@redhat.com
https://www.redhat.com/mailman/listinfo/anaconda-devel-list

Bill Nottingham 12-01-2009 02:19 PM

ANNOUNCE: mdadm 3.1.1 - A tool for managing Soft RAID under Linux
 
Hans de Goede (hdegoede@redhat.com) said:
> >No, not quite. You can use superblock version 1.0 on /boot and grub
> >will then work. Both version 0.90 and version 1.0 superblocks are at
> >the end of the device and do not confuse boot loaders. Here's a summary
> >of superblock format differences:
> >
>
> Ok, so for /boot we must specify a superblock version, should we use 1.0 or
> 0.9 (I assume 1.0, but confirmation of that would be good).

Knowing little about the details, couldn't the possibility of fixing grub
be investigated?

Bill

_______________________________________________
Anaconda-devel-list mailing list
Anaconda-devel-list@redhat.com
https://www.redhat.com/mailman/listinfo/anaconda-devel-list

Doug Ledford 12-01-2009 02:47 PM

ANNOUNCE: mdadm 3.1.1 - A tool for managing Soft RAID under Linux
 
On 12/01/2009 08:48 AM, Hans de Goede wrote:
> Hi,
>
> On 11/28/2009 02:02 AM, Doug Ledford wrote:
>> On 11/26/2009 04:31 AM, Hans de Goede wrote:

>>> 3) We need to change the not using of a bitmap, we should use a bitmap
>>> by default except when the array will be used for /boot or swap.
>>
>> Correct. The typical /boot array is too small to worry about, it can
>> usually be resynced in its entirety in a matter of seconds. Swap
>> partitions shouldn't use a bitmap because we don't want the overhead of
>> sync operations on the swap subsystem, especially since its data is
>> generally speaking transient. Other filesystems, especially once you
>> get to 10GB or larger, can benefit from the bitmap in the event of an
>> improper shutdown.
>>
>>> Questions:
>>> 1) What commandline option should we pass to "mdadm --create" to
>>> achieve this?
>>
>> --bitmap={none,internal}

Followup on this: --bitmap=internal when we want it, nothing when we
don't. The --bitmap=none option is only valid when changing an array
from having a bitmap to not having one using grow mode, it is not a
valid option to --create.


> Ok, so for /boot we must specify a superblock version, should we use 1.0 or
> 0.9 (I assume 1.0, but confirmation of that would be good).

Yes, 1.0 would be best.

> Then you should have asked us to change this 7 or 8 releases ago, changing
> this so close to RHEL-6 is just not going to happen.

I've been asking to be included in anaconda/md raid planning for a
*LONG* time now. Just because you are not the person I asked, do not
assume the request was not made. It has been made, multiple times, in
person, face to face, and by the time it was time to actually plan
things out, was always forgotten or prioritized to the bottom of the
ladder where nothing happened.

>>> If we can still specify which minor to use when creating a new
>>> array,
>>> even though
>>> that minor may change after the first reboot, then the amount of
>>> changes needed
>>> to the installer are minimal and we can likely do this without
>>> problems for RHEL-6.
>>
>> I don't understand. Please enlighten me as to these requirements on
>> minor numbers in the installer. After all, it's not like there isn't a
>> simple means of naming these things:
>>
>> If md raid device used for lvm pv, name it /dev/md/pv-#
>> If md raid device used for swap, name it /dev/md/swap-#
>> If md raid device used for /, name it /dev/md/root
>> If md raid device used for any other data partition, name it
>> /dev/md/<basename of mount point>
>>
>> And it's not like anaconda doesn't already have that information
>> available when its creating filesystem labels, so I'm curious why it's
>> so hard to use names instead of numbers for arrays in anaconda?
>>
>
> It is not that hard, but currently all mdraid code inside anaconda is
> based on the assumption that they are identified by their minor, changing
> this takes time, time we do not have before RHEL-6.
>
> So fixing this will have to wait till Fedora 14 I'm afraid.

If it even happens then...

> Regards,
>
> Hans
>
> p.s.
>
> Can you please reply to bug 537329 one more time, I've tried to explain
> why I think that we can simplify mdraid activation in the proposed way
> despite your objections. If you insist on keeping things as is, that is
> fine
> too, then I'll come up with a separate solution for the Intel BIOS RAID
> problems the current activation setup causes.

The Intel BIOS RAID problems in that bug report are not unique. In
fact, if you created a new, normal md raid array after installation and
did not enter the array's info in mdadm.conf, it too would fail to be
assembled (unless you hot plugged it after rc.sysinit was done running).
Running mdadm -Eb <constituent device> >> mdadm.conf is simply part of
the normal post-install array creation process for md raid arrays. In
any case, more comments in the bug.

--
Doug Ledford <dledford@redhat.com>
GPG KeyID: CFBFF194
http://people.redhat.com/dledford

Infiniband specific RPMs available at
http://people.redhat.com/dledford/Infiniband

_______________________________________________
Anaconda-devel-list mailing list
Anaconda-devel-list@redhat.com
https://www.redhat.com/mailman/listinfo/anaconda-devel-list

Doug Ledford 12-01-2009 02:50 PM

ANNOUNCE: mdadm 3.1.1 - A tool for managing Soft RAID under Linux
 
On 12/01/2009 10:19 AM, Bill Nottingham wrote:
> Hans de Goede (hdegoede@redhat.com) said:
>>> No, not quite. You can use superblock version 1.0 on /boot and grub
>>> will then work. Both version 0.90 and version 1.0 superblocks are at
>>> the end of the device and do not confuse boot loaders. Here's a summary
>>> of superblock format differences:
>>>
>>
>> Ok, so for /boot we must specify a superblock version, should we use 1.0 or
>> 0.9 (I assume 1.0, but confirmation of that would be good).
>
> Knowing little about the details, couldn't the possibility of fixing grub
> be investigated?

Grub *has* been fixed, it's called grub2. The amount of work necessary
to make grub do what grub2 already does isn't worth it, just upgrade to
grub2. However, that only solves the x86 arches, not anything else, and
I don't know of any other boot loaders that work with anything other
than 1.0/0.90 superblocks, so we still wouldn't have a universal
solution. Hence, sticking with 1.0 superblocks even if you have grub2
simply makes the most sense.

--
Doug Ledford <dledford@redhat.com>
GPG KeyID: CFBFF194
http://people.redhat.com/dledford

Infiniband specific RPMs available at
http://people.redhat.com/dledford/Infiniband

_______________________________________________
Anaconda-devel-list mailing list
Anaconda-devel-list@redhat.com
https://www.redhat.com/mailman/listinfo/anaconda-devel-list

Hans de Goede 12-02-2009 05:48 PM

ANNOUNCE: mdadm 3.1.1 - A tool for managing Soft RAID under Linux
 
Hi,

On 12/02/2009 06:22 PM, David Lehman wrote:

On Thu, 2009-11-26 at 10:31 +0100, Hans de Goede wrote:

Hi Doug,

That is a lot of information in there, let me try to summarize it
and please let me know if I've missed anything:

1) The default chunksize for raid4/5/6 is changing, this should
not be a problem as we do not specify a chunksize when creating
new arrays


We do have the 64k chunk size stored in MDRaidArrayDevice.chunkSize,
which we use to calculate array size, so we should update it there.



Hmm, we really should not store this but somehow query it. Doug,
is there a way to ask what the chunk size will become (we will need
this before we actually create the array).

Regards,

Hans

_______________________________________________
Anaconda-devel-list mailing list
Anaconda-devel-list@redhat.com
https://www.redhat.com/mailman/listinfo/anaconda-devel-list


All times are GMT. The time now is 07:14 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.