|
|

04-02-2008, 09:23 PM
|
|
|
Desynchronizing dm-raid1
Hi
Unfortunatelly, the bug with desychnronizing raid1 that someone pointed
out on Monday, is real. The bug happens when you modify the page while its
being written to raid1 device --- old version can be written to one mirror
leg, the new versions to the other mirror leg. Raid1 code does not notice
this, marks the region clean after the writes finish, and the volume stays
desynchronized.
The possibilities, how data can be modified while they are being written.
1. an application does O_DIRECT IO and modifies the memory underway.
--- this is a problem of the application and we don't have to care about
it.
2. an application maps file for writing. pdflush or kswapd daemon writes
the page on background while the application is modifying it.
3. an application writes to a page with write() syscall. This syscall
can race with pdflush or kswapd as well.
4. a filesystem modifies the buffer while its being written by pdflush or
kswapd daemons.
The pdflush and kswapd daemons run in background and do periodic writes of
the modified data. pdflush is triggered regularly and writes data in
specified interval (about 30 seconds), so that in case of crash, the image
on disk is not too old. kswapd is triggered when the free memory goes low
--- it writes file pages and filesystem buffers too.
In cases 2,3,4 the data may be modified while they are being written,
but the kernel writes them later again. The sequence is something like:
clear dirty bit
submit IO
--- if the data are modified while the IO is in progress, the dirty bit is
turned on again and the data will be written later and possible data
corruption is corrected. --- so as long as the system does not crash,
there can't be desynchronized mirror.
But if the system crashes before the data are written second time, the
blocks may stay desynchronized.
An example of data corruption on ext2:
We have a dirty bitmap buffer
Pdflush clears the dirty flag and starts writing the buffer
The write is submitted to dm-raid1, it makes two requests and submits them
to two mirror devices
This operation races with another thread allocating a block on ext2 and
doing:
ext2_new_blocks
calling read_block_bitmap
calling sb_getblk
calling bh_uptodate_or_lock --- sees that the buffer is uptodate
(even if it's under write), so it returns.
calling ext2_try_to_allocate_with_rsv
calling ext2_try_to_allocate
calling ext2_set_bit_atomic --- this modifies the bitmap
*** now suppose that 2nd mirror device already finished
its write and don't get updated bit, while 1st mirror
device writes the updated bit to disk.
calling mark_buffer_dirty --- this schedules new update of the buffer
(after several seconds)
Both writes finished, dm-raid1 driver turns off the dirty bit for the
region.
Before pdflush writes the buffer second time, we get a
***CRASH***
After new boot, dm-raid1 doesn't update the region, because the region's
bit is off. fsck scans the device. It reads the bitmap from the first
device, sees that the bit is correctly set and doesn't write the bitmap.
Some times later, the administrator removes the 1st disk, the kernel
starts reading from 2nd mirror. Ext2 allocates another file, it reads the
bitmap from the 2nd device, sees the bit is off and allocates another
block there. Now there is data corruption => two files pointing to the
same block.
Ideas how to fix it:
1. lock the buffers and unmap the pages while they are being written.
--- upstream developers would likely reject it. No other driver than
dm-raid1 has problems with this and they wouldn't damp performance because
of one driver.
2. never turn the region dirty bit off until the filesystem is unmounted.
--- simplest fix. If the computer crashes after a long time, it
resynchronizes the whole device. md-raid resynchronizes the whole device
after a crash too.
3. turn off the bit if the block wasn't written in one pdflush period
--- requires an interaction with pdflush, rather complex, I wouldn't
recommend it.
4. make more region states.
--- If the region is in RH_DIRTY state and all writes drain, the state is
changed to RH_MAYBE_DIRTY. (we don't know if the region is synchronized or
not). The disk dirty flag is kept.
--- periodically (once in few minutes, so that it doesn't affect
performance much), the change all regions in RH_MAYBE_DIRTY state to
RH_CLEAN_CANDIDATE, then issue sync() on all filesystems. If, after the
sync(), the region is still in RH_CLEAN_CANDIDATE (i.e. it hasn't been
written during the sync()), it is moved to RH_CLEAN state and the on-disk
bit for the region is turned off.
If one of the above scenarios 2,3,4 happened (modifying a buffer while
it's under the disk write), the the sync() would have written the buffer
again and kicked the region out of RH_CLEAN_CANDIDATE state. If the sync()
didn't touch the buffer than we are sure that both on-disk copies are
synchronized.
Do you have any other ideas on this?
Mikulas
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
|
|

04-02-2008, 11:13 PM
|
|
|
Desynchronizing dm-raid1
No other driver than dm-raid1
has problems with this and they wouldn't damp performance because of one
driver.
--- so I found that md-raid-[156] recently (2.6.13 or so) added a bitmap
mode and when this is used (argument --bitmap to mdadm), it is vulnerable
to this desynchronization bug too. Before, it synced the whole device when
it crashed without shutdown, so it was OK.
Mikulas
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
|
|

04-03-2008, 02:40 AM
|
|
|
Desynchronizing dm-raid1
Mikulas Patocka [mpatocka@redhat.com] wrote:
> Ideas how to fix it:
>
> 1. lock the buffers and unmap the pages while they are being written.
> --- upstream developers would likely reject it. No other driver than
> dm-raid1 has problems with this and they wouldn't damp performance because
> of one driver.
Very few drivers require it, so how about an interface to lock the pages
of an I/O available to drivers. Only needed RAID drivers would lock the
I/O while it is in progress and they only pay the performance penalty.
mmap pages are a bit tricky. They need to go into read-only mode when an
I/O is in progress. I know this would likely be rejected too!!!
> 4. make more region states.
> --- If the region is in RH_DIRTY state and all writes drain, the state is
> changed to RH_MAYBE_DIRTY. (we don't know if the region is synchronized or
> not). The disk dirty flag is kept.
> --- periodically (once in few minutes, so that it doesn't affect
> performance much), the change all regions in RH_MAYBE_DIRTY state to
> RH_CLEAN_CANDIDATE, then issue sync() on all filesystems. If, after the
> sync(), the region is still in RH_CLEAN_CANDIDATE (i.e. it hasn't been
> written during the sync()), it is moved to RH_CLEAN state and the on-disk
> bit for the region is turned off.
Sounds good except that it uses sync()! Is there a way to sync only
pages related to a certain block device? How hard it is to implement
such an interface?
--Malahal.
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
|
|

04-03-2008, 10:19 AM
|
|
|
Desynchronizing dm-raid1
See below [HM].
On Wed, Apr 02, 2008 at 04:23:41PM -0400, Mikulas Patocka wrote:
> Hi
>
> Unfortunatelly, the bug with desychnronizing raid1 that someone pointed out
> on Monday, is real. The bug happens when you modify the page while its
> being written to raid1 device --- old version can be written to one mirror
> leg, the new versions to the other mirror leg. Raid1 code does not notice
> this, marks the region clean after the writes finish, and the volume stays
> desynchronized.
>
> The possibilities, how data can be modified while they are being written.
>
> 1. an application does O_DIRECT IO and modifies the memory underway.
>
> --- this is a problem of the application and we don't have to care about
> it.
>
> 2. an application maps file for writing. pdflush or kswapd daemon writes
> the page on background while the application is modifying it.
>
> 3. an application writes to a page with write() syscall. This syscall can
> race with pdflush or kswapd as well.
>
> 4. a filesystem modifies the buffer while its being written by pdflush or
> kswapd daemons.
>
>
> The pdflush and kswapd daemons run in background and do periodic writes of
> the modified data. pdflush is triggered regularly and writes data in
> specified interval (about 30 seconds), so that in case of crash, the image
> on disk is not too old. kswapd is triggered when the free memory goes low
> --- it writes file pages and filesystem buffers too.
>
> In cases 2,3,4 the data may be modified while they are being written, but
> the kernel writes them later again. The sequence is something like:
> clear dirty bit
> submit IO
> --- if the data are modified while the IO is in progress, the dirty bit is
> turned on again and the data will be written later and possible data
> corruption is corrected. --- so as long as the system does not crash, there
> can't be desynchronized mirror.
>
> But if the system crashes before the data are written second time, the
> blocks may stay desynchronized.
>
> An example of data corruption on ext2:
>
> We have a dirty bitmap buffer
> Pdflush clears the dirty flag and starts writing the buffer
> The write is submitted to dm-raid1, it makes two requests and submits them
> to two mirror devices
>
> This operation races with another thread allocating a block on ext2 and
> doing:
[HM] And taking out a copy unlocked in the RAID driver ain't help
application data integrity, because it could still change the data while
the RAID driver is copying, hence leading to coherency on the
RAID set but holding incorrect application data.
One can argue that this is ok in case of a crash, because the application
failed to flush any page changes and hence has to be capable to recover
from this.
We always will end up with consistent mirrors (either on multiplicated
successful writes to all legs or after resynchronization of the mirror)
at the cost of internal caching of pages.
> ext2_new_blocks
> calling read_block_bitmap
> calling sb_getblk
> calling bh_uptodate_or_lock --- sees that the buffer is uptodate (even if
> it's under write), so it returns.
> calling ext2_try_to_allocate_with_rsv
> calling ext2_try_to_allocate
> calling ext2_set_bit_atomic --- this modifies the bitmap
> *** now suppose that 2nd mirror device already finished
> its write and don't get updated bit, while 1st mirror
> device writes the updated bit to disk.
> calling mark_buffer_dirty --- this schedules new update of the buffer
> (after several seconds)
>
> Both writes finished, dm-raid1 driver turns off the dirty bit for the
> region.
>
> Before pdflush writes the buffer second time, we get a
> ***CRASH***
>
> After new boot, dm-raid1 doesn't update the region, because the region's
> bit is off. fsck scans the device. It reads the bitmap from the first
> device, sees that the bit is correctly set and doesn't write the bitmap.
>
> Some times later, the administrator removes the 1st disk, the kernel starts
> reading from 2nd mirror. Ext2 allocates another file, it reads the bitmap
> from the 2nd device, sees the bit is off and allocates another block there.
> Now there is data corruption => two files pointing to the same block.
>
>
> Ideas how to fix it:
>
> 1. lock the buffers and unmap the pages while they are being written.
> --- upstream developers would likely reject it. No other driver than
> dm-raid1 has problems with this and they wouldn't damp performance because
> of one driver.
[HM] md RAID456 and dm RAID45 don't have the raid1 problem, because
they utilize stripe caches, hence tacking page copies. Application pages
can change nonetheless vs. stripe cache pages.
>
> 2. never turn the region dirty bit off until the filesystem is unmounted.
> --- simplest fix. If the computer crashes after a long time, it
> resynchronizes the whole device. md-raid resynchronizes the whole device
> after a crash too.
[HM] We wouldn't resync the whole device, just dirty regions.
Of course the whole device would be the worst case with a huge
write data set.
For obvious reasons this is not what we want performamce-wise...
>
> 3. turn off the bit if the block wasn't written in one pdflush period
> --- requires an interaction with pdflush, rather complex, I wouldn't
> recommend it.
>
> 4. make more region states.
> --- If the region is in RH_DIRTY state and all writes drain, the state is
> changed to RH_MAYBE_DIRTY. (we don't know if the region is synchronized or
> not). The disk dirty flag is kept.
> --- periodically (once in few minutes, so that it doesn't affect
> performance much), the change all regions in RH_MAYBE_DIRTY state to
> RH_CLEAN_CANDIDATE, then issue sync() on all filesystems. If, after the
> sync(), the region is still in RH_CLEAN_CANDIDATE (i.e. it hasn't been
> written during the sync()), it is moved to RH_CLEAN state and the on-disk
> bit for the region is turned off.
[HM] This is essentially one technical approach for my comment on 2. above.
RH_MAYBE_DIRTY sounds superfluous at first glance, because when all writes
to a region drained, we can set RH_CLEAN_CANDIDATE, run the sync() and check
if that state persists in order to trigger the dirty log update.
Heinz
>
> If one of the above scenarios 2,3,4 happened (modifying a buffer while it's
> under the disk write), the the sync() would have written the buffer again
> and kicked the region out of RH_CLEAN_CANDIDATE state. If the sync() didn't
> touch the buffer than we are sure that both on-disk copies are
> synchronized.
>
>
> Do you have any other ideas on this?
>
> Mikulas
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
|
|

04-03-2008, 03:21 PM
|
|
|
Desynchronizing dm-raid1
Heinz Mauelshagen [mauelshagen@redhat.com] wrote:
>
> [HM] md RAID456 and dm RAID45 don't have the raid1 problem, because
> they utilize stripe caches, hence tacking page copies. Application pages
> can change nonetheless vs. stripe cache pages.
I wish they didn't make copies of data pages for the sake of
performance! If they did make copies for all of their I/O, they don't
have this problem.
> > 4. make more region states.
> > --- If the region is in RH_DIRTY state and all writes drain, the state is
> > changed to RH_MAYBE_DIRTY. (we don't know if the region is synchronized or
> > not). The disk dirty flag is kept.
> > --- periodically (once in few minutes, so that it doesn't affect
> > performance much), the change all regions in RH_MAYBE_DIRTY state to
> > RH_CLEAN_CANDIDATE, then issue sync() on all filesystems. If, after the
> > sync(), the region is still in RH_CLEAN_CANDIDATE (i.e. it hasn't been
> > written during the sync()), it is moved to RH_CLEAN state and the on-disk
> > bit for the region is turned off.
>
> [HM] This is essentially one technical approach for my comment on 2. above.
> RH_MAYBE_DIRTY sounds superfluous at first glance, because when all writes
> to a region drained, we can set RH_CLEAN_CANDIDATE, run the sync() and check
> if that state persists in order to trigger the dirty log update.
I don't think the state RH_MAYBE_DIRTY is superfluous. If the region
state is RH_CLEAN_CANDIDATE after the sync(), that means no 'write'
happened since we set RH_CLEAN_CANDIDATE. If there was any write, the
region state would be 'RH_DIRTY' or 'RH_MAYBE_DIRTY'.
--Malahal.
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
|
|

04-03-2008, 03:49 PM
|
|
|
Desynchronizing dm-raid1
>>>>> "Malahal" == malahal <malahal@us.ibm.com> writes:
>> 1. lock the buffers and unmap the pages while they are being
>> written. --- upstream developers would likely reject it. No other
>> driver than dm-raid1 has problems with this and they wouldn't damp
>> performance because of one driver.
Malahal> Very few drivers require it, so how about an interface to
Malahal> lock the pages of an I/O available to drivers. Only needed
Malahal> RAID drivers would lock the I/O while it is in progress and
Malahal> they only pay the performance penalty. mmap pages are a bit
Malahal> tricky. They need to go into read-only mode when an I/O is in
Malahal> progress. I know this would likely be rejected too!!!
I have exactly the same problem with the data integrity stuff I'm
working on.
Essentially a checksum gets generated when a bio is submitted and both
the I/O controller and the disk drive verify the checksum.
With ext2 in particular I often experience that the page (usually
containing directory metadata) has been modified before the controller
does the DMA. And the I/O will then be rejected by the controller or
drive because the checksum doesn't match the data.
So this problem is not specific to DM/MD...
--
Martin K. Petersen Oracle Linux Engineering
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
|
|

04-07-2008, 03:25 PM
|
|
|
Desynchronizing dm-raid1
On Thu, Apr 03, 2008 at 07:21:54AM -0700, malahal@us.ibm.com wrote:
> Heinz Mauelshagen [mauelshagen@redhat.com] wrote:
> >
> > [HM] md RAID456 and dm RAID45 don't have the raid1 problem, because
> > they utilize stripe caches, hence tacking page copies. Application pages
> > can change nonetheless vs. stripe cache pages.
>
> I wish they didn't make copies of data pages for the sake of
> performance! If they did make copies for all of their I/O, they don't
> have this problem.
Me too but it's mandatory to be able to calculate parity chunks
>
> > > 4. make more region states.
> > > --- If the region is in RH_DIRTY state and all writes drain, the state is
> > > changed to RH_MAYBE_DIRTY. (we don't know if the region is synchronized or
> > > not). The disk dirty flag is kept.
> > > --- periodically (once in few minutes, so that it doesn't affect
> > > performance much), the change all regions in RH_MAYBE_DIRTY state to
> > > RH_CLEAN_CANDIDATE, then issue sync() on all filesystems. If, after the
> > > sync(), the region is still in RH_CLEAN_CANDIDATE (i.e. it hasn't been
> > > written during the sync()), it is moved to RH_CLEAN state and the on-disk
> > > bit for the region is turned off.
> >
> > [HM] This is essentially one technical approach for my comment on 2. above.
> > RH_MAYBE_DIRTY sounds superfluous at first glance, because when all writes
> > to a region drained, we can set RH_CLEAN_CANDIDATE, run the sync() and check
> > if that state persists in order to trigger the dirty log update.
>
> I don't think the state RH_MAYBE_DIRTY is superfluous. If the region
> state is RH_CLEAN_CANDIDATE after the sync(), that means no 'write'
> happened since we set RH_CLEAN_CANDIDATE. If there was any write, the
> region state would be 'RH_DIRTY' or 'RH_MAYBE_DIRTY'.
Hrm, sound like a contradiction in your statement.
Either it stays RH_CLEAN_CANDIDATE because of no writes *or*
it's state-changing to RH_DIRTY, no ?
Heinz
>
> --Malahal.
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
|
|

04-07-2008, 04:41 PM
|
|
|
Desynchronizing dm-raid1
Heinz Mauelshagen [mauelshagen@redhat.com] wrote:
> > > RH_MAYBE_DIRTY sounds superfluous at first glance, because when all writes
> > > to a region drained, we can set RH_CLEAN_CANDIDATE, run the sync() and check
> > > if that state persists in order to trigger the dirty log update.
If I understand your above description: a region's state is set to
RH_DIRTY when an I/O is scheduled in the region and is set to
RH_CLEAN_CANDIDATE when all I/O is completed. In other words, a region's
state is RH_CLEAN_CANDIDATE when there is no pending I/O to that region.
Did I get it right so far?
Then we invoke sync(). Now, if the region's state is RH_CLEAN_CANDIDATE,
you set the region's state to RH_CLEAN. If the region's state is
anything other than RH_CLEAN_CANDIDATE, you don't do anything. Am I
correct?
> > I don't think the state RH_MAYBE_DIRTY is superfluous. If the region
> > state is RH_CLEAN_CANDIDATE after the sync(), that means no 'write'
> > happened since we set RH_CLEAN_CANDIDATE. If there was any write, the
> > region state would be 'RH_DIRTY' or 'RH_MAYBE_DIRTY'.
>
> Hrm, sound like a contradiction in your statement.
> Either it stays RH_CLEAN_CANDIDATE because of no writes *or*
> it's state-changing to RH_DIRTY, no ?
The state would be RH_CLEAN_CANDIDATE if there were ***NO*** writes as
part of sync(). The next statement only describes what would happen if
there were any writes as part of sync().
--Malahal.
PS: Any comments from the original submitter if he thinks the state
is really superfluous?
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
|
|

04-07-2008, 06:05 PM
|
|
|
Desynchronizing dm-raid1
>>>>> "Malahal" == malahal <malahal@us.ibm.com> writes:
[I sent this last week but it never made it to the list]
Malahal> Very few drivers require it, so how about an interface to
Malahal> lock the pages of an I/O available to drivers. Only needed
Malahal> RAID drivers would lock the I/O while it is in progress and
Malahal> they only pay the performance penalty. mmap pages are a bit
Malahal> tricky. They need to go into read-only mode when an I/O is in
Malahal> progress. I know this would likely be rejected too!!!
I have exactly the same problem with the data integrity stuff I'm
working on.
Essentially a checksum gets generated when a bio is submitted, and
both the I/O controller and the disk drive verify the checksum.
With ext2 in particular I often experience that the page (usually
containing directory metadata) has been modified before the controller
does the DMA. And the I/O will then be rejected by the controller or
drive because the checksum doesn't match the data.
So this problem is not specific to DM/MD...
--
Martin K. Petersen Oracle Linux Engineering
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
|
|

04-07-2008, 06:22 PM
|
|
|
Desynchronizing dm-raid1
Martin K. Petersen [mkp@mkp.net] wrote:
> >>>>> "Malahal" == malahal <malahal@us.ibm.com> writes:
>
> [I sent this last week but it never made it to the list]
>
> Malahal> Very few drivers require it, so how about an interface to
> Malahal> lock the pages of an I/O available to drivers. Only needed
> Malahal> RAID drivers would lock the I/O while it is in progress and
> Malahal> they only pay the performance penalty. mmap pages are a bit
> Malahal> tricky. They need to go into read-only mode when an I/O is in
> Malahal> progress. I know this would likely be rejected too!!!
>
> I have exactly the same problem with the data integrity stuff I'm
> working on.
>
> Essentially a checksum gets generated when a bio is submitted, and
> both the I/O controller and the disk drive verify the checksum.
>
> With ext2 in particular I often experience that the page (usually
> containing directory metadata) has been modified before the controller
> does the DMA. And the I/O will then be rejected by the controller or
> drive because the checksum doesn't match the data.
Your problem is very similar to an iSCSI problem sumitted here:
http://now.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=137902
Fortunately, you can detect the problem and the I/O can be retried if
possible. In the RAID case, it goes undetected until you hit the
eventual corruption!
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
|
|
|
All times are GMT. The time now is 10:45 PM.
VBulletin, Copyright ©2000 - 2009, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright ©2007 - 2008, www.linux-archive.org
|