FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Device-mapper Development

 
 
LinkBack Thread Tools
 
Old 10-28-2010, 01:16 AM
Mike Snitzer
 
Default dm: use revalidate_disk to update device size after set_capacity

Hi Jun'ichi,

Thanks for your note. It took me a while to set aside some time to look
at this closer. Please see my response inlined below.

On Wed, Oct 20 2010 at 1:42am -0400,
Jun'ichi Nomura <j-nomura@ce.jp.nec.com> wrote:

> Hi Mike,
>
> (10/20/10 07:07), Mike Snitzer wrote:
> > Avoid taking md->bdev->bd_inode->i_mutex to update the DM block device's
> > size. Doing so eliminates the potential for deadlock if an fsync is
> > racing with a device resize.
> >
> > Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> > ---
> > drivers/md/dm.c | 5 +----
> > 1 files changed, 1 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index f934e98..fd315a7 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -1995,10 +1995,7 @@ static void event_callback(void *context)
> > static void __set_size(struct mapped_device *md, sector_t size)
> > {
> > set_capacity(md->disk, size);
> > -
> > - mutex_lock(&md->bdev->bd_inode->i_mutex);
> > - i_size_write(md->bdev->bd_inode, (loff_t)size << SECTOR_SHIFT);
> > - mutex_unlock(&md->bdev->bd_inode->i_mutex);
> > + revalidate_disk(md->disk);
> > }
>
> Some concerns/questions:
>
> - revalidate_disk() calls bdget() inside it.
> Does bdget() no longer block on I/O?
> Sometime ago, bdget() has been blocked by I_LOCK,
> where process holding I_LOCK blocked by resize.
> Since I_LOCK has disappeared, I suspect this might not
> be a valid concern anymore.
> FYI, past discussion on this topic:
> http://www.redhat.com/archives/dm-devel/2007-October/msg00134.html
> (it's not my intention to push the patch in the above URL)

OK, thanks for the pointer.

Yes, I agree with you, seems there is no longer any obvious potential
for bdget to block on IO.

> - revalidate_disk() ends up with get_super():
> revalidate_disk
> check_disk_size_change
> flush_disk
> __invalidate_device
> get_super
> and get_super() takes sb->s_umount.
> OTOH, there are codes which wait on I/O holding s_umount
> exclusively. E.g. freeze_super() calls sync_filesystem().
> So it seems there is a possible deadlock if you use
> revalidate_disk from dm_swap_table.

Doesn't seem like a freeze_super of a particular DM device would
conflict with revalidate_disk relative to sb->s_umount. But it is
concerning that revalidate_disk is calling flush_disk while the DM
device is suspended (though in practice this doesn't cause a problem):

dm_suspend() - flushes any outstanding IO.
lock_fs
freeze_bdev
freeze_super
down_write(&sb->s_umount);
sync_filesystem
up_write(&sb->s_umount);

dm_swap_table()
__bind
__set_size
revalidate_disk
check_disk_size_change
flush_disk
__invalidate_device
get_super
down_read(&sb->s_umount);
up_read(&sb->s_umount);

dm_resume()
unlock_fs
thaw_bdev
thaw_super
down_write(&sb->s_umount);
deactivate_locked_super
up_write(&s->s_umount);

> I've been away from that part of the code for a while.
> So it's nice if the above is just a false alarm...

Clearly warrants further analysis.

check_disk_size_change() takes bdev->bd_mutex while changing
bdev->bd_inode->i_size (rather than bdev->bd_inode->i_mutex). This is
what attracted me to revalidate_disk (in addition to the rest of the
block drivers using it for resize -- thanks to Neil Brown for pointing
it out).

But in my limited testing of the proposed patch (above), using linear DM
target over DM mpath, I haven't seen any problems. I was doing IO in
parallel to the resize. Notice with the patch we now see the following
messages (dm-0 is the mpath device, dm-1 is the linear):

dm-0: detected capacity change from 0 to 10737418240
dm-1: detected capacity change from 0 to 8589934592
...
dm-1: detected capacity change from 8589934592 to 9663676416
dm-1: detected capacity change from 9663676416 to 9667870720

But I haven't yet fully understood why check_disk_size_change's use of
bdev->bd_mutex sufficiently protects access to bdev->bd_inode->i_size
(unless all access to bdev->bd_inode->i_size takes bdev->bd_mutex; DM
being an exception?).

Given how naive I am on these core block paths there is more analysis
needed to verify/determine the proper fix for DM device resize (while
the device is suspended).

Could be the following patch be sufficient? (avoids potential for IO
while device is suspended -- final patch would need comments explaining
why revalidate_disk was avoided)

Mike

---
drivers/md/dm.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 7cb1352..248794a 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1996,9 +1996,9 @@ static void __set_size(struct mapped_device *md, sector_t size)
{
set_capacity(md->disk, size);

- mutex_lock(&md->bdev->bd_inode->i_mutex);
+ mutex_lock(&md->bdev->bd_mutex);
i_size_write(md->bdev->bd_inode, (loff_t)size << SECTOR_SHIFT);
- mutex_unlock(&md->bdev->bd_inode->i_mutex);
+ mutex_unlock(&md->bdev->bd_mutex);
}

/*

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 10-28-2010, 12:15 PM
"Jun'ichi Nomura"
 
Default dm: use revalidate_disk to update device size after set_capacity

Hi Mike,

(10/28/10 10:16), Mike Snitzer wrote:
> But in my limited testing of the proposed patch (above), using linear DM
> target over DM mpath, I haven't seen any problems. I was doing IO in
> parallel to the resize. Notice with the patch we now see the following
> messages (dm-0 is the mpath device, dm-1 is the linear):

There is FIFREEZE ioctl, which calls freeze_super.
So if you mix a process doing FIFREEZE (xfs_freeze?) in your test,
I think you hit the deadlock like this:

process A process B
-----------------------------------------------
suspend dm dev
ioctl(FIFREEZE)
freeze_super()
hold s_umount
sync_filesystems()
wait for I/O flowing..

resume dm dev
__set_size
revalidate_disk()
hold bd_mutex
flush_disk()
wait for s_umount

> But I haven't yet fully understood why check_disk_size_change's use of
> bdev->bd_mutex sufficiently protects access to bdev->bd_inode->i_size
> (unless all access to bdev->bd_inode->i_size takes bdev->bd_mutex; DM
> being an exception?).

i_size_read/write uses seqcount to protect the reads from
accessing incomplete write.
But the seqcount itself needs protection. Otherwise concurrent
writes will break the seqcount scheme.
So i_size_write()s need mutual exclusion, but not all accesses do.
That's my understanding from the comments in include/linux/fs.h.

> Given how naive I am on these core block paths there is more analysis
> needed to verify/determine the proper fix for DM device resize (while
> the device is suspended).
>
> Could be the following patch be sufficient? (avoids potential for IO
> while device is suspended -- final patch would need comments explaining
> why revalidate_disk was avoided)

Though I can't point out actual problem,
I think it's deadlock-prone to take bd_mutex in dm_swap_table.

There are already codes which do I/O while holding bd_mutex,
e.g. block/ioctl.c, though the code is not called for dm,
so we can' just set a general rule "Don't do I/O while holding bd_mutex".

Also, even if I/O is not done under bd_mutex, it might be blocked by
other. For example, though currently nobody can call revalidate_disk for dm,

process A process B process C
----------------------------------------------------------
suspend dm dev
freeze_super()
hold s_umount
sync_filesystems()
wait for I/O flowing..

revalidate_disk()
hold bd_mutex
flush_disk()
wait for s_umount

resume dm dev
__set_size
wait for bd_mutex

If __set_size() could be done in later stage of do_resume(),
we can use revalidate_disk() for dm, too.
What do you think?

Thanks,
--
Jun'ichi Nomura, NEC Corporation

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 10-28-2010, 07:54 PM
Mike Snitzer
 
Default dm: use revalidate_disk to update device size after set_capacity

On Thu, Oct 28 2010 at 8:15am -0400,
Jun'ichi Nomura <j-nomura@ce.jp.nec.com> wrote:

> Hi Mike,
>
> (10/28/10 10:16), Mike Snitzer wrote:
> > But in my limited testing of the proposed patch (above), using linear DM
> > target over DM mpath, I haven't seen any problems. I was doing IO in
> > parallel to the resize. Notice with the patch we now see the following
> > messages (dm-0 is the mpath device, dm-1 is the linear):
>
> There is FIFREEZE ioctl, which calls freeze_super.
> So if you mix a process doing FIFREEZE (xfs_freeze?) in your test,
> I think you hit the deadlock like this:
>
> process A process B
> -----------------------------------------------
> suspend dm dev
> ioctl(FIFREEZE)
> freeze_super()
> hold s_umount
> sync_filesystems()
> wait for I/O flowing..
>
> resume dm dev
> __set_size
> revalidate_disk()
> hold bd_mutex
> flush_disk()
> wait for s_umount

OK, I didn't consider FIFREEZE ioctl..

But regardless, I think it is now clear that we must avoid using
revalidate_disk() while a DM device is suspended.

> > But I haven't yet fully understood why check_disk_size_change's use of
> > bdev->bd_mutex sufficiently protects access to bdev->bd_inode->i_size
> > (unless all access to bdev->bd_inode->i_size takes bdev->bd_mutex; DM
> > being an exception?).
>
> i_size_read/write uses seqcount to protect the reads from
> accessing incomplete write.
> But the seqcount itself needs protection. Otherwise concurrent
> writes will break the seqcount scheme.
> So i_size_write()s need mutual exclusion, but not all accesses do.
> That's my understanding from the comments in include/linux/fs.h.

Correct, but my point was that revalidate_disk protects the
bdev->bd_inode->i_size i_size_write() with bdev->bd_mutex.

Whereas quite a few non-block driver i_size_write() callers use the
inode's i_mutex; as DM currently does with md->bdev->bd_inode->i_mutex.

As we now know, using md->bdev->bd_inode->i_mutex is prone to deadlock
against fsync.

> > Given how naive I am on these core block paths there is more analysis
> > needed to verify/determine the proper fix for DM device resize (while
> > the device is suspended).
> >
> > Could be the following patch be sufficient? (avoids potential for IO
> > while device is suspended -- final patch would need comments explaining
> > why revalidate_disk was avoided)
>
> Though I can't point out actual problem,
> I think it's deadlock-prone to take bd_mutex in dm_swap_table.
>
> There are already codes which do I/O while holding bd_mutex,
> e.g. block/ioctl.c, though the code is not called for dm,
> so we can' just set a general rule "Don't do I/O while holding bd_mutex".

Right, it would work provided we had that understanding within DM.

> Also, even if I/O is not done under bd_mutex, it might be blocked by
> other. For example, though currently nobody can call revalidate_disk for dm,

Hmm, that was a strange example, as you pointed out, considering the
fictional revalidate_disk caller. I was proposing using bd_mutex
directly -- independent of revalidate_disk and its associated flushing.

> If __set_size() could be done in later stage of do_resume(),
> we can use revalidate_disk() for dm, too.
> What do you think?

I think it would be incorrect to make a new DM table live without first
adjusting the size of the DM device to reflect the new table. Could
result in racing IO to the end of a device which would cause IO errors
like "access beyond end of device".

So our only option would be to change the device's size _before_
do_resume's dm_suspend(). But then if dm_swap_table() fails we'd need
to revert back to the old size -- which is clearly awkward.

I think updating the device's size within dm_swap_table() really is the
most logical place.

I'll cleanup the patch from my previous mail to include some in-code
comments and re-post it. Please ack it if you agree that direct use of
bd_mutex in __set_size() is a reasonable fix given the current code.

Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 10-28-2010, 10:18 PM
Mike Snitzer
 
Default dm: use revalidate_disk to update device size after set_capacity

On Thu, Oct 28 2010 at 3:54pm -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Thu, Oct 28 2010 at 8:15am -0400,
> Jun'ichi Nomura <j-nomura@ce.jp.nec.com> wrote:

> > Though I can't point out actual problem,
> > I think it's deadlock-prone to take bd_mutex in dm_swap_table.
> >
> > There are already codes which do I/O while holding bd_mutex,
> > e.g. block/ioctl.c, though the code is not called for dm,
> > so we can' just set a general rule "Don't do I/O while holding bd_mutex".
>
> Right, it would work provided we had that understanding within DM.

I think I misunderstood what you were saying as: no code does I/O to DM
while holding bd_mutex so we _can_ just set the general rule....

But re-reading your mail made me realize you were likely saying the
opposite? e.g. your can' was meant to be can't.

So even though we currently can't see a real deadlock associated with
using ->bd_mutex directly you're contending there is potential for one
(may already be one in hiding or one could be introduced in the future).

That is fine, best to be conservative.

> I'll cleanup the patch from my previous mail to include some in-code
> comments and re-post it. Please ack it if you agree that direct use of
> bd_mutex in __set_size() is a reasonable fix given the current code.

Taking a step back, don't we already have adequate locking for
__set_size's i_size_write() via dm_swap_table's md->suspend_lock?

So something like the following?

Mike
---
drivers/md/dm.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 7cb1352..e71b8a1 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1996,9 +1996,8 @@ static void __set_size(struct mapped_device *md, sector_t size)
{
set_capacity(md->disk, size);

- mutex_lock(&md->bdev->bd_inode->i_mutex);
+ /* i_size_write() is protected by md->suspend_lock */
i_size_write(md->bdev->bd_inode, (loff_t)size << SECTOR_SHIFT);
- mutex_unlock(&md->bdev->bd_inode->i_mutex);
}

/*

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 10-29-2010, 03:00 AM
"Jun'ichi Nomura"
 
Default dm: use revalidate_disk to update device size after set_capacity

Hi Mike,

(10/29/10 04:54), Mike Snitzer wrote:
>> If __set_size() could be done in later stage of do_resume(),
>> we can use revalidate_disk() for dm, too.
>> What do you think?
>
> I think it would be incorrect to make a new DM table live without first
> adjusting the size of the DM device to reflect the new table. Could
> result in racing IO to the end of a device which would cause IO errors
> like "access beyond end of device".

Yes, but at that point, resume ioctl is still in progress.
Doesn't that mean "the resize is still in progress"?
If so, who cares that I/O while the resize is not yet completed?

> I'll cleanup the patch from my previous mail to include some in-code
> comments and re-post it. Please ack it if you agree that direct use of
> bd_mutex in __set_size() is a reasonable fix given the current code.

My point is that it's better to avoid having a code with
special assumption on generic code, either "any codes don't do
I/O while holding bd_mutex" or "dm_resume is the only code
which calls i_size_write() for dm device".
I think it's prone for someone who don't care dm specifics
to introduce a new code that breaks such an assumption.

Anyway, I think your bd_mutex patch should be fine for now and
is better than the current code with i_mutex, which has a real deadlock issue.

--
Jun'ichi Nomura, NEC Corporation

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 

Thread Tools




All times are GMT. The time now is 05:10 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org