Linux Archive

Linux Archive (http://www.linux-archive.org/)
-   Device-mapper Development (http://www.linux-archive.org/device-mapper-development/)
-   -   Fix over-zealous flush_disk when changing device size. (http://www.linux-archive.org/device-mapper-development/490719-fix-over-zealous-flush_disk-when-changing-device-size.html)

Jeff Moyer 02-17-2011 04:03 PM

Fix over-zealous flush_disk when changing device size.
 
NeilBrown <neilb@suse.de> writes:

> Hi Andrew (and others)
> I wonder if you would review the following for me and comment.
>
> Thanks,
> NeilBrown
>
>
>
> From e7f75c2a757108cdd83ce8c808a16bf27686c95f Mon Sep 17 00:00:00 2001
> From: NeilBrown <neilb@suse.de>
> Date: Thu, 17 Feb 2011 16:37:30 +1100
> Subject: [PATCH] Fix over-zealous flush_disk when changing device size.
>
> There are two cases when we call flush_disk.
> In one, the device has disappeared (check_disk_change) so any
> data will hold becomes irrelevant.
> In the oter, the device has changed size (check_disk_size_change)
> so data we hold may be irrelevant.
>
> In both cases it makes sense to discard any 'clean' buffers,
> so they will be read back from the device if needed.
>
> In the former case it makes sense to discard 'dirty' buffers
> as there will never be anywhere safe to write the data. In the
> second case it *does*not* make sense to discard dirty buffers
> as that will lead to file system corruption when you simply enlarge
> the containing devices.
>
> flush_disk calls __invalidate_devices.
> __invalidate_device calls both invalidate_inodes and invalidate_bdev.
>
> invalidate_inodes *does* discard I_DIRTY inodes and this does lead
> to fs corruption.
>
> invalidate_bev *does*not* discard dirty pages, but I don't really care
> about that at present.
>
> So this patch adds a flag to __invalidate_device (calling it
> __invalidate_device2) to indicate whether dirty buffers should be
> killed, and this is passed to invalidate_inodes which can choose to
> skip dirty inodes.
>
> flusk_disk then passes true from check_disk_change and false from
> check_disk_size_change.
>
> dm avoids tripping over this problem by calling i_size_write directly
> rathher than using check_disk_size_change.
>
> md does use check_disk_size_change and so is affected.
>
> This regression was introduced by commit 608aeef17a
> which causes check_disk_size_change to call
> flush_disk.

This makes sense to me. Nice write-up, Neil.

Acked-by: Jeff Moyer <jmoyer@redhat.com>

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Jeff Moyer 02-21-2011 06:36 PM

Fix over-zealous flush_disk when changing device size.
 
NeilBrown <neilb@suse.de> writes:

> -int __invalidate_device(struct block_device *bdev)
> +int __invalidate_device2(struct block_device *bdev, bool kill_dirty)
> {
> struct super_block *sb = get_super(bdev);
> int res = 0;
> @@ -1614,7 +1614,7 @@ int __invalidate_device(struct block_device *bdev)
> * hold).
> */
> shrink_dcache_sb(sb);
> - res = invalidate_inodes(sb);
> + res = invalidate_inodes(sb, kill_dirty);
> drop_super(sb);
> }
> invalidate_bdev(bdev);

Neil, I think you also need to change the EXPORT_SYMBOL from
__invalidate_device to __invalidate_device2. The floppy module won't
build without that change.

Cheers,
Jeff

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

NeilBrown 02-21-2011 08:14 PM

Fix over-zealous flush_disk when changing device size.
 
On Mon, 21 Feb 2011 14:36:01 -0500 Jeff Moyer <jmoyer@redhat.com> wrote:

> NeilBrown <neilb@suse.de> writes:
>
> > -int __invalidate_device(struct block_device *bdev)
> > +int __invalidate_device2(struct block_device *bdev, bool kill_dirty)
> > {
> > struct super_block *sb = get_super(bdev);
> > int res = 0;
> > @@ -1614,7 +1614,7 @@ int __invalidate_device(struct block_device *bdev)
> > * hold).
> > */
> > shrink_dcache_sb(sb);
> > - res = invalidate_inodes(sb);
> > + res = invalidate_inodes(sb, kill_dirty);
> > drop_super(sb);
> > }
> > invalidate_bdev(bdev);
>
> Neil, I think you also need to change the EXPORT_SYMBOL from
> __invalidate_device to __invalidate_device2. The floppy module won't
> build without that change.

Good point, thank.
I've just made this change in mt for-next tree so it should appear in -next
today?

NeilBrown

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Christoph Hellwig 03-03-2011 01:31 PM

Fix over-zealous flush_disk when changing device size.
 
On Thu, Feb 17, 2011 at 04:50:57PM +1100, NeilBrown wrote:
>
> Hi Andrew (and others)
> I wonder if you would review the following for me and comment.

Please send think in this area through -fsdevel next time, thanks!

> There are two cases when we call flush_disk.
> In one, the device has disappeared (check_disk_change) so any
> data will hold becomes irrelevant.
> In the oter, the device has changed size (check_disk_size_change)
> so data we hold may be irrelevant.
>
> In both cases it makes sense to discard any 'clean' buffers,
> so they will be read back from the device if needed.

Does it? If the device has disappeared we can't read them back anyway.
If the device has resized to a smaller size the same is true about
those buffers that have gone away, and if it has resized to a larger
size invalidating anything doesn't make sense at all. I think this
area needs more love than a quick kill_dirty hackjob.

> In the former case it makes sense to discard 'dirty' buffers
> as there will never be anywhere safe to write the data. In the
> second case it *does*not* make sense to discard dirty buffers
> as that will lead to file system corruption when you simply enlarge
> the containing devices.

Doing anything like this at the buffer cache layer or inode cache layer
doesn't make any sense. If a device goes away or shrinks below the
filesystem size the filesystem simply needs to be shut down and in te
former size the admin needs to start a manual repair. Trying to do
any botch jobs in lower layer never works in practice.

For now I think the best short term fix is to simply revert commit
608aeef17a91747d6303de4df5e2c2e6899a95e8

"Call flush_disk() after detecting an online resize."

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

NeilBrown 03-03-2011 11:16 PM

Fix over-zealous flush_disk when changing device size.
 
On Thu, 3 Mar 2011 09:31:20 -0500 Christoph Hellwig <hch@infradead.org> wrote:

> On Thu, Feb 17, 2011 at 04:50:57PM +1100, NeilBrown wrote:
> >
> > Hi Andrew (and others)
> > I wonder if you would review the following for me and comment.
>
> Please send think in this area through -fsdevel next time, thanks!

Will try to remember - it is sometimes hard to get this sort of patch before
the right audience ... I thought "block layer" rather than "file systems" :-(

Thanks for finding it anyway.

>
> > There are two cases when we call flush_disk.
> > In one, the device has disappeared (check_disk_change) so any
> > data will hold becomes irrelevant.
> > In the oter, the device has changed size (check_disk_size_change)
> > so data we hold may be irrelevant.
> >
> > In both cases it makes sense to discard any 'clean' buffers,
> > so they will be read back from the device if needed.
>
> Does it? If the device has disappeared we can't read them back anyway.

I think that is the point - return an error rather than stale data.

> If the device has resized to a smaller size the same is true about
> those buffers that have gone away, and if it has resized to a larger
> size invalidating anything doesn't make sense at all. I think this
> area needs more love than a quick kill_dirty hackjob.

I tend to agree. I wasn't entirely convinced by the changelog comments on
the original offending patch, but I couldn't convince myself there was no
justification either, and I wanted to fix the corruption I saw - while close
to the end of a release cycle - without introducing any new regressions.

>
> > In the former case it makes sense to discard 'dirty' buffers
> > as there will never be anywhere safe to write the data. In the
> > second case it *does*not* make sense to discard dirty buffers
> > as that will lead to file system corruption when you simply enlarge
> > the containing devices.
>
> Doing anything like this at the buffer cache layer or inode cache layer
> doesn't make any sense. If a device goes away or shrinks below the
> filesystem size the filesystem simply needs to be shut down and in te
> former size the admin needs to start a manual repair. Trying to do
> any botch jobs in lower layer never works in practice.

Amen.
What I personally would really like to see is an interface for the block
device to say to the filesystem (or more specifically: whatever has bdclaimed
it) "I am about to resize to $X - is that OK?" and also "I have resized -
deal with it".

>
> For now I think the best short term fix is to simply revert commit
> 608aeef17a91747d6303de4df5e2c2e6899a95e8
>
> "Call flush_disk() after detecting an online resize."

You may be right, but I suspect that Andrew Patterson had a real issue to
solve which lead to submitting it, and I'd really like to understand that
issue before I would feel confident just reverting it.

Andrew: are you out there? Can you provide some background for your patch?

Thanks,
NeilBrown

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Andrew Patterson 03-04-2011 04:25 PM

Fix over-zealous flush_disk when changing device size.
 
On Fri, 2011-03-04 at 11:16 +1100, NeilBrown wrote:
> On Thu, 3 Mar 2011 09:31:20 -0500 Christoph Hellwig <hch@infradead.org> wrote:
>
> > On Thu, Feb 17, 2011 at 04:50:57PM +1100, NeilBrown wrote:
> > >
> > > Hi Andrew (and others)
> > > I wonder if you would review the following for me and comment.
> >
> > Please send think in this area through -fsdevel next time, thanks!
>
> Will try to remember - it is sometimes hard to get this sort of patch before
> the right audience ... I thought "block layer" rather than "file systems" :-(
>
> Thanks for finding it anyway.
>
> >
> > > There are two cases when we call flush_disk.
> > > In one, the device has disappeared (check_disk_change) so any
> > > data will hold becomes irrelevant.
> > > In the oter, the device has changed size (check_disk_size_change)
> > > so data we hold may be irrelevant.
> > >
> > > In both cases it makes sense to discard any 'clean' buffers,
> > > so they will be read back from the device if needed.
> >
> > Does it? If the device has disappeared we can't read them back anyway.
>
> I think that is the point - return an error rather than stale data.
>
> > If the device has resized to a smaller size the same is true about
> > those buffers that have gone away, and if it has resized to a larger
> > size invalidating anything doesn't make sense at all. I think this
> > area needs more love than a quick kill_dirty hackjob.
>
> I tend to agree. I wasn't entirely convinced by the changelog comments on
> the original offending patch, but I couldn't convince myself there was no
> justification either, and I wanted to fix the corruption I saw - while close
> to the end of a release cycle - without introducing any new regressions.
>
> >
> > > In the former case it makes sense to discard 'dirty' buffers
> > > as there will never be anywhere safe to write the data. In the
> > > second case it *does*not* make sense to discard dirty buffers
> > > as that will lead to file system corruption when you simply enlarge
> > > the containing devices.
> >
> > Doing anything like this at the buffer cache layer or inode cache layer
> > doesn't make any sense. If a device goes away or shrinks below the
> > filesystem size the filesystem simply needs to be shut down and in te
> > former size the admin needs to start a manual repair. Trying to do
> > any botch jobs in lower layer never works in practice.
>
> Amen.
> What I personally would really like to see is an interface for the block
> device to say to the filesystem (or more specifically: whatever has bdclaimed
> it) "I am about to resize to $X - is that OK?" and also "I have resized -
> deal with it".
>
> >
> > For now I think the best short term fix is to simply revert commit
> > 608aeef17a91747d6303de4df5e2c2e6899a95e8
> >
> > "Call flush_disk() after detecting an online resize."
>
> You may be right, but I suspect that Andrew Patterson had a real issue to
> solve which lead to submitting it, and I'd really like to understand that
> issue before I would feel confident just reverting it.
>
> Andrew: are you out there? Can you provide some background for your patch?

I put in the flush disk stuff at the suggestion of James Bottomley. In
fact the text for the justification in 608aeef17a91747d6303 is mostly
his. The idea is to get errors reported immediately rather than waiting
around for them to eventually get flushed and to make sure stale data is
not kept around. Certainly, at a minimum, not keeping stale data around
seems valuable to me.

What parts of the original justification did you think are unconvincing?
Note that the flush for growing the device is really only there for the
degenerate case where someone might shrink then grow a device
(admittedly, the user probably deserves to get data corruption/security
holes in such a case).

Andrew



--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

NeilBrown 03-06-2011 05:47 AM

Fix over-zealous flush_disk when changing device size.
 
On Fri, 04 Mar 2011 10:25:06 -0700 Andrew Patterson <andrew.patterson@hp.com>
wrote:

> On Fri, 2011-03-04 at 11:16 +1100, NeilBrown wrote:
> > On Thu, 3 Mar 2011 09:31:20 -0500 Christoph Hellwig <hch@infradead.org> wrote:
> >
> > > On Thu, Feb 17, 2011 at 04:50:57PM +1100, NeilBrown wrote:
> > > >
> > > > Hi Andrew (and others)
> > > > I wonder if you would review the following for me and comment.
> > >
> > > Please send think in this area through -fsdevel next time, thanks!
> >
> > Will try to remember - it is sometimes hard to get this sort of patch before
> > the right audience ... I thought "block layer" rather than "file systems" :-(
> >
> > Thanks for finding it anyway.
> >
> > >
> > > > There are two cases when we call flush_disk.
> > > > In one, the device has disappeared (check_disk_change) so any
> > > > data will hold becomes irrelevant.
> > > > In the oter, the device has changed size (check_disk_size_change)
> > > > so data we hold may be irrelevant.
> > > >
> > > > In both cases it makes sense to discard any 'clean' buffers,
> > > > so they will be read back from the device if needed.
> > >
> > > Does it? If the device has disappeared we can't read them back anyway.
> >
> > I think that is the point - return an error rather than stale data.
> >
> > > If the device has resized to a smaller size the same is true about
> > > those buffers that have gone away, and if it has resized to a larger
> > > size invalidating anything doesn't make sense at all. I think this
> > > area needs more love than a quick kill_dirty hackjob.
> >
> > I tend to agree. I wasn't entirely convinced by the changelog comments on
> > the original offending patch, but I couldn't convince myself there was no
> > justification either, and I wanted to fix the corruption I saw - while close
> > to the end of a release cycle - without introducing any new regressions.
> >
> > >
> > > > In the former case it makes sense to discard 'dirty' buffers
> > > > as there will never be anywhere safe to write the data. In the
> > > > second case it *does*not* make sense to discard dirty buffers
> > > > as that will lead to file system corruption when you simply enlarge
> > > > the containing devices.
> > >
> > > Doing anything like this at the buffer cache layer or inode cache layer
> > > doesn't make any sense. If a device goes away or shrinks below the
> > > filesystem size the filesystem simply needs to be shut down and in te
> > > former size the admin needs to start a manual repair. Trying to do
> > > any botch jobs in lower layer never works in practice.
> >
> > Amen.
> > What I personally would really like to see is an interface for the block
> > device to say to the filesystem (or more specifically: whatever has bdclaimed
> > it) "I am about to resize to $X - is that OK?" and also "I have resized -
> > deal with it".
> >
> > >
> > > For now I think the best short term fix is to simply revert commit
> > > 608aeef17a91747d6303de4df5e2c2e6899a95e8
> > >
> > > "Call flush_disk() after detecting an online resize."
> >
> > You may be right, but I suspect that Andrew Patterson had a real issue to
> > solve which lead to submitting it, and I'd really like to understand that
> > issue before I would feel confident just reverting it.
> >
> > Andrew: are you out there? Can you provide some background for your patch?
>
> I put in the flush disk stuff at the suggestion of James Bottomley. In
> fact the text for the justification in 608aeef17a91747d6303 is mostly
> his. The idea is to get errors reported immediately rather than waiting
> around for them to eventually get flushed and to make sure stale data is
> not kept around. Certainly, at a minimum, not keeping stale data around
> seems valuable to me.
>
> What parts of the original justification did you think are unconvincing?
> Note that the flush for growing the device is really only there for the
> degenerate case where someone might shrink then grow a device
> (admittedly, the user probably deserves to get data corruption/security
> holes in such a case).

hi Andrew.

One of the things that I didn't like about the change log is that it didn't
give clear context - what exactly is the problem it is trying to fix?
When you talk about "disks" changing size (reduced radius:-?) I think first
of 'dm' and 'md' - yet it clearly isn't dm related as dm doesn't even use
that code, and if it was 'md' related I would have thought I would have
heard about it....
So presumably these are some SCSI-attached devices that do internal volume
management and can change the size of .... targets? LUNs? something like
that.
So can these things change size without the SCSI layer immediately knowing
about it?? Don't you get some sort of "unit attention" or something?

The idea that the device might reduce in size and then grow again seems just
plain wrong. If that is possible it could return to it's original size and
you would never notice? And if there are cases that you know you will never
notice, then it seems like it is the wrong solution.
It would have been more credible if it *only* tried to flush when the size
was reduced ... which you do suggest as a possibility above I think.


The "potential security hole" seems completely bogus.
If you give me permission to read something, and I do, then you remove that
permission, the fact that I still remember what I read is not a security
hole. It is a natural fact of life

As for the two cases: I would describe them:
1. planned. The fs is already shrunk to within the new boundary so there
is nothing to be gained by flushing, and we could have to reload a
pile of data from storage which would be pointless
2. unplanned. The fs is probably toast, so whether we invalidate or not
isn't going to make a whole lot of difference.

So I would vote for not flushing.

The "not keeping stale data around" argument ... the only data that is
clearly stale is data that is in the block-device cache and beyond the new
EOF. Most filesystems keep most data in the page cache rather than the
block device cache so this would just be some metadata - maybe inode tables
or block usage bitmaps or similar. And caches regularly keep data around
that isn't actually going to be used - so keeping a bit more on the rare
occasion of a block-device-resize doesn't seem like an important cost.

Getting errors a little bit earlier in the case of an unplanned shrink is
possibly a credible argument. This would be read errors only of course -
write errors would still arrive at the same time. I'm not sure it would
really be very much earlier...maybe a bit in some cases.

On the whole, the arguments both for and against this change - in principle
- seem rather weak: "maybe" and "possibly" rather than something concrete.

So given that (due to an oversight) it actually causes filesystem
corruption, I tend to agree with Christoph that the best approach at this
stage is the revert the original patch, and then review all the related code
and come up with a "correct" approach.


And just to clarify: when I first found that description "unconvincing" I
hadn't thought through all of these issues - I think it was just that the
justification seems vague rather than concrete, so it was hard to reason
about it.


Would you be uncomfortable if I asked Linus to revert both my fix and your
original patch??

Thanks,
NeilBrown

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Andrew Patterson 03-07-2011 03:22 AM

Fix over-zealous flush_disk when changing device size.
 
On Sun, 2011-03-06 at 17:47 +1100, NeilBrown wrote:
> On Fri, 04 Mar 2011 10:25:06 -0700 Andrew Patterson <andrew.patterson@hp.com>
> wrote:
>
> > On Fri, 2011-03-04 at 11:16 +1100, NeilBrown wrote:
> > > On Thu, 3 Mar 2011 09:31:20 -0500 Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > > On Thu, Feb 17, 2011 at 04:50:57PM +1100, NeilBrown wrote:
> > > > >
> > > > > Hi Andrew (and others)
> > > > > I wonder if you would review the following for me and comment.
> > > >
> > > > Please send think in this area through -fsdevel next time, thanks!
> > >
> > > Will try to remember - it is sometimes hard to get this sort of patch before
> > > the right audience ... I thought "block layer" rather than "file systems" :-(
> > >
> > > Thanks for finding it anyway.
> > >
> > > >
> > > > > There are two cases when we call flush_disk.
> > > > > In one, the device has disappeared (check_disk_change) so any
> > > > > data will hold becomes irrelevant.
> > > > > In the oter, the device has changed size (check_disk_size_change)
> > > > > so data we hold may be irrelevant.
> > > > >
> > > > > In both cases it makes sense to discard any 'clean' buffers,
> > > > > so they will be read back from the device if needed.
> > > >
> > > > Does it? If the device has disappeared we can't read them back anyway.
> > >
> > > I think that is the point - return an error rather than stale data.
> > >
> > > > If the device has resized to a smaller size the same is true about
> > > > those buffers that have gone away, and if it has resized to a larger
> > > > size invalidating anything doesn't make sense at all. I think this
> > > > area needs more love than a quick kill_dirty hackjob.
> > >
> > > I tend to agree. I wasn't entirely convinced by the changelog comments on
> > > the original offending patch, but I couldn't convince myself there was no
> > > justification either, and I wanted to fix the corruption I saw - while close
> > > to the end of a release cycle - without introducing any new regressions.
> > >
> > > >
> > > > > In the former case it makes sense to discard 'dirty' buffers
> > > > > as there will never be anywhere safe to write the data. In the
> > > > > second case it *does*not* make sense to discard dirty buffers
> > > > > as that will lead to file system corruption when you simply enlarge
> > > > > the containing devices.
> > > >
> > > > Doing anything like this at the buffer cache layer or inode cache layer
> > > > doesn't make any sense. If a device goes away or shrinks below the
> > > > filesystem size the filesystem simply needs to be shut down and in te
> > > > former size the admin needs to start a manual repair. Trying to do
> > > > any botch jobs in lower layer never works in practice.
> > >
> > > Amen.
> > > What I personally would really like to see is an interface for the block
> > > device to say to the filesystem (or more specifically: whatever has bdclaimed
> > > it) "I am about to resize to $X - is that OK?" and also "I have resized -
> > > deal with it".
> > >
> > > >
> > > > For now I think the best short term fix is to simply revert commit
> > > > 608aeef17a91747d6303de4df5e2c2e6899a95e8
> > > >
> > > > "Call flush_disk() after detecting an online resize."
> > >
> > > You may be right, but I suspect that Andrew Patterson had a real issue to
> > > solve which lead to submitting it, and I'd really like to understand that
> > > issue before I would feel confident just reverting it.
> > >
> > > Andrew: are you out there? Can you provide some background for your patch?
> >
> > I put in the flush disk stuff at the suggestion of James Bottomley. In
> > fact the text for the justification in 608aeef17a91747d6303 is mostly
> > his. The idea is to get errors reported immediately rather than waiting
> > around for them to eventually get flushed and to make sure stale data is
> > not kept around. Certainly, at a minimum, not keeping stale data around
> > seems valuable to me.
> >
> > What parts of the original justification did you think are unconvincing?
> > Note that the flush for growing the device is really only there for the
> > degenerate case where someone might shrink then grow a device
> > (admittedly, the user probably deserves to get data corruption/security
> > holes in such a case).
>
> hi Andrew.
>
> One of the things that I didn't like about the change log is that it didn't
> give clear context - what exactly is the problem it is trying to fix?

Well true, but the rest of the patch set might have given the context.

> When you talk about "disks" changing size (reduced radius:-?) I think first
> of 'dm' and 'md' - yet it clearly isn't dm related as dm doesn't even use
> that code, and if it was 'md' related I would have thought I would have
> heard about it....
> So presumably these are some SCSI-attached devices that do internal volume
> management and can change the size of .... targets? LUNs? something like
> that.

Yes. The idea is that you want to increase/decrease the size of the
underlying block device (usually a FC or iSCSI LUN). Short of a reboot,
the OS will not detect the size changed. I added some code that rescans
the LUN and gets the new size and report the new size to the block
layer.

> So can these things change size without the SCSI layer immediately knowing
> about it?? Don't you get some sort of "unit attention" or something?
>

In some cases with some hardware yes. In most, no.

> The idea that the device might reduce in size and then grow again seems just
> plain wrong.

This would usually be due to some user error. The user may have shrunk
the device too much, realized there error and then corrected it. I admit
that it is pretty far-fetched.

> If that is possible it could return to it's original size and
> you would never notice?

Yes.

> And if there are cases that you know you will never
> notice, then it seems like it is the wrong solution.
> It would have been more credible if it *only* tried to flush when the size
> was reduced ... which you do suggest as a possibility above I think.
>
>
> The "potential security hole" seems completely bogus.
> If you give me permission to read something, and I do, then you remove that
> permission, the fact that I still remember what I read is not a security
> hole. It is a natural fact of life

I think the case here might be where you add a LUN onto the space where
the previous LUN space existed. Again, not likely.

>
> As for the two cases: I would describe them:
> 1. planned. The fs is already shrunk to within the new boundary so there
> is nothing to be gained by flushing, and we could have to reload a
> pile of data from storage which would be pointless

This is the security case.

> 2. unplanned. The fs is probably toast, so whether we invalidate or not
> isn't going to make a whole lot of difference.

Agreed.

>
> So I would vote for not flushing.
>
> The "not keeping stale data around" argument ... the only data that is
> clearly stale is data that is in the block-device cache and beyond the new
> EOF. Most filesystems keep most data in the page cache rather than the
> block device cache so this would just be some metadata - maybe inode tables
> or block usage bitmaps or similar. And caches regularly keep data around
> that isn't actually going to be used - so keeping a bit more on the rare
> occasion of a block-device-resize doesn't seem like an important cost.
>
> Getting errors a little bit earlier in the case of an unplanned shrink is
> possibly a credible argument. This would be read errors only of course -
> write errors would still arrive at the same time. I'm not sure it would
> really be very much earlier...maybe a bit in some cases.
>
> On the whole, the arguments both for and against this change - in principle
> - seem rather weak: "maybe" and "possibly" rather than something concrete.
>
> So given that (due to an oversight) it actually causes filesystem
> corruption, I tend to agree with Christoph that the best approach at this
> stage is the revert the original patch, and then review all the related code
> and come up with a "correct" approach.
>
>
> And just to clarify: when I first found that description "unconvincing" I
> hadn't thought through all of these issues - I think it was just that the
> justification seems vague rather than concrete, so it was hard to reason
> about it.
>
>
> Would you be uncomfortable if I asked Linus to revert both my fix and your
> original patch??

James Bottomley wanted me to put this functionality in. I have no
problem with reverting it myself, especially if it is causing other
problems. I would have to say that you need to ask him (or rather, I am
not qualified to render an opinion here).

Andrew


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Jeff Moyer 03-16-2011 07:30 PM

Fix over-zealous flush_disk when changing device size.
 
NeilBrown <neilb@suse.de> writes:

>> Synchronous notification of errors. If we don't try to write everything
>> back immediately after the size change, we don't see dirty pages in
>> zapped regions until the writeout/page cache management takes it into
>> its head to try to clean the pages.
>>
>
> So if you just want synchronous errors, I think you want:
> fsync_bdev()
>
> which calls sync_filesystem() if it can find a filesystem, else
> sync_blockdev(); (sync_filesystem itself calls sync_blockdev too).

... which deadlocks md. ;-) writeback_inodes_sb_nr is waiting for the
flusher thread to write back the dirty data. The flusher thread is
stuck in md_write_start, here:

wait_event(mddev->sb_wait,
!test_bit(MD_CHANGE_PENDING, &mddev->flags));

This is after reverting your change, and replacing the flush_disk call
in check_disk_size_change with a call to fsync_bdev. I'm not familiar
enough with md to really suggest a way forward. Neil?

Cheers,
Jeff

md127: detected capacity change from 267386880 to 401080320
INFO: task md127_raid5:2255 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
md127_raid5 D ffff88011d010690 5416 2255 2 0x00000080
ffff88010fcc7990 0000000000000046 ffff880100000000 ffffffff812070c9
0000000000014d00 ffff88011d010100 ffff88011d010690 ffff88010fcc7fd8
ffff88011d010698 0000000000014d00 ffff88010fcc6010 0000000000014d00
Call Trace:
[<ffffffff812070c9>] ? cpumask_next_and+0x29/0x50
[<ffffffff81493df5>] schedule_timeout+0x265/0x2d0
[<ffffffff8104b341>] ? enqueue_task+0x61/0x80
[<ffffffff81493a25>] wait_for_common+0x115/0x180
[<ffffffff81057850>] ? default_wake_function+0x0/0x10
[<ffffffff81493b38>] wait_for_completion+0x18/0x20
[<ffffffff8115cce2>] writeback_inodes_sb_nr+0x72/0xa0
[<ffffffff8115cfad>] writeback_inodes_sb+0x4d/0x60
[<ffffffff81162499>] __sync_filesystem+0x49/0x90
[<ffffffff81162592>] sync_filesystem+0x32/0x60
[<ffffffff8116bc59>] fsync_bdev+0x29/0x70
[<ffffffff8116bcea>] check_disk_size_change+0x4a/0xb0
[<ffffffff81208e27>] ? kobject_put+0x27/0x60
[<ffffffff8116bdaf>] revalidate_disk+0x5f/0x90
[<ffffffffa031155a>] raid5_finish_reshape+0x9a/0x1e0 [raid456]
[<ffffffff8138a933>] reap_sync_thread+0x63/0x130
[<ffffffff8138c8a6>] md_check_recovery+0x1f6/0x6f0
[<ffffffffa03150ab>] raid5d+0x3b/0x610 [raid456]
[<ffffffff810804c9>] ? prepare_to_wait+0x59/0x90
[<ffffffff81387ee9>] md_thread+0x119/0x150
[<ffffffff810801d0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81387dd0>] ? md_thread+0x0/0x150
[<ffffffff8107fb56>] kthread+0x96/0xa0
[<ffffffff8100cc04>] kernel_thread_helper+0x4/0x10
[<ffffffff8107fac0>] ? kthread+0x0/0xa0
[<ffffffff8100cc00>] ? kernel_thread_helper+0x0/0x10
INFO: task flush-9:127:2288 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
flush-9:127 D ffff88011cee30a0 4664 2288 2 0x00000080
ffff88011b0af6a0 0000000000000046 0000000000000000 0000000000000000
0000000000014d00 ffff88011cee2b10 ffff88011cee30a0 ffff88011b0affd8
ffff88011cee30a8 0000000000014d00 ffff88011b0ae010 0000000000014d00
Call Trace:
[<ffffffff8138bbb5>] md_write_start+0xa5/0x1c0
[<ffffffff810801d0>] ? autoremove_wake_function+0x0/0x40
[<ffffffffa0316435>] make_request+0x45/0x6c0 [raid456]
[<ffffffff811fbfcb>] ? blkiocg_update_dispatch_stats+0x8b/0xd0
[<ffffffff81385ca3>] md_make_request+0xd3/0x210
[<ffffffff811ee9da>] generic_make_request+0x2ea/0x5d0
[<ffffffff810e9cde>] ? mempool_alloc+0x5e/0x140
[<ffffffff811eed41>] submit_bio+0x81/0x110
[<ffffffff811699c6>] ? bio_alloc_bioset+0x56/0xf0
[<ffffffff81163ef6>] submit_bh+0xe6/0x140
[<ffffffff81165ad0>] __block_write_full_page+0x200/0x390
[<ffffffff811655a0>] ? end_buffer_async_write+0x0/0x1a0
[<ffffffff8116667e>] block_write_full_page_endio+0xde/0x110
[<ffffffffa037d3b0>] ? buffer_unmapped+0x0/0x20 [ext3]
[<ffffffff811666c0>] block_write_full_page+0x10/0x20
[<ffffffffa037de6d>] ext3_writeback_writepage+0x11d/0x170 [ext3]
[<ffffffff810f0152>] __writepage+0x12/0x40
[<ffffffff810f12b4>] write_cache_pages+0x1a4/0x490
[<ffffffff810f0140>] ? __writepage+0x0/0x40
[<ffffffff810f15bf>] generic_writepages+0x1f/0x30
[<ffffffff810f15f5>] do_writepages+0x25/0x30
[<ffffffff8115d5f0>] writeback_single_inode+0x90/0x220
[<ffffffff8115d9b6>] writeback_sb_inodes+0xc6/0x170
[<ffffffff8115dd3f>] wb_writeback+0x17f/0x430
[<ffffffff8106e217>] ? lock_timer_base+0x37/0x70
[<ffffffff8115e08d>] wb_do_writeback+0x9d/0x270
[<ffffffff8106e330>] ? process_timeout+0x0/0x10
[<ffffffff8115e302>] bdi_writeback_thread+0xa2/0x280
[<ffffffff8115e260>] ? bdi_writeback_thread+0x0/0x280
[<ffffffff8115e260>] ? bdi_writeback_thread+0x0/0x280
[<ffffffff8107fb56>] kthread+0x96/0xa0
[<ffffffff8100cc04>] kernel_thread_helper+0x4/0x10
[<ffffffff8107fac0>] ? kthread+0x0/0xa0
[<ffffffff8100cc00>] ? kernel_thread_helper+0x0/0x10
INFO: task updatedb:2342 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
updatedb D ffff88011bd77af0 5136 2342 2323 0x00000080
ffff88011c877cb8 0000000000000086 0000000000000000 ffff88011c1829a8
0000000000014d00 ffff88011bd77560 ffff88011bd77af0 ffff88011c877fd8
ffff88011bd77af8 0000000000014d00 ffff88011c876010 0000000000014d00
Call Trace:
[<ffffffff81165130>] ? sync_buffer+0x0/0x50
[<ffffffff8149382b>] io_schedule+0x6b/0xb0
[<ffffffff8116516b>] sync_buffer+0x3b/0x50
[<ffffffff81494057>] __wait_on_bit+0x57/0x80
[<ffffffff811699c6>] ? bio_alloc_bioset+0x56/0xf0
[<ffffffff81165130>] ? sync_buffer+0x0/0x50
[<ffffffff814940f3>] out_of_line_wait_on_bit+0x73/0x90
[<ffffffff81080210>] ? wake_bit_function+0x0/0x40
[<ffffffff81165126>] __wait_on_buffer+0x26/0x30
[<ffffffffa038006c>] ext3_bread+0x5c/0x80 [ext3]
[<ffffffffa037ba63>] ext3_readdir+0x1f3/0x600 [ext3]
[<ffffffff8114a650>] ? filldir+0x0/0xe0
[<ffffffff8114a650>] ? filldir+0x0/0xe0
[<ffffffff8114a7e0>] vfs_readdir+0xb0/0xd0
[<ffffffff8114a964>] sys_getdents+0x84/0xf0
[<ffffffff8100bdd2>] system_call_fastpath+0x16/0x1b


diff --git a/block/genhd.c b/block/genhd.c
index cbf1112..6a5b772 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1355,7 +1355,7 @@ int invalidate_partition(struct gendisk *disk, int partno)
struct block_device *bdev = bdget_disk(disk, partno);
if (bdev) {
fsync_bdev(bdev);
- res = __invalidate_device(bdev, true);
+ res = __invalidate_device(bdev);
bdput(bdev);
}
return res;
diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index 77fc76f..b9ba04f 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -3281,7 +3281,7 @@ static int set_geometry(unsigned int cmd, struct floppy_struct *g,
struct block_device *bdev = opened_bdev[cnt];
if (!bdev || ITYPE(drive_state[cnt].fd_device) != type)
continue;
- __invalidate_device(bdev, true);
+ __invalidate_device(bdev);
}
mutex_unlock(&open_lock);
} else {
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 8892870..5aae241 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -934,9 +934,9 @@ EXPORT_SYMBOL_GPL(bd_unlink_disk_holder);
* when a disk has been changed -- either by a media change or online
* resize.
*/
-static void flush_disk(struct block_device *bdev, bool kill_dirty)
+static void flush_disk(struct block_device *bdev)
{
- if (__invalidate_device(bdev, kill_dirty)) {
+ if (__invalidate_device(bdev)) {
char name[BDEVNAME_SIZE] = "";

if (bdev->bd_disk)
@@ -973,7 +973,7 @@ void check_disk_size_change(struct gendisk *disk, struct block_device *bdev)
"%s: detected capacity change from %lld to %lld
",
name, bdev_size, disk_size);
i_size_write(bdev->bd_inode, disk_size);
- flush_disk(bdev, false);
+ fsync_bdev(bdev);
}
}
EXPORT_SYMBOL(check_disk_size_change);
@@ -1026,7 +1026,7 @@ int check_disk_change(struct block_device *bdev)
if (!(events & DISK_EVENT_MEDIA_CHANGE))
return 0;

- flush_disk(bdev, true);
+ flush_disk(bdev);
if (bdops->revalidate_disk)
bdops->revalidate_disk(bdev->bd_disk);
return 1;
@@ -1607,7 +1607,7 @@ fail:
}
EXPORT_SYMBOL(lookup_bdev);

-int __invalidate_device(struct block_device *bdev, bool kill_dirty)
+int __invalidate_device(struct block_device *bdev)
{
struct super_block *sb = get_super(bdev);
int res = 0;
@@ -1620,7 +1620,7 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty)
* hold).
*/
shrink_dcache_sb(sb);
- res = invalidate_inodes(sb, kill_dirty);
+ res = invalidate_inodes(sb);
drop_super(sb);
}
invalidate_bdev(bdev);
diff --git a/fs/inode.c b/fs/inode.c
index 0647d80..9c2b795 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -548,14 +548,11 @@ void evict_inodes(struct super_block *sb)
/**
* invalidate_inodes - attempt to free all inodes on a superblock
* @sb: superblock to operate on
- * @kill_dirty: flag to guide handling of dirty inodes
*
* Attempts to free all inodes for a given superblock. If there were any
* busy inodes return a non-zero value, else zero.
- * If @kill_dirty is set, discard dirty inodes too, otherwise treat
- * them as busy.
*/
-int invalidate_inodes(struct super_block *sb, bool kill_dirty)
+int invalidate_inodes(struct super_block *sb)
{
int busy = 0;
struct inode *inode, *next;
@@ -567,10 +564,6 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE))
continue;
- if (inode->i_state & I_DIRTY && !kill_dirty) {
- busy = 1;
- continue;
- }
if (atomic_read(&inode->i_count)) {
busy = 1;
continue;
diff --git a/fs/internal.h b/fs/internal.h
index f3d15de..bee95ea 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -125,4 +125,4 @@ extern long do_handle_open(int mountdirfd,
*/
extern int get_nr_dirty_inodes(void);
extern void evict_inodes(struct super_block *);
-extern int invalidate_inodes(struct super_block *, bool);
+extern int invalidate_inodes(struct super_block *);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 13df14e..ff9a159 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2155,7 +2155,7 @@ extern void check_disk_size_change(struct gendisk *disk,
struct block_device *bdev);
extern int revalidate_disk(struct gendisk *);
extern int check_disk_change(struct block_device *);
-extern int __invalidate_device(struct block_device *, bool);
+extern int __invalidate_device(struct block_device *);
extern int invalidate_partition(struct gendisk *, int);
#endif
unsigned long invalidate_mapping_pages(struct address_space *mapping,

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Jeff Moyer 03-17-2011 04:33 PM

Fix over-zealous flush_disk when changing device size.
 
NeilBrown <neilb@suse.de> writes:

> On Wed, 16 Mar 2011 16:30:22 -0400 Jeff Moyer <jmoyer@redhat.com> wrote:
>
>> NeilBrown <neilb@suse.de> writes:
>>
>> >> Synchronous notification of errors. If we don't try to write everything
>> >> back immediately after the size change, we don't see dirty pages in
>> >> zapped regions until the writeout/page cache management takes it into
>> >> its head to try to clean the pages.
>> >>
>> >
>> > So if you just want synchronous errors, I think you want:
>> > fsync_bdev()
>> >
>> > which calls sync_filesystem() if it can find a filesystem, else
>> > sync_blockdev(); (sync_filesystem itself calls sync_blockdev too).
>>
>> ... which deadlocks md. ;-) writeback_inodes_sb_nr is waiting for the
>> flusher thread to write back the dirty data. The flusher thread is
>> stuck in md_write_start, here:
>>
>> wait_event(mddev->sb_wait,
>> !test_bit(MD_CHANGE_PENDING, &mddev->flags));
>>
>> This is after reverting your change, and replacing the flush_disk call
>> in check_disk_size_change with a call to fsync_bdev. I'm not familiar
>> enough with md to really suggest a way forward. Neil?
>
> That would be quite easy to avoid.
> Just call
> md_write_start()
> before revalidate_disk, and
> md_write_end()
> afterwards.

That does not avoid the problem (if I understood your suggestion). You
instead end up with the following:

INFO: task md127_raid5:2282 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
md127_raid5 D ffff88011c72d0a0 5688 2282 2 0x00000080
ffff880118997c20 0000000000000046 ffff880100000000 0000000000000246
0000000000014d00 ffff88011c72cb10 ffff88011c72d0a0 ffff880118997fd8
ffff88011c72d0a8 0000000000014d00 ffff880118996010 0000000000014d00
Call Trace:
[<ffffffff8138bbbd>] md_write_start+0xad/0x1d0
[<ffffffff810801d0>] ? autoremove_wake_function+0x0/0x40
[<ffffffffa0311558>] raid5_finish_reshape+0x98/0x1e0 [raid456]
[<ffffffff8138a933>] reap_sync_thread+0x63/0x130
[<ffffffff8138c8b6>] md_check_recovery+0x1f6/0x6f0
[<ffffffffa03150ab>] raid5d+0x3b/0x610 [raid456]
[<ffffffff810804c9>] ? prepare_to_wait+0x59/0x90
[<ffffffff81387ee9>] md_thread+0x119/0x150
[<ffffffff810801d0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff81387dd0>] ? md_thread+0x0/0x150
[<ffffffff8107fb56>] kthread+0x96/0xa0
[<ffffffff8100cc04>] kernel_thread_helper+0x4/0x10
[<ffffffff8107fac0>] ? kthread+0x0/0xa0
[<ffffffff8100cc00>] ? kernel_thread_helper+0x0/0x10

I'll leave this to you to work out when you have time.

Cheers,
Jeff

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


All times are GMT. The time now is 01:11 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.