FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Device-mapper Development

 
 
LinkBack Thread Tools
 
Old 11-25-2011, 07:25 PM
Mikulas Patocka
 
Default deadlock with suspend and quotas

This script causes a kernel deadlock:
#!/bin/sh
set -e
DEVICE=/dev/vg1/linear
lvchange -ay $DEVICE
mkfs.ext3 $DEVICE
mount -t ext3 -o usrquota,grpquota $DEVICE /mnt/test
quotacheck -gu /mnt/test
umount /mnt/test
mount -t ext3 -o usrquota,grpquota $DEVICE /mnt/test
quotaon /mnt/test
dmsetup suspend $DEVICE
setquota -u root 1 2 3 4 /mnt/test &
sleep 1
dmsetup resume $DEVICE

setquota acquired semaphore s_umount for read and then tried to perform
a transaction (and waits because the device is suspended).
dmsetup resume tries to acquire s_umount for write before resuming the device
(and waits for setquota).

Here are stacktraces:
setquota:
[ 67.524456] [<ffffffff810aa84e>] ? get_page_from_freelist+0x31e/0x790
[ 67.524529] [<ffffffffa0250265>] ? start_this_handle.isra.9+0x265/0x3b0 [jbd]
[ 67.524604] [<ffffffff8105bc00>] ? add_wait_queue+0x60/0x60
[ 67.524675] [<ffffffffa02505a1>] ? journal_start+0xc1/0x100 [jbd]
[ 67.524742] [<ffffffff810e62d6>] ? kmem_cache_alloc+0xf6/0x1b0
[ 67.524808] [<ffffffffa028018d>] ? ext3_acquire_dquot+0x3d/0x80 [ext3]
[ 67.524872] [<ffffffff81143749>] ? dqget+0x359/0x3b0
[ 67.524916] [<ffffffff81143ad4>] ? dquot_get_dqblk+0x14/0x1b0
[ 67.524985] [<ffffffff81147c34>] ? quota_getquota+0x24/0xd0
[ 67.525048] [<ffffffff810ff3db>] ? do_path_lookup+0x2b/0x90
[ 67.525082] [<ffffffff810ff8cd>] ? kern_path+0x1d/0x40
[ 67.525134] [<ffffffff81148311>] ? do_quotactl+0x421/0x540
[ 67.525191] [<ffffffff811073f0>] ? dput+0x20/0x230
[ 67.525234] [<ffffffff81148507>] ? sys_quotactl+0xd7/0x1a0
[ 67.525304] [<ffffffff8130a03b>] ? system_call_fastpath+0x16/0x1b

dmsetup resume:
[ 67.525887] [<ffffffffa0238280>] ? dev_wait+0xc0/0xc0 [dm_mod]
[ 67.525948] [<ffffffff81309225>] ? rwsem_down_failed_common+0xc5/0x160
[ 67.526013] [<ffffffff81198a43>] ? call_rwsem_down_write_failed+0x13/0x20
[ 67.526058] [<ffffffff81308adc>] ? down_write+0x1c/0x1d
[ 67.526103] [<ffffffff810f3a91>] ? thaw_super+0x21/0xc0
[ 67.526166] [<ffffffff81124d4d>] ? thaw_bdev+0x6d/0x90
[ 67.526223] [<ffffffff8105583e>] ? queue_work+0x4e/0x60
[ 67.526269] [<ffffffffa0230e63>] ? unlock_fs+0x23/0x40 [dm_mod]
[ 67.526341] [<ffffffffa02336d0>] ? dm_resume+0xb0/0xd0 [dm_mod]
[ 67.526388] [<ffffffffa0238420>] ? dev_suspend+0x1a0/0x230 [dm_mod]
[ 67.526441] [<ffffffffa0238a59>] ? ctl_ioctl+0x159/0x2a0 [dm_mod]
[ 67.526510] [<ffffffff8116c4ee>] ? ipc_addid+0x4e/0xd0
[ 67.526555] [<ffffffffa0238bae>] ? dm_ctl_ioctl+0xe/0x20 [dm_mod]
[ 67.526620] [<ffffffff811025de>] ? do_vfs_ioctl+0x8e/0x4e0
[ 67.526670] [<ffffffff811073f0>] ? dput+0x20/0x230
[ 67.526737] [<ffffffff810f3112>] ? fput+0x162/0x220
[ 67.526783] [<ffffffff81102a79>] ? sys_ioctl+0x49/0x90
[ 67.526838] [<ffffffff8130a03b>] ? system_call_fastpath+0x16/0x1b

The following patch fixes the deadlock. When the quota subsystem takes s_umount,
it checks if the filesystem is frozen. If it is, we drop s_umount, wait for
the filesystem to resume and retry.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
CC: stable@kernel.org

---
fs/quota/quota.c | 12 ++++++++++++
include/linux/fs.h | 1 +
2 files changed, 13 insertions(+)

Index: linux-3.1-fast/fs/quota/quota.c
================================================== =================
--- linux-3.1-fast.orig/fs/quota/quota.c 2011-11-25 20:19:14.000000000 +0100
+++ linux-3.1-fast/fs/quota/quota.c 2011-11-25 21:12:32.000000000 +0100
@@ -310,7 +310,19 @@ static struct super_block *quotactl_bloc
putname(tmp);
if (IS_ERR(bdev))
return ERR_CAST(bdev);
+retry_super:
sb = get_super(bdev);
+ if (sb && sb->s_frozen != SB_UNFROZEN) {
+ /*
+ * When resuming a frozen device, s_umount is taken for write.
+ * To avoid deadlock, we drop s_umount if the filesystem
+ * is frozen and wait for it to thaw.
+ */
+ up_read(&sb->s_umount);
+ vfs_check_frozen(sb, SB_FREEZE_WRITE);
+ put_super(sb);
+ goto retry_super;
+ }
bdput(bdev);
if (!sb)
return ERR_PTR(-ENODEV);
Index: linux-3.1-fast/include/linux/fs.h
================================================== =================
--- linux-3.1-fast.orig/include/linux/fs.h 2011-11-25 21:13:56.000000000 +0100
+++ linux-3.1-fast/include/linux/fs.h 2011-11-25 21:14:03.000000000 +0100
@@ -2496,6 +2496,7 @@ extern struct file_system_type *get_fs_t
extern struct super_block *get_super(struct block_device *);
extern struct super_block *get_active_super(struct block_device *bdev);
extern struct super_block *user_get_super(dev_t);
+extern void put_super(struct super_block *sb);
extern void drop_super(struct super_block *sb);
extern void iterate_supers(void (*)(struct super_block *, void *), void *);
extern void iterate_supers_type(struct file_system_type *,

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 11-28-2011, 02:04 PM
Jan Kara
 
Default deadlock with suspend and quotas

Hello,

On Fri 25-11-11 15:25:16, Mikulas Patocka wrote:
> This script causes a kernel deadlock:
> #!/bin/sh
> set -e
> DEVICE=/dev/vg1/linear
> lvchange -ay $DEVICE
> mkfs.ext3 $DEVICE
> mount -t ext3 -o usrquota,grpquota $DEVICE /mnt/test
> quotacheck -gu /mnt/test
> umount /mnt/test
> mount -t ext3 -o usrquota,grpquota $DEVICE /mnt/test
> quotaon /mnt/test
> dmsetup suspend $DEVICE
> setquota -u root 1 2 3 4 /mnt/test &
> sleep 1
> dmsetup resume $DEVICE
>
> setquota acquired semaphore s_umount for read and then tried to perform
> a transaction (and waits because the device is suspended).
> dmsetup resume tries to acquire s_umount for write before resuming the device
> (and waits for setquota).
>
> Here are stacktraces:
> setquota:
> [ 67.524456] [<ffffffff810aa84e>] ? get_page_from_freelist+0x31e/0x790
> [ 67.524529] [<ffffffffa0250265>] ? start_this_handle.isra.9+0x265/0x3b0 [jbd]
> [ 67.524604] [<ffffffff8105bc00>] ? add_wait_queue+0x60/0x60
> [ 67.524675] [<ffffffffa02505a1>] ? journal_start+0xc1/0x100 [jbd]
> [ 67.524742] [<ffffffff810e62d6>] ? kmem_cache_alloc+0xf6/0x1b0
> [ 67.524808] [<ffffffffa028018d>] ? ext3_acquire_dquot+0x3d/0x80 [ext3]
> [ 67.524872] [<ffffffff81143749>] ? dqget+0x359/0x3b0
> [ 67.524916] [<ffffffff81143ad4>] ? dquot_get_dqblk+0x14/0x1b0
> [ 67.524985] [<ffffffff81147c34>] ? quota_getquota+0x24/0xd0
> [ 67.525048] [<ffffffff810ff3db>] ? do_path_lookup+0x2b/0x90
> [ 67.525082] [<ffffffff810ff8cd>] ? kern_path+0x1d/0x40
> [ 67.525134] [<ffffffff81148311>] ? do_quotactl+0x421/0x540
> [ 67.525191] [<ffffffff811073f0>] ? dput+0x20/0x230
> [ 67.525234] [<ffffffff81148507>] ? sys_quotactl+0xd7/0x1a0
> [ 67.525304] [<ffffffff8130a03b>] ? system_call_fastpath+0x16/0x1b
>
> dmsetup resume:
> [ 67.525887] [<ffffffffa0238280>] ? dev_wait+0xc0/0xc0 [dm_mod]
> [ 67.525948] [<ffffffff81309225>] ? rwsem_down_failed_common+0xc5/0x160
> [ 67.526013] [<ffffffff81198a43>] ? call_rwsem_down_write_failed+0x13/0x20
> [ 67.526058] [<ffffffff81308adc>] ? down_write+0x1c/0x1d
> [ 67.526103] [<ffffffff810f3a91>] ? thaw_super+0x21/0xc0
> [ 67.526166] [<ffffffff81124d4d>] ? thaw_bdev+0x6d/0x90
> [ 67.526223] [<ffffffff8105583e>] ? queue_work+0x4e/0x60
> [ 67.526269] [<ffffffffa0230e63>] ? unlock_fs+0x23/0x40 [dm_mod]
> [ 67.526341] [<ffffffffa02336d0>] ? dm_resume+0xb0/0xd0 [dm_mod]
> [ 67.526388] [<ffffffffa0238420>] ? dev_suspend+0x1a0/0x230 [dm_mod]
> [ 67.526441] [<ffffffffa0238a59>] ? ctl_ioctl+0x159/0x2a0 [dm_mod]
> [ 67.526510] [<ffffffff8116c4ee>] ? ipc_addid+0x4e/0xd0
> [ 67.526555] [<ffffffffa0238bae>] ? dm_ctl_ioctl+0xe/0x20 [dm_mod]
> [ 67.526620] [<ffffffff811025de>] ? do_vfs_ioctl+0x8e/0x4e0
> [ 67.526670] [<ffffffff811073f0>] ? dput+0x20/0x230
> [ 67.526737] [<ffffffff810f3112>] ? fput+0x162/0x220
> [ 67.526783] [<ffffffff81102a79>] ? sys_ioctl+0x49/0x90
> [ 67.526838] [<ffffffff8130a03b>] ? system_call_fastpath+0x16/0x1b
>
> The following patch fixes the deadlock. When the quota subsystem takes s_umount,
> it checks if the filesystem is frozen. If it is, we drop s_umount, wait for
> the filesystem to resume and retry.
Thanks for the patch. I'm aware of the deadlock and Val Henson is working
on resolving these types of deadlocks more systematically. But since I
haven't heard from her for a while, I guess I'll merge your fix and she'll
update her series to reflect your change since those patches are going to
go in at earliest in the next merge window.
To sum up: I've merged your patch.

Honza
>
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> CC: stable@kernel.org
>
> ---
> fs/quota/quota.c | 12 ++++++++++++
> include/linux/fs.h | 1 +
> 2 files changed, 13 insertions(+)
>
> Index: linux-3.1-fast/fs/quota/quota.c
> ================================================== =================
> --- linux-3.1-fast.orig/fs/quota/quota.c 2011-11-25 20:19:14.000000000 +0100
> +++ linux-3.1-fast/fs/quota/quota.c 2011-11-25 21:12:32.000000000 +0100
> @@ -310,7 +310,19 @@ static struct super_block *quotactl_bloc
> putname(tmp);
> if (IS_ERR(bdev))
> return ERR_CAST(bdev);
> +retry_super:
> sb = get_super(bdev);
> + if (sb && sb->s_frozen != SB_UNFROZEN) {
> + /*
> + * When resuming a frozen device, s_umount is taken for write.
> + * To avoid deadlock, we drop s_umount if the filesystem
> + * is frozen and wait for it to thaw.
> + */
> + up_read(&sb->s_umount);
> + vfs_check_frozen(sb, SB_FREEZE_WRITE);
> + put_super(sb);
> + goto retry_super;
> + }
> bdput(bdev);
> if (!sb)
> return ERR_PTR(-ENODEV);
> Index: linux-3.1-fast/include/linux/fs.h
> ================================================== =================
> --- linux-3.1-fast.orig/include/linux/fs.h 2011-11-25 21:13:56.000000000 +0100
> +++ linux-3.1-fast/include/linux/fs.h 2011-11-25 21:14:03.000000000 +0100
> @@ -2496,6 +2496,7 @@ extern struct file_system_type *get_fs_t
> extern struct super_block *get_super(struct block_device *);
> extern struct super_block *get_active_super(struct block_device *bdev);
> extern struct super_block *user_get_super(dev_t);
> +extern void put_super(struct super_block *sb);
> extern void drop_super(struct super_block *sb);
> extern void iterate_supers(void (*)(struct super_block *, void *), void *);
> extern void iterate_supers_type(struct file_system_type *,
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 11-28-2011, 08:00 PM
Valerie Aurora
 
Default deadlock with suspend and quotas

On Mon, Nov 28, 2011 at 8:04 AM, Jan Kara <jack@suse.cz> wrote:
> *Hello,
>
> On Fri 25-11-11 15:25:16, Mikulas Patocka wrote:
>> This script causes a kernel deadlock:
>> #!/bin/sh
>> set -e
>> DEVICE=/dev/vg1/linear
>> lvchange -ay $DEVICE
>> mkfs.ext3 $DEVICE
>> mount -t ext3 -o usrquota,grpquota $DEVICE /mnt/test
>> quotacheck -gu /mnt/test
>> umount /mnt/test
>> mount -t ext3 -o usrquota,grpquota $DEVICE /mnt/test
>> quotaon /mnt/test
>> dmsetup suspend $DEVICE
>> setquota -u root 1 2 3 4 /mnt/test &
>> sleep 1
>> dmsetup resume $DEVICE
>>
>> setquota acquired semaphore s_umount for read and then tried to perform
>> a transaction (and waits because the device is suspended).
>> dmsetup resume tries to acquire s_umount for write before resuming the device
>> (and waits for setquota).
>>
>> Here are stacktraces:
>> setquota:
>> [ * 67.524456] *[<ffffffff810aa84e>] ? get_page_from_freelist+0x31e/0x790
>> [ * 67.524529] *[<ffffffffa0250265>] ? start_this_handle.isra.9+0x265/0x3b0 [jbd]
>> [ * 67.524604] *[<ffffffff8105bc00>] ? add_wait_queue+0x60/0x60
>> [ * 67.524675] *[<ffffffffa02505a1>] ? journal_start+0xc1/0x100 [jbd]
>> [ * 67.524742] *[<ffffffff810e62d6>] ? kmem_cache_alloc+0xf6/0x1b0
>> [ * 67.524808] *[<ffffffffa028018d>] ? ext3_acquire_dquot+0x3d/0x80 [ext3]
>> [ * 67.524872] *[<ffffffff81143749>] ? dqget+0x359/0x3b0
>> [ * 67.524916] *[<ffffffff81143ad4>] ? dquot_get_dqblk+0x14/0x1b0
>> [ * 67.524985] *[<ffffffff81147c34>] ? quota_getquota+0x24/0xd0
>> [ * 67.525048] *[<ffffffff810ff3db>] ? do_path_lookup+0x2b/0x90
>> [ * 67.525082] *[<ffffffff810ff8cd>] ? kern_path+0x1d/0x40
>> [ * 67.525134] *[<ffffffff81148311>] ? do_quotactl+0x421/0x540
>> [ * 67.525191] *[<ffffffff811073f0>] ? dput+0x20/0x230
>> [ * 67.525234] *[<ffffffff81148507>] ? sys_quotactl+0xd7/0x1a0
>> [ * 67.525304] *[<ffffffff8130a03b>] ? system_call_fastpath+0x16/0x1b
>>
>> dmsetup resume:
>> [ * 67.525887] *[<ffffffffa0238280>] ? dev_wait+0xc0/0xc0 [dm_mod]
>> [ * 67.525948] *[<ffffffff81309225>] ? rwsem_down_failed_common+0xc5/0x160
>> [ * 67.526013] *[<ffffffff81198a43>] ? call_rwsem_down_write_failed+0x13/0x20
>> [ * 67.526058] *[<ffffffff81308adc>] ? down_write+0x1c/0x1d
>> [ * 67.526103] *[<ffffffff810f3a91>] ? thaw_super+0x21/0xc0
>> [ * 67.526166] *[<ffffffff81124d4d>] ? thaw_bdev+0x6d/0x90
>> [ * 67.526223] *[<ffffffff8105583e>] ? queue_work+0x4e/0x60
>> [ * 67.526269] *[<ffffffffa0230e63>] ? unlock_fs+0x23/0x40 [dm_mod]
>> [ * 67.526341] *[<ffffffffa02336d0>] ? dm_resume+0xb0/0xd0 [dm_mod]
>> [ * 67.526388] *[<ffffffffa0238420>] ? dev_suspend+0x1a0/0x230 [dm_mod]
>> [ * 67.526441] *[<ffffffffa0238a59>] ? ctl_ioctl+0x159/0x2a0 [dm_mod]
>> [ * 67.526510] *[<ffffffff8116c4ee>] ? ipc_addid+0x4e/0xd0
>> [ * 67.526555] *[<ffffffffa0238bae>] ? dm_ctl_ioctl+0xe/0x20 [dm_mod]
>> [ * 67.526620] *[<ffffffff811025de>] ? do_vfs_ioctl+0x8e/0x4e0
>> [ * 67.526670] *[<ffffffff811073f0>] ? dput+0x20/0x230
>> [ * 67.526737] *[<ffffffff810f3112>] ? fput+0x162/0x220
>> [ * 67.526783] *[<ffffffff81102a79>] ? sys_ioctl+0x49/0x90
>> [ * 67.526838] *[<ffffffff8130a03b>] ? system_call_fastpath+0x16/0x1b
>>
>> The following patch fixes the deadlock. When the quota subsystem takes s_umount,
>> it checks if the filesystem is frozen. If it is, we drop s_umount, wait for
>> the filesystem to resume and retry.
> *Thanks for the patch. I'm aware of the deadlock and Val Henson is working
> on resolving these types of deadlocks more systematically. But since I
> haven't heard from her for a while, I guess I'll merge your fix and she'll
> update her series to reflect your change since those patches are going to
> go in at earliest in the next merge window.
> *To sum up: I've merged your patch.

FYI, I no longer have time to consult in addition to my full-time job.
Canonical is taking over this patch set.

-VAL

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 11-28-2011, 08:14 PM
Mikulas Patocka
 
Default deadlock with suspend and quotas

> >> The following patch fixes the deadlock. When the quota subsystem takes s_umount,
> >> it checks if the filesystem is frozen. If it is, we drop s_umount, wait for
> >> the filesystem to resume and retry.
> > *Thanks for the patch. I'm aware of the deadlock and Val Henson is working
> > on resolving these types of deadlocks more systematically. But since I
> > haven't heard from her for a while, I guess I'll merge your fix and she'll
> > update her series to reflect your change since those patches are going to
> > go in at earliest in the next merge window.
> > *To sum up: I've merged your patch.
>
> FYI, I no longer have time to consult in addition to my full-time job.
> Canonical is taking over this patch set.
>
> -VAL

Hi

Where can I get that patch set?

We are experiencing other similar deadlocks on RHEL-6, caused by sync or
background writeback (these code paths take s_umount and wait trying to do
I/O), but I wasn't able to reproduce these deadlocks on upstream kernel?
Are there other known deadlock possibilities?

Mikulas--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 11-28-2011, 10:32 PM
Mikulas Patocka
 
Default deadlock with suspend and quotas

> Hi
>
> Where can I get that patch set?
>
> We are experiencing other similar deadlocks on RHEL-6, caused by sync or
> background writeback (these code paths take s_umount and wait trying to do
> I/O), but I wasn't able to reproduce these deadlocks on upstream kernel?
> Are there other known deadlock possibilities?
>
> Mikulas

I found some patch named "[RFC PATCH 1/3] VFS: Fix s_umount thaw/write
deadlock" (I couldn't find the next two parts of the patch in the
archives). And the patch looks wrong:

- down_read_trylock(&sb->s_umount) doesn't fix anything. The lock is not
held when the filesystem is frozen and it is taken for write when thawing.
Consequently, any task can succeed with down_read_trylock(&sb->s_umount)
on a frozen filesystem and if this tasks attempts to do an I/O that is
waiting for thaw, it may still deadlock.

- skipping sync on frozen filesystem violates sync semantics.
Applications, such as databases, assume that when sync finishes, data were
written to stable storage. If we skip sync when the filesystem is frozen,
we can cause data corruption in these applications (if the system crashes
after we skipped a sync).

- I'm not sure what userspace quota subsystem will do if we start
returning -EBUSY spuriously.

There is another thing --- I wasn't able to reproduce these sync-related
deadlocks at all. Did anyone succeeded in reproducing them? Are there any
reported deadlocks? When freezing the ext3 or ext4 filesystem, the kernel
prevents creating new dirty data, syncs all data, and freezes the
filesystem. Consequently, the sync function never finds any dirty data and
so it doesn't block (sync doesn't writeback ATIME change, I don't know
why).

I made this patch that fixes the quota issue and possible sync issues. It
introduces a function down_read_s_umount(sb); --- this function takes
s_umount lock for read, but it waits if the filesystem is suspended.

There are three more potentially unsafe users: fs/cachefiles/interface.c,
fs/nilfs2/ioctl.c, fs/ubifs/budget.c - I didn't fix them because I can't
test them.

Mikulas

---

Fix a s_umount deadlock

The lock s_umount is taken for write and dropped when we freeze and resume
a block device.

Consequently, if some code holds s_umount and performs an I/O, a deadlock may
happen if the device is suspended.

Unfortunatelly, it seems that developers are not aware of this restriction
that I/O must not be peformed with s_umount held.

These deadlocks were observed:
* lvcreate process: sys_ioctl -> do_vfs_ioctl -> vfs_ioctl -> dm_ctl_ioctl ->
ctl_ioctl -> dev_suspend -> dm_resume -> unlock_fs -> thaw_bdev ->
call_rwsem_down_write_failed (the process is trying to resume frozen device,
but it is waiting trying to acquire s_umount)
* quota: sys_quotactl -> do_quotactl -> vfs_get_dqblk -> dqget ->
ext4_acquire_dquot -> ext4_journal_start_sb -> jbd2_journal_start ->
start_this_handle (the process is holding s_umount and trying to perform
a transaction, waiting for the device being unfrozen)

* iozone process: sys_sync -> sync_filesystems -> __sync_filesystem ->
writeback_inodes_sb -> writeback_inodes_sb_nr -> wait_for_completion
(the process is holding s_umount for read and trying to perform some I/O)
* flush-253:3: kthread -> bdi_start_fn -> bdi_writeback_task ->
wb_do_writeback -> wb_writeback -> writeback_sb_inodes ->
writeback_single_inode -> do_writepages -> ext4_da_writepages ->
ext4_journal_start_sb -> jbd2_journal_start -> start_this_handle
(holding s_umount for read too, acquired in writeback_inodes_wb,
and trying to perform I/O)
* lvcreate process: sys_ioctl -> do_vfs_ioctl -> vfs_ioctl -> dm_ctl_ioctl ->
ctl_ioctl -> dev_suspend -> dm_resume -> unlock_fs -> thaw_bdev ->
call_rwsem_down_write_failed (just like in a previous case: trying to
resume frozen device, waiting on s_umount)

This patch fixes these observed code paths:
* introduce a function to safely take s_unlock - down_read_s_umount. It takes
the lock, check if a filesystem is frozen, if it is, drops the lock and
waits for unfreeze.
* get_super function has a parameter, if the parameter is true, it waits for
the device to unfreeze (it fixes quota deadlock and a possible deadlock in
fsync_bdev and __invalidate_device)
* the same for iterate_supers (it fixes the sync deadlock and a possible
deadlock in drop_caches_sysctl_handler)
* grab_super_passive fails if the device is suspended (fixes the inode writeback
deadlock and a possible deadlock in prune_super)

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
CC: stable@kernel.org

---
fs/block_dev.c | 6 +++---
fs/buffer.c | 2 +-
fs/drop_caches.c | 2 +-
fs/fs-writeback.c | 8 ++++++++
fs/quota/quota.c | 4 ++--
fs/super.c | 29 ++++++++++++++++++++---------
fs/sync.c | 4 ++--
include/linux/fs.h | 28 +++++++++++++++++++++++++---
security/selinux/hooks.c | 2 +-
9 files changed, 63 insertions(+), 22 deletions(-)

Index: linux-3.2-rc3-fast/fs/quota/quota.c
================================================== =================
--- linux-3.2-rc3-fast.orig/fs/quota/quota.c 2011-11-25 05:51:13.000000000 +0100
+++ linux-3.2-rc3-fast/fs/quota/quota.c 2011-11-25 05:51:36.000000000 +0100
@@ -59,7 +59,7 @@ static int quota_sync_all(int type)
return -EINVAL;
ret = security_quotactl(Q_SYNC, type, 0, NULL);
if (!ret)
- iterate_supers(quota_sync_one, &type);
+ iterate_supers(quota_sync_one, &type, true);
return ret;
}

@@ -310,7 +310,7 @@ static struct super_block *quotactl_bloc
putname(tmp);
if (IS_ERR(bdev))
return ERR_CAST(bdev);
- sb = get_super(bdev);
+ sb = get_super(bdev, true);
bdput(bdev);
if (!sb)
return ERR_PTR(-ENODEV);
Index: linux-3.2-rc3-fast/include/linux/fs.h
================================================== =================
--- linux-3.2-rc3-fast.orig/include/linux/fs.h 2011-11-25 05:51:13.000000000 +0100
+++ linux-3.2-rc3-fast/include/linux/fs.h 2011-11-25 05:56:03.000000000 +0100
@@ -10,6 +10,7 @@
#include <linux/ioctl.h>
#include <linux/blk_types.h>
#include <linux/types.h>
+#include <linux/sched.h>

/*
* It's silly to have NR_OPEN bigger than NR_FILE, but you can change
@@ -1502,6 +1503,26 @@ enum {
wait_event((sb)->s_wait_unfrozen, ((sb)->s_frozen < (level)))

/*
+ * Take s_umount and make sure that the filesystem is not frozen.
+ * This function must be used if we intend to perform any I/O while
+ * holding s_umount.
+ *
+ * s_umount is taken for write when resuming a frozen filesystem,
+ * thus if someone calls down_read(&s->s_umount) and performs any I/O,
+ * it may deadlock.
+ */
+static inline void down_read_s_umount(struct super_block *sb)
+{
+retry:
+ down_read(&sb->s_umount);
+ if (unlikely(sb->s_frozen != SB_UNFROZEN)) {
+ up_write(&sb->s_umount);
+ vfs_check_frozen(sb, SB_FREEZE_WRITE);
+ goto retry;
+ }
+}
+
+/*
* until VFS tracks user namespaces for inodes, just make all files
* belong to init_user_ns
*/
@@ -2528,13 +2549,14 @@ extern int generic_block_fiemap(struct i
extern void get_filesystem(struct file_system_type *fs);
extern void put_filesystem(struct file_system_type *fs);
extern struct file_system_type *get_fs_type(const char *name);
-extern struct super_block *get_super(struct block_device *);
+extern struct super_block *get_super(struct block_device *, bool);
extern struct super_block *get_active_super(struct block_device *bdev);
extern struct super_block *user_get_super(dev_t);
extern void drop_super(struct super_block *sb);
-extern void iterate_supers(void (*)(struct super_block *, void *), void *);
+extern void iterate_supers(void (*)(struct super_block *, void *), void *, bool);
extern void iterate_supers_type(struct file_system_type *,
- void (*)(struct super_block *, void *), void *);
+ void (*)(struct super_block *, void *), void *,
+ bool);

extern int dcache_dir_open(struct inode *, struct file *);
extern int dcache_dir_close(struct inode *, struct file *);
Index: linux-3.2-rc3-fast/fs/buffer.c
================================================== =================
--- linux-3.2-rc3-fast.orig/fs/buffer.c 2011-11-25 05:51:13.000000000 +0100
+++ linux-3.2-rc3-fast/fs/buffer.c 2011-11-25 05:51:36.000000000 +0100
@@ -571,7 +571,7 @@ static void do_thaw_one(struct super_blo

static void do_thaw_all(struct work_struct *work)
{
- iterate_supers(do_thaw_one, NULL);
+ iterate_supers(do_thaw_one, NULL, false);
kfree(work);
printk(KERN_WARNING "Emergency Thaw complete
");
}
Index: linux-3.2-rc3-fast/fs/super.c
================================================== =================
--- linux-3.2-rc3-fast.orig/fs/super.c 2011-11-25 05:51:13.000000000 +0100
+++ linux-3.2-rc3-fast/fs/super.c 2011-11-29 00:16:01.000000000 +0100
@@ -337,7 +337,7 @@ bool grab_super_passive(struct super_blo
spin_unlock(&sb_lock);

if (down_read_trylock(&sb->s_umount)) {
- if (sb->s_root)
+ if (sb->s_frozen == SB_UNFROZEN && sb->s_root)
return true;
up_read(&sb->s_umount);
}
@@ -503,7 +503,7 @@ void sync_supers(void)
sb->s_count++;
spin_unlock(&sb_lock);

- down_read(&sb->s_umount);
+ down_read_s_umount(sb);
if (sb->s_root && sb->s_dirt)
sb->s_op->write_super(sb);
up_read(&sb->s_umount);
@@ -527,7 +527,8 @@ void sync_supers(void)
* Scans the superblock list and calls given function, passing it
* locked superblock and given argument.
*/
-void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
+void iterate_supers(void (*f)(struct super_block *, void *), void *arg,
+ bool wait_for_unfreeze)
{
struct super_block *sb, *p = NULL;

@@ -538,7 +539,10 @@ void iterate_supers(void (*f)(struct sup
sb->s_count++;
spin_unlock(&sb_lock);

- down_read(&sb->s_umount);
+ if (!wait_for_unfreeze)
+ down_read(&sb->s_umount);
+ else
+ down_read_s_umount(sb);
if (sb->s_root)
f(sb, arg);
up_read(&sb->s_umount);
@@ -563,7 +567,8 @@ void iterate_supers(void (*f)(struct sup
* locked superblock and given argument.
*/
void iterate_supers_type(struct file_system_type *type,
- void (*f)(struct super_block *, void *), void *arg)
+ void (*f)(struct super_block *, void *), void *arg,
+ bool wait_for_unfreeze)
{
struct super_block *sb, *p = NULL;

@@ -572,7 +577,10 @@ void iterate_supers_type(struct file_sys
sb->s_count++;
spin_unlock(&sb_lock);

- down_read(&sb->s_umount);
+ if (!wait_for_unfreeze)
+ down_read(&sb->s_umount);
+ else
+ down_read_s_umount(sb);
if (sb->s_root)
f(sb, arg);
up_read(&sb->s_umount);
@@ -597,7 +605,7 @@ EXPORT_SYMBOL(iterate_supers_type);
* mounted on the device given. %NULL is returned if no match is found.
*/

-struct super_block *get_super(struct block_device *bdev)
+struct super_block *get_super(struct block_device *bdev, bool wait_for_unfreeze)
{
struct super_block *sb;

@@ -612,7 +620,10 @@ rescan:
if (sb->s_bdev == bdev) {
sb->s_count++;
spin_unlock(&sb_lock);
- down_read(&sb->s_umount);
+ if (wait_for_unfreeze)
+ down_read_s_umount(sb);
+ else
+ down_read(&sb->s_umount);
/* still alive? */
if (sb->s_root)
return sb;
@@ -672,7 +683,7 @@ rescan:
if (sb->s_dev == dev) {
sb->s_count++;
spin_unlock(&sb_lock);
- down_read(&sb->s_umount);
+ down_read_s_umount(sb);
/* still alive? */
if (sb->s_root)
return sb;
Index: linux-3.2-rc3-fast/fs/sync.c
================================================== =================
--- linux-3.2-rc3-fast.orig/fs/sync.c 2011-11-25 05:51:13.000000000 +0100
+++ linux-3.2-rc3-fast/fs/sync.c 2011-11-25 05:51:36.000000000 +0100
@@ -89,7 +89,7 @@ static void sync_one_sb(struct super_blo
*/
static void sync_filesystems(int wait)
{
- iterate_supers(sync_one_sb, &wait);
+ iterate_supers(sync_one_sb, &wait, true);
}

/*
@@ -144,7 +144,7 @@ SYSCALL_DEFINE1(syncfs, int, fd)
return -EBADF;
sb = file->f_dentry->d_sb;

- down_read(&sb->s_umount);
+ down_read_s_umount(sb);
ret = sync_filesystem(sb);
up_read(&sb->s_umount);

Index: linux-3.2-rc3-fast/fs/block_dev.c
================================================== =================
--- linux-3.2-rc3-fast.orig/fs/block_dev.c 2011-11-25 05:51:13.000000000 +0100
+++ linux-3.2-rc3-fast/fs/block_dev.c 2011-11-25 05:51:36.000000000 +0100
@@ -222,7 +222,7 @@ EXPORT_SYMBOL(sync_blockdev);
*/
int fsync_bdev(struct block_device *bdev)
{
- struct super_block *sb = get_super(bdev);
+ struct super_block *sb = get_super(bdev, true);
if (sb) {
int res = sync_filesystem(sb);
drop_super(sb);
@@ -256,7 +256,7 @@ struct super_block *freeze_bdev(struct b
* to freeze_bdev grab an active reference and only the last
* thaw_bdev drops it.
*/
- sb = get_super(bdev);
+ sb = get_super(bdev, false);
drop_super(sb);
mutex_unlock(&bdev->bd_fsfreeze_mutex);
return sb;
@@ -1658,7 +1658,7 @@ EXPORT_SYMBOL(lookup_bdev);

int __invalidate_device(struct block_device *bdev, bool kill_dirty)
{
- struct super_block *sb = get_super(bdev);
+ struct super_block *sb = get_super(bdev, true);
int res = 0;

if (sb) {
Index: linux-3.2-rc3-fast/fs/drop_caches.c
================================================== =================
--- linux-3.2-rc3-fast.orig/fs/drop_caches.c 2011-11-25 05:51:13.000000000 +0100
+++ linux-3.2-rc3-fast/fs/drop_caches.c 2011-11-25 05:51:36.000000000 +0100
@@ -59,7 +59,7 @@ int drop_caches_sysctl_handler(ctl_table
return ret;
if (write) {
if (sysctl_drop_caches & 1)
- iterate_supers(drop_pagecache_sb, NULL);
+ iterate_supers(drop_pagecache_sb, NULL, true);
if (sysctl_drop_caches & 2)
drop_slab();
}
Index: linux-3.2-rc3-fast/security/selinux/hooks.c
================================================== =================
--- linux-3.2-rc3-fast.orig/security/selinux/hooks.c 2011-11-25 05:51:13.000000000 +0100
+++ linux-3.2-rc3-fast/security/selinux/hooks.c 2011-11-25 05:51:36.000000000 +0100
@@ -5693,7 +5693,7 @@ void selinux_complete_init(void)

/* Set up any superblocks initialized prior to the policy load. */
printk(KERN_DEBUG "SELinux: Setting up existing superblocks.
");
- iterate_supers(delayed_superblock_init, NULL);
+ iterate_supers(delayed_superblock_init, NULL, true);
}

/* SELinux requires early initialization in order to label
Index: linux-3.2-rc3-fast/fs/fs-writeback.c
================================================== =================
--- linux-3.2-rc3-fast.orig/fs/fs-writeback.c 2011-11-29 00:09:30.000000000 +0100
+++ linux-3.2-rc3-fast/fs/fs-writeback.c 2011-11-29 00:12:49.000000000 +0100
@@ -1273,6 +1273,10 @@ int writeback_inodes_sb_if_idle(struct s
{
if (!writeback_in_progress(sb->s_bdi)) {
down_read(&sb->s_umount);
+ if (unlikely(sb->s_frozen != SB_UNFROZEN)) {
+ up_read(&sb->s_umount);
+ return 0;
+ }
writeback_inodes_sb(sb, reason);
up_read(&sb->s_umount);
return 1;
@@ -1295,6 +1299,10 @@ int writeback_inodes_sb_nr_if_idle(struc
{
if (!writeback_in_progress(sb->s_bdi)) {
down_read(&sb->s_umount);
+ if (unlikely(sb->s_frozen != SB_UNFROZEN)) {
+ up_read(&sb->s_umount);
+ return 0;
+ }
writeback_inodes_sb_nr(sb, nr, reason);
up_read(&sb->s_umount);
return 1;

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 11-29-2011, 09:19 AM
Jan Kara
 
Default deadlock with suspend and quotas

Hi,

On Mon 28-11-11 18:32:18, Mikulas Patocka wrote:
> > Hi
> >
> > Where can I get that patch set?
> >
> > We are experiencing other similar deadlocks on RHEL-6, caused by sync or
> > background writeback (these code paths take s_umount and wait trying to do
> > I/O), but I wasn't able to reproduce these deadlocks on upstream kernel?
> > Are there other known deadlock possibilities?
>
> I found some patch named "[RFC PATCH 1/3] VFS: Fix s_umount thaw/write
> deadlock" (I couldn't find the next two parts of the patch in the
> archives). And the patch looks wrong:
Yes, that seems to be the series. I generally agree with you that the
last iteration still had some problems and some changes were requested.
That's why it's not merged yet after all...

> - down_read_trylock(&sb->s_umount) doesn't fix anything. The lock is not
> held when the filesystem is frozen and it is taken for write when thawing.
> Consequently, any task can succeed with down_read_trylock(&sb->s_umount)
> on a frozen filesystem and if this tasks attempts to do an I/O that is
> waiting for thaw, it may still deadlock.
Agreed.

> - skipping sync on frozen filesystem violates sync semantics.
> Applications, such as databases, assume that when sync finishes, data were
> written to stable storage. If we skip sync when the filesystem is frozen,
> we can cause data corruption in these applications (if the system crashes
> after we skipped a sync).
Here I don't agree. Filesystem must guarantee there are no dirty data on
a frozen filesystem. Ext4 and XFS do this, ext3 would need proper
page_mkwrite() implementation for this but that's the problem of ext3, not
freezing code in general. If there are no dirty data, sync code (and also
flusher thread) is free to return without doing anything.

That being said, it is hard to implement freeze handling in page_mkwrite()
in such a way that there would be no dirty pages (but we know there cannot
be dirty data in such pages). Currently we mark the page dirty during page
fault and wait for frozen filesystem only after that so that we are
guaranteed that either freezing code will wait for page fault to finish and
will write the page or page fault code notices that freezing is in progress
and blocks (see fs/buffer.c:block_page_mkwrite()).

So I believe the consensus was that we should not block sync or flusher
thread on frozen filesystem. Firstly, it's kind of ugly from user
perspective (you cannot sync filesystems on your system while one
filesystem is frozen???), secondly, in case of flusher thread it has some
serious implications if there are more filesystems on the same device - you
would effectively stop any writeback to the device possibly hanging the
whole system due to dirty limit being exceeded. So at least in these two
cases we should just ignore frozen filesystem.

> - I'm not sure what userspace quota subsystem will do if we start
> returning -EBUSY spuriously.
Quota tools will complain to the user which would be fine I think. But
blocking is fine as well. In this particular case I don't care much but it
should be consistent with what happens to sync. So probably Q_SYNC command
should ignore frozen filesystem, Q_SETQUOTA or Q_SETINFO should block.

> There is another thing --- I wasn't able to reproduce these sync-related
> deadlocks at all. Did anyone succeeded in reproducing them? Are there any
> reported deadlocks? When freezing the ext3 or ext4 filesystem, the kernel
> prevents creating new dirty data, syncs all data, and freezes the
> filesystem. Consequently, the sync function never finds any dirty data and
> so it doesn't block (sync doesn't writeback ATIME change, I don't know
> why).
See above why sync can actually spot some dirty inodes/page (although
there is not any dirty data). Surbhi (added to CC) from Canonical could
actually trigger these races and consequent deadlocks (and I belive some of
their customers as well). Also some RH customers were hitting these
deadlocks (Eric Sandeen was handling those bugs AFAIK) but those were made
happy by my changes to block_page_mkwrite() which made the race window much
narrower.

> I made this patch that fixes the quota issue and possible sync issues. It
> introduces a function down_read_s_umount(sb); --- this function takes
> s_umount lock for read, but it waits if the filesystem is suspended.
>
> There are three more potentially unsafe users: fs/cachefiles/interface.c,
> fs/nilfs2/ioctl.c, fs/ubifs/budget.c - I didn't fix them because I can't
> test them.
>
> Mikulas
>
> ---
>
> Fix a s_umount deadlock
>
> The lock s_umount is taken for write and dropped when we freeze and resume
> a block device.
>
> Consequently, if some code holds s_umount and performs an I/O, a deadlock may
> happen if the device is suspended.
>
> Unfortunatelly, it seems that developers are not aware of this restriction
> that I/O must not be peformed with s_umount held.
>
> These deadlocks were observed:
> * lvcreate process: sys_ioctl -> do_vfs_ioctl -> vfs_ioctl -> dm_ctl_ioctl ->
> ctl_ioctl -> dev_suspend -> dm_resume -> unlock_fs -> thaw_bdev ->
> call_rwsem_down_write_failed (the process is trying to resume frozen device,
> but it is waiting trying to acquire s_umount)
> * quota: sys_quotactl -> do_quotactl -> vfs_get_dqblk -> dqget ->
> ext4_acquire_dquot -> ext4_journal_start_sb -> jbd2_journal_start ->
> start_this_handle (the process is holding s_umount and trying to perform
> a transaction, waiting for the device being unfrozen)
>
> * iozone process: sys_sync -> sync_filesystems -> __sync_filesystem ->
> writeback_inodes_sb -> writeback_inodes_sb_nr -> wait_for_completion
> (the process is holding s_umount for read and trying to perform some I/O)
> * flush-253:3: kthread -> bdi_start_fn -> bdi_writeback_task ->
> wb_do_writeback -> wb_writeback -> writeback_sb_inodes ->
> writeback_single_inode -> do_writepages -> ext4_da_writepages ->
> ext4_journal_start_sb -> jbd2_journal_start -> start_this_handle
> (holding s_umount for read too, acquired in writeback_inodes_wb,
> and trying to perform I/O)
> * lvcreate process: sys_ioctl -> do_vfs_ioctl -> vfs_ioctl -> dm_ctl_ioctl ->
> ctl_ioctl -> dev_suspend -> dm_resume -> unlock_fs -> thaw_bdev ->
> call_rwsem_down_write_failed (just like in a previous case: trying to
> resume frozen device, waiting on s_umount)
>
> This patch fixes these observed code paths:
> * introduce a function to safely take s_unlock - down_read_s_umount. It takes
> the lock, check if a filesystem is frozen, if it is, drops the lock and
> waits for unfreeze.
> * get_super function has a parameter, if the parameter is true, it waits for
> the device to unfreeze (it fixes quota deadlock and a possible deadlock in
> fsync_bdev and __invalidate_device)
> * the same for iterate_supers (it fixes the sync deadlock and a possible
> deadlock in drop_caches_sysctl_handler)
> * grab_super_passive fails if the device is suspended (fixes the inode writeback
> deadlock and a possible deadlock in prune_super)
>
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> CC: stable@kernel.org
>
> ---
> fs/block_dev.c | 6 +++---
> fs/buffer.c | 2 +-
> fs/drop_caches.c | 2 +-
> fs/fs-writeback.c | 8 ++++++++
> fs/quota/quota.c | 4 ++--
> fs/super.c | 29 ++++++++++++++++++++---------
> fs/sync.c | 4 ++--
> include/linux/fs.h | 28 +++++++++++++++++++++++++---
> security/selinux/hooks.c | 2 +-
> 9 files changed, 63 insertions(+), 22 deletions(-)
>
> Index: linux-3.2-rc3-fast/fs/quota/quota.c
> ================================================== =================
> --- linux-3.2-rc3-fast.orig/fs/quota/quota.c 2011-11-25 05:51:13.000000000 +0100
> +++ linux-3.2-rc3-fast/fs/quota/quota.c 2011-11-25 05:51:36.000000000 +0100
> @@ -59,7 +59,7 @@ static int quota_sync_all(int type)
> return -EINVAL;
> ret = security_quotactl(Q_SYNC, type, 0, NULL);
> if (!ret)
> - iterate_supers(quota_sync_one, &type);
> + iterate_supers(quota_sync_one, &type, true);
> return ret;
> }
>
> @@ -310,7 +310,7 @@ static struct super_block *quotactl_bloc
> putname(tmp);
> if (IS_ERR(bdev))
> return ERR_CAST(bdev);
> - sb = get_super(bdev);
> + sb = get_super(bdev, true);
> bdput(bdev);
> if (!sb)
> return ERR_PTR(-ENODEV);
> Index: linux-3.2-rc3-fast/include/linux/fs.h
> ================================================== =================
> --- linux-3.2-rc3-fast.orig/include/linux/fs.h 2011-11-25 05:51:13.000000000 +0100
> +++ linux-3.2-rc3-fast/include/linux/fs.h 2011-11-25 05:56:03.000000000 +0100
> @@ -10,6 +10,7 @@
> #include <linux/ioctl.h>
> #include <linux/blk_types.h>
> #include <linux/types.h>
> +#include <linux/sched.h>
>
> /*
> * It's silly to have NR_OPEN bigger than NR_FILE, but you can change
> @@ -1502,6 +1503,26 @@ enum {
> wait_event((sb)->s_wait_unfrozen, ((sb)->s_frozen < (level)))
>
> /*
> + * Take s_umount and make sure that the filesystem is not frozen.
> + * This function must be used if we intend to perform any I/O while
> + * holding s_umount.
> + *
> + * s_umount is taken for write when resuming a frozen filesystem,
> + * thus if someone calls down_read(&s->s_umount) and performs any I/O,
> + * it may deadlock.
> + */
> +static inline void down_read_s_umount(struct super_block *sb)
> +{
> +retry:
> + down_read(&sb->s_umount);
> + if (unlikely(sb->s_frozen != SB_UNFROZEN)) {
> + up_write(&sb->s_umount);
> + vfs_check_frozen(sb, SB_FREEZE_WRITE);
> + goto retry;
> + }
> +}
> +
> +/*
> * until VFS tracks user namespaces for inodes, just make all files
> * belong to init_user_ns
> */
> @@ -2528,13 +2549,14 @@ extern int generic_block_fiemap(struct i
> extern void get_filesystem(struct file_system_type *fs);
> extern void put_filesystem(struct file_system_type *fs);
> extern struct file_system_type *get_fs_type(const char *name);
> -extern struct super_block *get_super(struct block_device *);
> +extern struct super_block *get_super(struct block_device *, bool);
> extern struct super_block *get_active_super(struct block_device *bdev);
> extern struct super_block *user_get_super(dev_t);
> extern void drop_super(struct super_block *sb);
> -extern void iterate_supers(void (*)(struct super_block *, void *), void *);
> +extern void iterate_supers(void (*)(struct super_block *, void *), void *, bool);
> extern void iterate_supers_type(struct file_system_type *,
> - void (*)(struct super_block *, void *), void *);
> + void (*)(struct super_block *, void *), void *,
> + bool);
>
> extern int dcache_dir_open(struct inode *, struct file *);
> extern int dcache_dir_close(struct inode *, struct file *);
> Index: linux-3.2-rc3-fast/fs/buffer.c
> ================================================== =================
> --- linux-3.2-rc3-fast.orig/fs/buffer.c 2011-11-25 05:51:13.000000000 +0100
> +++ linux-3.2-rc3-fast/fs/buffer.c 2011-11-25 05:51:36.000000000 +0100
> @@ -571,7 +571,7 @@ static void do_thaw_one(struct super_blo
>
> static void do_thaw_all(struct work_struct *work)
> {
> - iterate_supers(do_thaw_one, NULL);
> + iterate_supers(do_thaw_one, NULL, false);
> kfree(work);
> printk(KERN_WARNING "Emergency Thaw complete
");
> }
> Index: linux-3.2-rc3-fast/fs/super.c
> ================================================== =================
> --- linux-3.2-rc3-fast.orig/fs/super.c 2011-11-25 05:51:13.000000000 +0100
> +++ linux-3.2-rc3-fast/fs/super.c 2011-11-29 00:16:01.000000000 +0100
> @@ -337,7 +337,7 @@ bool grab_super_passive(struct super_blo
> spin_unlock(&sb_lock);
>
> if (down_read_trylock(&sb->s_umount)) {
> - if (sb->s_root)
> + if (sb->s_frozen == SB_UNFROZEN && sb->s_root)
> return true;
> up_read(&sb->s_umount);
> }
> @@ -503,7 +503,7 @@ void sync_supers(void)
> sb->s_count++;
> spin_unlock(&sb_lock);
>
> - down_read(&sb->s_umount);
> + down_read_s_umount(sb);
> if (sb->s_root && sb->s_dirt)
> sb->s_op->write_super(sb);
> up_read(&sb->s_umount);
> @@ -527,7 +527,8 @@ void sync_supers(void)
> * Scans the superblock list and calls given function, passing it
> * locked superblock and given argument.
> */
> -void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
> +void iterate_supers(void (*f)(struct super_block *, void *), void *arg,
> + bool wait_for_unfreeze)
> {
> struct super_block *sb, *p = NULL;
>
> @@ -538,7 +539,10 @@ void iterate_supers(void (*f)(struct sup
> sb->s_count++;
> spin_unlock(&sb_lock);
>
> - down_read(&sb->s_umount);
> + if (!wait_for_unfreeze)
> + down_read(&sb->s_umount);
> + else
> + down_read_s_umount(sb);
> if (sb->s_root)
> f(sb, arg);
> up_read(&sb->s_umount);
> @@ -563,7 +567,8 @@ void iterate_supers(void (*f)(struct sup
> * locked superblock and given argument.
> */
> void iterate_supers_type(struct file_system_type *type,
> - void (*f)(struct super_block *, void *), void *arg)
> + void (*f)(struct super_block *, void *), void *arg,
> + bool wait_for_unfreeze)
> {
> struct super_block *sb, *p = NULL;
>
> @@ -572,7 +577,10 @@ void iterate_supers_type(struct file_sys
> sb->s_count++;
> spin_unlock(&sb_lock);
>
> - down_read(&sb->s_umount);
> + if (!wait_for_unfreeze)
> + down_read(&sb->s_umount);
> + else
> + down_read_s_umount(sb);
> if (sb->s_root)
> f(sb, arg);
> up_read(&sb->s_umount);
> @@ -597,7 +605,7 @@ EXPORT_SYMBOL(iterate_supers_type);
> * mounted on the device given. %NULL is returned if no match is found.
> */
>
> -struct super_block *get_super(struct block_device *bdev)
> +struct super_block *get_super(struct block_device *bdev, bool wait_for_unfreeze)
> {
> struct super_block *sb;
>
> @@ -612,7 +620,10 @@ rescan:
> if (sb->s_bdev == bdev) {
> sb->s_count++;
> spin_unlock(&sb_lock);
> - down_read(&sb->s_umount);
> + if (wait_for_unfreeze)
> + down_read_s_umount(sb);
> + else
> + down_read(&sb->s_umount);
> /* still alive? */
> if (sb->s_root)
> return sb;
> @@ -672,7 +683,7 @@ rescan:
> if (sb->s_dev == dev) {
> sb->s_count++;
> spin_unlock(&sb_lock);
> - down_read(&sb->s_umount);
> + down_read_s_umount(sb);
> /* still alive? */
> if (sb->s_root)
> return sb;
> Index: linux-3.2-rc3-fast/fs/sync.c
> ================================================== =================
> --- linux-3.2-rc3-fast.orig/fs/sync.c 2011-11-25 05:51:13.000000000 +0100
> +++ linux-3.2-rc3-fast/fs/sync.c 2011-11-25 05:51:36.000000000 +0100
> @@ -89,7 +89,7 @@ static void sync_one_sb(struct super_blo
> */
> static void sync_filesystems(int wait)
> {
> - iterate_supers(sync_one_sb, &wait);
> + iterate_supers(sync_one_sb, &wait, true);
> }
>
> /*
> @@ -144,7 +144,7 @@ SYSCALL_DEFINE1(syncfs, int, fd)
> return -EBADF;
> sb = file->f_dentry->d_sb;
>
> - down_read(&sb->s_umount);
> + down_read_s_umount(sb);
> ret = sync_filesystem(sb);
> up_read(&sb->s_umount);
>
> Index: linux-3.2-rc3-fast/fs/block_dev.c
> ================================================== =================
> --- linux-3.2-rc3-fast.orig/fs/block_dev.c 2011-11-25 05:51:13.000000000 +0100
> +++ linux-3.2-rc3-fast/fs/block_dev.c 2011-11-25 05:51:36.000000000 +0100
> @@ -222,7 +222,7 @@ EXPORT_SYMBOL(sync_blockdev);
> */
> int fsync_bdev(struct block_device *bdev)
> {
> - struct super_block *sb = get_super(bdev);
> + struct super_block *sb = get_super(bdev, true);
> if (sb) {
> int res = sync_filesystem(sb);
> drop_super(sb);
> @@ -256,7 +256,7 @@ struct super_block *freeze_bdev(struct b
> * to freeze_bdev grab an active reference and only the last
> * thaw_bdev drops it.
> */
> - sb = get_super(bdev);
> + sb = get_super(bdev, false);
> drop_super(sb);
> mutex_unlock(&bdev->bd_fsfreeze_mutex);
> return sb;
> @@ -1658,7 +1658,7 @@ EXPORT_SYMBOL(lookup_bdev);
>
> int __invalidate_device(struct block_device *bdev, bool kill_dirty)
> {
> - struct super_block *sb = get_super(bdev);
> + struct super_block *sb = get_super(bdev, true);
> int res = 0;
>
> if (sb) {
> Index: linux-3.2-rc3-fast/fs/drop_caches.c
> ================================================== =================
> --- linux-3.2-rc3-fast.orig/fs/drop_caches.c 2011-11-25 05:51:13.000000000 +0100
> +++ linux-3.2-rc3-fast/fs/drop_caches.c 2011-11-25 05:51:36.000000000 +0100
> @@ -59,7 +59,7 @@ int drop_caches_sysctl_handler(ctl_table
> return ret;
> if (write) {
> if (sysctl_drop_caches & 1)
> - iterate_supers(drop_pagecache_sb, NULL);
> + iterate_supers(drop_pagecache_sb, NULL, true);
> if (sysctl_drop_caches & 2)
> drop_slab();
> }
> Index: linux-3.2-rc3-fast/security/selinux/hooks.c
> ================================================== =================
> --- linux-3.2-rc3-fast.orig/security/selinux/hooks.c 2011-11-25 05:51:13.000000000 +0100
> +++ linux-3.2-rc3-fast/security/selinux/hooks.c 2011-11-25 05:51:36.000000000 +0100
> @@ -5693,7 +5693,7 @@ void selinux_complete_init(void)
>
> /* Set up any superblocks initialized prior to the policy load. */
> printk(KERN_DEBUG "SELinux: Setting up existing superblocks.
");
> - iterate_supers(delayed_superblock_init, NULL);
> + iterate_supers(delayed_superblock_init, NULL, true);
> }
>
> /* SELinux requires early initialization in order to label
> Index: linux-3.2-rc3-fast/fs/fs-writeback.c
> ================================================== =================
> --- linux-3.2-rc3-fast.orig/fs/fs-writeback.c 2011-11-29 00:09:30.000000000 +0100
> +++ linux-3.2-rc3-fast/fs/fs-writeback.c 2011-11-29 00:12:49.000000000 +0100
> @@ -1273,6 +1273,10 @@ int writeback_inodes_sb_if_idle(struct s
> {
> if (!writeback_in_progress(sb->s_bdi)) {
> down_read(&sb->s_umount);
> + if (unlikely(sb->s_frozen != SB_UNFROZEN)) {
> + up_read(&sb->s_umount);
> + return 0;
> + }
> writeback_inodes_sb(sb, reason);
> up_read(&sb->s_umount);
> return 1;
> @@ -1295,6 +1299,10 @@ int writeback_inodes_sb_nr_if_idle(struc
> {
> if (!writeback_in_progress(sb->s_bdi)) {
> down_read(&sb->s_umount);
> + if (unlikely(sb->s_frozen != SB_UNFROZEN)) {
> + up_read(&sb->s_umount);
> + return 0;
> + }
> writeback_inodes_sb_nr(sb, nr, reason);
> up_read(&sb->s_umount);
> return 1;
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 11-29-2011, 09:21 AM
Jan Kara
 
Default deadlock with suspend and quotas

Sorry for replying once more - forgot to add those CCs - so please reply
to this email...

On Tue 29-11-11 11:19:01, Jan Kara wrote:
> Hi,
>
> On Mon 28-11-11 18:32:18, Mikulas Patocka wrote:
> > > Hi
> > >
> > > Where can I get that patch set?
> > >
> > > We are experiencing other similar deadlocks on RHEL-6, caused by sync or
> > > background writeback (these code paths take s_umount and wait trying to do
> > > I/O), but I wasn't able to reproduce these deadlocks on upstream kernel?
> > > Are there other known deadlock possibilities?
> >
> > I found some patch named "[RFC PATCH 1/3] VFS: Fix s_umount thaw/write
> > deadlock" (I couldn't find the next two parts of the patch in the
> > archives). And the patch looks wrong:
> Yes, that seems to be the series. I generally agree with you that the
> last iteration still had some problems and some changes were requested.
> That's why it's not merged yet after all...
>
> > - down_read_trylock(&sb->s_umount) doesn't fix anything. The lock is not
> > held when the filesystem is frozen and it is taken for write when thawing.
> > Consequently, any task can succeed with down_read_trylock(&sb->s_umount)
> > on a frozen filesystem and if this tasks attempts to do an I/O that is
> > waiting for thaw, it may still deadlock.
> Agreed.
>
> > - skipping sync on frozen filesystem violates sync semantics.
> > Applications, such as databases, assume that when sync finishes, data were
> > written to stable storage. If we skip sync when the filesystem is frozen,
> > we can cause data corruption in these applications (if the system crashes
> > after we skipped a sync).
> Here I don't agree. Filesystem must guarantee there are no dirty data on
> a frozen filesystem. Ext4 and XFS do this, ext3 would need proper
> page_mkwrite() implementation for this but that's the problem of ext3, not
> freezing code in general. If there are no dirty data, sync code (and also
> flusher thread) is free to return without doing anything.
>
> That being said, it is hard to implement freeze handling in page_mkwrite()
> in such a way that there would be no dirty pages (but we know there cannot
> be dirty data in such pages). Currently we mark the page dirty during page
> fault and wait for frozen filesystem only after that so that we are
> guaranteed that either freezing code will wait for page fault to finish and
> will write the page or page fault code notices that freezing is in progress
> and blocks (see fs/buffer.c:block_page_mkwrite()).
>
> So I believe the consensus was that we should not block sync or flusher
> thread on frozen filesystem. Firstly, it's kind of ugly from user
> perspective (you cannot sync filesystems on your system while one
> filesystem is frozen???), secondly, in case of flusher thread it has some
> serious implications if there are more filesystems on the same device - you
> would effectively stop any writeback to the device possibly hanging the
> whole system due to dirty limit being exceeded. So at least in these two
> cases we should just ignore frozen filesystem.
>
> > - I'm not sure what userspace quota subsystem will do if we start
> > returning -EBUSY spuriously.
> Quota tools will complain to the user which would be fine I think. But
> blocking is fine as well. In this particular case I don't care much but it
> should be consistent with what happens to sync. So probably Q_SYNC command
> should ignore frozen filesystem, Q_SETQUOTA or Q_SETINFO should block.
>
> > There is another thing --- I wasn't able to reproduce these sync-related
> > deadlocks at all. Did anyone succeeded in reproducing them? Are there any
> > reported deadlocks? When freezing the ext3 or ext4 filesystem, the kernel
> > prevents creating new dirty data, syncs all data, and freezes the
> > filesystem. Consequently, the sync function never finds any dirty data and
> > so it doesn't block (sync doesn't writeback ATIME change, I don't know
> > why).
> See above why sync can actually spot some dirty inodes/page (although
> there is not any dirty data). Surbhi (added to CC) from Canonical could
> actually trigger these races and consequent deadlocks (and I belive some of
> their customers as well). Also some RH customers were hitting these
> deadlocks (Eric Sandeen was handling those bugs AFAIK) but those were made
> happy by my changes to block_page_mkwrite() which made the race window much
> narrower.
>
> > I made this patch that fixes the quota issue and possible sync issues. It
> > introduces a function down_read_s_umount(sb); --- this function takes
> > s_umount lock for read, but it waits if the filesystem is suspended.
> >
> > There are three more potentially unsafe users: fs/cachefiles/interface.c,
> > fs/nilfs2/ioctl.c, fs/ubifs/budget.c - I didn't fix them because I can't
> > test them.
> >
> > Mikulas
> >
> > ---
> >
> > Fix a s_umount deadlock
> >
> > The lock s_umount is taken for write and dropped when we freeze and resume
> > a block device.
> >
> > Consequently, if some code holds s_umount and performs an I/O, a deadlock may
> > happen if the device is suspended.
> >
> > Unfortunatelly, it seems that developers are not aware of this restriction
> > that I/O must not be peformed with s_umount held.
> >
> > These deadlocks were observed:
> > * lvcreate process: sys_ioctl -> do_vfs_ioctl -> vfs_ioctl -> dm_ctl_ioctl ->
> > ctl_ioctl -> dev_suspend -> dm_resume -> unlock_fs -> thaw_bdev ->
> > call_rwsem_down_write_failed (the process is trying to resume frozen device,
> > but it is waiting trying to acquire s_umount)
> > * quota: sys_quotactl -> do_quotactl -> vfs_get_dqblk -> dqget ->
> > ext4_acquire_dquot -> ext4_journal_start_sb -> jbd2_journal_start ->
> > start_this_handle (the process is holding s_umount and trying to perform
> > a transaction, waiting for the device being unfrozen)
> >
> > * iozone process: sys_sync -> sync_filesystems -> __sync_filesystem ->
> > writeback_inodes_sb -> writeback_inodes_sb_nr -> wait_for_completion
> > (the process is holding s_umount for read and trying to perform some I/O)
> > * flush-253:3: kthread -> bdi_start_fn -> bdi_writeback_task ->
> > wb_do_writeback -> wb_writeback -> writeback_sb_inodes ->
> > writeback_single_inode -> do_writepages -> ext4_da_writepages ->
> > ext4_journal_start_sb -> jbd2_journal_start -> start_this_handle
> > (holding s_umount for read too, acquired in writeback_inodes_wb,
> > and trying to perform I/O)
> > * lvcreate process: sys_ioctl -> do_vfs_ioctl -> vfs_ioctl -> dm_ctl_ioctl ->
> > ctl_ioctl -> dev_suspend -> dm_resume -> unlock_fs -> thaw_bdev ->
> > call_rwsem_down_write_failed (just like in a previous case: trying to
> > resume frozen device, waiting on s_umount)
> >
> > This patch fixes these observed code paths:
> > * introduce a function to safely take s_unlock - down_read_s_umount. It takes
> > the lock, check if a filesystem is frozen, if it is, drops the lock and
> > waits for unfreeze.
> > * get_super function has a parameter, if the parameter is true, it waits for
> > the device to unfreeze (it fixes quota deadlock and a possible deadlock in
> > fsync_bdev and __invalidate_device)
> > * the same for iterate_supers (it fixes the sync deadlock and a possible
> > deadlock in drop_caches_sysctl_handler)
> > * grab_super_passive fails if the device is suspended (fixes the inode writeback
> > deadlock and a possible deadlock in prune_super)
> >
> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > CC: stable@kernel.org
> >
> > ---
> > fs/block_dev.c | 6 +++---
> > fs/buffer.c | 2 +-
> > fs/drop_caches.c | 2 +-
> > fs/fs-writeback.c | 8 ++++++++
> > fs/quota/quota.c | 4 ++--
> > fs/super.c | 29 ++++++++++++++++++++---------
> > fs/sync.c | 4 ++--
> > include/linux/fs.h | 28 +++++++++++++++++++++++++---
> > security/selinux/hooks.c | 2 +-
> > 9 files changed, 63 insertions(+), 22 deletions(-)
> >
> > Index: linux-3.2-rc3-fast/fs/quota/quota.c
> > ================================================== =================
> > --- linux-3.2-rc3-fast.orig/fs/quota/quota.c 2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/quota/quota.c 2011-11-25 05:51:36.000000000 +0100
> > @@ -59,7 +59,7 @@ static int quota_sync_all(int type)
> > return -EINVAL;
> > ret = security_quotactl(Q_SYNC, type, 0, NULL);
> > if (!ret)
> > - iterate_supers(quota_sync_one, &type);
> > + iterate_supers(quota_sync_one, &type, true);
> > return ret;
> > }
> >
> > @@ -310,7 +310,7 @@ static struct super_block *quotactl_bloc
> > putname(tmp);
> > if (IS_ERR(bdev))
> > return ERR_CAST(bdev);
> > - sb = get_super(bdev);
> > + sb = get_super(bdev, true);
> > bdput(bdev);
> > if (!sb)
> > return ERR_PTR(-ENODEV);
> > Index: linux-3.2-rc3-fast/include/linux/fs.h
> > ================================================== =================
> > --- linux-3.2-rc3-fast.orig/include/linux/fs.h 2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/include/linux/fs.h 2011-11-25 05:56:03.000000000 +0100
> > @@ -10,6 +10,7 @@
> > #include <linux/ioctl.h>
> > #include <linux/blk_types.h>
> > #include <linux/types.h>
> > +#include <linux/sched.h>
> >
> > /*
> > * It's silly to have NR_OPEN bigger than NR_FILE, but you can change
> > @@ -1502,6 +1503,26 @@ enum {
> > wait_event((sb)->s_wait_unfrozen, ((sb)->s_frozen < (level)))
> >
> > /*
> > + * Take s_umount and make sure that the filesystem is not frozen.
> > + * This function must be used if we intend to perform any I/O while
> > + * holding s_umount.
> > + *
> > + * s_umount is taken for write when resuming a frozen filesystem,
> > + * thus if someone calls down_read(&s->s_umount) and performs any I/O,
> > + * it may deadlock.
> > + */
> > +static inline void down_read_s_umount(struct super_block *sb)
> > +{
> > +retry:
> > + down_read(&sb->s_umount);
> > + if (unlikely(sb->s_frozen != SB_UNFROZEN)) {
> > + up_write(&sb->s_umount);
> > + vfs_check_frozen(sb, SB_FREEZE_WRITE);
> > + goto retry;
> > + }
> > +}
> > +
> > +/*
> > * until VFS tracks user namespaces for inodes, just make all files
> > * belong to init_user_ns
> > */
> > @@ -2528,13 +2549,14 @@ extern int generic_block_fiemap(struct i
> > extern void get_filesystem(struct file_system_type *fs);
> > extern void put_filesystem(struct file_system_type *fs);
> > extern struct file_system_type *get_fs_type(const char *name);
> > -extern struct super_block *get_super(struct block_device *);
> > +extern struct super_block *get_super(struct block_device *, bool);
> > extern struct super_block *get_active_super(struct block_device *bdev);
> > extern struct super_block *user_get_super(dev_t);
> > extern void drop_super(struct super_block *sb);
> > -extern void iterate_supers(void (*)(struct super_block *, void *), void *);
> > +extern void iterate_supers(void (*)(struct super_block *, void *), void *, bool);
> > extern void iterate_supers_type(struct file_system_type *,
> > - void (*)(struct super_block *, void *), void *);
> > + void (*)(struct super_block *, void *), void *,
> > + bool);
> >
> > extern int dcache_dir_open(struct inode *, struct file *);
> > extern int dcache_dir_close(struct inode *, struct file *);
> > Index: linux-3.2-rc3-fast/fs/buffer.c
> > ================================================== =================
> > --- linux-3.2-rc3-fast.orig/fs/buffer.c 2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/buffer.c 2011-11-25 05:51:36.000000000 +0100
> > @@ -571,7 +571,7 @@ static void do_thaw_one(struct super_blo
> >
> > static void do_thaw_all(struct work_struct *work)
> > {
> > - iterate_supers(do_thaw_one, NULL);
> > + iterate_supers(do_thaw_one, NULL, false);
> > kfree(work);
> > printk(KERN_WARNING "Emergency Thaw complete
");
> > }
> > Index: linux-3.2-rc3-fast/fs/super.c
> > ================================================== =================
> > --- linux-3.2-rc3-fast.orig/fs/super.c 2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/super.c 2011-11-29 00:16:01.000000000 +0100
> > @@ -337,7 +337,7 @@ bool grab_super_passive(struct super_blo
> > spin_unlock(&sb_lock);
> >
> > if (down_read_trylock(&sb->s_umount)) {
> > - if (sb->s_root)
> > + if (sb->s_frozen == SB_UNFROZEN && sb->s_root)
> > return true;
> > up_read(&sb->s_umount);
> > }
> > @@ -503,7 +503,7 @@ void sync_supers(void)
> > sb->s_count++;
> > spin_unlock(&sb_lock);
> >
> > - down_read(&sb->s_umount);
> > + down_read_s_umount(sb);
> > if (sb->s_root && sb->s_dirt)
> > sb->s_op->write_super(sb);
> > up_read(&sb->s_umount);
> > @@ -527,7 +527,8 @@ void sync_supers(void)
> > * Scans the superblock list and calls given function, passing it
> > * locked superblock and given argument.
> > */
> > -void iterate_supers(void (*f)(struct super_block *, void *), void *arg)
> > +void iterate_supers(void (*f)(struct super_block *, void *), void *arg,
> > + bool wait_for_unfreeze)
> > {
> > struct super_block *sb, *p = NULL;
> >
> > @@ -538,7 +539,10 @@ void iterate_supers(void (*f)(struct sup
> > sb->s_count++;
> > spin_unlock(&sb_lock);
> >
> > - down_read(&sb->s_umount);
> > + if (!wait_for_unfreeze)
> > + down_read(&sb->s_umount);
> > + else
> > + down_read_s_umount(sb);
> > if (sb->s_root)
> > f(sb, arg);
> > up_read(&sb->s_umount);
> > @@ -563,7 +567,8 @@ void iterate_supers(void (*f)(struct sup
> > * locked superblock and given argument.
> > */
> > void iterate_supers_type(struct file_system_type *type,
> > - void (*f)(struct super_block *, void *), void *arg)
> > + void (*f)(struct super_block *, void *), void *arg,
> > + bool wait_for_unfreeze)
> > {
> > struct super_block *sb, *p = NULL;
> >
> > @@ -572,7 +577,10 @@ void iterate_supers_type(struct file_sys
> > sb->s_count++;
> > spin_unlock(&sb_lock);
> >
> > - down_read(&sb->s_umount);
> > + if (!wait_for_unfreeze)
> > + down_read(&sb->s_umount);
> > + else
> > + down_read_s_umount(sb);
> > if (sb->s_root)
> > f(sb, arg);
> > up_read(&sb->s_umount);
> > @@ -597,7 +605,7 @@ EXPORT_SYMBOL(iterate_supers_type);
> > * mounted on the device given. %NULL is returned if no match is found.
> > */
> >
> > -struct super_block *get_super(struct block_device *bdev)
> > +struct super_block *get_super(struct block_device *bdev, bool wait_for_unfreeze)
> > {
> > struct super_block *sb;
> >
> > @@ -612,7 +620,10 @@ rescan:
> > if (sb->s_bdev == bdev) {
> > sb->s_count++;
> > spin_unlock(&sb_lock);
> > - down_read(&sb->s_umount);
> > + if (wait_for_unfreeze)
> > + down_read_s_umount(sb);
> > + else
> > + down_read(&sb->s_umount);
> > /* still alive? */
> > if (sb->s_root)
> > return sb;
> > @@ -672,7 +683,7 @@ rescan:
> > if (sb->s_dev == dev) {
> > sb->s_count++;
> > spin_unlock(&sb_lock);
> > - down_read(&sb->s_umount);
> > + down_read_s_umount(sb);
> > /* still alive? */
> > if (sb->s_root)
> > return sb;
> > Index: linux-3.2-rc3-fast/fs/sync.c
> > ================================================== =================
> > --- linux-3.2-rc3-fast.orig/fs/sync.c 2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/sync.c 2011-11-25 05:51:36.000000000 +0100
> > @@ -89,7 +89,7 @@ static void sync_one_sb(struct super_blo
> > */
> > static void sync_filesystems(int wait)
> > {
> > - iterate_supers(sync_one_sb, &wait);
> > + iterate_supers(sync_one_sb, &wait, true);
> > }
> >
> > /*
> > @@ -144,7 +144,7 @@ SYSCALL_DEFINE1(syncfs, int, fd)
> > return -EBADF;
> > sb = file->f_dentry->d_sb;
> >
> > - down_read(&sb->s_umount);
> > + down_read_s_umount(sb);
> > ret = sync_filesystem(sb);
> > up_read(&sb->s_umount);
> >
> > Index: linux-3.2-rc3-fast/fs/block_dev.c
> > ================================================== =================
> > --- linux-3.2-rc3-fast.orig/fs/block_dev.c 2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/block_dev.c 2011-11-25 05:51:36.000000000 +0100
> > @@ -222,7 +222,7 @@ EXPORT_SYMBOL(sync_blockdev);
> > */
> > int fsync_bdev(struct block_device *bdev)
> > {
> > - struct super_block *sb = get_super(bdev);
> > + struct super_block *sb = get_super(bdev, true);
> > if (sb) {
> > int res = sync_filesystem(sb);
> > drop_super(sb);
> > @@ -256,7 +256,7 @@ struct super_block *freeze_bdev(struct b
> > * to freeze_bdev grab an active reference and only the last
> > * thaw_bdev drops it.
> > */
> > - sb = get_super(bdev);
> > + sb = get_super(bdev, false);
> > drop_super(sb);
> > mutex_unlock(&bdev->bd_fsfreeze_mutex);
> > return sb;
> > @@ -1658,7 +1658,7 @@ EXPORT_SYMBOL(lookup_bdev);
> >
> > int __invalidate_device(struct block_device *bdev, bool kill_dirty)
> > {
> > - struct super_block *sb = get_super(bdev);
> > + struct super_block *sb = get_super(bdev, true);
> > int res = 0;
> >
> > if (sb) {
> > Index: linux-3.2-rc3-fast/fs/drop_caches.c
> > ================================================== =================
> > --- linux-3.2-rc3-fast.orig/fs/drop_caches.c 2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/drop_caches.c 2011-11-25 05:51:36.000000000 +0100
> > @@ -59,7 +59,7 @@ int drop_caches_sysctl_handler(ctl_table
> > return ret;
> > if (write) {
> > if (sysctl_drop_caches & 1)
> > - iterate_supers(drop_pagecache_sb, NULL);
> > + iterate_supers(drop_pagecache_sb, NULL, true);
> > if (sysctl_drop_caches & 2)
> > drop_slab();
> > }
> > Index: linux-3.2-rc3-fast/security/selinux/hooks.c
> > ================================================== =================
> > --- linux-3.2-rc3-fast.orig/security/selinux/hooks.c 2011-11-25 05:51:13.000000000 +0100
> > +++ linux-3.2-rc3-fast/security/selinux/hooks.c 2011-11-25 05:51:36.000000000 +0100
> > @@ -5693,7 +5693,7 @@ void selinux_complete_init(void)
> >
> > /* Set up any superblocks initialized prior to the policy load. */
> > printk(KERN_DEBUG "SELinux: Setting up existing superblocks.
");
> > - iterate_supers(delayed_superblock_init, NULL);
> > + iterate_supers(delayed_superblock_init, NULL, true);
> > }
> >
> > /* SELinux requires early initialization in order to label
> > Index: linux-3.2-rc3-fast/fs/fs-writeback.c
> > ================================================== =================
> > --- linux-3.2-rc3-fast.orig/fs/fs-writeback.c 2011-11-29 00:09:30.000000000 +0100
> > +++ linux-3.2-rc3-fast/fs/fs-writeback.c 2011-11-29 00:12:49.000000000 +0100
> > @@ -1273,6 +1273,10 @@ int writeback_inodes_sb_if_idle(struct s
> > {
> > if (!writeback_in_progress(sb->s_bdi)) {
> > down_read(&sb->s_umount);
> > + if (unlikely(sb->s_frozen != SB_UNFROZEN)) {
> > + up_read(&sb->s_umount);
> > + return 0;
> > + }
> > writeback_inodes_sb(sb, reason);
> > up_read(&sb->s_umount);
> > return 1;
> > @@ -1295,6 +1299,10 @@ int writeback_inodes_sb_nr_if_idle(struc
> > {
> > if (!writeback_in_progress(sb->s_bdi)) {
> > down_read(&sb->s_umount);
> > + if (unlikely(sb->s_frozen != SB_UNFROZEN)) {
> > + up_read(&sb->s_umount);
> > + return 0;
> > + }
> > writeback_inodes_sb_nr(sb, nr, reason);
> > up_read(&sb->s_umount);
> > return 1;
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 11-29-2011, 10:06 AM
Mikulas Patocka
 
Default deadlock with suspend and quotas

On Tue 29-11-11 11:19:01, Jan Kara wrote:
> Hi,
>
> On Mon 28-11-11 18:32:18, Mikulas Patocka wrote:
> > > Hi
> > >
> > > Where can I get that patch set?
> > >
> > > We are experiencing other similar deadlocks on RHEL-6, caused by sync or
> > > background writeback (these code paths take s_umount and wait trying to do
> > > I/O), but I wasn't able to reproduce these deadlocks on upstream kernel?
> > > Are there other known deadlock possibilities?
> >
> > I found some patch named "[RFC PATCH 1/3] VFS: Fix s_umount thaw/write
> > deadlock" (I couldn't find the next two parts of the patch in the
> > archives). And the patch looks wrong:
> Yes, that seems to be the series. I generally agree with you that the
> last iteration still had some problems and some changes were requested.
> That's why it's not merged yet after all...
>
> > - down_read_trylock(&sb->s_umount) doesn't fix anything. The lock is not
> > held when the filesystem is frozen and it is taken for write when thawing.
> > Consequently, any task can succeed with down_read_trylock(&sb->s_umount)
> > on a frozen filesystem and if this tasks attempts to do an I/O that is
> > waiting for thaw, it may still deadlock.
> Agreed.
>
> > - skipping sync on frozen filesystem violates sync semantics.
> > Applications, such as databases, assume that when sync finishes, data were
> > written to stable storage. If we skip sync when the filesystem is frozen,
> > we can cause data corruption in these applications (if the system crashes
> > after we skipped a sync).
> Here I don't agree. Filesystem must guarantee there are no dirty data on
> a frozen filesystem.

This is technically impossible to achieve on ext2, fat or other
non-transactional filesystems. These filesystems have no locks around code
paths that set data or inodes dirty. And you still need working sync for
ext2. So the best thing to do in sync is to wait until the filesystem is
unfrozen.

> Ext4 and XFS do this, ext3 would need proper
> page_mkwrite() implementation for this but that's the problem of ext3, not
> freezing code in general. If there are no dirty data, sync code (and also
> flusher thread) is free to return without doing anything.
>
> That being said, it is hard to implement freeze handling in page_mkwrite()
> in such a way that there would be no dirty pages (but we know there cannot
> be dirty data in such pages). Currently we mark the page dirty during page
> fault and wait for frozen filesystem only after that so that we are
> guaranteed that either freezing code will wait for page fault to finish and
> will write the page or page fault code notices that freezing is in progress
> and blocks (see fs/buffer.c:block_page_mkwrite()).
>
> So I believe the consensus was that we should not block sync or flusher
> thread on frozen filesystem. Firstly, it's kind of ugly from user
> perspective (you cannot sync filesystems on your system while one
> filesystem is frozen???), secondly, in case of flusher thread it has some
> serious implications if there are more filesystems on the same device - you
> would effectively stop any writeback to the device possibly hanging the
> whole system due to dirty limit being exceeded. So at least in these two
> cases we should just ignore frozen filesystem.

For background writes I agree, that they should skip frozen filesystems.
But for synchronous writes (sync) must wait, at least on non-transactional
filesystems.

> > - I'm not sure what userspace quota subsystem will do if we start
> > returning -EBUSY spuriously.
> Quota tools will complain to the user which would be fine I think. But
> blocking is fine as well. In this particular case I don't care much but it
> should be consistent with what happens to sync. So probably Q_SYNC command
> should ignore frozen filesystem, Q_SETQUOTA or Q_SETINFO should block.
>
> > There is another thing --- I wasn't able to reproduce these sync-related
> > deadlocks at all. Did anyone succeeded in reproducing them? Are there any
> > reported deadlocks? When freezing the ext3 or ext4 filesystem, the kernel
> > prevents creating new dirty data, syncs all data, and freezes the
> > filesystem. Consequently, the sync function never finds any dirty data and
> > so it doesn't block (sync doesn't writeback ATIME change, I don't know
> > why).
> See above why sync can actually spot some dirty inodes/page (although
> there is not any dirty data). Surbhi (added to CC) from Canonical could
> actually trigger these races and consequent deadlocks (and I belive some of
> their customers as well). Also some RH customers were hitting these
> deadlocks (Eric Sandeen was handling those bugs AFAIK) but those were made
> happy by my changes to block_page_mkwrite() which made the race window much
> narrower.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 11-29-2011, 10:11 AM
Jan Kara
 
Default deadlock with suspend and quotas

On Tue 29-11-11 06:06:21, Mikulas Patocka wrote:
> On Tue 29-11-11 11:19:01, Jan Kara wrote:
> > Hi,
> >
> > On Mon 28-11-11 18:32:18, Mikulas Patocka wrote:
> > > > Hi
> > > >
> > > > Where can I get that patch set?
> > > >
> > > > We are experiencing other similar deadlocks on RHEL-6, caused by sync or
> > > > background writeback (these code paths take s_umount and wait trying to do
> > > > I/O), but I wasn't able to reproduce these deadlocks on upstream kernel?
> > > > Are there other known deadlock possibilities?
> > >
> > > I found some patch named "[RFC PATCH 1/3] VFS: Fix s_umount thaw/write
> > > deadlock" (I couldn't find the next two parts of the patch in the
> > > archives). And the patch looks wrong:
> > Yes, that seems to be the series. I generally agree with you that the
> > last iteration still had some problems and some changes were requested.
> > That's why it's not merged yet after all...
> >
> > > - down_read_trylock(&sb->s_umount) doesn't fix anything. The lock is not
> > > held when the filesystem is frozen and it is taken for write when thawing.
> > > Consequently, any task can succeed with down_read_trylock(&sb->s_umount)
> > > on a frozen filesystem and if this tasks attempts to do an I/O that is
> > > waiting for thaw, it may still deadlock.
> > Agreed.
> >
> > > - skipping sync on frozen filesystem violates sync semantics.
> > > Applications, such as databases, assume that when sync finishes, data were
> > > written to stable storage. If we skip sync when the filesystem is frozen,
> > > we can cause data corruption in these applications (if the system crashes
> > > after we skipped a sync).
> > Here I don't agree. Filesystem must guarantee there are no dirty data on
> > a frozen filesystem.
>
> This is technically impossible to achieve on ext2, fat or other
> non-transactional filesystems. These filesystems have no locks around code
> paths that set data or inodes dirty. And you still need working sync for
> ext2. So the best thing to do in sync is to wait until the filesystem is
> unfrozen.
Then suspend is effectively unsupported on the filesystem and should
return EOPNOTSUPP? At least that's what I'd expect...

Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 11-29-2011, 11:54 AM
Mikulas Patocka
 
Default deadlock with suspend and quotas

> > This is technically impossible to achieve on ext2, fat or other
> > non-transactional filesystems. These filesystems have no locks around code
> > paths that set data or inodes dirty. And you still need working sync for
> > ext2. So the best thing to do in sync is to wait until the filesystem is
> > unfrozen.
> Then suspend is effectively unsupported on the filesystem and should
> return EOPNOTSUPP? At least that's what I'd expect...
>
> Honza
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

LVM uses suspend every time it changes layout of the logical volume. For
example when it converts to/from mirrored format, extends/shrinks the
volume, moves the volume to a different disk, takes a snapshots, merges a
snapshot back, on mirror or multipath failover.

For most of these actions (except taking a snapshot) it is irrelevant if
there are dirty data in the filesystem cache or not while it is suspended.
So there is no point in banning suspend on ext2. If you banned it, you
couldn't use ext2 on LVM at all.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 

Thread Tools




All times are GMT. The time now is 07:18 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org