FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Device-mapper Development

 
 
LinkBack Thread Tools
 
Old 07-29-2010, 09:54 PM
Chandra Seetharaman
 
Default scsi_dh : Couple of fixes for scsi device handlers

Babu,

Your main object is to protect scsi_dh_data across scsi_dh_activate() by
way of getting kref around scsi_dh_activate(), right ?

Wouldn't doing what Shyam suggested (doing kref_put() and put_device())
in scsi_activate() make it simpler and code still be readable ? (it
would make all the patches except 2/6 not needed).

Did you hit with any problems doing it that way ?

Also the snippet (in 2/6)
---------------
@@ -228,7 +228,8 @@ store_dh_state(struct device *dev, struc
* Activate a device handler
*/
if (scsi_dh->activate)
- err = scsi_dh->activate(sdev, NULL, NULL);
+ err = scsi_dh_activate(sdev->request_queue,
+ NULL, NULL);
else
err = 0;
}
------------------
can be made as a patch in itself.

Thanks for fixing the problem.

regards,

chandra
On Wed, 2010-07-28 at 16:58 -0600, Moger, Babu wrote:
> These patches fix the following two cases.
> 1. Devices going away while scsi device hander's activate is still in progress.
>
> 2. Removal of scsi_dh_data(calling detach handler) when scsi device hander's activate is still in progress.
>
> We have been seeing these problems while running multipath failover tests on LSI storage. These patches fix the problem. We have verified it.
>
> Here is the panic we have been seeing while running failover failback tests.
>
> > 00:40:42:869 COM1 >------------[ cut here ]------------
> > 00:40:42:869 COM1 >kernel BUG at /usr/src/packages/BUILD/lsi-
> > scsi_dh_rdac-01.00/obj/default/scsi_dh_rdac.c:232!
> > 00:40:42:869 COM1 >invalid opcode: 0000 [1] SMP
> > 00:40:42:885 COM1 >last sysfs file: /sys/kernel/uevent_seqnum
> > 00:40:42:885 COM1 >CPU 3
> > 00:40:42:885 COM1 >Modules linked in: dm_round_robin dm_multipath
> > nls_utf8 cifs(X) microcode af_packet ipv6 fuse loop dm_mod iTCO_wdt
> > iTCO_vendor_support dcdbas(X) pcspkr rtc_cmos rtc_core serio_raw
> > rtc_lib i5000_edac edac_core bnx2 shpchp sg pci_hotplug button
> > mptctl usbhid hid ff_memless uhci_hcd ehci_hcd usbcore sd_mod
> > crc_t10dif mpt2sas(N) raid_class edd ext3 mbcache jbd fan
> > ide_pci_generic piix ide_core ata_generic ata_piix libata dock
> > mptsas mptscsih mptbase scsi_transport_sas thermal processor
> > thermal_sys hwmon scsi_dh_rdac(X) scsi_dh scsi_mod
> > 00:40:42:932 COM1 >Supported: No
> > 00:40:42:932 COM1 >Pid: 14044, comm: kmpath_handlerd Tainted: G
> > 2.6.27.39-0.3-default #1
> > 00:40:42:932 COM1 >RIP: 0010:
> > rdac_activate+0x257/0x387 [scsi_dh_rdac]
> > 00:40:42:947 COM1 >RSP: 0018:ffff880127109dc0 EFLAGS: 00010246
> > 00:40:42:947 COM1 >RAX: ffff8800ae02f000 RBX: 0000000000000001 RCX:
> > 0000000000000018
> > 00:40:42:963 COM1 >RDX: 0000000000001bbc RSI: 0000000000000282 RDI:
> > ffff8800c2ccd918
> > 00:40:42:963 COM1 >RBP: 00000000fffffffb R08: ffffffff806eaf78 R09:
> > ffff880028087720
> > 00:40:42:963 COM1 >R10: 0000000000000000 R11: ffffffff80284ebe R12:
> > ffffffffa0030fbe
> > 00:40:42:978 COM1 >R13: 0000000000000000 R14: 0000000000000282 R15:
> > 0000000000000000
> > 00:40:42:978 COM1 >FS: 0000000000000000(0000) GS:ffff88012fb81ec0
> > (0000) knlGS:0000000000000000
> > 00:40:42:978 COM1 >CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> > 00:40:42:994 COM1 >CR2: 00000000f7701630 CR3: 0000000101d5e000 CR4:
> > 00000000000006e0
> > 00:40:42:994 COM1 >DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > 0000000000000000
> > 00:40:43:010 COM1 >DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> > 0000000000000400
> > 00:40:43:010 COM1 >Process kmpath_handlerd (pid: 14044, threadinfo
> > ffff880127108000, task ffff88012710e680)
> > 00:40:43:010 COM1 >Stack: ffff880127109ec0 ffffffff8049c431
> > 0000000000000000 ffff880127109e50
> > 00:40:43:025 COM1 > ffff8800ae02f000 ffff8800b5032208
> > ffff8800b5032200 ffff8800ae02f250
> > 00:40:43:025 COM1 > ffff8800b5032216 0000000580a33680
> > ffff8800c2ccd6b0 ffff8800ae02f120
> > 00:40:43:041 COM1 >Call Trace:
> > 00:40:43:041 COM1 > scsi_dh_activate+0x81/0x9b[scsi_dh]
> > 00:40:43:041 COM1 > activate_path+0x22/0x46
> > [dm_multipath]
> > 00:40:43:041 COM1 > run_workqueue+0x7a/0x100
> > 00:40:43:057 COM1 > worker_thread+0xd8/0xe7
> > 00:40:43:057 COM1 > kthread+0x47/0x73
> > 00:40:43:057 COM1 > child_rip+0xa/0x11
> > 00:40:43:057 COM1 >
> > 00:40:43:057 COM1 >
> > 00:40:43:057 COM1 >Code: 4c 89 ea e8 78 dd 30 e0 4c 89 ef 89 c5 e8
> > db a8 30 e0 85 ed 0f 84 da 00 00 00 48 8b 44 24 20 4c 8b a8 d0 05 00
> > 00 4d 85 ed 75 04 <0f> 0b eb fe 48 8b 7c 24 40 48 8d 54 24 60 be 60
> > 00 00 00 e8 ae
> > 00:40:43:072 COM1 >RIP rdac_activate+0x257/
> > 0x387 [scsi_dh_rdac]
> > 00:40:43:088 COM1 > RSP <ffff880127109dc0>
> > 00:40:43:088 COM1 >---[ end trace 00e89c598c82483b ]---
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 07-29-2010, 10:35 PM
"Moger, Babu"
 
Default scsi_dh : Couple of fixes for scsi device handlers

Chandra, Shyam,
Thanks for your comments.. Please see my response.

> -----Original Message-----
> From: Chandra Seetharaman [mailto:sekharan@us.ibm.com]
> Sent: Thursday, July 29, 2010 4:54 PM
> To: Moger, Babu
> Cc: device-mapper development; linux-scsi@vger.kernel.org; Qi, Yanling;
> Chauhan, Vijay; Dachepalli, Sudhir; Stankey, Robert
> Subject: Re: [PATCH 0/6] scsi_dh : Couple of fixes for scsi device
> handlers
>
> Babu,
>
> Your main object is to protect scsi_dh_data across scsi_dh_activate()
> by
> way of getting kref around scsi_dh_activate(), right ?
>

Yes, That is correct.

> Wouldn't doing what Shyam suggested (doing kref_put() and put_device())
> in scsi_activate() make it simpler and code still be readable ? (it
> would make all the patches except 2/6 not needed).
>
> Did you hit with any problems doing it that way ?
>

Yes, We can do that. Problem is I am hitting the issue with BUG_ON in get_rdac_data which is there in the beginning of rdac_activate.
If I have to go this way, then I need to remove a call get_rdac_data and just validate pointers. Report error(SCSI_DH_IO) if pointer is not valid.
Then hold the reference counts and continue if everything is alright. I will send the new patches as soon as I can.

> Also the snippet (in 2/6)
> ---------------
> @@ -228,7 +228,8 @@ store_dh_state(struct device *dev, struc
> * Activate a device handler
> */
> if (scsi_dh->activate)
> - err = scsi_dh->activate(sdev, NULL,
> NULL);
> + err = scsi_dh_activate(sdev-
> >request_queue,
> + NULL, NULL);
> else
> err = 0;
> }
> ------------------
> can be made as a patch in itself.
>
> Thanks for fixing the problem.

Ok.. I will make this as separate patch..

>
> regards,
>
> chandra
> On Wed, 2010-07-28 at 16:58 -0600, Moger, Babu wrote:
> > These patches fix the following two cases.
> > 1. Devices going away while scsi device hander's activate is still in
> progress.
> >
> > 2. Removal of scsi_dh_data(calling detach handler) when scsi device
> hander's activate is still in progress.
> >
> > We have been seeing these problems while running multipath failover
> tests on LSI storage. These patches fix the problem. We have verified
> it.
> >
> > Here is the panic we have been seeing while running failover failback
> tests.
> >
> > > 00:40:42:869 COM1 >------------[ cut here ]------------
> > > 00:40:42:869 COM1 >kernel BUG at /usr/src/packages/BUILD/lsi-
> > > scsi_dh_rdac-01.00/obj/default/scsi_dh_rdac.c:232!
> > > 00:40:42:869 COM1 >invalid opcode: 0000 [1] SMP
> > > 00:40:42:885 COM1 >last sysfs file: /sys/kernel/uevent_seqnum
> > > 00:40:42:885 COM1 >CPU 3
> > > 00:40:42:885 COM1 >Modules linked in: dm_round_robin dm_multipath
> > > nls_utf8 cifs(X) microcode af_packet ipv6 fuse loop dm_mod iTCO_wdt
> > > iTCO_vendor_support dcdbas(X) pcspkr rtc_cmos rtc_core serio_raw
> > > rtc_lib i5000_edac edac_core bnx2 shpchp sg pci_hotplug button
> > > mptctl usbhid hid ff_memless uhci_hcd ehci_hcd usbcore sd_mod
> > > crc_t10dif mpt2sas(N) raid_class edd ext3 mbcache jbd fan
> > > ide_pci_generic piix ide_core ata_generic ata_piix libata dock
> > > mptsas mptscsih mptbase scsi_transport_sas thermal processor
> > > thermal_sys hwmon scsi_dh_rdac(X) scsi_dh scsi_mod
> > > 00:40:42:932 COM1 >Supported: No
> > > 00:40:42:932 COM1 >Pid: 14044, comm: kmpath_handlerd Tainted: G
> > > 2.6.27.39-0.3-default #1
> > > 00:40:42:932 COM1 >RIP: 0010:
> > > rdac_activate+0x257/0x387 [scsi_dh_rdac]
> > > 00:40:42:947 COM1 >RSP: 0018:ffff880127109dc0 EFLAGS: 00010246
> > > 00:40:42:947 COM1 >RAX: ffff8800ae02f000 RBX: 0000000000000001
> RCX:
> > > 0000000000000018
> > > 00:40:42:963 COM1 >RDX: 0000000000001bbc RSI: 0000000000000282
> RDI:
> > > ffff8800c2ccd918
> > > 00:40:42:963 COM1 >RBP: 00000000fffffffb R08: ffffffff806eaf78
> R09:
> > > ffff880028087720
> > > 00:40:42:963 COM1 >R10: 0000000000000000 R11: ffffffff80284ebe
> R12:
> > > ffffffffa0030fbe
> > > 00:40:42:978 COM1 >R13: 0000000000000000 R14: 0000000000000282
> R15:
> > > 0000000000000000
> > > 00:40:42:978 COM1 >FS: 0000000000000000(0000) GS:ffff88012fb81ec0
> > > (0000) knlGS:0000000000000000
> > > 00:40:42:978 COM1 >CS: 0010 DS: 0018 ES: 0018 CR0:
> 000000008005003b
> > > 00:40:42:994 COM1 >CR2: 00000000f7701630 CR3: 0000000101d5e000
> CR4:
> > > 00000000000006e0
> > > 00:40:42:994 COM1 >DR0: 0000000000000000 DR1: 0000000000000000
> DR2:
> > > 0000000000000000
> > > 00:40:43:010 COM1 >DR3: 0000000000000000 DR6: 00000000ffff0ff0
> DR7:
> > > 0000000000000400
> > > 00:40:43:010 COM1 >Process kmpath_handlerd (pid: 14044, threadinfo
> > > ffff880127108000, task ffff88012710e680)
> > > 00:40:43:010 COM1 >Stack: ffff880127109ec0 ffffffff8049c431
> > > 0000000000000000 ffff880127109e50
> > > 00:40:43:025 COM1 > ffff8800ae02f000 ffff8800b5032208
> > > ffff8800b5032200 ffff8800ae02f250
> > > 00:40:43:025 COM1 > ffff8800b5032216 0000000580a33680
> > > ffff8800c2ccd6b0 ffff8800ae02f120
> > > 00:40:43:041 COM1 >Call Trace:
> > > 00:40:43:041 COM1 > scsi_dh_activate+0x81/0x9b[scsi_dh]
> > > 00:40:43:041 COM1 > activate_path+0x22/0x46
> > > [dm_multipath]
> > > 00:40:43:041 COM1 > run_workqueue+0x7a/0x100
> > > 00:40:43:057 COM1 > worker_thread+0xd8/0xe7
> > > 00:40:43:057 COM1 > kthread+0x47/0x73
> > > 00:40:43:057 COM1 > child_rip+0xa/0x11
> > > 00:40:43:057 COM1 >
> > > 00:40:43:057 COM1 >
> > > 00:40:43:057 COM1 >Code: 4c 89 ea e8 78 dd 30 e0 4c 89 ef 89 c5 e8
> > > db a8 30 e0 85 ed 0f 84 da 00 00 00 48 8b 44 24 20 4c 8b a8 d0 05
> 00
> > > 00 4d 85 ed 75 04 <0f> 0b eb fe 48 8b 7c 24 40 48 8d 54 24 60 be 60
> > > 00 00 00 e8 ae
> > > 00:40:43:072 COM1 >RIP rdac_activate+0x257/
> > > 0x387 [scsi_dh_rdac]
> > > 00:40:43:088 COM1 > RSP <ffff880127109dc0>
> > > 00:40:43:088 COM1 >---[ end trace 00e89c598c82483b ]---
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-scsi"
> in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 07-30-2010, 12:04 AM
Chandra Seetharaman
 
Default scsi_dh : Couple of fixes for scsi device handlers

See comment below

On Thu, 2010-07-29 at 16:35 -0600, Moger, Babu wrote:
> Chandra, Shyam,
> Thanks for your comments.. Please see my response.
>
> > -----Original Message-----
> > From: Chandra Seetharaman [mailto:sekharan@us.ibm.com]
> > Sent: Thursday, July 29, 2010 4:54 PM
> > To: Moger, Babu
> > Cc: device-mapper development; linux-scsi@vger.kernel.org; Qi, Yanling;
> > Chauhan, Vijay; Dachepalli, Sudhir; Stankey, Robert
> > Subject: Re: [PATCH 0/6] scsi_dh : Couple of fixes for scsi device
> > handlers
> >
> > Babu,
> >
> > Your main object is to protect scsi_dh_data across scsi_dh_activate()
> > by
> > way of getting kref around scsi_dh_activate(), right ?
> >
>
> Yes, That is correct.
>
> > Wouldn't doing what Shyam suggested (doing kref_put() and put_device())
> > in scsi_activate() make it simpler and code still be readable ? (it
> > would make all the patches except 2/6 not needed).
> >
> > Did you hit with any problems doing it that way ?
> >
>
> Yes, We can do that. Problem is I am hitting the issue with BUG_ON
> in get_rdac_data which is there in the beginning of rdac_activate.

I do not understand the problem.

If the BUG_ON on get_rdac_data() is being triggered, that means
sdev->scsi_dh_data is NULL, if that is the case, how can you access
sdev->scsi_dh_data->kref in scsi_dh_activate (in patch 2/6) ? Wouldn't
it trip a oops ?


> If I have to go this way, then I need to remove a call get_rdac_data
> and just validate pointers. Report error(SCSI_DH_IO) if pointer is not valid.
> Then hold the reference counts and continue if everything is alright.
> I will send the new patches as soon as I can.


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 07-30-2010, 06:12 PM
"Moger, Babu"
 
Default scsi_dh : Couple of fixes for scsi device handlers

See my comments below..

> -----Original Message-----
> From: Chandra Seetharaman [mailto:sekharan@us.ibm.com]
> Sent: Thursday, July 29, 2010 7:05 PM
> To: Moger, Babu
> Cc: device-mapper development; linux-scsi@vger.kernel.org;
> Shyam_Iyer@Dell.com; Qi, Yanling; Chauhan, Vijay; Dachepalli, Sudhir;
> Stankey, Robert
> Subject: RE: [PATCH 0/6] scsi_dh : Couple of fixes for scsi device
> handlers
>
> See comment below
>
> On Thu, 2010-07-29 at 16:35 -0600, Moger, Babu wrote:
> > Chandra, Shyam,
> > Thanks for your comments.. Please see my response.
> >
> > > -----Original Message-----
> > > From: Chandra Seetharaman [mailto:sekharan@us.ibm.com]
> > > Sent: Thursday, July 29, 2010 4:54 PM
> > > To: Moger, Babu
> > > Cc: device-mapper development; linux-scsi@vger.kernel.org; Qi,
> Yanling;
> > > Chauhan, Vijay; Dachepalli, Sudhir; Stankey, Robert
> > > Subject: Re: [PATCH 0/6] scsi_dh : Couple of fixes for scsi device
> > > handlers
> > >
> > > Babu,
> > >
> > > Your main object is to protect scsi_dh_data across
> scsi_dh_activate()
> > > by
> > > way of getting kref around scsi_dh_activate(), right ?
> > >
> >
> > Yes, That is correct.
> >
> > > Wouldn't doing what Shyam suggested (doing kref_put() and
> put_device())
> > > in scsi_activate() make it simpler and code still be readable ? (it
> > > would make all the patches except 2/6 not needed).
> > >
> > > Did you hit with any problems doing it that way ?
> > >
> >
> > Yes, We can do that. Problem is I am hitting the issue with BUG_ON
> > in get_rdac_data which is there in the beginning of rdac_activate.
>
> I do not understand the problem.
>
> If the BUG_ON on get_rdac_data() is being triggered, that means
> sdev->scsi_dh_data is NULL, if that is the case, how can you access
> sdev->scsi_dh_data->kref in scsi_dh_activate (in patch 2/6) ? Wouldn't
> it trip a oops ?

Test case is deleting both active and passive paths almost together during the multipath testing. Looks like DM picked up the active path failure first. Then failed the active path and started scheduling activate_path to failover to passive path. Passive path is also about to go down pretty soon. When the control was in scsi_dh_activate, Looks like scsi_dh_data was still valid because I did not see panic here. But scsi_dh_data became NULL when control went to rdac_activate. That is when I hit the bug on.

kernel BUG at /usr/src/packages/BUILD/lsi-scsi_dh_rdac-01.00/obj/default/scsi_dh_rdac.c:232!
RIP: 0010: rdac_activate+0x257/0x387 [scsi_dh_rdac]

My understanding is someone triggered scsi_dh->detach for passive path during this small window. Only way I could see problem go away is holding reference counts between these calls. Did i miss anything here? See the code snippet below..

int scsi_dh_activate(struct request_queue *q, activate_complete fn, void *data)
{
int err = 0;
unsigned long flags;
struct scsi_device *sdev;
struct scsi_device_handler *scsi_dh = NULL;

spin_lock_irqsave(q->queue_lock, flags);
sdev = q->queuedata;
if (sdev && sdev->scsi_dh_data)
scsi_dh = sdev->scsi_dh_data->scsi_dh;
if (!scsi_dh || !get_device(&sdev->sdev_gendev))
err = SCSI_DH_NOSYS;
spin_unlock_irqrestore(q->queue_lock, flags);

if (err)
return err;

if (scsi_dh->activate)
err = scsi_dh->activate(sdev, fn, data);
put_device(&sdev->sdev_gendev);
return err;
}
EXPORT_SYMBOL_GPL(scsi_dh_activate);


static int rdac_activate(struct scsi_device *sdev,
activate_complete fn, void *data)
{
struct rdac_dh_data *h = get_rdac_data(sdev); ******* I hit BUG_ON here *********
int err = SCSI_DH_OK;

err = check_ownership(sdev, h);
if (err != SCSI_DH_OK)
goto done;

if (h->lun_state == RDAC_LUN_UNOWNED) {
err = queue_mode_select(sdev, fn, data);
if (err == SCSI_DH_OK)
return 0;
}
done:
if (fn)
fn(data, err);
return 0;
}


>
>
> > If I have to go this way, then I need to remove a call
> get_rdac_data
> > and just validate pointers. Report error(SCSI_DH_IO) if pointer is
> not valid.
> > Then hold the reference counts and continue if everything is
> alright.
> > I will send the new patches as soon as I can.
>


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 07-30-2010, 11:56 PM
Chandra Seetharaman
 
Default scsi_dh : Couple of fixes for scsi device handlers

On Fri, 2010-07-30 at 12:12 -0600, Moger, Babu wrote:

> > >
> > > Yes, We can do that. Problem is I am hitting the issue with BUG_ON
> > > in get_rdac_data which is there in the beginning of rdac_activate.
> >
> > I do not understand the problem.
> >
> > If the BUG_ON on get_rdac_data() is being triggered, that means
> > sdev->scsi_dh_data is NULL, if that is the case, how can you access
> > sdev->scsi_dh_data->kref in scsi_dh_activate (in patch 2/6) ? Wouldn't
> > it trip a oops ?
>
> Test case is deleting both active and passive paths almost together during
> the multipath testing. Looks like DM picked up the active path failure
> first. Then failed the active path and started scheduling activate_path to
> failover to passive path. Passive path is also about to go down pretty soon.
> When the control was in scsi_dh_activate, Looks like scsi_dh_data was still
> valid because I did not see panic here. But scsi_dh_data became NULL when
> control went to rdac_activate. That is when I hit the bug on.
>
> kernel BUG at /usr/src/packages/BUILD/lsi-scsi_dh_rdac-01.00/obj/default/scsi_dh_rdac.c:232!
> RIP: 0010: rdac_activate+0x257/0x387 [scsi_dh_rdac]
>
> My understanding is someone triggered scsi_dh->detach for passive path during
> this small window. Only way I could see problem go away is holding reference
> counts between these calls. Did i miss anything here? See the code snippet below..

So, basically, the new patch just reduces the window of race. IOW, if it
just spins for few seconds (or preempted) just before calling the
kref_get() in the new patch, it will generate an oops as scsi_dh_data
would be NULL.

Looks like you have to create a lock to protect scsi_dh_data and then
call kref_get() with the protection of lock.

>
> int scsi_dh_activate(struct request_queue *q, activate_complete fn, void *data)
> {
> int err = 0;
> unsigned long flags;
> struct scsi_device *sdev;
> struct scsi_device_handler *scsi_dh = NULL;
>
> spin_lock_irqsave(q->queue_lock, flags);
> sdev = q->queuedata;
> if (sdev && sdev->scsi_dh_data)
> scsi_dh = sdev->scsi_dh_data->scsi_dh;
> if (!scsi_dh || !get_device(&sdev->sdev_gendev))
> err = SCSI_DH_NOSYS;
> spin_unlock_irqrestore(q->queue_lock, flags);
>
> if (err)
> return err;
>
> if (scsi_dh->activate)
> err = scsi_dh->activate(sdev, fn, data);
> put_device(&sdev->sdev_gendev);
> return err;
> }
> EXPORT_SYMBOL_GPL(scsi_dh_activate);
>
>
> static int rdac_activate(struct scsi_device *sdev,
> activate_complete fn, void *data)
> {
> struct rdac_dh_data *h = get_rdac_data(sdev); ******* I hit BUG_ON here *********
> int err = SCSI_DH_OK;
>
> err = check_ownership(sdev, h);
> if (err != SCSI_DH_OK)
> goto done;
>
> if (h->lun_state == RDAC_LUN_UNOWNED) {
> err = queue_mode_select(sdev, fn, data);
> if (err == SCSI_DH_OK)
> return 0;
> }
> done:
> if (fn)
> fn(data, err);
> return 0;
> }
>
>
> >
> >
> > > If I have to go this way, then I need to remove a call
> > get_rdac_data
> > > and just validate pointers. Report error(SCSI_DH_IO) if pointer is
> > not valid.
> > > Then hold the reference counts and continue if everything is
> > alright.
> > > I will send the new patches as soon as I can.
> >
>
> NrybXǧv^)޺{.n+{"{ayʇڙ,jfhzw j:+vwjmzZ+ݢj"!


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 07-31-2010, 12:22 AM
Shyam Iyer
 
Default scsi_dh : Couple of fixes for scsi device handlers

On 07/30/2010 07:56 PM, Chandra Seetharaman wrote:

On Fri, 2010-07-30 at 12:12 -0600, Moger, Babu wrote:



Yes, We can do that. Problem is I am hitting the issue with BUG_ON
in get_rdac_data which is there in the beginning of rdac_activate.


I do not understand the problem.

If the BUG_ON on get_rdac_data() is being triggered, that means
sdev->scsi_dh_data is NULL, if that is the case, how can you access
sdev->scsi_dh_data->kref in scsi_dh_activate (in patch 2/6) ? Wouldn't
it trip a oops ?


Test case is deleting both active and passive paths almost together during
the multipath testing. Looks like DM picked up the active path failure
first. Then failed the active path and started scheduling activate_path to
failover to passive path. Passive path is also about to go down pretty soon.
When the control was in scsi_dh_activate, Looks like scsi_dh_data was still
valid because I did not see panic here. But scsi_dh_data became NULL when
control went to rdac_activate. That is when I hit the bug on.

kernel BUG at /usr/src/packages/BUILD/lsi-scsi_dh_rdac-01.00/obj/default/scsi_dh_rdac.c:232!
RIP: 0010: rdac_activate+0x257/0x387 [scsi_dh_rdac]

My understanding is someone triggered scsi_dh->detach for passive path during
this small window. Only way I could see problem go away is holding reference
counts between these calls. Did i miss anything here? See the code snippet below..


So, basically, the new patch just reduces the window of race. IOW, if it
just spins for few seconds (or preempted) just before calling the
kref_get() in the new patch, it will generate an oops as scsi_dh_data
would be NULL.

Looks like you have to create a lock to protect scsi_dh_data and then
call kref_get() with the protection of lock.



Second that.. Sounds like each kref_get/kref_put .. needs a lock protection not just this scenario.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 
Old 07-31-2010, 05:02 AM
"Moger, Babu"
 
Default scsi_dh : Couple of fixes for scsi device handlers

> -----Original Message-----
> From: Shyam Iyer [mailto:shyam_iyer@dell.com]
> Sent: Friday, July 30, 2010 7:22 PM
> To: sekharan@us.ibm.com
> Cc: Moger, Babu; device-mapper development; linux-scsi@vger.kernel.org;
> Qi, Yanling; Chauhan, Vijay; Dachepalli, Sudhir; Stankey, Robert
> Subject: Re: [PATCH 0/6] scsi_dh : Couple of fixes for scsi device
> handlers
>
> On 07/30/2010 07:56 PM, Chandra Seetharaman wrote:
> > On Fri, 2010-07-30 at 12:12 -0600, Moger, Babu wrote:
> >
> >
> >>>> Yes, We can do that. Problem is I am hitting the issue with
> BUG_ON
> >>>> in get_rdac_data which is there in the beginning of rdac_activate.
> >>>>
> >>> I do not understand the problem.
> >>>
> >>> If the BUG_ON on get_rdac_data() is being triggered, that means
> >>> sdev->scsi_dh_data is NULL, if that is the case, how can you access
> >>> sdev->scsi_dh_data->kref in scsi_dh_activate (in patch 2/6) ? Wouldn't
> >>> it trip a oops ?
> >>>
> >> Test case is deleting both active and passive paths almost together
> during
> >> the multipath testing. Looks like DM picked up the active path
> failure
> >> first. Then failed the active path and started scheduling activate_path
> to
> >> failover to passive path. Passive path is also about to go down pretty
> soon.
> >> When the control was in scsi_dh_activate, Looks like scsi_dh_data was
> still
> >> valid because I did not see panic here. But scsi_dh_data became NULL
> when
> >> control went to rdac_activate. That is when I hit the bug on.
> >>
> >> kernel BUG at /usr/src/packages/BUILD/lsi-scsi_dh_rdac-
> 01.00/obj/default/scsi_dh_rdac.c:232!
> >> RIP: 0010: rdac_activate+0x257/0x387 [scsi_dh_rdac]
> >>
> >> My understanding is someone triggered scsi_dh->detach for passive path
> during
> >> this small window. Only way I could see problem go away is holding
> reference
> >> counts between these calls. Did i miss anything here? See the code
> snippet below..
> >>
> > So, basically, the new patch just reduces the window of race. IOW, if it
> > just spins for few seconds (or preempted) just before calling the
> > kref_get() in the new patch, it will generate an oops as scsi_dh_data
> > would be NULL.

That is correct.

> >
> > Looks like you have to create a lock to protect scsi_dh_data and then
> > call kref_get() with the protection of lock.
> >
> >
> Second that.. Sounds like each kref_get/kref_put .. needs a lock
> protection not just this scenario.

I agree. I will attempt one more set of patches. This time I will try to match get and put sequence in a one file. Please feel free to comment.


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel
 

Thread Tools




All times are GMT. The time now is 12:59 PM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org