FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Cluster Development

 
 
LinkBack Thread Tools
 
Old 03-22-2012, 12:59 AM
dann frazier
 
Default GPF in dlm_lowcomms_stop

I observed a GPF in dlm_lowcomms_stop() in a crash dump from a
user. I'll include a traceback at the bottom. I have a theory as to
what is going on, but could use some help proving it.

1 void dlm_lowcomms_stop(void)
2 {
3 /* Set all the flags to prevent any
4 socket activity.
5 */
6 mutex_lock(&connections_lock);
7 foreach_conn(stop_conn);
8 mutex_unlock(&connections_lock);
9
10 work_stop();
11
12 mutex_lock(&connections_lock);
13 clean_writequeues();
14
15 foreach_conn(free_conn);
16
17 mutex_unlock(&connections_lock);
18 kmem_cache_destroy(con_cache);
19 }

lines 6-8 should stop any traffic coming in on existing connections,
and the connections_lock prevents any new connections from being
added.

line 10 shutsdown the work queues which process incoming data on
existing connections. Our connections are supposed to be dead, so no
new work should be added to these queues.

However... we've dropped the connections_lock, so its possible that a
new connection gets created on line 9. This connection structure would
have pointers to the workqueues that we're about to destroy. Sometime
later on we get data on this new connection, we try to add work to the
now-obliterated workqueues, and things blow up.

This is a 2.6.32 kernel, but this code looks to be about the
same in 3.3. Normally I'd like to instrument the code to widen the
race window and see if I can trigger it, but I'm pretty clueless about
the stack here. Is there an easy way to mimic this activity from
userspace?

[80061.292085] Call Trace:
[80061.312762] <IRQ>
[80061.332554] [<ffffffff810387b9>] default_spin_lock_flags+0x9/0x10
[80061.356963] [<ffffffff8155c44f>] _spin_lock_irqsave+0x2f/0x40
[80061.380717] [<ffffffff81080504>] __queue_work+0x24/0x50
[80061.403591] [<ffffffff81080632>] queue_work_on+0x42/0x60
[80061.426314] [<ffffffff810807ff>] queue_work+0x1f/0x30
[80061.448516] [<ffffffffa027326b>] lowcomms_data_ready+0x3b/0x40 [dlm]
[80061.472149] [<ffffffff814ba318>] tcp_data_queue+0x308/0xaf0
[80061.494912] [<ffffffff814bd021>] tcp_rcv_established+0x371/0x730
[80061.517903] [<ffffffff814c4bb3>] tcp_v4_do_rcv+0xf3/0x160
[80061.539872] [<ffffffff814c6305>] tcp_v4_rcv+0x5b5/0x7e0
[80061.561284] [<ffffffff814a4110>] ? ip_local_deliver_finish+0x0/0x2d0
[80061.583714] [<ffffffff8149bb84>] ? nf_hook_slow+0x74/0x100
[80061.604714] [<ffffffff814a4110>] ? ip_local_deliver_finish+0x0/0x2d0
[80061.626449] [<ffffffff814a41ed>] ip_local_deliver_finish+0xdd/0x2d0
[80061.647767] [<ffffffff814a4470>] ip_local_deliver+0x90/0xa0
[80061.667743] [<ffffffff814a392d>] ip_rcv_finish+0x12d/0x440
[80061.687278] [<ffffffff814a3eb5>] ip_rcv+0x275/0x360
[80061.705783] [<ffffffff8147450a>] netif_receive_skb+0x38a/0x5d0
[80061.725345] [<ffffffffa0257178>] br_handle_frame_finish+0x198/0x1a0 [bridge]
[80061.746206] [<ffffffffa025cc78>] br_nf_pre_routing_finish+0x238/0x350 [bridge]
[80061.779534] [<ffffffff8149bb84>] ? nf_hook_slow+0x74/0x100
[80061.798437] [<ffffffffa025ca40>] ? br_nf_pre_routing_finish+0x0/0x350 [bridge]
[80061.831598] [<ffffffffa025d15a>] br_nf_pre_routing+0x3ca/0x468 [bridge]
[80061.851561] [<ffffffff81561765>] ? do_IRQ+0x75/0xf0
[80061.869287] [<ffffffff812b3a63>] ? cpumask_next_and+0x23/0x40
[80061.887717] [<ffffffff8149bacc>] nf_iterate+0x6c/0xb0
[80061.905491] [<ffffffffa0256fe0>] ? br_handle_frame_finish+0x0/0x1a0 [bridge]
[80061.925829] [<ffffffff8149bb84>] nf_hook_slow+0x74/0x100
[80061.944215] [<ffffffffa0256fe0>] ? br_handle_frame_finish+0x0/0x1a0 [bridge]
[80061.964735] [<ffffffff8146b216>] ? __netdev_alloc_skb+0x36/0x60
[80061.984128] [<ffffffffa025730c>] br_handle_frame+0x18c/0x250 [bridge]
[80062.004294] [<ffffffff81537138>] ? vlan_hwaccel_do_receive+0x38/0xf0
[80062.024707] [<ffffffff8147437a>] netif_receive_skb+0x1fa/0x5d0
[80062.044500] [<ffffffff81474920>] napi_skb_finish+0x50/0x70
[80062.063911] [<ffffffff81537794>] vlan_gro_receive+0x84/0x90
[80062.083512] [<ffffffffa0015fd9>] igb_clean_rx_irq_adv+0x259/0x6d0 [igb]
[80062.104516] [<ffffffffa00164c2>] igb_poll+0x72/0x380 [igb]
[80062.124363] [<ffffffffa0168236>] ? ioat2_issue_pending+0x86/0xa0 [ioatdma]
[80062.145937] [<ffffffff8147500f>] net_rx_action+0x10f/0x250
[80062.165915] [<ffffffff8109331e>] ? tick_handle_oneshot_broadcast+0x10e/0x120
[80062.187812] [<ffffffff8106d9e7>] __do_softirq+0xb7/0x1f0
[80062.207733] [<ffffffff810c47b0>] ? handle_IRQ_event+0x60/0x170
[80062.228411] [<ffffffff810132ec>] call_softirq+0x1c/0x30
[80062.248332] [<ffffffff81014cb5>] do_softirq+0x65/0xa0
[80062.268058] [<ffffffff8106d7e5>] irq_exit+0x85/0x90
[80062.287578] [<ffffffff81561765>] do_IRQ+0x75/0xf0
[80062.306763] [<ffffffff81012b13>] ret_from_intr+0x0/0x11
[80062.326418] <EOI>
[80062.342191] [<ffffffff8130f59f>] ? acpi_idle_enter_c1+0xa3/0xc1
[80062.362503] [<ffffffff8130f57e>] ? acpi_idle_enter_c1+0x82/0xc1
[80062.382525] [<ffffffff8144f1e7>] ? cpuidle_idle_call+0xa7/0x140
[80062.402409] [<ffffffff81010e63>] ? cpu_idle+0xb3/0x110
[80062.421258] [<ffffffff81544137>] ? rest_init+0x77/0x80
[80062.439930] [<ffffffff81882dd8>] ? start_kernel+0x368/0x371
[80062.459099] [<ffffffff8188233a>] ? x86_64_start_reservations+0x125/0x129
[80062.479533] [<ffffffff81882438>] ? x86_64_start_kernel+0xfa/0x109
[80062.499226] Code: 00 00 48 c7 c2 2e 85 03 81 48 c7 c1 31 85 03 81 e9 dd fe ff ff 90 90 90 90 90 90 90 90 90 90 90 90 90 55 b8 00 01 00 00 48 89 e5 <f0> 66 0f c1 07 38 e0 74 06 f3 90 8a 07 eb f6 c9 c3 66 0f 1f 44
[80062.563580] RIP [<ffffffff81038729>] __ticket_spin_lock+0x9/0x20
[80062.584129] RSP <ffff880016603740>
 
Old 03-30-2012, 03:42 PM
David Teigland
 
Default GPF in dlm_lowcomms_stop

On Wed, Mar 21, 2012 at 07:59:13PM -0600, dann frazier wrote:
> However... we've dropped the connections_lock, so its possible that a
> new connection gets created on line 9. This connection structure would
> have pointers to the workqueues that we're about to destroy. Sometime
> later on we get data on this new connection, we try to add work to the
> now-obliterated workqueues, and things blow up.

Hi Dan, I'm not very familiar with this code either, but I've talked with
Chrissie and she suggested we try something like this:

diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 133ef6d..a3c431e 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -142,6 +142,7 @@ struct writequeue_entry {

static struct sockaddr_storage *dlm_local_addr[DLM_MAX_ADDR_COUNT];
static int dlm_local_count;
+static int dlm_allow_conn;

/* Work queues */
static struct workqueue_struct *recv_workqueue;
@@ -710,6 +711,13 @@ static int tcp_accept_from_sock(struct connection *con)
struct connection *newcon;
struct connection *addcon;

+ mutex_lock(&connections_lock);
+ if (!dlm_allow_conn) {
+ mutex_unlock(&connections_lock);
+ return -1;
+ }
+ mutex_unlock(&connections_lock);
+
memset(&peeraddr, 0, sizeof(peeraddr));
result = sock_create_kern(dlm_local_addr[0]->ss_family, SOCK_STREAM,
IPPROTO_TCP, &newsock);
@@ -1503,6 +1511,7 @@ void dlm_lowcomms_stop(void)
socket activity.
*/
mutex_lock(&connections_lock);
+ dlm_allow_conn = 0;
foreach_conn(stop_conn);
mutex_unlock(&connections_lock);

@@ -1540,6 +1549,8 @@ int dlm_lowcomms_start(void)
if (!con_cache)
goto out;

+ dlm_allow_conn = 1;
+
/* Start listening */
if (dlm_config.ci_protocol == 0)
error = tcp_listen_for_all();
@@ -1555,6 +1566,7 @@ int dlm_lowcomms_start(void)
return 0;

fail_unlisten:
+ dlm_allow_conn = 0;
con = nodeid2con(0,0);
if (con) {
close_connection(con, false);
 
Old 03-30-2012, 04:42 PM
David Teigland
 
Default GPF in dlm_lowcomms_stop

On Fri, Mar 30, 2012 at 11:42:56AM -0400, David Teigland wrote:
> Hi Dan, I'm not very familiar with this code either, but I've talked with
> Chrissie and she suggested we try something like this:

A second version that addresses a potentially similar problem in start.

diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 133ef6d..5c1b0e3 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -142,6 +142,7 @@ struct writequeue_entry {

static struct sockaddr_storage *dlm_local_addr[DLM_MAX_ADDR_COUNT];
static int dlm_local_count;
+static int dlm_allow_conn;

/* Work queues */
static struct workqueue_struct *recv_workqueue;
@@ -710,6 +711,13 @@ static int tcp_accept_from_sock(struct connection *con)
struct connection *newcon;
struct connection *addcon;

+ mutex_lock(&connections_lock);
+ if (!dlm_allow_conn) {
+ mutex_unlock(&connections_lock);
+ return -1;
+ }
+ mutex_unlock(&connections_lock);
+
memset(&peeraddr, 0, sizeof(peeraddr));
result = sock_create_kern(dlm_local_addr[0]->ss_family, SOCK_STREAM,
IPPROTO_TCP, &newsock);
@@ -1503,6 +1511,7 @@ void dlm_lowcomms_stop(void)
socket activity.
*/
mutex_lock(&connections_lock);
+ dlm_allow_conn = 0;
foreach_conn(stop_conn);
mutex_unlock(&connections_lock);

@@ -1530,7 +1539,7 @@ int dlm_lowcomms_start(void)
if (!dlm_local_count) {
error = -ENOTCONN;
log_print("no local IP address has been set");
- goto out;
+ goto fail;
}

error = -ENOMEM;
@@ -1538,7 +1547,13 @@ int dlm_lowcomms_start(void)
__alignof__(struct connection), 0,
NULL);
if (!con_cache)
- goto out;
+ goto fail;
+
+ error = work_start();
+ if (error)
+ goto fail_destroy;
+
+ dlm_allow_conn = 1;

/* Start listening */
if (dlm_config.ci_protocol == 0)
@@ -1548,20 +1563,17 @@ int dlm_lowcomms_start(void)
if (error)
goto fail_unlisten;

- error = work_start();
- if (error)
- goto fail_unlisten;
-
return 0;

fail_unlisten:
+ dlm_allow_conn = 0;
con = nodeid2con(0,0);
if (con) {
close_connection(con, false);
kmem_cache_free(con_cache, con);
}
+fail_destroy:
kmem_cache_destroy(con_cache);
-
-out:
+fail:
return error;
}
 
Old 03-30-2012, 05:17 PM
dann frazier
 
Default GPF in dlm_lowcomms_stop

On Fri, Mar 30, 2012 at 12:42:40PM -0400, David Teigland wrote:
> On Fri, Mar 30, 2012 at 11:42:56AM -0400, David Teigland wrote:
> > Hi Dan, I'm not very familiar with this code either, but I've talked with
> > Chrissie and she suggested we try something like this:

Yeah, that's the mechanism I was thinking of as well.

> A second version that addresses a potentially similar problem in
> start.

Oh, good catch! I hadn't considered that path.

Reviewed-by: dann frazier <dann.frazier@canonical.com>

>
> diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
> index 133ef6d..5c1b0e3 100644
> --- a/fs/dlm/lowcomms.c
> +++ b/fs/dlm/lowcomms.c
> @@ -142,6 +142,7 @@ struct writequeue_entry {
>
> static struct sockaddr_storage *dlm_local_addr[DLM_MAX_ADDR_COUNT];
> static int dlm_local_count;
> +static int dlm_allow_conn;
>
> /* Work queues */
> static struct workqueue_struct *recv_workqueue;
> @@ -710,6 +711,13 @@ static int tcp_accept_from_sock(struct connection *con)
> struct connection *newcon;
> struct connection *addcon;
>
> + mutex_lock(&connections_lock);
> + if (!dlm_allow_conn) {
> + mutex_unlock(&connections_lock);
> + return -1;
> + }
> + mutex_unlock(&connections_lock);
> +
> memset(&peeraddr, 0, sizeof(peeraddr));
> result = sock_create_kern(dlm_local_addr[0]->ss_family, SOCK_STREAM,
> IPPROTO_TCP, &newsock);
> @@ -1503,6 +1511,7 @@ void dlm_lowcomms_stop(void)
> socket activity.
> */
> mutex_lock(&connections_lock);
> + dlm_allow_conn = 0;
> foreach_conn(stop_conn);
> mutex_unlock(&connections_lock);
>
> @@ -1530,7 +1539,7 @@ int dlm_lowcomms_start(void)
> if (!dlm_local_count) {
> error = -ENOTCONN;
> log_print("no local IP address has been set");
> - goto out;
> + goto fail;
> }
>
> error = -ENOMEM;
> @@ -1538,7 +1547,13 @@ int dlm_lowcomms_start(void)
> __alignof__(struct connection), 0,
> NULL);
> if (!con_cache)
> - goto out;
> + goto fail;
> +
> + error = work_start();
> + if (error)
> + goto fail_destroy;
> +
> + dlm_allow_conn = 1;
>
> /* Start listening */
> if (dlm_config.ci_protocol == 0)
> @@ -1548,20 +1563,17 @@ int dlm_lowcomms_start(void)
> if (error)
> goto fail_unlisten;
>
> - error = work_start();
> - if (error)
> - goto fail_unlisten;
> -
> return 0;
>
> fail_unlisten:
> + dlm_allow_conn = 0;
> con = nodeid2con(0,0);
> if (con) {
> close_connection(con, false);
> kmem_cache_free(con_cache, con);
> }
> +fail_destroy:
> kmem_cache_destroy(con_cache);
> -
> -out:
> +fail:
> return error;
> }
 
Old 05-04-2012, 05:33 PM
dann frazier
 
Default GPF in dlm_lowcomms_stop

On Fri, Mar 30, 2012 at 11:17:56AM -0600, dann frazier wrote:
> On Fri, Mar 30, 2012 at 12:42:40PM -0400, David Teigland wrote:
> > On Fri, Mar 30, 2012 at 11:42:56AM -0400, David Teigland wrote:
> > > Hi Dan, I'm not very familiar with this code either, but I've talked with
> > > Chrissie and she suggested we try something like this:
>
> Yeah, that's the mechanism I was thinking of as well.
>
> > A second version that addresses a potentially similar problem in
> > start.
>
> Oh, good catch! I hadn't considered that path.
>
> Reviewed-by: dann frazier <dann.frazier@canonical.com>

fyi, we haven't uncovered any problems when testing with this
patch. Are there plans to submit it for the next merge cycle?

-dann

> >
> > diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
> > index 133ef6d..5c1b0e3 100644
> > --- a/fs/dlm/lowcomms.c
> > +++ b/fs/dlm/lowcomms.c
> > @@ -142,6 +142,7 @@ struct writequeue_entry {
> >
> > static struct sockaddr_storage *dlm_local_addr[DLM_MAX_ADDR_COUNT];
> > static int dlm_local_count;
> > +static int dlm_allow_conn;
> >
> > /* Work queues */
> > static struct workqueue_struct *recv_workqueue;
> > @@ -710,6 +711,13 @@ static int tcp_accept_from_sock(struct connection *con)
> > struct connection *newcon;
> > struct connection *addcon;
> >
> > + mutex_lock(&connections_lock);
> > + if (!dlm_allow_conn) {
> > + mutex_unlock(&connections_lock);
> > + return -1;
> > + }
> > + mutex_unlock(&connections_lock);
> > +
> > memset(&peeraddr, 0, sizeof(peeraddr));
> > result = sock_create_kern(dlm_local_addr[0]->ss_family, SOCK_STREAM,
> > IPPROTO_TCP, &newsock);
> > @@ -1503,6 +1511,7 @@ void dlm_lowcomms_stop(void)
> > socket activity.
> > */
> > mutex_lock(&connections_lock);
> > + dlm_allow_conn = 0;
> > foreach_conn(stop_conn);
> > mutex_unlock(&connections_lock);
> >
> > @@ -1530,7 +1539,7 @@ int dlm_lowcomms_start(void)
> > if (!dlm_local_count) {
> > error = -ENOTCONN;
> > log_print("no local IP address has been set");
> > - goto out;
> > + goto fail;
> > }
> >
> > error = -ENOMEM;
> > @@ -1538,7 +1547,13 @@ int dlm_lowcomms_start(void)
> > __alignof__(struct connection), 0,
> > NULL);
> > if (!con_cache)
> > - goto out;
> > + goto fail;
> > +
> > + error = work_start();
> > + if (error)
> > + goto fail_destroy;
> > +
> > + dlm_allow_conn = 1;
> >
> > /* Start listening */
> > if (dlm_config.ci_protocol == 0)
> > @@ -1548,20 +1563,17 @@ int dlm_lowcomms_start(void)
> > if (error)
> > goto fail_unlisten;
> >
> > - error = work_start();
> > - if (error)
> > - goto fail_unlisten;
> > -
> > return 0;
> >
> > fail_unlisten:
> > + dlm_allow_conn = 0;
> > con = nodeid2con(0,0);
> > if (con) {
> > close_connection(con, false);
> > kmem_cache_free(con_cache, con);
> > }
> > +fail_destroy:
> > kmem_cache_destroy(con_cache);
> > -
> > -out:
> > +fail:
> > return error;
> > }
>
 
Old 05-04-2012, 05:51 PM
David Teigland
 
Default GPF in dlm_lowcomms_stop

On Fri, May 04, 2012 at 11:33:17AM -0600, dann frazier wrote:
> On Fri, Mar 30, 2012 at 11:17:56AM -0600, dann frazier wrote:
> > On Fri, Mar 30, 2012 at 12:42:40PM -0400, David Teigland wrote:
> > > On Fri, Mar 30, 2012 at 11:42:56AM -0400, David Teigland wrote:
> > > > Hi Dan, I'm not very familiar with this code either, but I've talked with
> > > > Chrissie and she suggested we try something like this:
> >
> > Yeah, that's the mechanism I was thinking of as well.
> >
> > > A second version that addresses a potentially similar problem in
> > > start.
> >
> > Oh, good catch! I hadn't considered that path.
> >
> > Reviewed-by: dann frazier <dann.frazier@canonical.com>
>
> fyi, we haven't uncovered any problems when testing with this
> patch.

Great, thanks.

> Are there plans to submit it for the next merge cycle?

Yes, it's here
http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/teigland/linux-dlm.git;a=shortlog;h=refs/heads/next
 

Thread Tools




All times are GMT. The time now is 01:05 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org