Linux Archive

Linux Archive (http://www.linux-archive.org/)
-   Fedora Directory (http://www.linux-archive.org/fedora-directory/)
-   -   Strange wedging (http://www.linux-archive.org/fedora-directory/439459-strange-wedging.html)

"Edward Z. Yang" 10-13-2010 11:04 PM

Strange wedging
 
Hey all,

I'm too tired right now to write up a proper report, but would
the following behavior be something y'all be'd interested in
debugging?

* Outgoing incremental GSSAP-authed MMR replications wedge
indefinitely, in Kerberos code.
* It's impossible to do a full update without disabling all
incoming and outgoing replication agreements, because as
soon as another replication goes and gets stuck, everything
else fails.

Basically, dirsrv+GSSAPI can get into some sort of wedged
state persistent across restarts that means:

* You can't restart the server without kill -9'ing it
* You can't do a full update

And the only way to fix it is to reinitialize all replication
agreements.

Edward
--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

Rich Megginson 10-14-2010 07:35 PM

Strange wedging
 
Edward Z. Yang wrote:
> Hey all,
>
> I'm too tired right now to write up a proper report, but would
> the following behavior be something y'all be'd interested in
> debugging?
>
We have tested server to server SASL/GSSAPI with replication on RHEL5,
but we have not seen this happen. Do you have more than one replication
agreement? Would it be possible for you to provide a stacktrace
obtained with thread apply all bt in gdb?
> * Outgoing incremental GSSAP-authed MMR replications wedge
> indefinitely, in Kerberos code.
> * It's impossible to do a full update without disabling all
> incoming and outgoing replication agreements, because as
> soon as another replication goes and gets stuck, everything
> else fails.
>
> Basically, dirsrv+GSSAPI can get into some sort of wedged
> state persistent across restarts that means:
>
> * You can't restart the server without kill -9'ing it
> * You can't do a full update
>
> And the only way to fix it is to reinitialize all replication
> agreements.
>
> Edward
> --
> 389 users mailing list
> 389-users@lists.fedoraproject.org
> https://admin.fedoraproject.org/mailman/listinfo/389-users
>

--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

"Edward Z. Yang" 10-14-2010 09:14 PM

Strange wedging
 
Excerpts from Rich Megginson's message of Thu Oct 14 15:35:38 -0400 2010:
> We have tested server to server SASL/GSSAPI with replication on RHEL5,
> but we have not seen this happen. Do you have more than one replication
> agreement?

Yes; we're doing full multimaster, so ever master has a replication agreement
with every other master.

> Would it be possible for you to provide a stacktrace
> obtained with thread apply all bt in gdb?

Sure. See:

http://web.mit.edu/~ezyang/Public/wedged-ldap.txt

Edward
--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

Rich Megginson 10-14-2010 09:41 PM

Strange wedging
 
Edward Z. Yang wrote:
> Excerpts from Rich Megginson's message of Thu Oct 14 15:35:38 -0400 2010:
>
>> We have tested server to server SASL/GSSAPI with replication on RHEL5,
>> but we have not seen this happen. Do you have more than one replication
>> agreement?
>>
>
> Yes; we're doing full multimaster, so ever master has a replication agreement
> with every other master.
>
>
>> Would it be possible for you to provide a stacktrace
>> obtained with thread apply all bt in gdb?
>>
>
> Sure. See:
>
> http://web.mit.edu/~ezyang/Public/wedged-ldap.txt
>
> Edward
>
Thanks. Looks like this stack trace is from a 389-ds-base-1.2.5 server:

Thread 36 (Thread 0x7f29ff5fe910 (LWP 24382)):
#0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:220
#1 0x0000003facc22ff9 in ?? () from /lib64/libnspr4.so
#2 0x0000003facc23bdc in PR_WaitCondVar () from /lib64/libnspr4.so
#3 0x00007f2a1898ecfc in protocol_sleep (prp=0x2723a50, duration=300000)
at ldap/servers/plugins/replication/repl5_inc_protocol.c:1309
#4 0x00007f2a1898fedc in repl5_inc_run (prp=0x2723a50) at ldap/servers/plugins/replication/repl5_inc_protocol.c:796
#5 0x00007f2a18994119 in prot_thread_main (arg=<value optimized out>)
at ldap/servers/plugins/replication/repl5_protocol.c:313
#6 0x0000003facc29773 in ?? () from /lib64/libnspr4.so
#7 0x000000300b80685a in start_thread (arg=<value optimized out>) at pthread_create.c:297
#8 0x000000300acde22d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#9 0x0000000000000000 in ?? ()

This corresponds to:
http://git.fedorahosted.org/git/?p=389/ds.git;a=blob;f=ldap/servers/plugins/replication/repl5_inc_protocol.c;h=4e733dec208e3426d13c2ed2b42 39300d955e232;hb=389-ds-base-1.2.5

795 <http://git.fedorahosted.org/cgi-bin/gitweb.cgi#l795>
wait_change_timer_set = 1;
796 <http://git.fedorahosted.org/cgi-bin/gitweb.cgi#l796>
protocol_sleep(prp, MAX_WAIT_BETWEEN_SESSIONS);
797 <http://git.fedorahosted.org/cgi-bin/gitweb.cgi#l797> }

But not to 1.2.6:
http://git.fedorahosted.org/git/?p=389/ds.git;a=blob;f=ldap/servers/plugins/replication/repl5_inc_protocol.c;h=6475eb89ba168b30a8cb38cd5a7 8f8dc1d8b4796;hb=389-ds-base-1.2.6
795 <http://git.fedorahosted.org/cgi-bin/gitweb.cgi#l795> else
796 <http://git.fedorahosted.org/cgi-bin/gitweb.cgi#l796> {
797 <http://git.fedorahosted.org/cgi-bin/gitweb.cgi#l797>
if (wait_change_timer_set)

Although I can't say for sure whether the bug you are encountering
exists in 1.2.6, it's much easier for us to support the latest version.
Can you try to reproduce with 1.2.6? If you would rather use 1.2.6.1,
it has been pushed to Fedora/EPEL Stable and should be available from
the mirrors within the next 48 hours. If you don't want to wait you can
install from Fedora updates-testing or EPEL epel-testing.
--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

"Edward Z. Yang" 10-14-2010 09:47 PM

Strange wedging
 
We've not observed any of our 1.2.6 servers wedging in this fashion.
However, we need to preserve our 1.2.5 servers because if we axe them
we can't do full updates yet (as per https://bugzilla.redhat.com/show_bug.cgi?id=637852).
With any luck the upcoming update will fix our issue; this patch is
slated for 1.2.6.1?

Edward
--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

Rich Megginson 10-14-2010 10:57 PM

Strange wedging
 
Edward Z. Yang wrote:
> We've not observed any of our 1.2.6 servers wedging in this fashion.
> However, we need to preserve our 1.2.5 servers because if we axe them
> we can't do full updates yet (as per https://bugzilla.redhat.com/show_bug.cgi?id=637852).
> With any luck the upcoming update will fix our issue; this patch is
> slated for 1.2.6.1?
>
1.2.6.1 is already released. There is a slight chance we could do a
1.2.6.2, but otherwise we were targeting this for 1.2.7.
> Edward
>

--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

"Edward Z. Yang" 10-14-2010 11:12 PM

Strange wedging
 
Excerpts from Rich Megginson's message of Thu Oct 14 18:57:54 -0400 2010:
> 1.2.6.1 is already released. There is a slight chance we could do a
> 1.2.6.2, but otherwise we were targeting this for 1.2.7.

I wonder if Fedora 13 is going to pick up 1.2.7. Might be a bit annoying
if they don't. I'll try to work around the bug for now, but it's kind of
painful.

Cheers,
Edward
--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

Rich Megginson 10-14-2010 11:15 PM

Strange wedging
 
Edward Z. Yang wrote:
> Excerpts from Rich Megginson's message of Thu Oct 14 18:57:54 -0400 2010:
>
>> 1.2.6.1 is already released. There is a slight chance we could do a
>> 1.2.6.2, but otherwise we were targeting this for 1.2.7.
>>
>
> I wonder if Fedora 13 is going to pick up 1.2.7.
Yes. We will push 1.2.7 to Fedora 13
> Might be a bit annoying
> if they don't. I'll try to work around the bug for now, but it's kind of
> painful.
>
> Cheers,
> Edward
>

--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

"Edward Z. Yang" 10-14-2010 11:20 PM

Strange wedging
 
Excerpts from Rich Megginson's message of Thu Oct 14 19:15:33 -0400 2010:
> Yes. We will push 1.2.7 to Fedora 13

Cool, that'll be great. I wait eagerly for the release.

Edward
--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

"Edward Z. Yang" 10-16-2010 07:29 PM

Strange wedging
 
Some of our 1.2.6.1 servers wedge too; dunno if it's the same bug:

http://web.mit.edu/~ezyang/Public/hung-terminating-dirsrv-1.2.6.log

[root@whole-enchilada ~]# ns-slapd --version
389 Project
389-Directory/1.2.6.1 B2010.272.237

Cheers,
Edward
--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users


All times are GMT. The time now is 01:38 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.