It's easy to tell if you've hit this bug, because a message like this will
always appear in /var/log/messages:
SM: 02000378 ignoring service callback id=2000144 event=1324
If you look at /proc/cluster/lock_dlm/debug on this node at this point,
you'll see something like this at the end, which shows what the problem
is:
others_may_mount start_done 1322 b
The event_id that others_may_mount uses when calling kcl_start_done()
is incorrect; it's using 1322 when it should be 1324.
I believe the fix is for others_may_mount() to read the event_id
after taking the umount_lock semaphore which serializes
others_may_mount() with a start callback from the lock_dlm thread.
In this case, I believe the start callback is changing the event_id
after others_may_mount reads it, and before othres_may_mount gets
the umount_lock semaphore.