FAQ Search Today's Posts Mark Forums Read
» Video Reviews

» Linux Archive

Linux-archive is a website aiming to archive linux email lists and to make them easily accessible for linux users/developers.


» Sponsor

» Partners

» Sponsor

Go Back   Linux Archive > Redhat > Cluster Development

 
 
LinkBack Thread Tools
 
Old 01-17-2008, 03:31 PM
Wendy Cheng
 
Default NLM failover unlock commands

J. Bruce Fields wrote:

On Thu, Jan 17, 2008 at 10:48:56AM -0500, Wendy Cheng wrote:


J. Bruce Fields wrote:


Remind me: why do we need both per-ip and per-filesystem methods? In
practice, I assume that we'll always do *both*?


Failover normally is done via virtual IP address - so per-ip base method
should be the core routine. However, for non-cluster filesystem such as
ext3/4, changing server also implies umount. If there are clients not
following rule and obtaining locks via different ip interfaces, umount
would fail that ends up aborting the failover process. That's the place
we need the per-filesystem method.


ServerA:
1. Tear down the IP address
2. Unexport the path
3. Write IP to /proc/fs/nfsd/unlock_ip to unlock files
4. If unmount required,
write path name to /proc/fs/nfsd/unlock_filesystem, then unmount.
5. Signal peer to begin take-over.

Sometime ago we were looking at "export name" as the core method (so
per-filesystem method is a subset of that). Unfortunately, the prototype
efforts showed the code would be too intrusive (if filesystem sub-tree
is exported).


We're migrating clients by moving a server ip address from one node to
another. And I assume we're permitting at most one node to export each
filesystem at a time. So it *should* be the case that the set of locks
held on the filesystem(s) that are moving are the same as the set of
locks held by the virtual ip that is moving.


This is true for non-cluster filesystem. But a cluster filesystem can be
exported from multiple servers.



But that last sentence:

it *should* be the case that the set of locks held on the
filesystem(s) that are moving are the same as the set of locks
held by the virtual ip that is moving.

is still true in the cluster filesystem case, right?

--b.


Yes .... Wendy
 
Old 01-17-2008, 03:40 PM
"J. Bruce Fields"
 
Default NLM failover unlock commands

On Thu, Jan 17, 2008 at 11:31:22AM -0500, Wendy Cheng wrote:
> J. Bruce Fields wrote:
>> On Thu, Jan 17, 2008 at 10:48:56AM -0500, Wendy Cheng wrote:
>>
>>> J. Bruce Fields wrote:
>>>
>>>> Remind me: why do we need both per-ip and per-filesystem methods? In
>>>> practice, I assume that we'll always do *both*?
>>>>
>>> Failover normally is done via virtual IP address - so per-ip base
>>> method should be the core routine. However, for non-cluster
>>> filesystem such as ext3/4, changing server also implies umount. If
>>> there are clients not following rule and obtaining locks via
>>> different ip interfaces, umount would fail that ends up aborting the
>>> failover process. That's the place we need the per-filesystem
>>> method.
>>>
>>> ServerA:
>>> 1. Tear down the IP address
>>> 2. Unexport the path
>>> 3. Write IP to /proc/fs/nfsd/unlock_ip to unlock files
>>> 4. If unmount required,
>>> write path name to /proc/fs/nfsd/unlock_filesystem, then unmount.
>>> 5. Signal peer to begin take-over.
>>>
>>> Sometime ago we were looking at "export name" as the core method (so
>>> per-filesystem method is a subset of that). Unfortunately, the
>>> prototype efforts showed the code would be too intrusive (if
>>> filesystem sub-tree is exported).
>>>
>>>> We're migrating clients by moving a server ip address from one node to
>>>> another. And I assume we're permitting at most one node to export each
>>>> filesystem at a time. So it *should* be the case that the set of locks
>>>> held on the filesystem(s) that are moving are the same as the set of
>>>> locks held by the virtual ip that is moving.
>>>>
>>> This is true for non-cluster filesystem. But a cluster filesystem can
>>> be exported from multiple servers.
>>>
>>
>> But that last sentence:
>>
>> it *should* be the case that the set of locks held on the
>> filesystem(s) that are moving are the same as the set of locks
>> held by the virtual ip that is moving.
>>
>> is still true in the cluster filesystem case, right?
>>
>> --b.
>>
> Yes .... Wendy

In one situations (buggy client? Weird network failure?) could that
fail to be the case?

Would there be any advantage to enforcing that requirement in the
server? (For example, teaching nlm to reject any locking request for a
certain filesystem that wasn't sent to a certain server IP.)

--b.
 
Old 01-17-2008, 04:59 PM
Wendy Cheng
 
Default NLM failover unlock commands

Frank Filz wrote:


I assume the intent here with this implementation is that the node
taking over will start lock recovery for the ip address? With that
perspective, I guess it would be important that each file system only be
accessed with a single ip address otherwise lock recovery will not work
correctly since another node/ip could accept locks for that filesystem,
possibly "stealing" a lock that is in recovery. As I recall, our
implementation put the entire filesystem cluster-wide into recovery
during fail-over.




We have 2 more patches on their way to this mailing that:

* Set per-ip based grace period
* Notify only relevant clients about reclaim events

-- Wendy
 
Old 01-17-2008, 05:07 PM
Wendy Cheng
 
Default NLM failover unlock commands

J. Bruce Fields wrote:

On Thu, Jan 17, 2008 at 11:31:22AM -0500, Wendy Cheng wrote:


J. Bruce Fields wrote:


On Thu, Jan 17, 2008 at 10:48:56AM -0500, Wendy Cheng wrote:



J. Bruce Fields wrote:



Remind me: why do we need both per-ip and per-filesystem methods? In
practice, I assume that we'll always do *both*?


Failover normally is done via virtual IP address - so per-ip base
method should be the core routine. However, for non-cluster
filesystem such as ext3/4, changing server also implies umount. If
there are clients not following rule and obtaining locks via
different ip interfaces, umount would fail that ends up aborting the
failover process. That's the place we need the per-filesystem
method.


ServerA:
1. Tear down the IP address
2. Unexport the path
3. Write IP to /proc/fs/nfsd/unlock_ip to unlock files
4. If unmount required,
write path name to /proc/fs/nfsd/unlock_filesystem, then unmount.
5. Signal peer to begin take-over.

Sometime ago we were looking at "export name" as the core method (so
per-filesystem method is a subset of that). Unfortunately, the
prototype efforts showed the code would be too intrusive (if
filesystem sub-tree is exported).



We're migrating clients by moving a server ip address from one node to
another. And I assume we're permitting at most one node to export each
filesystem at a time. So it *should* be the case that the set of locks
held on the filesystem(s) that are moving are the same as the set of
locks held by the virtual ip that is moving.


This is true for non-cluster filesystem. But a cluster filesystem can
be exported from multiple servers.



But that last sentence:

it *should* be the case that the set of locks held on the
filesystem(s) that are moving are the same as the set of locks
held by the virtual ip that is moving.

is still true in the cluster filesystem case, right?

--b.



Yes .... Wendy



In one situations (buggy client? Weird network failure?) could that
fail to be the case?

Would there be any advantage to enforcing that requirement in the
server? (For example, teaching nlm to reject any locking request for a
certain filesystem that wasn't sent to a certain server IP.)

--b.

It is doable... could be added into the "resume" patch that is currently
being tested (since the logic is so similar to the per-ip base grace
period) that should be out for review no later than next Monday.


However, as any new code added into the system, there are trade-off(s).
I'm not sure we want to keep enhancing this too much though. Remember,
locking is about latency. Adding more checking will hurt latency.


-- Wendy
 
Old 01-17-2008, 07:23 PM
"J. Bruce Fields"
 
Default NLM failover unlock commands

To summarize a phone conversation from today:

On Thu, Jan 17, 2008 at 01:07:02PM -0500, Wendy Cheng wrote:
> J. Bruce Fields wrote:
>> Would there be any advantage to enforcing that requirement in the
>> server? (For example, teaching nlm to reject any locking request for a
>> certain filesystem that wasn't sent to a certain server IP.)
>>
>> --b.
>>
> It is doable... could be added into the "resume" patch that is currently
> being tested (since the logic is so similar to the per-ip base grace
> period) that should be out for review no later than next Monday.
>
> However, as any new code added into the system, there are trade-off(s).
> I'm not sure we want to keep enhancing this too much though.

Sure. And I don't want to make this terribly complicated. The patch
looks good, and solves a clear problem. That said, there are a few
related problems we'd like to solve:

- We want to be able to move an export to a node with an already
active nfs server. Currently that requires restarting all of
nfsd on the target node. This is what I understand your next
patch fixes.
- In the case of a filesystem that may be mounted from multiple
nodes at once, we need to make sure we're not leaving a window
allowing other applications to claim locks that nfs clients
haven't recovered yet.
- Ideally we'd like this to be possible without making the
filesystem block all lock requests during a 90-second grace
period; instead it should only have to block those requests
that conflict with to-be-recovered locks.
- All this should work for nfsv4, where we want to eventually
also allow migration of individual clients, and
client-initiated failover.

I absolutely don't want to delay solving this particular problem until
all the above is figured out, but I would like to be reasonably
confident that the new user-interface can be extended naturally to
handle the above cases; or at least that it won't unnecessarily
complicate their implementation.

I'll try to sketch an implementation of most of the above in the next
week.

Anyway, that together with the fact that 2.6.25 is opening up soon (in a
week or so?) inclines me toward delay submitting this until 2.6.26.

> Remember,
> locking is about latency. Adding more checking will hurt latency.

Do you have any latency tests that we could use, or latency-sensitive
workloads that you use as benchmarks?

My suspicion is that checks such as these would be dwarfed by the posix
deadlock detection checks, not to mention the roundtrip to the server
for the nlm rpc and (in the gfs2 case) the communication with gfs2's
posix lock manager.

But I'd love any chance to demonstrate lock latency problems--I'm sure
there's good work to be done there.

--b.
 
Old 01-18-2008, 09:03 AM
Frank van Maarseveen
 
Default NLM failover unlock commands

On Thu, Jan 17, 2008 at 03:23:42PM -0500, J. Bruce Fields wrote:
> To summarize a phone conversation from today:
>
> On Thu, Jan 17, 2008 at 01:07:02PM -0500, Wendy Cheng wrote:
> > J. Bruce Fields wrote:
> >> Would there be any advantage to enforcing that requirement in the
> >> server? (For example, teaching nlm to reject any locking request for a
> >> certain filesystem that wasn't sent to a certain server IP.)
> >>
> >> --b.
> >>
> > It is doable... could be added into the "resume" patch that is currently
> > being tested (since the logic is so similar to the per-ip base grace
> > period) that should be out for review no later than next Monday.
> >
> > However, as any new code added into the system, there are trade-off(s).
> > I'm not sure we want to keep enhancing this too much though.
>
> Sure. And I don't want to make this terribly complicated. The patch
> looks good, and solves a clear problem. That said, there are a few
> related problems we'd like to solve:
>
> - We want to be able to move an export to a node with an already
> active nfs server. Currently that requires restarting all of
> nfsd on the target node. This is what I understand your next
> patch fixes.

Maybe a silly question but what about using "exportfs -r" for this?

--
Frank
 
Old 01-18-2008, 09:21 AM
Frank van Maarseveen
 
Default NLM failover unlock commands

> shell> echo 10.1.1.2 > /proc/fs/nfsd/unlock_ip
> shell> echo /mnt/sfs1 > /proc/fs/nfsd/unlock_filesystem
>
> The expected sequence of events can be:
> 1. Tear down the IP address

You might consider using iptables at this point for dropping outgoing
packets with that source IP address to catch any packet still in
flight. It fixed ESTALE problems for me IIRC (NFSv3, UDP).

> 2. Unexport the path
> 3. Write IP to /proc/fs/nfsd/unlock_ip to unlock files
> 4. If unmount required, write path name to
> /proc/fs/nfsd/unlock_filesystem, then unmount.
> 5. Signal peer to begin take-over.
>

--
Frank
 
Old 01-18-2008, 01:56 PM
Wendy Cheng
 
Default NLM failover unlock commands

Frank van Maarseveen wrote:



- We want to be able to move an export to a node with an already
active nfs server. Currently that requires restarting all of
nfsd on the target node. This is what I understand your next
patch fixes.



Maybe a silly question but what about using "exportfs -r" for this?



/me prays we won't go back to our *old* export failover proposal (about
two years ago) ...


Anyway, re-export is part of the required steps that take-over server
needs to do. It, however, doesn't handle lockd grace period so we will
have the possibility that other client will steal the locks away. That's
why Bruce and I are working on the second "resume" patch at this moment.


-- Wendy
 
Old 01-18-2008, 02:00 PM
Wendy Cheng
 
Default NLM failover unlock commands

Frank van Maarseveen wrote:

shell> echo 10.1.1.2 > /proc/fs/nfsd/unlock_ip
shell> echo /mnt/sfs1 > /proc/fs/nfsd/unlock_filesystem

The expected sequence of events can be:
1. Tear down the IP address



You might consider using iptables at this point for dropping outgoing
packets with that source IP address to catch any packet still in
flight. It fixed ESTALE problems for me IIRC (NFSv3, UDP).



ok, thanks ... Wendy


2. Unexport the path
3. Write IP to /proc/fs/nfsd/unlock_ip to unlock files
4. If unmount required, write path name to
/proc/fs/nfsd/unlock_filesystem, then unmount.
5. Signal peer to begin take-over.
 

Thread Tools




All times are GMT. The time now is 01:21 AM.

VBulletin, Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.
Content Relevant URLs by vBSEO ©2007, Crawlability, Inc.
Copyright 2007 - 2008, www.linux-archive.org