Ronnie Sahlberg [Wed, 10 Oct 2007 20:16:36 +0000 (06:16 +1000)]
simplify election handling
make sure we read and update the flags from all remote nodes before we
reach the first codepath that can call do_recovery()
since during do_recovery() we need to know what the flags are.
Ronnie Sahlberg [Tue, 9 Oct 2007 23:42:32 +0000 (09:42 +1000)]
add a --single-public-ip argument to ctdbd to specify the ip address
used in single public ip address mode.
when using this argument, --public-interface must also be used.
add a vnn structure to the ctdb context to describe the single public ip
address
update the killtcp control in the daemon that if a socketpair that is to
be killed does not match a normal public address it checks if the
destination address maches the single public ip address and if so uses
that vnn structure from the ctdb context
this allows killtcp to kill also connections to the single public ip
instead of only normal public addresses
Ronnie Sahlberg [Mon, 8 Oct 2007 04:05:22 +0000 (14:05 +1000)]
add an initial test version of an ip multiplex tool that allows us
to have one single public ip address for the entire cluster.
this ip address is attached to lo on all nodes but only the recmaster
will respond to arp requests for this address.
the recmaster then runs an ipmux process that will pass any incoming
packets to this ip address onto the other node sin the cluster based on
the ip address of the client host
to use this feature one must
1, have one fixed ip address in the customers network attached
permanently attached to an interface
2, set CTDB_PUBLI_INTERFACE=
to specify on which interface the clients attach to the node
3, CTDB_SINGLE_PUBLI_IP=ip-address
to specify which ipaddress should be the "single public ip address"
to test with only one single client, attach several ip addresses to
the client and ping the public address from the client with different -I
options. look in network trace to see to which node the packet is
passed onto.
Ronnie Sahlberg [Sun, 7 Oct 2007 23:47:20 +0000 (09:47 +1000)]
add a function in the ctdb tool to determine whether the local node is
the recmaster or not.
return 0 if the node is the recmaster and 1 (true) if it is not or if
we could not communicate with the ctdb daemon.
call it 'isnotrecmaster' to cope with that if the tool could not bind to
the socket to tyalk to the daemon, the tool will automatically return an
error and exit code 1
thus the tool will only return 0 if it could talk successfully to the
local daemon and if the local daemon confirms this node is the recmaster
Ronnie Sahlberg [Mon, 24 Sep 2007 00:52:26 +0000 (10:52 +1000)]
when we have a public ip address mismatch (i.e. we hold addresses we
shouldnt or we are not holding addresses wqe should)
we must first freeze the local node before we set the recovery mode
Andrew Tridgell [Mon, 24 Sep 2007 00:00:14 +0000 (10:00 +1000)]
no longer wait at startup for services to become available, instead
set the node initially unhealthy and let the status monitoring bring the node online.
This fixes a problem with winbindd, where it refused to start because secrets.tdb was not populated
but we could not populate ctdbd, because the net command would not run while ctdbd was still doing startup
and thus frozen
(This used to be ctdb commit 3a001b793dd76fb96addf1e2ccb74da326fbcfbc)
Ronnie Sahlberg [Fri, 21 Sep 2007 05:19:33 +0000 (15:19 +1000)]
in ctdb_control_persistent_store() we must talloc_steal() the pointer to
c to prevent it from being immediately freed (and our persistent store
state with it) if we need to wait asynchronously for other nodes before
we can reply back to the client
Ronnie Sahlberg [Fri, 14 Sep 2007 00:16:36 +0000 (10:16 +1000)]
let each node verify that they have a correct assignment of public ip
addresses (i.e. htey hold those they should hold and they dont hold
any of those they shouldnt hold)
if an inconsistency is found, mark the local node as recovery mode
active
and wait for the recovery master to trigger a full blown recovery
Ronnie Sahlberg [Thu, 13 Sep 2007 04:51:37 +0000 (14:51 +1000)]
when a ctdb_takeover_run has failed we must make sure that
need_takeover_run is set to true or else we might forget to rerun it
again during the next recovery
othervise, need_takeover_run is only set to true IFF the node flags for
a remote node and the local nodes differ.
It is possible that a takeover run fails and thus the reassignment of
ip addresses is incomplete but before we get back to the test in
monitor_cluster() that all the node flags of all nodes have converged
and they now match each others again. and thus causing
monitor_cluster() to fail to realize that a takeover run is needed.
Andrew Tridgell [Wed, 12 Sep 2007 03:23:36 +0000 (13:23 +1000)]
- set arp_ignore to prevent replying to arp requests for addresses on loopback
- put removed IPs on loopback with scope host
- check for nul strings in ethtool call
;
Andrew Tridgell [Wed, 12 Sep 2007 03:22:31 +0000 (13:22 +1000)]
- don't allow the registration of clients with IPs we don't hold
- change some debug levels to make tracking of IP release problems easier
(This used to be ctdb commit 5f9aed62adaf87750f953412c55b29c58e4bb6c0)
Ronnie Sahlberg [Sun, 9 Sep 2007 21:20:44 +0000 (07:20 +1000)]
change the signature to ctdb_sys_have_ip() to also return:
a bool that specifies whether the ip was held by a loopback adaptor or
not
the name of the interface where the ip was held
when we release an ip address from an interface, move the ip address
over to the loopback interface
when we release an ip address after we have move it onto loopback,
use 60.nfs to kill off the server side (the local part) of the tcp
connection so that the tcp connections dont survive a
failover/failback
61.nfstickle, since we kill hte tcp connections when we release an ip
address we no longer need to restart the nfs service in 61.nfstickle
update ctdb_takeover to use the new signature for ctdb_sys_have_ip
when we add a tcp connection to kill in ctdb_killtcp_add_connection()
check if either the srouce or destination address match a known public
address
Ronnie Sahlberg [Fri, 7 Sep 2007 22:09:02 +0000 (08:09 +1000)]
set /proc/sys/net/ipv4/conf/all/arp_filter to 1 by default when
10.interfaces startsup
this setting makes the system only respond to APR requests from the NIC
where the ip address is tied to and adds to the
"principle of least surprise" when using multihoming servers
Ronnie Sahlberg [Fri, 7 Sep 2007 06:45:19 +0000 (16:45 +1000)]
ctdb ip must loop over all connected nodes to pull hte public ip list
and merge into a big list since with the deassociation between a node
and a public ipaddress the /etc/ctdb/public_addresses files can
differ between nodes and no node know about all public addresses that a
cluster can use
Ronnie Sahlberg [Thu, 6 Sep 2007 22:52:56 +0000 (08:52 +1000)]
60.nfs:
we must always restart the lockmanager when the cluster has been
reconfigured and ip addresses has changed. This is to make sure we get a
clusterwide grace period for nfs locking.
if we dont do this and only restart locking on the nodes that were
direclty affected, a different client can take out a conflicting lock
from a different node before affected clients has had a chance to
reclaim all the locks lost during reconfigure.
grace period on rhel5 kernel has bene increased to 90 seconds!
statd-callout:
we must restart lockmanager to ensure a clusterwide grace period for
nfs. this makes locking "more correct" for nfs clients and prevents
other clients/nodes from taking out a conflicting lock while a different
client/node tries to reclaim lost locks.
This makes it "almost consistent" for NFS clients but there is still
the possibility that a cifs client can take out a conflicting lock
before an nfs client has had a chance to reclaim an existing lock.
This can not be solved with anything less than making the kernel nfs
lock manager "samba aware" and making samba aware of the internal state
of the kernel lock manager so that they can cooperate.
we can not just stop/start the lockmanager back to back in rhel5 since
if they are stopped/started too close to eachother then when the new
lockmanager upon starting up sends out statd notifications two things
can happen:
1, new lockmanager sends out notification BEFORE it has registered with
portmapper leading to
lockmanager starts
lockmanager sends notification to the client
client tries to recover the lock and tries to portmap the lockmanager
port on the server.
server is not (yet) registered with portmapper and server responds
"no such program" to hte clients request to discover where lockmanager
is.
client then just completely gives up reclaiming the lock and doesnt
even reattempt the portmapper call after some timeout.
==> lock reclaim failed.
2, if they are started back to back, and a client tries to reclaim the
lock the lockmanager sometimes sends two responses back to back
to the client. one with status NLM_GRANTED (==you got the lock
reclaimed) and one with status NLM_DENIED (==you could not get the lock
reclaimed)
This confuses the client and leads to the server thinking that the
client does have the lock and the client thinking it has not got the
lock and orphaned locks result.
We also send out additional notification messages of different formats
to allow more legacy clients to interoperate with locking.