[Spread-users] Questions about network disconnections
John Lane Schultz
jschultz at spreadconcepts.com
Thu Mar 1 11:33:37 EST 2007
Scenarios (1) or (3) are the proper behavior of Spread.
In Spread, previously partitioned daemons will detect each other either through
traffic being sent on their segment (i.e. - broadcast/multicast) address or
through periodic unicast probes of remote daemons.
Within a LAN, the periodic probing is very slow (e.g. - once every 5 minutes)
and the daemons rely on (re)discovering one another primarily through hearing
traffic on their segment address. Spread control traffic, however, doesn't
usually go on the segment address. Most commonly, only user data traffic goes
on the segment address. This is why traffic from a spmonitor or, more commonly,
a user application can "wake up" daemons to the fact that they have been
reconnected at the lower level.
Scenarios (2) and (4) are obviously improper behavior. Most likely though, such
behavior points to a problem in your network as Spread has been heavily tested
in exactly the scenario you are describing.
Blocking does occur while the daemons are reconfiguring and synchronizing. If
you have a "flaky" daemon that seems to be constantly dis/connecting, which can
cause the membership algorithm to "churn," then this can freeze the
configuration for periods of time. However, this shouldn't happen in properly
configured and functioning LAN environments.
Answers to your questions:
(1) Spread was built to allow distributed applications to cleanly handle network
partitions and merges. It provides strong semantic guarantees and a simple
interface for such events.
(2) Yes, it is.
(3) Not that I can see. However, if you don't have any client traffic flowing,
then the daemons may remain partitioned from their point of view. The aberrant
behavior you are observing (cases 2 and 4) is most likely due to a flaky
switch/router or NIC(s) in your network. Also, you might want to try a
broadcast address to see if you get better behavior as not all switches/routers
do multicast properly.
If you would like Spread to reform the daemons even when no client traffic is
flowing, then you could alter the daemon to send some control traffic on the
segment address periodically, which would trigger the membership algorithm to
reform (much like a user application's traffic does). If you are interested,
Spread Concepts offers consulting services for such projects and you can contact
us at info at spreadconcepts.com
Cheers!
JL TRESSET wrote:
> Hi,
>
> we are currently trying to use spread to build some redundancy features
> and we encounter some strange behaviors of spread daemons (it seems to
> be strange from our own point of view, but perhaps there is nothing
> strange, and perhaps it's due to some misunderstanding from us).
>
> We use the following simple configuration file with the Spread 4.0 (the
> precompiled GNU/Linux version) version :
>
> Spread_Segment 239.16.0.1:4848 {
> metaxa 10.0.1.48
> kebab 10.0.1.46
> wasabi 10.0.1.72
> muffin 10.0.1.47
> }
> SocketPortReuse = ON
>
>
> and we launch the spread daemon on each workstation (using./spread -n
> <my host name>). After few seconds, all seems to work on each
> workstation and something like :
> Configuration at kebab is:
> Num Segments 1
> 4 239.16.0.1 4848
> metaxa 10.0.1.48
> kebab 10.0.1.46
> wasabi 10.0.1.72
> muffin 10.0.1.47
> ====================
>
>
> appears on each console.
>
>
> Then we try to unplug the network from one of the station. After few
> seconds, the unplugged one detects that the daemon is alone on his
> segment (only the local workstation is listed on the console) and the
> three others do the same (displaying a list with only three
> workstation). After few seconds or few minutes, we plug the unplugged
> station again. Then we have the following possible behaviors, occurring
> randomly (from our point of view) :
>
> 1) the four workstations are in the same segment again, after very few
> seconds
> 2) the four workstations are in the same segment again, after very few
> seconds, but after another short time period the originally unplugged
> workstation go back from the segment and seems to create his "own" segment.
> 3) the four workstations are not automatically in the same segment
> again, but using the spmonitor tool or sptuser sample seems to "excite"
> them (?!?!) and the original four-stations segment is recreated.
> 4) the four workstations are never in the same segment again, even using
> one of the spread tools
>
> Note : some times, the 3 remaining workstation seem to be blocking, and
> the return of the fourth one seems to unlock them....
>
> So few questions about these behaviors :
> 1) should we expect to take into account in our software the problem of
> "hard" network disconnections using spread ? (we currently build a
> daemon client using the Spread library API).
> 2) Is this case a "nominal" usage of spread ?
> 3) Is there something we don't understand or something we do wrong using
> spread ?
>
> Thanks in advance for your answers,
>
> best regards,
> JLT
>
>
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users
>
--
John Schultz
Spread Concepts LLC
Phn: 443 838 2200
Fax: 301 560 8875
More information about the Spread-users
mailing list