[Spread-users] Questions about network disconnections

Thu Mar 1 11:33:37 EST 2007

Scenarios (1) or (3) are the proper behavior of Spread.

In Spread, previously partitioned daemons will detect each other either through 
traffic being sent on their segment (i.e. - broadcast/multicast) address or 
through periodic unicast probes of remote daemons.

Within a LAN, the periodic probing is very slow (e.g. - once every 5 minutes) 
and the daemons rely on (re)discovering one another primarily through hearing 
traffic on their segment address.  Spread control traffic, however, doesn't 
usually go on the segment address.  Most commonly, only user data traffic goes 
on the segment address.  This is why traffic from a spmonitor or, more commonly, 
a user application can "wake up" daemons to the fact that they have been 
reconnected at the lower level.

Scenarios (2) and (4) are obviously improper behavior.  Most likely though, such 
behavior points to a problem in your network as Spread has been heavily tested 
in exactly the scenario you are describing.

Blocking does occur while the daemons are reconfiguring and synchronizing.  If 
you have a "flaky" daemon that seems to be constantly dis/connecting, which can 
cause the membership algorithm to "churn," then this can freeze the 
configuration for periods of time.  However, this shouldn't happen in properly 
configured and functioning LAN environments.

Answers to your questions:

(1) Spread was built to allow distributed applications to cleanly handle network 
partitions and merges.  It provides strong semantic guarantees and a simple 
interface for such events.

(2) Yes, it is.

(3) Not that I can see.  However, if you don't have any client traffic flowing, 
then the daemons may remain partitioned from their point of view.  The aberrant 
behavior you are observing (cases 2 and 4) is most likely due to a flaky 
switch/router or NIC(s) in your network.  Also, you might want to try a 
broadcast address to see if you get better behavior as not all switches/routers 
do multicast properly.

If you would like Spread to reform the daemons even when no client traffic is 
flowing, then you could alter the daemon to send some control traffic on the 
segment address periodically, which would trigger the membership algorithm to 
reform (much like a user application's traffic does).  If you are interested, 
Spread Concepts offers consulting services for such projects and you can contact 
us at info at spreadconcepts.com

Cheers!

JL TRESSET wrote:
> Hi,
> 
> we are currently trying to use spread to build some redundancy  features
> and we encounter some strange behaviors of spread daemons (it seems to
> be strange from our own point of view, but perhaps there is nothing
> strange, and perhaps it's due to some misunderstanding from us).
> 
> We use the following simple configuration file with the Spread 4.0 (the
> precompiled GNU/Linux version) version :
> 
> Spread_Segment  239.16.0.1:4848 {
>        metaxa          10.0.1.48
>        kebab           10.0.1.46
>        wasabi          10.0.1.72
>        muffin          10.0.1.47
> }
> SocketPortReuse = ON
> 
> 
> and we launch the spread daemon on each workstation (using./spread -n
> <my host name>). After few seconds, all seems to work on each
> workstation and something like :
> Configuration at kebab is:
> Num Segments 1
>        4       239.16.0.1        4848
>                metaxa                  10.0.1.48
>                kebab                   10.0.1.46
>                wasabi                  10.0.1.72
>                muffin                  10.0.1.47
> ====================
> 
> 
> appears on each console.
> 
> 
> Then we try to unplug the network from one of the station. After few
> seconds, the unplugged one detects that the daemon is alone on his
> segment (only the local workstation is listed on the console) and the
> three others do the same (displaying a list with only three
> workstation). After few seconds or few minutes, we plug the unplugged
> station again. Then we have the following possible behaviors, occurring
> randomly (from our point of view)  :
> 
> 1) the four workstations are in the same segment again, after very few
> seconds
> 2) the four workstations are in the same segment again, after very few
> seconds, but after another short time period the originally unplugged
> workstation go back from the segment and seems to create his "own" segment.
> 3) the four workstations are not automatically in the same segment
> again, but using the spmonitor tool or sptuser sample seems to "excite"
> them (?!?!) and the original four-stations segment is recreated.
> 4) the four workstations are never in the same segment again, even using
> one of the spread tools
> 
> Note : some times, the 3 remaining workstation seem to be blocking, and
> the return of the fourth one seems to unlock them....
> 
> So few questions about these behaviors :
> 1) should we expect to take into account in our software the problem of
> "hard" network disconnections using spread  ? (we currently build a
> daemon client using the Spread library API).
> 2) Is this case a "nominal" usage of spread ?
> 3) Is there something we don't understand or something we do wrong using
> spread ?
> 
> Thanks in advance for your answers,
> 
> best regards,
> JLT
> 
> 
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users
> 

-- 
John Schultz
Spread Concepts LLC
Phn: 443 838 2200
Fax: 301 560 8875