[Spread-users] Partition problem

Ben Laurie ben at algroup.co.uk
Thu Apr 3 05:09:33 EST 2003

Aswin Almeida wrote:
> Hello folks.
> BBN Technologies is conducting experiments on the Spread and Secure
> Spread (layered architecture) for DARPA.
> Recently, we experienced problems with a "partitioning issue" which
> can affect the measurement of join times. Yair is aware of these
> problems via our BBN-JHU-SRI experiment mailing list. I wanted to
> appeal to a _wider audience_ as well to see if anyone else has ideas.
> The Network ------------------
> Network contains enclaves (Spread Segments) in several locations
> around the United States. We will refer to these segments by letter
> and dub them Site [A..F].
> Site A is in Virginia Site B is in Cambridge Site C is in New York 
> Site D is in California Site E is in Hawaii Site F is in New Mexico
> Background -----------------
> Four users from site A join group "test".  (The "initial set of
> users"). Four users from Site B join group "test".  (The "joining set
> of users").
> Thus, only membership messages are involved, not regular messages
> between users.
> What was observed?
> The four users from site B will form their own group.  A partition
> occurs.  Approximately 30 seconds later a merge will take place after
> the groups discover each other.  With a larger initial set of users
> (e.g. 64 users) and a larger joining set of users (e.g. 64 users)
> this initially partitioned and then merged behavior does NOT occur.
> Similar behavior was observed when the users are drawn from sites A
> and C.
> More recently, we observed this same behavior when executing between
> sites B and C.  This link was thought to be good between sites B and
> C for Spread's purposes, according to the "send" and "receive"
> utilities.  Use of s.c and r.c is discussed in the next section.
> Why is this of concern? ----------------------------------
> Beyond being a mild curiousity, it could potentially affect data
> collection.
> Yair sat down with Sara and I in Virginia and suggested the use of
> the s.c and r.c (send and receive) utilities. Deploying "send" at
> Site A and "receive" at Site B, the three of us observed
> approximately 30% loss. Using these utilities at Site B and Site C,
> the loss was alot lower (5-7%) and Yair said this was more
> acceptable.
> However, we are still seeing the partitioning behavior even between
> sites B and C, where the reported loss by s and r is markedly lower
> than 30%.
> Investigation to Date ------------------------------
> Initially there was thought that Site A, its hardware (Sidewinder NIC
> card which had perhaps stale errors on the console), or its
> connectivity to the outside world could have been the culprit for
> this partitioning issue.
> Since Yair visited with Sara and I at Site A weeks ago, no errors
> have appeared on the console. This site has a fractional T3 as its
> connection to the outside world. Other sites such as B, C which look
> ok with the s.c and r.c utilities have exhibited this partitioning
> behavior as well. In fact, site pairs which have high loss rates
> reported by s.c and r.c produce somewhat consistent join times across
> runs.
> This supports the theory that the problem is not with any particular
> enclave, but perhaps with the underlying network itself. Pairwise
> pings and traceroutes using external IPs for our VPN don't show a
> problem between the sites.
> Maybe we are missing something and can use spmonitor and subsequently
> tweak flow control.  Or maybe there is something else we can try with
> the Spread.
> What have we already tried?  Yair had suggested we try to tune the
> parameters in flow_control.c, thus Sara changed Window = 15 and
> Personal_window = 3 in FC_init.  Unfortunately this did not help.  We
> still are seeing this partition/merge behavior, which would skew join
> time measurements.
> What we need --------------------- When users at two different
> geographic locations join a single group, we'd like to not see a
> *partition* (and then many moments later) a merge, as this will skew
> our measured data for jointimes.  Avoiding the slow, eventual merge
> would help data collection, but it would be optimal to avoid this
> partitioning behavior altogether.
> Thoughts, comments appreciated,

I'm curious to know why you don't think this is a bug in Spread?



http://www.apache-ssl.org/ben.html       http://www.thebunker.net/

"There is no limit to what a man can do or how far he can go if he
doesn't mind who gets the credit." - Robert Woodruff

More information about the Spread-users mailing list