[Spread-users] Partition problem
Ben Laurie
ben at algroup.co.uk
Thu Apr 3 05:09:33 EST 2003
Aswin Almeida wrote:
> Hello folks.
>
> BBN Technologies is conducting experiments on the Spread and Secure
> Spread (layered architecture) for DARPA.
>
> Recently, we experienced problems with a "partitioning issue" which
> can affect the measurement of join times. Yair is aware of these
> problems via our BBN-JHU-SRI experiment mailing list. I wanted to
> appeal to a _wider audience_ as well to see if anyone else has ideas.
>
>
> The Network ------------------
>
> Network contains enclaves (Spread Segments) in several locations
> around the United States. We will refer to these segments by letter
> and dub them Site [A..F].
>
> Site A is in Virginia Site B is in Cambridge Site C is in New York
> Site D is in California Site E is in Hawaii Site F is in New Mexico
>
> Background -----------------
>
> Four users from site A join group "test". (The "initial set of
> users"). Four users from Site B join group "test". (The "joining set
> of users").
>
> Thus, only membership messages are involved, not regular messages
> between users.
>
> What was observed?
>
> The four users from site B will form their own group. A partition
> occurs. Approximately 30 seconds later a merge will take place after
> the groups discover each other. With a larger initial set of users
> (e.g. 64 users) and a larger joining set of users (e.g. 64 users)
> this initially partitioned and then merged behavior does NOT occur.
> Similar behavior was observed when the users are drawn from sites A
> and C.
>
> More recently, we observed this same behavior when executing between
> sites B and C. This link was thought to be good between sites B and
> C for Spread's purposes, according to the "send" and "receive"
> utilities. Use of s.c and r.c is discussed in the next section.
>
> Why is this of concern? ----------------------------------
>
> Beyond being a mild curiousity, it could potentially affect data
> collection.
>
> Yair sat down with Sara and I in Virginia and suggested the use of
> the s.c and r.c (send and receive) utilities. Deploying "send" at
> Site A and "receive" at Site B, the three of us observed
> approximately 30% loss. Using these utilities at Site B and Site C,
> the loss was alot lower (5-7%) and Yair said this was more
> acceptable.
>
> However, we are still seeing the partitioning behavior even between
> sites B and C, where the reported loss by s and r is markedly lower
> than 30%.
>
> Investigation to Date ------------------------------
>
> Initially there was thought that Site A, its hardware (Sidewinder NIC
> card which had perhaps stale errors on the console), or its
> connectivity to the outside world could have been the culprit for
> this partitioning issue.
>
> Since Yair visited with Sara and I at Site A weeks ago, no errors
> have appeared on the console. This site has a fractional T3 as its
> connection to the outside world. Other sites such as B, C which look
> ok with the s.c and r.c utilities have exhibited this partitioning
> behavior as well. In fact, site pairs which have high loss rates
> reported by s.c and r.c produce somewhat consistent join times across
> runs.
>
> This supports the theory that the problem is not with any particular
> enclave, but perhaps with the underlying network itself. Pairwise
> pings and traceroutes using external IPs for our VPN don't show a
> problem between the sites.
>
> Maybe we are missing something and can use spmonitor and subsequently
> tweak flow control. Or maybe there is something else we can try with
> the Spread.
>
> What have we already tried? Yair had suggested we try to tune the
> parameters in flow_control.c, thus Sara changed Window = 15 and
> Personal_window = 3 in FC_init. Unfortunately this did not help. We
> still are seeing this partition/merge behavior, which would skew join
> time measurements.
>
> What we need --------------------- When users at two different
> geographic locations join a single group, we'd like to not see a
> *partition* (and then many moments later) a merge, as this will skew
> our measured data for jointimes. Avoiding the slow, eventual merge
> would help data collection, but it would be optimal to avoid this
> partitioning behavior altogether.
>
> Thoughts, comments appreciated,
I'm curious to know why you don't think this is a bug in Spread?
Cheers,
Ben.
--
http://www.apache-ssl.org/ben.html http://www.thebunker.net/
"There is no limit to what a man can do or how far he can go if he
doesn't mind who gets the credit." - Robert Woodruff
More information about the Spread-users
mailing list