[Spread-users] Problems with more than 2 hosts in a segment

Hans Juergen von Lengerke lengerkeh at sixt.de
Tue Dec 12 12:53:43 EST 2000


Hello George, Hello Yair,

(jeez, I go for a coffee and got two replies... open source sucks
 sometime ya know... you can't just relax and tell your manager you're
 waiting for support ;-)


I haven't tried _all_ permutations but I have tried

        hermes, took    ... gamgee joins
        baggins, gamgee ... hermes joins
	hermes, gamgee  ... baggins joins

all spread.conf's are exactly the same (distributed via FTP)

I have used 'r' and 's' as suggested. The way I called them was

	172.21.1.10 # r -d
and	172.21.1.25 # s -a 172.21.1.10

I've tried 'r -d' on any machine and 's'ed them from all other machines
(including localhost). The only odd thing I could see is that the first
's' sent to a freshly started 'r -d' would always have one package
missed. I suspect that this isn't the problem though.

I have done more testing on the time it takes for machines to join the
segment. Basically, when I start spread at the same time on two
machines, it takes about 1 minute until the two machines see each other
in the segment. After starting spread on the third machine (which brakes
stuff) it takes again 1 minute until the third machine joins the
segment.

This kind of smells like it may have something to do with the resolver.
All machines access the same nameserver. However, that nameserver
doesn't know anything about the machines so resolving has to be done via
/etc/hosts (which is the same on all machines in the segment and only
contains IP's/Names of machines in the spread segment). Could this be
the cause for the problem?

Thx, Hans


Yair Amir <yairamir at cnds.jhu.edu> on Dec 12, 2000:

> It is strange that it takes 30 seconds to complete the membership (as opposed to 5 or so).
> In addition to George's question I would like to check whether the broadcast address 172.21.1.255 works properly on all of the machines (send and receive). That can be checked using "s" and "r". You might need to build them but they are in the make file.
> Run "r usage" and "s usage" to see how they work.
> 
>     Cheers,
> 
>     :) Yair.
> 
> George Schlossnagle wrote:
> 
> > Any two hosts can work, or is it always a particular 3rd one that causes
> > the problems?  Do they all have _identical_ spread.conf files (they
> > must.)
> >
> > George
> >
> > Hans Juergen von Lengerke wrote:
> > >
> > > I am experiencing problems in a spread segment which looks like this:
> > >
> > >   hermes:/usr/local/bin # cat spread.conf
> > >   Spread_Segment  172.21.1.255:3333 {
> > >           hermes  172.21.1.10
> > >           baggins 172.21.1.25
> > >           gamgee  172.21.1.26
> > >           took    172.21.1.27
> > >   }
> > >
> > > The config is exactly the same on all of those hosts. All hosts run SuSE
> > > Linux with kernel 2.2.16-SMP apart from hermes which runs a 2.2.14
> > > kernel.
> > >
> > > When spread runs on any two of the machines everything works as
> > > expected. As soon as a third machine joins the segment things go wrong.
> > > 'user' sessions don't work anymore. For example:
> > >
> > >   gamgee:~ # user -s 3333
> > >   Spread library version is 3.14
> > >   recv_nointr_timeout: Timed out
> > >   SP_error: (-8) Connection closed by spread
> > >
> > >   Bye.
> > >
> > > and also existing 'user' sessions do no longer receive messages sent
> > > from themselves or other group members. For example, I have spread and
> > > 'user' sessions running on baggins and gamgee. Everything works fine:
> > >
> > > [on gamgee]
> > >   User> j test
> > >
> > >   User> s test
> > >   enter message: foo
> > >
> > >   User>
> > >   ============================
> > >   received SAFE message from #user#gamgee, of type 1, (endian 0) to 1
> > >   groups
> > >   (4 bytes): foo
> > >
> > > Now, I start spread on hermes. For some reason, it takes a fair amount
> > > of time (prob ~30 seconds) until all spread deamons report that hermes
> > > has joined. After this was reported we go on with the 'user' session:
> > >
> > > [still on gamgee]
> > >   User> s test
> > >   enter message: bar
> > >
> > >   User>
> > >
> > > Nothing happens. Nobody receives the message although nothing has
> > > changed apart from hermes joining the spread ring.
> > >
> > > Can anyone help?
> > >
> > > Thx, Hans






More information about the Spread-users mailing list