[Spread-users] RESOLVED - Issue with Spread going silent

Luke Marsden luke-lists at hybrid-logic.co.uk
Sun Nov 7 16:42:27 EST 2010


Hi Mel,

Thank you!! Changing the header file as per
http://lists.spread.org/pipermail/spread-users/2010-July/004316.html
seems to have done the trick. Clusters of two now join together with a
third even while our Python daemon is running and pumping data through
Spread.

Can't express how grateful I am for this, this was a real show-stopper
for the launch of our beta programme, which is this week!

Yair, I think this might warrant a point release, you've probably got a
lot of 64-bit users who could get bitten by this bug. Also, thank you
for your kind assistance today :-)

-- 
Best Regards,
Luke Marsden
CTO, Hybrid Logic Ltd.

Web: http://www.hybrid-cluster.com/
Hybrid Web Cluster - cloud web hosting

Mobile: +447791750420


On Sun, 2010-11-07 at 18:25 +0000, Melissa Jenkins wrote:
> Hi Luke,
> 
> Out of interest are you using i386 or amd64 kernel?
> 
> If's it's amd64 there is a size_t problem...
> 
> http://lists.spread.org/pipermail/spread-users/2010-July/004316.html
> 
> Might help - causes strange problems with sending/receiving messages if it's wrong
> Mel
> 
> 
> On 7 Nov 2010, at 17:54, Luke Marsden wrote:
> 
> > Hi Yair,
> > 
> > Thank you for this. I have now recompiled the Spread Python bindings
> > against the version of the library. The size of the spread.so file
> > changed, so I was hopeful. No luck though, I still get the same problem.
> > Spread works fine when the Python daemon is disconnected, but fails to
> > accept a new Spread node to a group of two when the Python daemon is
> > connected and sending messages.
> > 
> > Can you think of any reason why having clients connected to Spread could
> > cause it to behave in this way?
> > 
> > My next step is to try and reproduce the problem with the smallest
> > possible Python script which just sends a few bytes of heartbeat data
> > every second.
> > 
> > I do have some real hardware, some old PowerEdge 1850s, in my basement
> > which I can upgrade to FreeBSD 8.1 -- it is possible that the problem is
> > triggered by both having a Python client connected *and* being on a
> > virtualised platform.
> > 
> > Getting the servers up and running will take a bit of time though. First
> > I'll see if I can reproduce the issue with a simplest-possible Python
> > test case.
> > 
> > I'll be in touch with my findings as soon as possible.
> > 
> > Thank you again.
> > 
> > -- 
> > Best Regards,
> > Luke Marsden
> > CTO, Hybrid Logic Ltd.
> > 
> > Web: http://www.hybrid-cluster.com/
> > Hybrid Web Cluster - cloud web hosting
> > 
> > Mobile: +447791750420
> > 
> > 
> > On Sun, 2010-11-07 at 12:29 -0500, Yair Amir wrote:
> >> Dear Luke,
> >> 
> >> Thanks - this is very helpful. This confirmed my analysis from before.
> >> 
> >> The network membership looks good, so the form2 token should be sent
> >> to the correct address, but unfortunately, that specific message is never received
> >> (actually the message is sent twice by 147 but none of the copies is
> >>  received by 102).
> >> 
> >> I don't see an easy way to diagnose this without digging to the network level
> >> because, from Spread perspective, it seems it is doing its job correctly and
> >> just a specific message is never making it even though it is sent several times.
> >> And the same thing repeats exactly the same.
> >> 
> >> So all in all, I don't think it is a higher level bug. The next step would be
> >> to turn on NETWORK level debug messages and to see what the network layer of
> >> Spread is doing with that specific message. You do this similarly to the
> >> way we turned MEMBERSHIP debug messages - just add the word NETWORK
> >> before (or after) MEMBERSHIP in the spread.conf file
> >> 
> >> If the network level of spread will do its job as I expect, it will go to the
> >> data link level of Spread, and beyond that become an operating system / network
> >> card issue.
> >> 
> >> Before we dive in this - is there a way to natively have 3 computers running
> >> the exact same operating system but without virtualization?
> >> I know many people use Spread with virtualization successfully as I do in some
> >> testings, but not with FreeBSD (I have mac, linux and Windows).
> >> It is ironic - Spread was originally developed on NetBSD.
> >> 
> >> Cheers,
> >> 
> >> 	:) Yair.
> >> 
> >> On 11/7/10 10:46 AM, Luke Marsden wrote:
> >>> Hi Yair,
> >>> 
> >>> Thank you so much for your time on this.
> >>> 
> >>> Here is the diff so you can check it:
> >>> https://github.com/hybridlogic/Spread-Yair-fix/commit/cc456dcaa073629634ce0019673324b54af71b4f
> >>> Also I had to do this to get it to compile:
> >>> https://github.com/hybridlogic/Spread-Yair-fix/commit/15649ddc00bc728204b324f63c13fe77fb15a33a
> >>> 
> >>> And here is the output for the first few seconds after starting the
> >>> third daemon:
> >>> 
> >>> http://lukemarsden.net/yair-debug/Screenshot-1.png
> >>> http://lukemarsden.net/yair-debug/Screenshot-2.png
> >>> http://lukemarsden.net/yair-debug/Screenshot-3.png
> >>> http://lukemarsden.net/yair-debug/Screenshot-4.png
> >>> 
> >>> (Ignore the *** GOT HERE ***, that was me.)
> >>> 
> >>> If you wish to make any code changes, you can fork the repo at
> >>> https://github.com/hybridlogic/Spread-Yair-fix to your own GitHub
> >>> account, commit the changes and issue a pull request, then I can merge
> >>> and test very quickly.
> >>> 
> >>> Alternatively just send me line numbers and code and I'll apply the
> >>> changes manually, whatever's quicker for you :-)
> >>> 
> >> 
> >> _______________________________________________
> >> Spread-users mailing list
> >> Spread-users at lists.spread.org
> >> http://lists.spread.org/mailman/listinfo/spread-users
> > 
> > 
> > _______________________________________________
> > Spread-users mailing list
> > Spread-users at lists.spread.org
> > http://lists.spread.org/mailman/listinfo/spread-users
> 


-- 
Best Regards,
Luke Marsden
CTO, Hybrid Logic Ltd.

Web: http://www.hybrid-cluster.com/
Hybrid Web Cluster - cloud web hosting

Mobile: +447791750420





More information about the Spread-users mailing list