[Spread-users] Issue with Spread going silent

Luke Marsden luke-lists at hybrid-logic.co.uk
Sun Nov 7 12:25:19 EST 2010


Hi again,

Very strange, I have found a way to stop the problem happening: turn off
the program which is trying to use Spread.

This program is a Python daemon which sends heartbeats every second
(amongst other things). When it is running on each node, trying to
broadcast over Spread, Spread loses the ability to accept a third member
into an existing group of two. But when I disable the Python daemon,
Spread stops breaking in this way.

One hypothesis is that the Python Spread bindings we're using are linked
against an older version of the Spread library.

I will investigate and report back. Thanks for your input today,
Yair :-)

-- 
Best Regards,
Luke Marsden
CTO, Hybrid Logic Ltd.

Web: http://www.hybrid-cluster.com/
Hybrid Web Cluster - cloud web hosting

Mobile: +447791750420



On Sun, 2010-11-07 at 10:08 -0500, Yair Amir wrote:
> Hi Luke,
> 
> It sheds light in the sense that I see what is happening:
> 
> - 147 and 102 are together with 147 the representative and they work well.
> - 48 comes along. It finds the others and I think they correctly
>    discover 147 and 48 as the representatives. As 48 does not have a ring
>    it creates a form1 token and sends it to 147.
> - 147 then sends the form1 token to 102, which is good.
> - 102 then creates a form2 token and sends it to 147, which will be
>    the representative of the new ring. The form2 token now contains all
>    the information needed to form the new ring.
> - 147 gets that form2 token and processes it, which is good.
> - 147 is supposed to send the form2 token to 102, which will be the daemon after
>    147 in the new ring.
> - I do not see 102 getting that form2 token from 147, which is strange
>    (as it did get the form1 token from it). This is what causes the
>    ring to dissolve.
> 
> I do not understand why this happens though - why that particular message
> is lost. Somehow it does not look random though as it probably happens
> over and over again.
> 
> If you like - you can make 2 slight code changes in Spread in the file
> membership.c, rebuild Spread and re-run EXACTLY THE SAME SCENARIO.
> 
> The code changes are to add the following unnumbered lines in their
> place in the membership.c file.
> 
> Code change 1:
> 
> 1935         Net_set_membership( Future_membership );
>               printf("Yair: Installing new network membership ----------->\n");
> 	     Conf_print( Future_membership);
>               printf("Yair: <-------------------------------------------->\n");
> 1936         FC_new_configuration( );
> 
> 
> Code change 2:
> 
> 2013         if( Conf_last( &Future_membership ) != My.id )
> 2014         {
> 2015                 Net_send_token( &send_scat );
> 2016                 Net_send_token( &send_scat );
>               printf("Yair: Sent form2 from Read_form2 ------------------>\n");
> 2017                 Token_rounds = 0;
> 2018
> 2019         }else{
> 2020                 /* build first regular token */
> 2021                 send_scat.num_elements = 1;
> 2022
> 2023                 form_token->type = 0;
> 2024                 form_token->seq = 0;
> 2025                 form_token->aru = Last_seq;
> 2026                 form_token->flow_control = 0;
> 2027                 form_token->rtr_len = 0;
> 2028
> 2029                 Net_send_token( &send_scat );
>               printf("Yair: Sent regular token from Read_form2 ---------->\n");
> 2030                 Token_rounds = 1;
> 2031         }
> 
> 
> Cheers,
> 
> 	:) Yair.
> 
> On 11/7/10 9:13 AM, Luke Marsden wrote:
> > Hi Yair,
> > 
> > Thank you. I agree 4% packet loss is high. I get quite a bit of packet
> > loss when saturating the network interfaces (spsend/recv or ping -f),
> > but none at all when transmitting just a small amount of traffic. Since
> > in normal operation Spread shouldn't go near saturating the network
> > interfaces, I agree that this is unlikely to be the cause of the
> > problem. An interesting artefact of the virtualisation though.
> > 
> > I have rearranged the machines in the spread.conf. They are using their
> > public IPs for this test, not the 10.0.0.* addresses (although they
> > exhibit the same behaviour either way):
> > 
> > Spread_Segment 178.22.66.147:4803 {
> >     2f20196c853548e7 178.22.66.147
> > }
> > Spread_Segment 178.22.67.102:4803 {
> >     27edda570dce48bb 178.22.67.102
> > }
> > Spread_Segment 178.22.67.48:4803 {
> >     fff0bbd5e0da4103 178.22.67.48
> > }
> > 
> > I've added the MEMBERSHIP debug flag, and this is the output. I started
> > the spread daemons from left-to-right, which now corresponds to
> > top-to-bottom :-)
> > 
> > http://lukemarsden.net/debugging.png
> > 
> > Does this shed any light?
> > 





More information about the Spread-users mailing list