[Spread-users] Spread 2.5

Wed Sep 3 15:14:01 EDT 2014

State 4 is that the membership algorithm is in the GATHER state of forming a new ring.  

From what else you said though, it sounds like the previous ring may still be alive but is unable to fill a hole and allow further progress to proceed.

This could happen if the traffic from the retransmitter can not reach the requester, which is my best guess of what is happening here.  For example, if a firewall somewhere blocked traffic between the pair on the ports in question.  The requester can send the token to the next daemon, but for some reason, retransmissions from the retransmitter (likely the next machine) can not reach the requester on the ports in question.

Assuming that loss was very rare in your environment, then this problem might not present itself for an extended period of time.  Indeed, retransmissions would be rare and everything would work fine.  Is that what you generally see?

The way to test my diagnosis would be to figure out which machine is asking for retransmissions and which machine is retransmitting and test whether traffic can flow bi-directionally on the ports in question between these machines.

Preferably, you would test the communication between all pairs of your daemons on the ports in question.  Spread comes with two programs, spsend and sprecv, that can help you test this.  You can build them in the daemon directory by running “make spsend sprecv” there.  They have simple usage.  You need to test whether you can send and recv UDP in each direction on the ports that Spread uses.

Spread uses three ports.  Its base port and the two following.  The base port, if you don’t specify one, like you did below, is port 4803.  So you want to test ports 4803, 4804 and 4805 for UDP bi-directional connectivity between all your machines and multicast on port 4803.

We’ve run into this problem often enough, that I would like to automate this kind of testing if possible somehow.  Maybe running spread on all your machines in a specific manner to test all the pair-wise connections and spit out a report.  I’ll have to think about this more.

Let me know if you have further questions about this or I can help you more.

Cheers!

-----
John Lane Schultz
Spread Concepts LLC
Cell: 443 838 2200

On Sep 3, 2014, at 4:21 AM, Göran Hasse <gorhas at gmail.com> wrote:

Could you not change logging a bit. My logs are filled with
"Send_join: State is 4"
It would be nice to understand if this is a state transition. What is
state 4? Does it mean
everyting is ok? or something else?

Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4

2014-09-03 5:23 GMT+02:00 John Lane Schultz <jschultz at spreadconcepts.com>:
>> From the logs, can you tell us what state the daemons were in when this was occurring?  It could be a bug.
> 
> Cheers!
> 
> -----
> John Lane Schultz
> Spread Concepts LLC
> Cell: 443 838 2200
> 
> On Sep 2, 2014, at 10:43 PM, Yair Amir <yairamir at cs.jhu.edu> wrote:
> 
> I have a setup with 6 devices (running spread 4.4.0 on linux) with each one configured on a separate segment because they are located on separate subnets. Multicast and broadcast is not available so I configured the segments as follow
> 
> Spread_Segment  172.23.1.1 {
> node1   172.23.1.1
> }
> 
> Spread_Segment  172.23.2.1 {
> node2   172.23.2.1
> }
> 
> Spread_Segment  172.23.3.1 {
> node3   172.23.3.1
> }
> 
> Spread_Segment  172.23.4.1 {
> node4   172.23.4.1
> }
> 
> Spread_Segment  172.23.5.1 {
> node5   172.23.5.1
> }
> 
> Spread_Segment  172.23.6.1 {
> node6   172.23.6.1
> }
> 
> Everything works perfectly 99.999%  of the time but it happened a few times that we had a situation where all the communication between the nodes were stalled and looking at spmonitor we discovered that some daemon were constantly retransmitting.  There was no way to get out of this mode besides restarting the daemon. During that time all communication between the nodes were work fine on all other ports (ping, 22, http, and some other udp port that we use).
> 
> My Questions are:
> - Why would that happen ?
> - Is there a way to detect it and to resolve it without restarting the daemon ?
> 
> Thanks
> 
> Claude
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users
> 
> 
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users

-- 
gorhas at gmail.com
Göran Hasse
Boo 229
715 91  ODENSBACKEN
Mob: 070-5530148