[Spread-users] Re-Send: Spread issue when starting a 50 node cluster
jnoller at archivas.com
Mon Aug 23 16:59:08 EDT 2004
Hey everyone - I'm resending this in hope to get some insight - any help
I'm currently trying to get a spread cluster with 50 nodes up
and running - I noticed that if I spawn all 50 spread processes at once
(all are using the same configuration file, all are on different
machines) and then attempt to use spuser to join the same group across
the machines, none of the cluster members receive membership messages
back. If I attempt to send a message to that group, the message is never
sent or received. I am currently using spread 3.17.2.
I have tested the following scenarios:
Scenario 1: I start spread on 20 nodes simultaneously, and then attempt
to connect to the spread daemon via spuser. This worked - all 20 nodes
got the membership message from the "test" group I told them to join,
and I was able to send a message via the group that was received.
Scenario 2: I started spread on 25 nodes simultaneously - I then tried
to connect to spread via spuser and it seemed to succeed. When I
attempted to join the spread group "test" - no membership messages were
received, and trying to send a message to the group failed (no error
seen, but no message sent/received).
Scenario 3: I started spread on 24 nodes simultaneously - I repeat the
steps above with the same results. However, I noticed that spread
crashed on at least 1 node, from the logs:
[Wed 28 Jul 2004 09:48:36] E_handle_events: poll select
[Wed 28 Jul 2004 09:48:36] E_handle_events: select with timeout (4,
[Wed 28 Jul 2004 09:48:37] E_handle_events: exec handler for fd 5,
fd_type 0, priority 1
[Wed 28 Jul 2004 09:48:37] DL_recv: received 1472 bytes on channel 5
[Wed 28 Jul 2004 09:48:37] Received Token
[Wed 28 Jul 2004 09:48:37] it is a Form Token.
[Wed 28 Jul 2004 09:48:37] Memb_handle_token: handling form2 token
[Wed 28 Jul 2004 09:48:37] Handle_form2 in FORM
[Wed 28 Jul 2004 09:48:37] Net_set_membership: Token_address :
[Wed 28 Jul 2004 09:48:37] Read_form2: num_rings = 24, num_bytes = 7248,
Memb_id = (167838047 -1)
Exit caused by Alarm(EXIT)
Scenario 4: I start spread up on 23 nodes simultaneously, I ran through
the same "join group test" "send message" and it worked across all 23
nodes. Spread did not error, or roll.
Scenario 5: Working from scenario 4 - I add in node 24 and then node 25
(one at a time). I tell them to join the existing "test" group, and it
works. I can send a message to the group, and all 25 nodes receive the
Scenario 6: Building from 5 - I added another 10 nodes, simultaneously.
I told spuser to connect and join the "test" group - this worked. All 35
nodes received the membership messages, and the message sent to the
Scenario 7: Building from 6 I added the remaining 15 nodes at once. I
told them to join the test group, and it worked. The test message sent
to the group worked.
Overall, if I start the nodes at a staggered rate - 10, then 10, then 10
and so on, it seems to work fine, and the nodes can join the group and
send messages fine. The problem only surfaces when you try to start that
many nodes simultaneously.
I can provide more logs as needed. Attached is my spread.conf file used
for the sessions.
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
More information about the Spread-users