[Spread-users] main concern on High CPU utilization

John Lane Schultz jschultz at spreadconcepts.com
Thu Mar 5 11:44:11 EST 2009


First off, if those are supposed to be Spread configuration files, then I think you've got them pretty wrong.  In fact, if you are running Spread version 4, then I'd be surprised if your daemons are communicating with one another at all.

In general, you want all the daemons in a configuration to run from the same configuration file*.  More specifically, the segment portions of the configurations should be exactly the same: same # of segments, listed in the same order, each segment address should match up and each segment should contain the same nodes listed in the same order.

So, the first problem that jumps out at me is that your two different daemons are running from slightly different configuration files.  In particular, the address of the segment is different:

Server 1: Gcs_Segment 192.168.4.72:8800 ...
Server 2: Gcs_Segment 192.168.4.70:8800 ...

Normally, with Spread v4, I would expect the two daemons to compute different configuration hashes and, therefore, refuse to talk to one another.  I wouldn't expect lots of extra CPU overhead in such a case.  So, some initial questions are:  What version of Spread are you running?  Are the daemons reporting that they are connecting with one another on the command line?

Now, if you are running Spread v3 or if you are running Spread v4 and you gave us a slightly incorrect listing of your configuration files in your email and they both had the same segment address, such as:

Server 1: Gcs_Segment 192.168.4.72:8800 ...
Server 2: Gcs_Segment 192.168.4.72:8800 ...

Then that might explain why you are seeing high CPU usage.  When you place multiple nodes in the same segment, you are telling Spread that if the nodes in the segment send a message to the Segment address, then all of the nodes in the segment will likely receive the message (i.e. - the Segment address is a multicast or broadcast address).

In your file(s) you put one of the machines IPs as the Segment address.  Bizarrely, because you put the opposite machine's IP address there and you only have two nodes in the segment, that might actually work, but that is certainly not the designed intention of the Segment address and I can't rightly say what should or would happen with the configuration you gave it.

Normally, when someone puts an incorrect Segment address, such as one of the machine's IP address, then what happens is that the message is sent there and the other daemons never receive it.  So, they all request re-sends of the packet, which often simply just sends it to the segment address again.  There are is a fallback mechanism when the segment address doesn't seem to be working or only one other daemon requests a resend (as would be in your case).  In that case, the daemon will eventually resend the message using unicast, which will work in your configuration.

So, right now, I would say fix your configuration first and see if the CPU issue remains.  I would recommend running the following configuration on both of your daemon, listed in order of preference:

If you have multicast working between the servers then:

Spread_Segment 227.227.227.227:8800 {
	Node1	192.168.4.70
	Node2	192.168.4.72
}

If you have broadcast working between the servers and it is a /24 network:

Spread_Segment 192.168.4.255:8800 {
	Node1	192.168.4.70
	Node2	192.168.4.72
}

Otherwise:

Spread_Segment 0.0.0.0:8800 {
	Node1	192.168.4.70
}

Spread_Segment 0.0.0.0:8800 {
	Node2	192.168.4.72
}

Cheers!
John

---
John Lane Schultz
Spread Concepts LLC
Phn: 443 838 2200 
Fax: 301 560 8875

Thursday, March 5, 2009, 8:12:32 AM, you wrote:


Hi 
 
Server 1 (IP 192.168.4.70)
Gcs_Segment 192.168.4.72:8800{
               Node1 192.168.4.70
               Node2  192.168.4.72
         }
Server 2 (IP 192.168.4.72)
Gcs_Segment 192.168.4.70:8800{
               Node1 192.168.4.70
               Node2  192.168.4.72
         }
Above are the two given config files for individuals spread deamons 
 
I am using the Power PC operating system (with Motorola processor) .
And I had tried all the things given in the doc of the spread doc available at the site  but not able to reduce the CPU utilization it is taking around 30%.
I am attaching the monitor snapshots too for analysis .
Could you please send me some doc (Low level Design doc/ doc for understanding of code flow ) for better understanding of the spread process.
Or Could you please suggest me some way to reduce the CPU utilization.
Waiting for your early response .
Sandeep jeevan 
Member Technical(Stack) 
Mob:9717892153 
VNL  | 246, Phase IV, Udyog Vihar, Gurgaon, Haryana 122 015, INDIA | +91-124-4311600-609 | F +91-124-4104766 | www.vnl.in
  

From: John Lane Schultz [mailto:jschultz at spreadconcepts.com] 
Sent: Wednesday, March 04, 2009 9:31 PM
To: Sandeep Jeevan
Cc: spread-users at lists.spread.org
Subject: RE: [Spread-users] Problem with token sending module (main concern on High CPU utilization)
 
As you can see, the only difference is in the test on the Token_counter.  This function determines whether or not a ring leader should stop circulating the token.  When true, this test basically puts the system into a “dormant” mode, compared to an “active” mode that keeps the token circulating as fast as possible.
 
In the 4.x.x version, it will stop circulating the token after all daemons have acknowledged receiving all traffic and the token makes 1 additional circulation.  In the 3.x.x. version, it does the same but only after the token makes 100 additional circulations.
 
This change was made because we were getting complaints from people that the token continued to circulate for some time even when no user traffic was flowing.  So, this way there is likely less token traffic.  The drawback to doing this is that if a daemon wanted to send after the token stopped circulating but before the token_hurry timeout, then it would need to send a request for the token to the ring leader who would then begin circulating it rather than the token just coming to it automatically if it had continued circulating.  In other words, if your system has low amounts of activity, then the optimization that reduces the token traffic will likely increase the latency of sending and delivering messages.
 
You can change the number any way you like without harming the overall functioning of the protocol.  It will just raise or lower how aggressive Spread is in trying to optimize for low latency.  The higher the # the longer the token will continue circulating after no new traffic is injected.
Cheers!

---
John Lane Schultz
Spread Concepts LLC
Phn: 443 838 2200 

From: spread-users-bounces at lists.spread.org [mailto:spread-users-bounces at lists.spread.org] On Behalf Of Sandeep Jeevan
Sent: Wednesday, March 04, 2009 12:26 AM
To: John Lane Schultz
Cc: spread-users at lists.spread.org
Subject: Re: [Spread-users] Problem with token sending module (main concern on High CPU utilization)
 
 
 
Dear John
 
Could you please guide me for this in version 4.x.x in protocol.c
 
static  int To_hold_token()
{
  if( ( Memb_state() == OP ||
        ( Memb_state() == GATHER && Memb_token_alive() ) )&&
      Get_retrans(Last_token->type) <= 1      &&
      Aru == Highest_seq && Token_counter > 1 ) return ( 1 );
  else return( 0 );
}
 
While in version 3.x.x in protocol.c 
 
static  int To_hold_token()
{
  if( ( Memb_state() == OP ||
        ( Memb_state() == GATHER && Memb_token_alive() ) )&&
      Get_retrans(Last_token->type) <= 1      &&
      Aru == Highest_seq && Token_counter > 100 ) return ( 1 );
  else return( 0 );
}
 
If I change 100 to 1 will it impact my system anyway in 3.x.x
 
 
Sandeep jeevan 
Member Technical(Stack) 
Mob:9717892153 
VNL  | 246, Phase IV, Udyog Vihar, Gurgaon, Haryana 122 015, INDIA | +91-124-4311600-609 | F +91-124-4104766 | www.vnl.in
  

From: John Lane Schultz [mailto:jschultz at spreadconcepts.com] 
Sent: Tuesday, March 03, 2009 8:36 PM
To: Sandeep Jeevan; spread-users at lists.spread.org
Subject: RE: [Spread-users] Problem with token sending module (main concern on High CPU utilization)
 
It looks like for the first send he is retransmitting it due to a suspected token loss or another daemon’s request at line 2.  On lines 7 and 27, however, this process had just received the token before he sent it on again.
 
The current token ring algorithm will forward the token as fast as possible so long as any of the daemons has user traffic to send to try and minimize the latency of messages.  Only if the all the daemons have no more user data to send will the ring leader then hold the token and stop passing it around.  In that case, he will hold it for the token_hurry timeout before passing it anyway for failure detection.
Cheers!

---
John Lane Schultz
Spread Concepts LLC
Phn: 443 838 2200 

From: spread-users-bounces at lists.spread.org [mailto:spread-users-bounces at lists.spread.org] On Behalf Of Sandeep Jeevan
Sent: Tuesday, March 03, 2009 1:42 AM
To: spread-users at lists.spread.org
Subject: [Spread-users] Problem with token sending module (main concern on High CPU utilization)
 
Following the  logs that I got after enabling the logs.
I am facing a unique problem that is after  line numbered 11 I am continuously sending token after every 1 milli second and while the token timing in the memembership.c (200000 ) could any body let me know why it happens so .
 
 
1 [Tue 03 Mar 2009 10:42:03] DL_send: sent a message of 24 bytes to (192.168.4.70,8801) on channel 5
2 [Tue 03 Mar 2009 10:42:03] Prot_token_hurry: retransmiting token 13 1
3 [Tue 03 Mar 2009 10:42:03] E_handle_events: next event
4 [Tue 03 Mar 2009 10:42:03] E_handle_events: poll select
5 [Tue 03 Mar 2009 10:42:03] E_handle_events: exec handler for fd 5, fd_type 0, priority 1
6 [Tue 03 Mar 2009 10:42:03] DL_recv: received 24 bytes on channel 5
7 [Tue 03 Mar 2009 10:42:03] Received Token
8 [Tue 03 Mar 2009 10:42:03] dispose: disposing pointer 0x1013a0d8 to object type 20 named scatter
9 [Tue 03 Mar 2009 10:42:03] dispose: disposing pointer 0x101389a0 to object type 27 named down_link
10[Tue 03 Mar 2009 10:42:03] Send_new_packets: packet 292 sent and inserted
11[Tue 03 Mar 2009 10:42:03] Net_flush_bcast: Flushing with Queued_bytes = 896; num_elements in scat = 2; size of scat0,1 = 32 864
12[Tue 03 Mar 2009 10:42:03] Net_flush_bcast Num_send_needed =0
13[Tue 03 Mar 2009 10:42:03] Net_send_token:before milli 400:Tue Mar  3 10:42:03 2009
 
[Tue 03 Mar 2009 10:42:03] ifndef ARCH_SCATTER_NONE ::::$$$ DL_send:sendmsg called ret =24 num_try 0
[Tue 03 Mar 2009 10:42:03] DL_send: sent a message of 24 bytes to (192.168.4.70,8801) on channel 5
[Tue 03 Mar 2009 10:42:03] dispose: disposing pointer 0x1010f3e0 to object type 35 named time_event
[Tue 03 Mar 2009 10:42:03] E_queue: dequeued a (first) simillar event
[Tue 03 Mar 2009 10:42:03] E_queue: (first) event queued func 0x1001e574 code 0 data 0x0 in future (0:200000)
[Tue 03 Mar 2009 10:42:03] dispose: disposing pointer 0x10138098 to object type 35 named time_event
[Tue 03 Mar 2009 10:42:03] E_queue: dequeued a simillar event
[Tue 03 Mar 2009 10:42:03] E_queue: event queued for func 0x10010958 code 0 data 0x0 in future (0:500000)
[Tue 03 Mar 2009 10:42:03] dispose: disposing pointer 0x1010f360 to object type 8 named token_head_obj
[Tue 03 Mar 2009 10:42:03] E_handle_events: next event
[Tue 03 Mar 2009 10:42:03] E_handle_events: poll select
[Tue 03 Mar 2009 10:42:03] E_handle_events: exec handler for fd 5, fd_type 0, priority 1
[Tue 03 Mar 2009 10:42:03] DL_recv: received 24 bytes on channel 5
[Tue 03 Mar 2009 10:42:03] Received Token
[Tue 03 Mar 2009 10:42:03] Net_send_token:before milli 400:Tue Mar  3 10:42:03 2009
 
[Tue 03 Mar 2009 10:42:03] ifndef ARCH_SCATTER_NONE ::::$$$ DL_send:sendmsg called ret =24 num_try 0
[Tue 03 Mar 2009 10:42:03] DL_send: sent a message of 24 bytes to (192.168.4.70,8801) on channel 5
[Tue 03 Mar 2009 10:42:03] dispose: disposing pointer 0x1010f460 to object type 35 named time_event
[Tue 03 Mar 2009 10:42:03] E_queue: dequeued a (first) simillar event
[Tue 03 Mar 2009 10:42:03] E_queue: (first) event queued func 0x1001e574 code 0 data 0x0 in future (0:200000)
[Tue 03 Mar 2009 10:42:03] dispose: disposing pointer 0x1010f3e0 to object type 35 named time_event
[Tue 03 Mar 2009 10:42:03] E_queue: dequeued a simillar event
 
 
Sandeep jeevan
Member Technical(Stack)
Mob:9717892153
VNL  | 246, Phase IV, Udyog Vihar, Gurgaon, Haryana 122 015, INDIA | +91-124-4311600-609 | F +91-124-4104766 | www.vnl.in
 
 
 
 
The information contained in this e-mail is private & confidential and may also be legally privileged. If you are not the intended recipient, please notify us, preferably by e-mail, and do not read, copy or disclose the contents of this message to anyone.
The information contained in this e-mail is private & confidential and may also be legally privileged. If you are not the intended recipient, please notify us, preferably by e-mail, and do not read, copy or disclose the contents of this message to anyone.
The information contained in this e-mail is private & confidential and may also be legally privileged. If you are not the intended recipient, please notify us, preferably by e-mail, and do not read, copy or disclose the contents of this message to anyone.





More information about the Spread-users mailing list