[Spread-users] sporadic latencies with SP_receive

Ed Holyat Ed.Holyat at olf.com
Fri Aug 5 15:39:26 EDT 2011


I have run into this issue before.  For every member that is still joined in a group, Spread sends that member information with every message.  Eventually you exceed a threshold for what the message can hold and it will crash because the spread code isn't handling this case.  I believe there was a fix for this at some point to just truncate the MembersList.

I believe this was the fix in my local code to prevent the crash, but you will still get the large messages.  I don't know if this change made it into the open source.  The reason I hit this bug is because Internally we use long session names between 60 to 128 characters.

daemon->groups.c->G_mess_to_groups() add this to the creating members for loop ( num_bytes < ( sizeof(Temp_buf) - MAX_GROUP_NAME - 1 ) );

...
                        /* creating members */
                        for( j = 0; j < num_memb && ( num_bytes < ( sizeof(Temp_buf) - MAX_GROUP_NAME - 1 ) ); ++j )


To get rid of the latency - The ultimate solution for me was to remove the forwarding of all the groups in every message; I don't want to know who is currently subscribed to a group, I only care about the members that were leaving the group.  So I have a code change to limit the MembersList to just members leaving the group; I hope the code becomes part of the open source in the next release.  It allowed my product to support ~300 users per daemon vs. ~30 users per daemon without it.  


The basic change without using the configuration file is this.  The line numbers may not line up with the open source version.


daemon/configuration.c,v
configuration.c
100c100
< 
---
> static  bool    OnlyLeaveMemberships = FALSE; // just set to TRUE if you don't have the configuration file setting available to turn this on.
856a857
> 
871a873,890
> bool    Conf_get_only_leave_memberships(void)
> {
>         return(OnlyLeaveMemberships);
> }
> 
> void    Conf_set_only_leave_memberships(bool new_state)
> {
>         if (new_state == FALSE) {
>             Alarmp(SPLOG_PRINT, CONF, "Conf_set_only_leave_memberships: Delivering ALL types of membership events!\n");
>         } else if (new_state == TRUE) {
>             Alarmp(SPLOG_PRINT, CONF, "Conf_set_only_leave_memberships: Delivering ONLY group leave events and network membership events!\n");
>         } else {
>                 /* invalid setting */
>                 return;
>         }
>         OnlyLeaveMemberships = new_state;
> }
>
configuration.h
137a138,139
> bool            Conf_get_only_leave_memberships(void);
> void            Conf_set_only_leave_memberships(bool new_state);

/daemon/groups.c,v
1204c1204,1205
<                 if( Is_memb_session( Sessions[ ses ].status ) )
---
>                 if( Is_memb_session( Sessions[ ses ].status ) &&
>                     Is_allowed_membership( caused ) ) {
1206a1208
>         }
1237c1239,1240
<         if( Is_memb_session( Sessions[ ses ].status ) )
---
>         if( Is_memb_session( Sessions[ ses ].status ) &&
>             Is_allowed_membership( CAUSED_BY_LEAVE ) )
1287c1290,1291
<                 if( Is_memb_session( Sessions[ ses ].status ) )
---
>                 if( Is_memb_session( Sessions[ ses ].status ) &&
>                     Is_allowed_membership( CAUSED_BY_NETWORK ) )
1311c1315,1316
<                 if( Is_memb_session( Sessions[ ses ].status ) )
---
>                 if( Is_memb_session( Sessions[ ses ].status )  && 
>                     Is_allowed_membership( CAUSED_BY_NETWORK) )
1338c1343,1344
<                 if( Is_memb_session( Sessions[ ses ].status ) )
---
>                 if( Is_memb_session( Sessions[ ses ].status ) &&
>                     Is_allowed_membership( CAUSED_BY_NETWORK) )
1659a1666,1667
> 
> 	if (Conf_get_only_leave_memberships() == FALSE ) {
1660a1669,1672
> 	} else {
> 	   head_ptr->num_groups = 0;
> 	}
> 
1670c1682
< 	for (stdskl_begin(&grp->DaemonsList, &it); !stdskl_is_end(&grp->DaemonsList, &it); ) 
---
> 	for (stdskl_begin(&grp->DaemonsList, &it); (Conf_get_only_leave_memberships() == FALSE) && !stdskl_is_end(&grp->DaemonsList, &it); ) 

daemon/sess_body.h,v
sess_body.h
85a86,87
> #define         Is_allowed_membership( caused ) ( caused == CAUSED_BY_LEAVE || caused == CAUSED_BY_DISCONNECT || caused == CAUSED_BY_NETWORK || ( caused == CAUSED_BY_JOIN && Conf_get_only_leave_memberships() == FALSE) )




-----Original Message-----
From: Johannes Wienke [mailto:jwienke at techfak.uni-bielefeld.de] 
Sent: Thursday, August 04, 2011 6:44 AM
To: spread-users at lists.spread.org
Subject: Re: [Spread-users] sporadic latencies with SP_receive

Hey again,

sorry for bumping, but are there any ideas how and why this happens? It really decreases our performance right now.

Regards,
Johannes

On 07/25/2011 01:47 PM, Johannes Wienke wrote:
> Dear all,
> 
> we encountered some latency issues in our applications using spread.
> Today we tried to isolate the problem and came up with a test program 
> that demonstrates the behavior.
> 
> Generally, the observation is that in a threaded setup, using local 
> communication, and a small sleep between calls to SP_receive, these 
> receive calls sometimes take up to 100 ms, e.g. generating this log:
> 
> receive took 55 us
> receive took 68 us
> receive took 97071 us
> receive took 54 us
> receive took 67 us
> receive took 97060 us
> receive took 54 us
> receive took 68 us
> receive took 97086 us
> receive took 56 us
> receive took 69 us
> receive took 97091 us
> receive took 56 us
> receive took 67 us
> receive took 97071 us
> 
> The attached program exactly produces this output. Please note that 
> this only happens if the sleep call is present in line 108.
> 
> We have also verified that this is not related to the architecture we 
> are running on (Linux 32 and 64 bit), nevertheless we got a stack 
> corruption on 32 bit in the sender thread with a privateGroup array 
> only MAX_PRIVATE_NAME characters long. Thus the increased size. Is 
> this also a known problem?
> 
> We would be happy to get some insights or fixes in how to prevent this 
> issue. In a real application the sleep is usually not required to 
> trigger the problem as the receiving thread is still doing other 
> things in its loop.
> 
> Regards,
> Johannes
> 
> 
> 
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users





More information about the Spread-users mailing list