[Spread-users] sporadic latencies with SP_receive
Ed Holyat
Ed.Holyat at olf.com
Fri Aug 5 15:39:26 EDT 2011
I have run into this issue before. For every member that is still joined in a group, Spread sends that member information with every message. Eventually you exceed a threshold for what the message can hold and it will crash because the spread code isn't handling this case. I believe there was a fix for this at some point to just truncate the MembersList.
I believe this was the fix in my local code to prevent the crash, but you will still get the large messages. I don't know if this change made it into the open source. The reason I hit this bug is because Internally we use long session names between 60 to 128 characters.
daemon->groups.c->G_mess_to_groups() add this to the creating members for loop ( num_bytes < ( sizeof(Temp_buf) - MAX_GROUP_NAME - 1 ) );
...
/* creating members */
for( j = 0; j < num_memb && ( num_bytes < ( sizeof(Temp_buf) - MAX_GROUP_NAME - 1 ) ); ++j )
To get rid of the latency - The ultimate solution for me was to remove the forwarding of all the groups in every message; I don't want to know who is currently subscribed to a group, I only care about the members that were leaving the group. So I have a code change to limit the MembersList to just members leaving the group; I hope the code becomes part of the open source in the next release. It allowed my product to support ~300 users per daemon vs. ~30 users per daemon without it.
The basic change without using the configuration file is this. The line numbers may not line up with the open source version.
daemon/configuration.c,v
configuration.c
100c100
<
---
> static bool OnlyLeaveMemberships = FALSE; // just set to TRUE if you don't have the configuration file setting available to turn this on.
856a857
>
871a873,890
> bool Conf_get_only_leave_memberships(void)
> {
> return(OnlyLeaveMemberships);
> }
>
> void Conf_set_only_leave_memberships(bool new_state)
> {
> if (new_state == FALSE) {
> Alarmp(SPLOG_PRINT, CONF, "Conf_set_only_leave_memberships: Delivering ALL types of membership events!\n");
> } else if (new_state == TRUE) {
> Alarmp(SPLOG_PRINT, CONF, "Conf_set_only_leave_memberships: Delivering ONLY group leave events and network membership events!\n");
> } else {
> /* invalid setting */
> return;
> }
> OnlyLeaveMemberships = new_state;
> }
>
configuration.h
137a138,139
> bool Conf_get_only_leave_memberships(void);
> void Conf_set_only_leave_memberships(bool new_state);
/daemon/groups.c,v
1204c1204,1205
< if( Is_memb_session( Sessions[ ses ].status ) )
---
> if( Is_memb_session( Sessions[ ses ].status ) &&
> Is_allowed_membership( caused ) ) {
1206a1208
> }
1237c1239,1240
< if( Is_memb_session( Sessions[ ses ].status ) )
---
> if( Is_memb_session( Sessions[ ses ].status ) &&
> Is_allowed_membership( CAUSED_BY_LEAVE ) )
1287c1290,1291
< if( Is_memb_session( Sessions[ ses ].status ) )
---
> if( Is_memb_session( Sessions[ ses ].status ) &&
> Is_allowed_membership( CAUSED_BY_NETWORK ) )
1311c1315,1316
< if( Is_memb_session( Sessions[ ses ].status ) )
---
> if( Is_memb_session( Sessions[ ses ].status ) &&
> Is_allowed_membership( CAUSED_BY_NETWORK) )
1338c1343,1344
< if( Is_memb_session( Sessions[ ses ].status ) )
---
> if( Is_memb_session( Sessions[ ses ].status ) &&
> Is_allowed_membership( CAUSED_BY_NETWORK) )
1659a1666,1667
>
> if (Conf_get_only_leave_memberships() == FALSE ) {
1660a1669,1672
> } else {
> head_ptr->num_groups = 0;
> }
>
1670c1682
< for (stdskl_begin(&grp->DaemonsList, &it); !stdskl_is_end(&grp->DaemonsList, &it); )
---
> for (stdskl_begin(&grp->DaemonsList, &it); (Conf_get_only_leave_memberships() == FALSE) && !stdskl_is_end(&grp->DaemonsList, &it); )
daemon/sess_body.h,v
sess_body.h
85a86,87
> #define Is_allowed_membership( caused ) ( caused == CAUSED_BY_LEAVE || caused == CAUSED_BY_DISCONNECT || caused == CAUSED_BY_NETWORK || ( caused == CAUSED_BY_JOIN && Conf_get_only_leave_memberships() == FALSE) )
-----Original Message-----
From: Johannes Wienke [mailto:jwienke at techfak.uni-bielefeld.de]
Sent: Thursday, August 04, 2011 6:44 AM
To: spread-users at lists.spread.org
Subject: Re: [Spread-users] sporadic latencies with SP_receive
Hey again,
sorry for bumping, but are there any ideas how and why this happens? It really decreases our performance right now.
Regards,
Johannes
On 07/25/2011 01:47 PM, Johannes Wienke wrote:
> Dear all,
>
> we encountered some latency issues in our applications using spread.
> Today we tried to isolate the problem and came up with a test program
> that demonstrates the behavior.
>
> Generally, the observation is that in a threaded setup, using local
> communication, and a small sleep between calls to SP_receive, these
> receive calls sometimes take up to 100 ms, e.g. generating this log:
>
> receive took 55 us
> receive took 68 us
> receive took 97071 us
> receive took 54 us
> receive took 67 us
> receive took 97060 us
> receive took 54 us
> receive took 68 us
> receive took 97086 us
> receive took 56 us
> receive took 69 us
> receive took 97091 us
> receive took 56 us
> receive took 67 us
> receive took 97071 us
>
> The attached program exactly produces this output. Please note that
> this only happens if the sleep call is present in line 108.
>
> We have also verified that this is not related to the architecture we
> are running on (Linux 32 and 64 bit), nevertheless we got a stack
> corruption on 32 bit in the sender thread with a privateGroup array
> only MAX_PRIVATE_NAME characters long. Thus the increased size. Is
> this also a known problem?
>
> We would be happy to get some insights or fixes in how to prevent this
> issue. In a real application the sleep is usually not required to
> trigger the problem as the receiving thread is still doing other
> things in its loop.
>
> Regards,
> Johannes
>
>
>
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users
More information about the Spread-users
mailing list