[Spread-users] Spread Daemon stops sending and receiving messages while showing increasing number of retransmission

Wed Sep 3 15:23:20 EDT 2014

This code was changed in the most recent version and was much more onerous previously.  Previously (4.3 and all previous versions), if a send failed at the OS level, then it would be retried up to ten times with a 10ms wait between each attempt ...

This code was originally introduced a long time ago, to handle intermittent transmission issues where the OS would briefly return an error code but allow a send a bit later to go just fine.

I significantly reduced the number of attempts and the wait to not impact processing much if such an error occurred, but still giving it one quick shot at overcoming an intermittent error.

Personally, I think DL_MAX_NUM_SEND_RETRIES should be zero.

Cheers!

-----
John Lane Schultz
Spread Concepts LLC
Cell: 443 838 2200

On Sep 2, 2014, at 11:23 PM, Göran Hasse <gorhas at gmail.com> wrote:

There is some strange code when retransmitting...
The file is data_link.c

Why is DL_MAX_NUM_SEND_RETRIES 1
and not some other number.

Why is the delaytime 100 usec and not some other number?
Are those numbers found by "trial and error"? Then you probably have
discovered this error. If those numbers are found by analyse it would be nice
know. Is the delay dictated by network concern or because the
daemon need some slack? I am worried that the delay should be 117 usec. ;-)

(And by the way... When you hand over a packet the the sendque you
have *no clue* about
then it will leave the interface.  Even the device-driver for the
interface could have impact here.
http://www.coverfire.com/articles/queueing-in-the-linux-network-stack/
Therefore introducing delays in protocol handling is a risky buissines.)

---
#define DL_MAX_NUM_SEND_RETRIES 1
---

for( i = 0, total_len = 0; i < (int) scat->num_elements; ++i ) {

#ifdef ARCH_SCATTER_NONE
               memcpy( &pseudo_scat[total_len],
scat->elements[i].buf, scat->elements[i].len );
#endif
               total_len += scat->elements[i].len;
       }

       for( num_try = 1;; ++num_try ) {

#ifndef ARCH_SCATTER_NONE
               ret = sendmsg( chan, &msg, 0 );
#else
               ret = sendto( chan, pseudo_scat, total_len, 0, (struct
sockaddr *) &soc_addr, sizeof( soc_addr ) );
#endif

               if( ret >= 0 || num_try > DL_MAX_NUM_SEND_RETRIES ) {
break; }  /* success or give up */

               /* delay for a short while */

               select_delay.tv_sec  = 0;
               select_delay.tv_usec = 100;

               Alarmp( SPLOG_WARNING, DATA_LINK, "DL_send: delaying
for %ld.%06lds after failed send to (" IPF ":%d): %d %d '%s'\n",
                   select_delay.tv_sec, select_delay.tv_usec,
IP(address), port, ret, sock_errno, sock_strerror(sock_errno) );

           select( 0, NULL, NULL, NULL, &select_delay );

               Alarmp( SPLOG_WARNING, DATA_LINK, "DL_send: woke up;
about to attempt send retry #%d\n", num_try );
       }

//

2014-09-02 23:07 GMT+02:00 Claude Chausse <claude.chausse at gmail.com>:
> I have a setup with 6 devices (running spread 4.4.0 on linux) with each one configured on a separate segment because they are located on separate subnets. Multicast and broadcast is not available so I configured the segments as follow
> 
> Spread_Segment  172.23.1.1 {
>  node1   172.23.1.1
> }
> 
> Spread_Segment  172.23.2.1 {
>  node2   172.23.2.1
> }
> 
> Spread_Segment  172.23.3.1 {
> node3   172.23.3.1
> }
> 
> Spread_Segment  172.23.4.1 {
>  node4   172.23.4.1
> }
> 
> Spread_Segment  172.23.5.1 {
>  node5   172.23.5.1
> }
> 
> Spread_Segment  172.23.6.1 {
>  node6   172.23.6.1
> }
> 
> Everything works perfectly 99.999%  of the time but it happened a few times that we had a situation where all the communication between the nodes were stalled and looking at spmonitor we discovered that some daemon were constantly retransmitting.  There was no way to get out of this mode besides restarting the daemon. During that time all communication between the nodes were work fine on all other ports (ping, 22, http, and some other udp port that we use).
> 
> My Questions are:
> - Why would that happen ?
> - Is there a way to detect it and to resolve it without restarting the daemon ?
> 
> Thanks
> 
> Claude
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users

-- 
gorhas at gmail.com
Göran Hasse
Boo 229
715 91  ODENSBACKEN
Mob: 070-5530148

_______________________________________________
Spread-users mailing list
Spread-users at lists.spread.org
http://lists.spread.org/mailman/listinfo/spread-users