[Spread-users] high cpu load

Thu Aug 30 22:20:42 EDT 2001

Hi,

Dirk Vleugels wrote:
> 
> Hello,
> 
> On Mon, Aug 27, 2001 at 10:53:50AM -0400, Yair Amir wrote:
> > Something seems wrong with your network. I think Spread should use less than 10% cpu.
> > You have way to many retransmissions. You have only about 30 packets/sec which is really
> > low. Usually in speeds below thousands of packets/sec, there should almost be no retransmissions.
> > I don't know why this happens on your network.
> 
> I have still not found an explanation. There is no network
> problem on a lower layer i think.

There is definitely a problem with the too high retransmission.

> 
> strace shows:
> 
> [.....]
> recvmsg(4, {msg_name(16)={sin_family=AF_INET, sin_port=htons(4804),
> sin_addr=inet_addr("192.168.100.127")}},
> msg_iov(2)=[{"\200\0\17\200\177d\250\300\203E\353\1\177d\250\300\203"...,
> 24}, {"\1d\250\300\233\t\206;\177d\250\300\0\0\2\0\200E\353\1"...,
> 1448}], msg_controllen=0, msg_flags=0}, 0) = 24
> sendmsg(4, {msg_name(16)={sin_family=AF_INET, sin_port=htons(4804),
> sin_addr=inet_addr("192.168.100.2")}},
> msg_iov(2)=[{"\200\0\0\200\1d\250\300\203E\353\1\1d\250\300\203E\353"...,
> 24}, {"", 0}], msg_controllen=0, msg_flags=0}, 0) = 24
> gettimeofday({999204195, 343903}, {4294967176, 0}) = 0
> gettimeofday({999204195, 343940}, {4294967176, 0}) = 0
> gettimeofday({999204195, 343977}, {4294967176, 0}) = 0
> gettimeofday({999204195, 344012}, {4294967176, 0}) = 0
> select(1024, [3 4 6 7 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
> 26 27 28 29 30 31 33 34 35 36 37 38 39 42 43 44 45 46 47 48 49 50 51 52
> 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
> 77 78 79 80 81 82 83 84 85 86 88 89 90 91 92 93 94 95 96 97 98 99 100
> 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 118 119
> 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137
> 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153], [], [6
> 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 33
> 34 35 36 37 38 39 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
> 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
> 84 85 86 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106
> 107 108 109 110 111 112 113 114 115 116 118 119 120 121 122 123 124 125
> 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
> 144 145 146 147 148 149 150 151 152 153], {0, 0}) = 0 (Timeout)
> select(1024, [3 4 6 7 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
> 26 27 28 29 30 31 33 34 35 36 37 38 39 42 43 44 45 46 47 48 49 50 51 52
> 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
> 77 78 79 80 81 82 83 84 85 86 88 89 90 91 92 93 94 95 96 97 98 99 100
> 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 118 119
> 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137
> 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153], [], [6
> 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 33
> 34 35 36 37 38 39 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
> 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83
> 84 85 86 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106
> 107 108 109 110 111 112 113 114 115 116 118 119 120 121 122 123 124 125
> 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
> 144 145 146 147 148 149 150 151 152 153], {1, 999891} <unfinished ...>
> [.....]
> 
> Could the select on a large number of fd's slow the daemon down (it
> shouldn't afaik)? In peak hours even more httpd's would be launched
> (SoftLimit 1024 right now). Is the number of retransmits an RELIABLE
> message issue? Assuming this is _no_ ether (100Mbit btw.) problem.
> 
> > > sent pack: 5706975      recv pack : 17722855    retrans    : 5540933
> > > u retrans: 5386778      s retrans :  154155     b retrans  :       0

>From this I understand that you probably have only one segment in you file, which is fine.
u retrans is when only one guy lost it so far so you unicast to this guy.
s retrans is when at least 2 guys from the same segment lost it (again, it seems
you have only one segment in the system). BUT these numbers are extremely high.
Something is really wrong. We have some programs you can find in the distribution
called "s" and "r" that can show what happens in terms of low level issues on 
your network. s and r are not part of the Spread toolkit really but they can be
useful to us when we try to see if there is a problem on the udp level.
Maybe you can poke a bit there.

I know that select with 1000 file descriptors consumes relatively high amount of CPU. 
Personally I did not try it. I know Jonathan and Theo tried these things. Probably we 
should move to poll (not to poll architecture, but to the poll system call that can replace
select and is more efficient) at some point in the future. 

But I see only 100-150 fds in your trace. The trace looks ok to me. 

I have no idea what your problem is and probably personally will not be able to 
further contribute without being on your system myself. My first recommendation would
be to run s and r a bit and see what is going on on the udp level.

Maybe some other person can contribute here.

> 
> What means 'u retrans', 's retrans' and 'b retrans'. I find no
> explanation in the user manual. It seems nearly every packet needs a
> retransmit?
> 
> Puzzled,
> Dirk

	Cheers,

	:) Yair.