[Spread-users] Corrupt packets

Jonathan Stanton jonathan at cnds.jhu.edu
Mon Dec 4 16:11:34 EST 2006


Very interesting. 

I saw in the patch that you are checksumming both the daemon-to-daemon 
traffic (UDP) and the client-server (message contents only) which goes 
over TCP/UnixDomain. This is really strange, as both UDP and TCP have 
checksums and should not deliver corrupted data to the application 
(Spread)

Were the UDP/TCP checksums valid on the 'corrupt' data -- I'd guess they 
had to be for the packets not to be dropped -- were you able to capture 
an example packet that had a valid checksum but was corrupt? 

This kind of checksum is something I'd like to avoid if possible as it 
complicates the code and is more overhead per packet -- but if we can 
have corrupt data delivery and it isn't just a particular OS bug, then 
it's worth considering. 

If the data is corrupted in kernel/memory before being sent but after 
"spread" finished with it, then that would explain the situation -- but 
should indicate an OS bug. 

Jonathan
On Mon, Dec 04, 2006 at 09:01:51AM -0800, Alec H. Peterson wrote:
> Hi all,
> 
> So a few days ago I e-mailed about getting ring lockups.  We tracked  
> this problem down to corrupt packets getting delivered to Spread  
> (both over the session and data link layers).  I've attached a patch  
> that seems to address the problems by adding a checksum to the  
> appropriate data structures, and we feel this could potentially be  
> useful to others.  If there are reasons why this shouldn't be  
> included in Spread we would love to know, because those may well be  
> reasons why we shouldn't use it.  Clearly it changes the network  
> protocol, so it won't be compatible with other builds of Spread.   
> However, this does solve our lockup and corrupt data problems.
> 
> We're also curious if anybody else has seen 'odd' Spread behavior  
> (like ring lockups and/or corrupt data delivered to the client  
> library).  The configuration we have seen this on is very straight- 
> forward:
> 
> Sun x4100 Server
> Solaris 10
> Spread 3.17.3 (both stock and with some local patches)
> 
> We have some very similar servers deployed in-house that do not  
> experience these problems at all.
> 
> Thanks!
> 
> Alec
> 


> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users


-- 
-------------------------------------------------------
Jonathan R. Stanton         jonathan at cs.jhu.edu
Dept. of Computer Science   
Johns Hopkins University    
-------------------------------------------------------




More information about the Spread-users mailing list