[Spread-users] odd error message - bug?

Jonathan Stanton jonathan at cnds.jhu.edu
Thu Feb 14 01:30:19 EST 2008


If anyone can duplicate this or has also seen it, please email me or the list with 
details.

I've had one other report of this exact error (in a similar configuration -- lots of empty
segments) which came in in dec 2000 with version 3.14. After a bit of follow up, the
person who reported it wasn't able to duplicate or provide any additional information.

Since then I don't have any records of other reports until now.

I can tell you what's happening, but not the root cause.

The Smallest_member function looks through the list of current daemon members and finds 
the one with the smallest id (pretty straightforward). This error happens when the list of 
daemons includes one who's proc_id field is "0" instead of an IP address. Whether this is 
caused by some rare memory corruption, or a bug when creating the membership list for 
configurations with certain characteristics is unclear -- whatever it is doesn't happen 
very often.

I'm not sure if that helps any, but if you have it occur again, please send the log output 
like you did below and if possible, send a core dump.

To generate the core you will need to enable it for your OS (using ulimit for unix's or a 
sysctl) and change all of the calls to "exit( 0 )" in the source file daemon/alarm.c to be 
calls to "abort()" instead and then recompile.

I certainly understand if you can't do the core dumps, but they would by really helpful in 
determining the cause. 

Cheers,

Jonathan


On Wed, Feb 13, 2008 at 09:48:58AM -0800, Nolan Johnson wrote:
> Spread failed to start on a server, with an odd error message logged:
> [Wed 13 Feb 2008 17:28:57] --------------------
> [Wed 13 Feb 2008 17:28:57] Configuration at i-147c0000 is:
> [Wed 13 Feb 2008 17:28:57] Num Segments 8
> [Wed 13 Feb 2008 17:28:57]      0       10.255.xx.yy    4803
> [Wed 13 Feb 2008 17:28:57]      0       10.255.xx.yy      4803
> [Wed 13 Feb 2008 17:28:57]      0       10.255.xx.yy     4803
> [Wed 13 Feb 2008 17:28:57]      0       10.255.xx.yy     4803
> [Wed 13 Feb 2008 17:28:57]      0       10.255.xx.yy       4803
> [Wed 13 Feb 2008 17:28:57]      0       10.255.xx.yy      4803
> [Wed 13 Feb 2008 17:28:57]      0       10.255.xx.yy     4803
> [Wed 13 Feb 2008 17:28:57]      1       10.255.xx.yy     4803
> [Wed 13 Feb 2008 17:28:57]              i-147c0000              10.255.xx.yy
> [Wed 13 Feb 2008 17:28:57] ====================
> [Wed 13 Feb 2008 17:29:08] Smallest_member: Bug! i: 1, proc 0
> Exit caused by Alarm(EXIT)
> 
> This is running in Amazon's ec2 environment, which doesn't support multicast.  So every instance is in its own segment.
> 
> Each of the other instances started up properly and joined the spread network, and when I started this instance again, everything worked properly.  I can't reliably repeat this.
> 
> Anyone know what's going on?
> 
>        
> ---------------------------------
> Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  Try it now.
> _______________________________________________
> Spread-users mailing list
> Spread-users at lists.spread.org
> http://lists.spread.org/mailman/listinfo/spread-users


-- 
-------------------------------------------------------
Jonathan R. Stanton         jonathan at cs.jhu.edu
Dept. of Computer Science   
Johns Hopkins University    
-------------------------------------------------------




More information about the Spread-users mailing list