[Spread-users] Crash during dynamic reconfiguration

Mon Dec 17 10:53:06 EST 2007

Hello,

We tried to add an extra segment to our spread network.

Namely, we went from:

Spread_Segment  10.100.255.255:4803 {
	bigpimms	10.100.1.200
	pfe1		10.100.1.101
	pfe2		10.100.1.102
	pfe3		10.100.1.103
	pfe4		10.100.1.104
	pimms-mgmt	10.100.1.201
}

to:

Spread_Segment  10.100.255.255:4803 {
	bigpimms	10.100.1.200
	pfe1		10.100.1.101
	pfe2		10.100.1.102
	pfe3		10.100.1.103
	pfe4		10.100.1.104
	pimms-mgmt	10.100.1.201
}

Spread_Segment  10.200.255.255:4803 {
	bigguinness	10.200.1.200
	gfe1		10.200.1.101
	gfe2		10.200.1.102
	gfe3		10.200.1.103
	gfe4		10.200.1.104
	guinness-mgmt	10.200.1.201
}

(rest of config file available if needed).

1) Started spmonitor
2) Updated config files on all nodes
3) Did a 'r' to reload configuration

...and then spmonitor went to 95% CPU usage and sat there, and every
single daemon died.

A sample spread log file output from one of the nodes reads:

====================
Membership id is ( 174326216, 1196860032)
--------------------
Configuration at pimms-mgmt is:
Num Segments 1
         6       10.100.255.255    4803
                 bigpimms                10.100.1.200
                 pfe1                    10.100.1.101
                 pfe2                    10.100.1.102
                 pfe3                    10.100.1.103
                 pfe4                    10.100.1.104
                 pimms-mgmt              10.100.1.201
====================
Conf_load_conf_file: error opening config file /opt/data-store/etc/
spread.conf
Exit caused by Alarm(EXIT)
Setting SO_REUSEADDR to always on -- make sure Spread daemon host is
secured!
Set user name to 'nobody'
Set group name to 'nogroup'
Finished configuration file.
Hash value for this configuration is: 3616445461
Conf_load_conf_file: My name: pimms-mgmt, id: 10.100.1.201, port: 4803
Membership id is ( 174326217, 1197904878)
--------------------
Configuration at pimms-mgmt is:
Num Segments 2
         1       10.100.255.255    4803
                 pimms-mgmt              10.100.1.201
         0       10.200.255.255    4803
====================
Membership id is ( 174326217, 1197904889)
--------------------
Configuration at pimms-mgmt is:
Num Segments 2
         1       10.100.255.255    4803
                 pimms-mgmt              10.100.1.201
         3       10.200.255.255    4803
                 gfe1                    10.200.1.101
                 gfe2                    10.200.1.102
                 gfe3                    10.200.1.103
====================

Eg, it looks like it died of 'error opening config file'. The
sysadmin says he hadn't changed anything between then and
subsequently force-restarting spread everywhere, when it then loaded
fine.

Any ideas? Every spread daemon dying in synch didn't do wonders for
our application :-)

ABS

--
Alaric Snell-Pym
Work: http://www.snell-systems.co.uk/
Play: http://www.snell-pym.org.uk/alaric/
Blog: http://www.snell-pym.org.uk/?author=4