[Spread-users] Crash during dynamic reconfiguration
Alaric Snell-Pym
alaric at snell-pym.org.uk
Mon Dec 17 10:53:06 EST 2007
Hello,
We tried to add an extra segment to our spread network.
Namely, we went from:
Spread_Segment 10.100.255.255:4803 {
bigpimms 10.100.1.200
pfe1 10.100.1.101
pfe2 10.100.1.102
pfe3 10.100.1.103
pfe4 10.100.1.104
pimms-mgmt 10.100.1.201
}
to:
Spread_Segment 10.100.255.255:4803 {
bigpimms 10.100.1.200
pfe1 10.100.1.101
pfe2 10.100.1.102
pfe3 10.100.1.103
pfe4 10.100.1.104
pimms-mgmt 10.100.1.201
}
Spread_Segment 10.200.255.255:4803 {
bigguinness 10.200.1.200
gfe1 10.200.1.101
gfe2 10.200.1.102
gfe3 10.200.1.103
gfe4 10.200.1.104
guinness-mgmt 10.200.1.201
}
(rest of config file available if needed).
1) Started spmonitor
2) Updated config files on all nodes
3) Did a 'r' to reload configuration
...and then spmonitor went to 95% CPU usage and sat there, and every
single daemon died.
A sample spread log file output from one of the nodes reads:
====================
Membership id is ( 174326216, 1196860032)
--------------------
Configuration at pimms-mgmt is:
Num Segments 1
6 10.100.255.255 4803
bigpimms 10.100.1.200
pfe1 10.100.1.101
pfe2 10.100.1.102
pfe3 10.100.1.103
pfe4 10.100.1.104
pimms-mgmt 10.100.1.201
====================
Conf_load_conf_file: error opening config file /opt/data-store/etc/
spread.conf
Exit caused by Alarm(EXIT)
Setting SO_REUSEADDR to always on -- make sure Spread daemon host is
secured!
Set user name to 'nobody'
Set group name to 'nogroup'
Finished configuration file.
Hash value for this configuration is: 3616445461
Conf_load_conf_file: My name: pimms-mgmt, id: 10.100.1.201, port: 4803
Membership id is ( 174326217, 1197904878)
--------------------
Configuration at pimms-mgmt is:
Num Segments 2
1 10.100.255.255 4803
pimms-mgmt 10.100.1.201
0 10.200.255.255 4803
====================
Membership id is ( 174326217, 1197904889)
--------------------
Configuration at pimms-mgmt is:
Num Segments 2
1 10.100.255.255 4803
pimms-mgmt 10.100.1.201
3 10.200.255.255 4803
gfe1 10.200.1.101
gfe2 10.200.1.102
gfe3 10.200.1.103
====================
Eg, it looks like it died of 'error opening config file'. The
sysadmin says he hadn't changed anything between then and
subsequently force-restarting spread everywhere, when it then loaded
fine.
Any ideas? Every spread daemon dying in synch didn't do wonders for
our application :-)
ABS
--
Alaric Snell-Pym
Work: http://www.snell-systems.co.uk/
Play: http://www.snell-pym.org.uk/alaric/
Blog: http://www.snell-pym.org.uk/?author=4
More information about the Spread-users
mailing list