[Spread-users] Clock skew and spread
Nilo Rivera
nrivera at cs.jhu.edu
Sun May 13 11:41:12 EDT 2007
Hi,
Is not the OS event scheduling. The events are in a queue in events.c.
The queue items will hold the time that the callback should be fired.
The time is computed by doing a gettimeofday() call and adding the lapse
you want (i.e.. +5seconds). Then, it uses select() to check for socket
events or sleep until the next time event, which is supplied in the call
(i.e. 5 seconds). When it wakes up, it will check the queue of events
and the time it should fire. To check, it calls again gettimeofday()
and compares the current time to the time in the queue item. So the
main issue is that gettimeofday() at first return a time that was not
accurate. Even if you never sleep in select(), you will still have the
same problem as the timestamp in the event says that it should fire at
time X. This event may be, for example, the one that sends messages to
the client.
Note that the events or the system time has nothing to do with message
order. The order (even in agreed/total order) is determined by the
underlying protocols and based on a global sequence number.
Nilo
Vsevolod Vlaskin wrote:
> Hi,
>
> Good considerations. Did I understand correct: you say
> that it was not Spread, but the OS-specific event
> scheduling?
>
> (I have not checked the Spread source code, but it
> would be strange to use absolute time or time
> durations to schedule events, if only FIFO ordering
> (or even virtual synchrony) is used, as only the order
> of the messages counts, not their timing.
>
> I guess, NTP does the time adjustments, when the
> computer time goes out of sync with the time on the
> LAN. These jumps probably are of order of milliseconds
> or seconds, but in our case it proved to be enough to
> make Spread to stop delivering messages a couple of
> times. Our system runs daily every day, so
> probabilistically there was enough chance for us to
> hit the problem. And the effect was BAD.)
>
> Thank you very much,
>
> Vsevolod Vlaskine
>
>
>
> --- Nilo Rivera <nrivera at cs.jhu.edu> wrote:
>
>
>> Hi,
>>
>> I had similar problems with another software that
>> uses the same event
>> system.
>>
>> In general, when you need to schedule a callback
>> function 5 seconds in
>> the future, you schedule the event to the current
>> time + 5 seconds (look
>> at E_queue in events.c). You may have a lot of
>> events in between, but
>> when you reach that time (based on the system time),
>> the event system
>> will call that event before any other that was
>> schedule for a future time.
>>
>> If your clock jumps into some future time, and stays
>> in that time, the
>> problem is that a lot of events will look as expired
>> and will start
>> firing. The protocols may not behave properly.
>>
>> But the bad case happens when it jumps into the
>> future and comes back
>> into the past. In this case, you may schedule
>> events at the future
>> time + 5 seconds. When the clock comes back to the
>> current time, it
>> will not hit the event until reaching the expected
>> time. In that case,
>> your program may be stock for quite a while.
>>
>> I avoided the problem by blocking NTP port from my
>> network, and allowing
>> NTP to set clocks when I knew it was safe (when my
>> program was not
>> running). Then again: (1) why are some NTP daemons
>> making clock jumps
>> (my pure guess at the time was that it was setting
>> the system time to
>> GMT and then back to EST, but I never looked at the
>> NTP code), (2) is
>> there any easy/pretty solution to avoid this problem
>> in an event system.
>>
>>
>> Cheers,
>> Nilo
>>
>>
>>
>> Vsevolod Vlaskin wrote:
>>
>>> Hi,
>>>
>>> A while ago, we seemed to consistently see a
>>>
>> similar
>>
>>> problem in our configuration, which was only one
>>> Spread daemon with a number of clients all on the
>>>
>> same
>>
>>> LAN on Linux. We used just FIFO ordering for all
>>>
>> our
>>
>>> Spread clients.
>>>
>>> A few times, the Spread communication failed
>>> altogether: messages stopped being delivered
>>>
>> (which
>>
>>> was quite tragic) and we noticed that our NTP
>>>
>> service
>>
>>> did noticeable clock jumps at the time of failure.
>>>
>>> We posted the question on this list, but there was
>>>
>> no
>>
>>> reply. Maybe now there will be more response.
>>>
>>> Best regards,
>>>
>>> Vsevolod Vlaskine
>>>
>>>
>>>
>>> --- John Robinson <jr at vertica.com> wrote:
>>>
>>>
>>>
>>>> We lost our T1 connection to the world for a
>>>>
>> while
>>
>>>> today, and I think
>>>> some of our servers' clocks may have drifted (no
>>>> internal NTP source...).
>>>>
>>>> Can this cause oddities among a subnet of spread
>>>> daemons? Do they have
>>>> to drop connections to their clients for reasons
>>>> related to clock drift
>>>> amongst the host machines? If so, is there some
>>>> logging I can enable to
>>>> track this?
>>>>
>>>> I think I have seen similar things happen when we
>>>> try to run spread
>>>> daemons on a "cluster" under VMWare, which is
>>>>
>> known
>>
>>>> to introduce clock
>>>> issues.
>>>>
>>>> thanks,
>>>> /jr
>>>>
>>>>
>>>> _______________________________________________
>>>> Spread-users mailing list
>>>> Spread-users at lists.spread.org
>>>>
>>>>
>>>>
> http://lists.spread.org/mailman/listinfo/spread-users
>
>>>
>>>
>>>
>>> __________________________________________________
>>> Do You Yahoo!?
>>> Tired of spam? Yahoo! Mail has the best spam
>>>
>> protection around
>>
>>> http://mail.yahoo.com
>>>
>>> _______________________________________________
>>> Spread-users mailing list
>>> Spread-users at lists.spread.org
>>>
>>>
> http://lists.spread.org/mailman/listinfo/spread-users
>
>>>
>>>
>>
>>
>
>
>
>
> ____________________________________________________________________________________Be a better Heartthrob. Get better relationship answers from someone who knows. Yahoo! Answers - Check it out.
> http://answers.yahoo.com/dir/?link=list&sid=396545433
>
More information about the Spread-users
mailing list