Basic VJ algorithm ------------------ (1) rtt = (1-alpha)*rtt + alpha*m (2) rttvar = (1-beta)*rttvar + beta*|m-rtt| (3) rto = rtt + gamma*rttvar Constants alpha,beta and gamma were chosen as 1/8,1/4 and 4. SIGCOMM paper, where this algo was formulated, is pretty obscure. Its analysis is not only unintelligible, but when applied to current situation is wrong in fact. The first (and the only) sanity test ------------------------------------ rto MUST NOT be less than the last sample. It means (after simple airthmetics) that: (4) alpha+gamma*beta > 1 See? alpha=1/8, beta=1/4 and gamma=4 satisfy this criteria. It was reason why original choice gamma=2 was deemed to fail, rationale behind choice gamma=4 by VJ is not quite right. At first sight this does not mean that small values of beta (and smoother estimator, hence) inevitably result in huge gamma. If (4) is violated, formulae (3) could be updated to: (3wrong) rto = max(rtt + gamma*rttvar, m) However this is wrong, because this formulae remembers only gamma*beta of rtt fluctuation. So, formulae (4) is MUST. The algo taken "as is", is wrong -------------------------------- It is evidently incomplete without additional adjustments. F.e. imagine that connection was perfectly smooth for some time, so that rttvar converged to zero. In this case we have rto equal to rtt and even minor fluctuation will result in false timeout. BSD does not see this effect _occasionally_, because it rounds times to 500 msec, but it also will fail f.e. when rtt=499msec. Coarse timer does _NOT_ repair the algorithm, it just hides the problem moving it to area of some selected timeout values and not very large fluctuations. We need some additional fixup or bounding rttvar from below to repair it. Minimal possible value for this bound is 50msec, which takes into account inevitable delays at receiver _host_ up (rather then in network) to 200 msec. Maximal value of bound can be pretty large, f.e. it should be at least mss/bandwidth to survive after new packet is injected to network due to cwnd growing during congestion avoidance phase. Slow start phase ---------------- The sanity test described in the section "The first (and the only) sanity test" is very weak. Actually, to adapt to cwnd increases rto must grow _faster_ than rtt does. At least twice faster. In the worst situation, when rttvar=0, we have to use: (4') alpha+gamma*beta > 2 I.e. gamma should be at least 8. It is overkill in steady state, so that I would propose to use a bit different approach, described below in the section "Timer restart". Sampling rate problem --------------------- Values of alpha and beta are frequently considered as mostly not important. It is another mistake. VJ formulated the algo for sample rate of rtt. However, Linux (and even BSD, when RTTM is used) sample each segment i.e. with rate of rtt/cwnd. It means that effectively we use alpha = 1 - (1-alpha)^cwnd. This number is _very_ large. Particularly at window of 256K and 1.5K mtu it is almost 1 i.e. _no_ smoothing is made at all. One consequence of this is that rttvar tends to become zero too fastly. This effect is rarely visible in BSD. In fact, with coarse timer all the fuss around rto is ridiculous. "Smoothing" variable, which takes ONE (yes!) value on 99% of really existing links is something, which allows to say that BSD experience gives no useful information and may be ignored as whole. 8) Timer restart ------------- Another feature of BSD timer is wrong timer restart, which effectively adds one rtt to rto. In fact, BSD uses formulae: (3bsd) rto = 2*rtt + gamma*rttvar which is surely strong overestimate for long delay links. Linux-2.2 restarts timer correctly, but problem, mentioned above, was solved by explicit multiplying rto with factor 5/4, which is better than BSD on long delay links, but it is not prefect by two reasons: A. It fails, when the result is less than 200msec. [ "lbl-cern" incident ] B. It adds rtt/4, which can harm long delay links. I propose to restart time correctly, but adjusting rto only for this case by additional gamma*rttvar term to predict slow start increase, which is equivalent to using gamma=8. Decision -------- Correct algorithm must look like: (3correct) rto = rtt + fixup where "fixup" satisfies the following conditions: a. it must not depend on rtt to avoid penalizing long delay links. b. it must adapt to slow start on bandwidth limited links. c. it must avoid failures while congestion avoidance phase on bandwidth limited links, due to injection new segment each rtt. d. it must avoid failures after bursts, generated by big acks. e. it should avoid failures after bursts, generated by application limited connections. f. it must be >200msec to avoid failures due to host latencies. g. it must be >gamma*rttvar to adapt to network fluctuations, where alpha+gamma*beta > 1. Property (a) prohibits both fixup rto/4 used by Linux earlier and BSD way, forcing rto timer to restart after each ACK. Property (b) means that alpha+gamma*beta > 2 or, alternatively, alpha+gamma*beta > 1, but additional gamma*beta term is added while timer restart. Property (c) means that "fixup" must use rtt variance with memory not less than one rtt: hence with sampling rate rtt/cwnd we must use ewma constants proportional to cwnd. In fact, Eifel algorithm satisies this property. But it looks too complicated. The way proposed by me uses another feature from Eifel: namely, different ewma constants for rttvar in the cases, when rtt grows and when it drops. Namely, we use VJ constants for the case of growth (i.e. it grows very fastly, much faster than in Eifel), but additionally smooth rttvar over rtt period. It is not very cheap, three new state variables are added, but the issue seems to be enough important to accept such penalty. Property (d) should follow from (b), because the first segment of burst will adapt rto to wait for the next one. Property (e) is partially similar to (d), except for the case, when big ack follows sequence of short ones. We cannot predict this fault of receiver, but hope that rttvar will sense erratic AVKing and to adapt in some extent. Property (f) and (g) do not require something clever. So, resulting algo is supposed to be robust like, like... I even do not know. 8)