setdiscovery: randomly pick between heads and sample when taking full sample

author | Pierre-Yves David <pierre-yves.david@fb.com> |

Wed, 07 Jan 2015 12:09:51 -0800 | |

changeset 23810 | b681d3a2bf04ad135de5b09e2918ac0da3fdc593 |

parent 23809 | 9ca2eb881b53656bf7b763dddb12a1510c5c8a2f |

child 23811 | e2b262e2ee73f4db6867ec60e98ce766e5f3d095 |

push id | 1 |

push user | gszorc@mozilla.com |

push date | Wed, 18 Mar 2015 16:34:57 +0000 |

setdiscovery: randomly pick between heads and sample when taking full sample
Before this changeset, the discovery protocol was too heads-centric. Heads of the
undiscovered set were always sent for discovery and any room remaining in the
sample were filled with exponential samples (and random ones if any room
remained).
This behaved extremely poorly when the number of heads exceeded the sample size,
because we keep just asking about the existence of heads, then their direct parent
and so on. As a result, the 'O(log(len(repo)))' discovery turns into a
'O(len(repo))' one. As a solution we take a random sample of the heads plus
exponential samples. This way we ensure some exponential sampling is achieved,
bringing back some logarithmic convergence of the discovery again.
This patch only applies this principle in one place. More places will be updated
in future patches.
One test is impacted because the random sample happen to be different. By
chance, it helps a bit in this case.

--- a/mercurial/setdiscovery.py +++ b/mercurial/setdiscovery.py @@ -108,21 +108,21 @@ def _takefullsample(dag, nodes, size): always, sample, desiredlen = _setupsample(dag, nodes, size) if sample is None: return always # update from heads _updatesample(dag, nodes, sample, always) # update from roots _updatesample(dag.inverse(), nodes, sample, always) assert sample - sample = _limitsample(sample, desiredlen) - if len(sample) < desiredlen: - more = desiredlen - len(sample) - sample.update(random.sample(list(nodes - sample - always), more)) sample.update(always) + sample = _limitsample(sample, size) + if len(sample) < size: + more = size - len(sample) + sample.update(random.sample(list(nodes - sample), more)) return sample def _limitsample(sample, desiredlen): """return a random subset of sample of at most desiredlen item""" if len(sample) > desiredlen: sample = set(random.sample(sample, desiredlen)) return sample

--- a/tests/test-setdiscovery.t +++ b/tests/test-setdiscovery.t @@ -321,17 +321,17 @@ One with >200 heads, which used to use u sampling from both directions searching: 5 queries query 5; still undecided: 740, sample size is: 200 sampling from both directions searching: 6 queries query 6; still undecided: 540, sample size is: 200 sampling from both directions searching: 7 queries - query 7; still undecided: 44, sample size is: 44 + query 7; still undecided: 37, sample size is: 37 7 total queries common heads: 3ee37d65064a Test actual protocol when pulling one new head in addition to common heads $ hg clone -U b c $ hg -R c id -ir tip 513314ca8b3a