setdiscovery: randomly pick between heads and sample when taking full sample
authorPierre-Yves David <pierre-yves.david@fb.com>
Wed, 07 Jan 2015 12:09:51 -0800
changeset 23810 b681d3a2bf04ad135de5b09e2918ac0da3fdc593
parent 23809 9ca2eb881b53656bf7b763dddb12a1510c5c8a2f
child 23811 e2b262e2ee73f4db6867ec60e98ce766e5f3d095
push id1
push usergszorc@mozilla.com
push dateWed, 18 Mar 2015 16:34:57 +0000
setdiscovery: randomly pick between heads and sample when taking full sample Before this changeset, the discovery protocol was too heads-centric. Heads of the undiscovered set were always sent for discovery and any room remaining in the sample were filled with exponential samples (and random ones if any room remained). This behaved extremely poorly when the number of heads exceeded the sample size, because we keep just asking about the existence of heads, then their direct parent and so on. As a result, the 'O(log(len(repo)))' discovery turns into a 'O(len(repo))' one. As a solution we take a random sample of the heads plus exponential samples. This way we ensure some exponential sampling is achieved, bringing back some logarithmic convergence of the discovery again. This patch only applies this principle in one place. More places will be updated in future patches. One test is impacted because the random sample happen to be different. By chance, it helps a bit in this case.
mercurial/setdiscovery.py
tests/test-setdiscovery.t
--- a/mercurial/setdiscovery.py
+++ b/mercurial/setdiscovery.py
@@ -108,21 +108,21 @@ def _takefullsample(dag, nodes, size):
     always, sample, desiredlen = _setupsample(dag, nodes, size)
     if sample is None:
         return always
     # update from heads
     _updatesample(dag, nodes, sample, always)
     # update from roots
     _updatesample(dag.inverse(), nodes, sample, always)
     assert sample
-    sample = _limitsample(sample, desiredlen)
-    if len(sample) < desiredlen:
-        more = desiredlen - len(sample)
-        sample.update(random.sample(list(nodes - sample - always), more))
     sample.update(always)
+    sample = _limitsample(sample, size)
+    if len(sample) < size:
+        more = size - len(sample)
+        sample.update(random.sample(list(nodes - sample), more))
     return sample
 
 def _limitsample(sample, desiredlen):
     """return a random subset of sample of at most desiredlen item"""
     if len(sample) > desiredlen:
         sample = set(random.sample(sample, desiredlen))
     return sample
 
--- a/tests/test-setdiscovery.t
+++ b/tests/test-setdiscovery.t
@@ -321,17 +321,17 @@ One with >200 heads, which used to use u
   sampling from both directions
   searching: 5 queries
   query 5; still undecided: 740, sample size is: 200
   sampling from both directions
   searching: 6 queries
   query 6; still undecided: 540, sample size is: 200
   sampling from both directions
   searching: 7 queries
-  query 7; still undecided: 44, sample size is: 44
+  query 7; still undecided: 37, sample size is: 37
   7 total queries
   common heads: 3ee37d65064a
 
 Test actual protocol when pulling one new head in addition to common heads
 
   $ hg clone -U b c
   $ hg -R c id -ir tip
   513314ca8b3a