Bug 1391476 - Add UID and GID to cache parameters; r?dustin draft
authorGregory Szorc <gps@mozilla.com>
Tue, 22 Aug 2017 16:49:26 -0700
changeset 650864 42610adf13d99029de3a81c54d96e67204c6e871
parent 650863 2e0720c9dee6a263d5610a0d88282181047f9dec
child 727517 94d7bfeef3f4dc14dcb319ef626769c8f31789d6
push id75517
push usergszorc@mozilla.com
push dateTue, 22 Aug 2017 23:53:07 +0000
reviewersdustin
bugs1391476
milestone57.0a1
Bug 1391476 - Add UID and GID to cache parameters; r?dustin The UID and GID that a task executes under is dynamic. As a result, caches need to be aware of the UID and GID that owns files otherwise subsequent tasks could run into permission denied errors. This is why `run-task --chown-recursive` exists. By recursively changing ownership of persisted files, we ensure the current task is able to read and write all existing files. When you take a step back, you realize that chowning of cached files is an expensive workaround. Yes, this results in cache hits. But the cost is you potentially have to perform hundreds of thousands of I/O system calls to mass chown. The ideal situation is that UID/GID is consistent across tasks on any given cache and potentially expensive permissions setting can be avoided. So, that's what this commit does. We add the task's UID and GID to run-task's requirements. When we first see a cache, we record a UID and GID with it and chown the empty cache directory to that UID and GID. Subsequent tasks using this cache *must* use the same UID and GID or else run-task will fail. Since run-task now guarantees that all cache consumers use the same UID and GID, we can avoid a potentially expensive recursive chown. But there is an exception. In untrusted environments (namely Try), we recursively chown existing caches if there is a uid/gid mismatch. We do this because Try is a sandbox and any random task could experiment with a non-standard uid/gid. That populated cache would "poison" the cache for the next caller. Or vice-versa. It would be annoying if caches were randomly poisoned due to Try pushes that didn't realize there was a UID/GID mismatch. We could outlaw "bad" UID and GIDs. But that makes the barrier to testing things on Try harder. So, we go with the flow and recursively chown caches in this scenario. This change will shine light on all tasks using inconsistent UID and GID values on the same cache. Bustage is anticipated. Unfortunately, we can't easily know what will break. So it will be one of those things where we will have to fix problems as they arise. Fortunately, because caches are now tied to the content of run-task, we only need to back out this change and tasks should revert to caches without UID and GID pinning requirements and everything will work again. MozReview-Commit-ID: 2ka4rOnnXIp
taskcluster/docker/recipes/run-task
taskcluster/taskgraph/transforms/task.py
--- a/taskcluster/docker/recipes/run-task
+++ b/taskcluster/docker/recipes/run-task
@@ -32,16 +32,32 @@ import urllib2
 
 FINGERPRINT_URL = 'http://taskcluster/secrets/v1/secret/project/taskcluster/gecko/hgfingerprint'
 FALLBACK_FINGERPRINT = {
     'fingerprints':
         "sha256:8e:ad:f7:6a:eb:44:06:15:ed:f3:e4:69:a6:64:60:37:2d:ff:98:88:37"
         ":bf:d7:b8:40:84:01:48:9c:26:ce:d9"}
 
 
+CACHE_UID_GID_MISMATCH = '''
+There is a UID/GID mismatch on the cache. This likely means:
+
+a) different tasks are running as a different user/group
+b) different Docker images have different UID/GID for the same user/group
+
+Our cache policy is that the UID/GID for ALL tasks must be consistent
+for the lifetime of the cache. This eliminates permissions problems due
+to file/directory user/group ownership.
+
+To make this error go away, ensure that all Docker images are use
+a consistent UID/GID and that all tasks using this cache are running as
+the same user/group.
+'''
+
+
 def print_line(prefix, m):
     now = datetime.datetime.utcnow()
     print(b'[%s %sZ] %s' % (prefix, now.isoformat(), m), end=b'')
 
 
 def run_and_prefix_output(prefix, args, extra_env=None):
     """Runs a process and prefixes its output with the time.
 
@@ -267,38 +283,49 @@ def main(args):
     # principle of "fail fast" to ensure optimal task execution.
 
     if 'TASKCLUSTER_CACHES' in os.environ:
         caches = os.environ['TASKCLUSTER_CACHES'].split(';')
         del os.environ['TASKCLUSTER_CACHES']
     else:
         caches = []
 
+    if 'TASKCLUSTER_UNTRUSTED_CACHES' in os.environ:
+        untrusted_caches = True
+        del os.environ['TASKCLUSTER_UNTRUSTED_CACHES']
+    else:
+        untrusted_caches = False
+
     our_requirements = {
         # Include a version string that we can bump whenever to trigger
         # fresh caches. The actual value is not relevant and doesn't need
         # to follow any explicit order. Since taskgraph bakes this file's
         # hash into cache names, any change to this file/version is sufficient
         # to force the use of a new cache.
         b'version=1',
+        # Include the UID and GID the task will run as to ensure that tasks
+        # with different UID and GID don't share the same cache.
+        b'uid=%d' % uid,
+        b'gid=%d' % gid,
     }
 
     for cache in caches:
         if not os.path.isdir(cache):
             print('cache %s is not a directory; this should never happen')
             return 1
 
         requires_path = os.path.join(cache, '.cacherequires')
 
-        # The cache is empty. No validation necessary. Just set up our
-        # requirements file.
+        # The cache is empty. Configure it.
         if not os.listdir(cache):
             print_line(b'cache', b'cache %s is empty; writing requirements: '
                        b'%s\n' % (cache, b' '.join(sorted(our_requirements))))
 
+            # We write a requirements file so future invocations know what the
+            # requirements are.
             with open(requires_path, 'wb') as fh:
                 fh.write(b'\n'.join(sorted(our_requirements)))
 
             # And make it read-only as a precaution against deletion.
             os.chmod(requires_path, stat.S_IRUSR | stat.S_IRGRP | stat.S_IROTH)
 
             set_dir_permissions(cache, uid, gid)
 
@@ -307,26 +334,50 @@ def main(args):
         elif os.path.exists(requires_path):
             with open(requires_path, 'rb') as fh:
                 wanted_requirements = set(fh.read().splitlines())
 
             print_line(b'cache', b'cache %s exists; requirements: %s\n' % (
                        cache, b' '.join(sorted(wanted_requirements))))
 
             missing = wanted_requirements - our_requirements
-            if missing:
+
+            # Allow requirements mismatch for uid/gid if and only if caches
+            # are untrusted. This allows cache behavior on Try to be
+            # reasonable. Otherwise, random tasks could "poison" cache
+            # usability by introducing uid/gid mismatches. For untrusted
+            # environments like Try, this is a perfectly reasonable thing to
+            # allow.
+            if missing and untrusted_caches and running_as_root and \
+                    all(s.startswith(('uid=', 'gid=')) for s in missing):
+                print_line(b'cache',
+                           b'cache %s uid/gid mismatch; this is acceptable '
+                           b'because caches for this task are untrusted; '
+                           b'changing ownership to facilitate cache use\n' %
+                           cache)
+                chown_recursive(cache, user.pw_name, group.gr_name, uid, gid)
+
+                # And write out the updated reality.
+                with open(requires_path, 'wb') as fh:
+                    fh.write(b'\n'.join(sorted(our_requirements)))
+
+            elif missing:
                 print('requirements for populated cache %s differ from '
                       'this task' % cache)
                 print('cache requirements: %s' % ' '.join(sorted(
                     wanted_requirements)))
                 print('our requirements:   %s' % ' '.join(sorted(
                     our_requirements)))
+                if any(s.startswith(('uid=', 'gid=')) for s in missing):
+                    print(CACHE_UID_GID_MISMATCH)
                 return 1
 
-            chown_recursive(cache, user.pw_name, group.gr_name, uid, gid)
+            # We don't need to adjust permissions here because the cache is
+            # associated with a uid/gid and the first task should have set
+            # a proper owner/group.
 
         # The cache has content and no requirements file. This shouldn't
         # happen because run-task should be the first thing that touches a
         # cache.
         else:
             print('cache %s is not empty and is missing a .cacherequires '
                   'file; the cache names for this task are likely '
                   'mis-configured')
--- a/taskcluster/taskgraph/transforms/task.py
+++ b/taskcluster/taskgraph/transforms/task.py
@@ -728,16 +728,19 @@ def build_docker_worker_payload(config, 
 
         payload['cache'] = caches
 
     # And send down volumes information to run-task as well.
     if run_task and worker.get('volumes'):
         payload['env']['TASKCLUSTER_VOLUMES'] = ';'.join(
             sorted(worker['volumes']))
 
+    if payload.get('cache') and skip_untrusted:
+        payload['env']['TASKCLUSTER_UNTRUSTED_CACHES'] = '1'
+
     if features:
         payload['features'] = features
     if capabilities:
         payload['capabilities'] = capabilities
 
     # coalesce / superseding
     if 'coalesce-name' in task and level > 1:
         key = COALESCE_KEY.format(