Bug 1394420 - jit-generate atomic ops to be called from c++. r=nbp, r=froydnj
☠☠ backed out by c311812c530c ☠ ☠
authorLars T Hansen <lhansen@mozilla.com>
Tue, 21 Aug 2018 21:13:12 +0200
changeset 514669 2f5be1913934665cf692d3e43cadbc36f5448643
parent 514622 c486d86fd49fc2079255f8c96d94ec8c1fa2950a
child 514670 b2ffeeac7326d673ff09474ce3007e84687c5a62
push id1953
push userffxbld-merge
push dateMon, 11 Mar 2019 12:10:20 +0000
treeherdermozilla-release@9c35dcbaa899 [default view] [failures only]
perfherder[talos] [build metrics] [platform microbench] (compared to previous push)
reviewersnbp, froydnj
bugs1394420
milestone66.0a1
first release with
nightly linux32
nightly linux64
nightly mac
nightly win32
nightly win64
last release without
nightly linux32
nightly linux64
nightly mac
nightly win32
nightly win64
Bug 1394420 - jit-generate atomic ops to be called from c++. r=nbp, r=froydnj SpiderMonkey (and eventually DOM) will sometimes access shared memory from multiple threads without synchronization; this is a natural consequence of the JS memory model + JS/DOM specs. We have always had a hardware-specific abstraction layer for these accesses, to isolate code from the details of how unsynchronized / racy access is handled. This layer has been written in C++ and has several problems: - In C++, racy access is undefined behavior, and the abstraction layer is therefore inherently unsafe, especially in the presence of inlining, PGO, and clever compilers. (And TSAN will start complaining, too.) - Some of the compiler intrinsics that are used in the C++ abstraction layer are not actually the right primitives -- they assume C++, ie non-racy, semantics, and may not implement the correct barriers in all cases. - There are few guarantees that the synchronization implemented by the C++ primitives is actually compatible with the synchronization used by jitted code. - While x86 and ARM have 8-byte synchronized access (CMPXCHG8B and LDREXD/STREXD), some C++ compilers do not support their use well or at all, leading to occasional hardship for porting teams. This patch solves all these problems by jit-generating the racy access abstraction layer in the form of C++-compatible functions that: do not trigger UB in the C++ code; do not depend on possibly-incorrect intrinsics but instead always emit the proper barriers; are guaranteed to be JIT-compatible; and support x86 properly. Mostly this code is straightforward: each access function is a short, nearly prologue- and epilogue-less, sequence of instructions that performs a normal load or store or appropriately synchronized operation (CMPXCHG or similar). Safe-for-races memcpy and memmove are trickier but are handled by combining some C++ code with several jit-generated functions that perform unrolled copies for various block sizes and alignments. The performance story is not completely satisfactory: On the one hand, we don't regress anything because copying unshared-to-unshared we do not use the new primitives but instead the C++ compiler's optimized memcpy and standard memory loads and stores. On the other hand, performance with shared memory is lower than performance with unshared memory. TypedArray.prototype.set() is a good test case. When the source and target arrays have the same type, the engine uses a memcpy; shared memory copying is 3x slower than unshared memory for 100,000 8K copies (Uint8). However, when the source and target arrays are slightly different types (Uint8 vs Int8) the engine uses individual loads and stores, which for shared memory turns into two calls per byte being moved; in this case, shared memory is 127x slower than unshared memory. (All numbers on x64 Linux.) Can we live with the very significant slowdown in the latter case? It depends on the applications we envision for shared memory. Primarily, shared memory will be used as wasm heap memory, in which case most applications that need to move data will use all Uint8Array arrays and the slowdown is OK. But it is clearly a type of performance cliff. We can reduce the overhead by jit-generating more code, specifically code to perform the load, convert, and store in common cases. More interestingly, and simpler, we can probably use memcpy in all cases by copying first (fairly fast) and then running a local fixup. A bug should be filed for this but IMO we're OK with the current solution. (Memcpy can also be further sped up in platform-specific ways by generating cleverer code that uses REP MOVS or SIMD or similar.)
js/src/jit-test/tests/atomics/memcpy-fidelity.js
js/src/jit/AtomicOperations.h
js/src/jit/MacroAssembler.h
js/src/jit/arm/AtomicOperations-arm.h
js/src/jit/arm/MacroAssembler-arm.cpp
js/src/jit/arm64/AtomicOperations-arm64-gcc.h
js/src/jit/arm64/AtomicOperations-arm64-msvc.h
js/src/jit/arm64/MacroAssembler-arm64.cpp
js/src/jit/mips-shared/AtomicOperations-mips-shared.h
js/src/jit/moz.build
js/src/jit/none/AtomicOperations-feeling-lucky.h
js/src/jit/shared/AtomicOperations-shared-jit.cpp
js/src/jit/shared/AtomicOperations-shared-jit.h
js/src/jit/x64/MacroAssembler-x64.cpp
js/src/jit/x86-shared/Assembler-x86-shared.h
js/src/jit/x86-shared/AtomicOperations-x86-shared-gcc.h
js/src/jit/x86-shared/AtomicOperations-x86-shared-msvc.h
js/src/vm/Initialization.cpp
new file mode 100644
--- /dev/null
+++ b/js/src/jit-test/tests/atomics/memcpy-fidelity.js
@@ -0,0 +1,181 @@
+// In order not to run afoul of C++ UB we have our own non-C++ definitions of
+// operations (they are actually jitted) that can operate racily on shared
+// memory, see jit/shared/AtomicOperations-shared-jit.cpp.
+//
+// Operations on fixed-width 1, 2, 4, and 8 byte data are adequately tested
+// elsewhere.  Here we specifically test our safe-when-racy replacements of
+// memcpy and memmove.
+//
+// There are two primitives in the engine, memcpy_down and memcpy_up.  These are
+// equivalent except when data overlap, in which case memcpy_down handles
+// overlapping copies that move from higher to lower addresses and memcpy_up
+// handles ditto from lower to higher.  memcpy uses memcpy_down always while
+// memmove selects the one to use dynamically based on its arguments.
+
+// Basic memcpy algorithm to be tested:
+//
+// - if src and target have the same alignment
+//   - byte copy up to word alignment
+//   - block copy as much as possible
+//   - word copy as much as possible
+//   - byte copy any tail
+// - else if on a platform that can deal with unaligned access
+//   (ie, x86, ARM64, and ARM if the proper flag is set)
+//   - block copy as much as possible
+//   - word copy as much as possible
+//   - byte copy any tail
+// - else // on a platform that can't deal with unaligned access
+//   (ie ARM without the flag or x86 DEBUG builds with the
+//   JS_NO_UNALIGNED_MEMCPY env var)
+//   - block copy with byte copies
+//   - word copy with byte copies
+//   - byte copy any tail
+
+var target_buf = new SharedArrayBuffer(1024);
+var src_buf = new SharedArrayBuffer(1024);
+
+///////////////////////////////////////////////////////////////////////////
+//
+// Different src and target buffer, this is memcpy "move down".  The same
+// code is used in the engine for overlapping buffers when target addresses
+// are lower than source addresses.
+
+fill(src_buf);
+
+// Basic 1K perfectly aligned copy, copies blocks only.
+{
+    let target = new Uint8Array(target_buf);
+    let src = new Uint8Array(src_buf);
+    clear(target_buf);
+    target.set(src);
+    check(target_buf, 0, 1024, 0);
+}
+
+// Buffers are equally aligned but not on a word boundary and not ending on a
+// word boundary either, so this will copy first some bytes, then some blocks,
+// then some words, and then some bytes.
+{
+    let fill = 0x79;
+    clear(target_buf, fill);
+    let target = new Uint8Array(target_buf, 1, 1022);
+    let src = new Uint8Array(src_buf, 1, 1022);
+    target.set(src);
+    check_fill(target_buf, 0, 1, fill);
+    check(target_buf, 1, 1023, 1);
+    check_fill(target_buf, 1023, 1024, fill);
+}
+
+// Buffers are unequally aligned, we'll copy bytes only on some platforms and
+// unaligned blocks/words on others.
+{
+    clear(target_buf);
+    let target = new Uint8Array(target_buf, 0, 1023);
+    let src = new Uint8Array(src_buf, 1);
+    target.set(src);
+    check(target_buf, 0, 1023, 1);
+    check_zero(target_buf, 1023, 1024);
+}
+
+///////////////////////////////////////////////////////////////////////////
+//
+// Overlapping src and target buffer and the target addresses are always
+// higher than the source addresses, this is memcpy "move up"
+
+// Buffers are equally aligned but not on a word boundary and not ending on a
+// word boundary either, so this will copy first some bytes, then some blocks,
+// then some words, and then some bytes.
+{
+    fill(target_buf);
+    let target = new Uint8Array(target_buf, 9, 999);
+    let src = new Uint8Array(target_buf, 1, 999);
+    target.set(src);
+    check(target_buf, 9, 1008, 1);
+    check(target_buf, 1008, 1024, 1008 & 255);
+}
+
+// Buffers are unequally aligned, we'll copy bytes only on some platforms and
+// unaligned blocks/words on others.
+{
+    fill(target_buf);
+    let target = new Uint8Array(target_buf, 2, 1022);
+    let src = new Uint8Array(target_buf, 1, 1022);
+    target.set(src);
+    check(target_buf, 2, 1024, 1);
+}
+
+///////////////////////////////////////////////////////////////////////////
+//
+// Copy 0 to 127 bytes from and to a variety of addresses to check that we
+// handle limits properly in these edge cases.
+
+// Too slow in debug-noopt builds but we don't want to flag the test as slow,
+// since that means it'll never be run.
+
+if (this.getBuildConfiguration && !getBuildConfiguration().debug)
+{
+    let t = new Uint8Array(target_buf);
+    for (let my_src_buf of [src_buf, target_buf]) {
+        for (let size=0; size < 127; size++) {
+            for (let src_offs=0; src_offs < 8; src_offs++) {
+                for (let target_offs=0; target_offs < 8; target_offs++) {
+                    clear(target_buf, Math.random()*255);
+                    let target = new Uint8Array(target_buf, target_offs, size);
+
+                    // Zero is boring
+                    let bias = (Math.random() * 100 % 12) | 0;
+
+                    // Note src may overlap target partially
+                    let src = new Uint8Array(my_src_buf, src_offs, size);
+                    for ( let i=0; i < size; i++ )
+                        src[i] = i+bias;
+
+                    // We expect these values to be unchanged by the copy
+                    let below = target_offs > 0 ? t[target_offs - 1] : 0;
+                    let above = t[target_offs + size];
+
+                    // Copy
+                    target.set(src);
+
+                    // Verify
+                    check(target_buf, target_offs, target_offs + size, bias);
+                    if (target_offs > 0)
+                        assertEq(t[target_offs-1], below);
+                    assertEq(t[target_offs+size], above);
+                }
+            }
+        }
+    }
+}
+
+
+// Utilities
+
+function clear(buf, fill) {
+    let a = new Uint8Array(buf);
+    for ( let i=0; i < a.length; i++ )
+        a[i] = fill;
+}
+
+function fill(buf) {
+    let a = new Uint8Array(buf);
+    for ( let i=0; i < a.length; i++ )
+        a[i] = i & 255
+}
+
+function check(buf, from, to, startingWith) {
+    let a = new Uint8Array(buf);
+    for ( let i=from; i < to; i++ ) {
+        assertEq(a[i], startingWith);
+        startingWith = (startingWith + 1) & 255;
+    }
+}
+
+function check_zero(buf, from, to) {
+    check_fill(buf, from, to, 0);
+}
+
+function check_fill(buf, from, to, fill) {
+    let a = new Uint8Array(buf);
+    for ( let i=from; i < to; i++ )
+        assertEq(a[i], fill);
+}
--- a/js/src/jit/AtomicOperations.h
+++ b/js/src/jit/AtomicOperations.h
@@ -142,16 +142,23 @@ class AtomicOperations {
   static inline void memcpySafeWhenRacy(void* dest, const void* src,
                                         size_t nbytes);
 
   // Replacement for memmove().  No access-atomicity guarantees.
   static inline void memmoveSafeWhenRacy(void* dest, const void* src,
                                          size_t nbytes);
 
  public:
+  // On some platforms we generate code for the atomics at run-time; that
+  // happens here.
+  static bool Initialize();
+
+  // Deallocate the code segment for generated atomics functions.
+  static void ShutDown();
+
   // Test lock-freedom for any int32 value.  This implements the
   // Atomics::isLockFree() operation in the ECMAScript Shared Memory and
   // Atomics specification, as follows:
   //
   // 4-byte accesses are always lock free (in the spec).
   // 1- and 2-byte accesses are always lock free (in SpiderMonkey).
   //
   // Lock-freedom for 8 bytes is determined by the platform's isLockfree8().
@@ -350,31 +357,37 @@ inline bool AtomicOperations::isLockfree
 #if defined(JS_SIMULATOR_MIPS32)
 #  if defined(__clang__) || defined(__GNUC__)
 #    include "jit/mips-shared/AtomicOperations-mips-shared.h"
 #  else
 #    error "No AtomicOperations support for this platform+compiler combination"
 #  endif
 #elif defined(__x86_64__) || defined(_M_X64) || defined(__i386__) || \
     defined(_M_IX86)
-#  if defined(__clang__) || defined(__GNUC__)
+#  if defined(JS_CODEGEN_X86) || defined(JS_CODEGEN_X64)
+#    include "jit/shared/AtomicOperations-shared-jit.h"
+#  elif defined(__clang__) || defined(__GNUC__)
 #    include "jit/x86-shared/AtomicOperations-x86-shared-gcc.h"
 #  elif defined(_MSC_VER)
 #    include "jit/x86-shared/AtomicOperations-x86-shared-msvc.h"
 #  else
 #    error "No AtomicOperations support for this platform+compiler combination"
 #  endif
 #elif defined(__arm__)
-#  if defined(__clang__) || defined(__GNUC__)
+#  if defined(JS_CODEGEN_ARM)
+#    include "jit/shared/AtomicOperations-shared-jit.h"
+#  elif defined(__clang__) || defined(__GNUC__)
 #    include "jit/arm/AtomicOperations-arm.h"
 #  else
 #    error "No AtomicOperations support for this platform+compiler combination"
 #  endif
 #elif defined(__aarch64__) || defined(_M_ARM64)
-#  if defined(__clang__) || defined(__GNUC__)
+#  if defined(JS_CODEGEN_ARM64)
+#    include "jit/shared/AtomicOperations-shared-jit.h"
+#  elif defined(__clang__) || defined(__GNUC__)
 #    include "jit/arm64/AtomicOperations-arm64-gcc.h"
 #  elif defined(_MSC_VER)
 #    include "jit/arm64/AtomicOperations-arm64-msvc.h"
 #  else
 #    error "No AtomicOperations support for this platform+compiler combination"
 #  endif
 #elif defined(__mips__)
 #  if defined(__clang__) || defined(__GNUC__)
--- a/js/src/jit/MacroAssembler.h
+++ b/js/src/jit/MacroAssembler.h
@@ -972,23 +972,23 @@ class MacroAssembler : public MacroAssem
   inline void maxFloat32(FloatRegister other, FloatRegister srcDest,
                          bool handleNaN) PER_SHARED_ARCH;
   inline void maxDouble(FloatRegister other, FloatRegister srcDest,
                         bool handleNaN) PER_SHARED_ARCH;
 
   // ===============================================================
   // Shift functions
 
-  // For shift-by-register there may be platform-specific
-  // variations, for example, x86 will perform the shift mod 32 but
-  // ARM will perform the shift mod 256.
+  // For shift-by-register there may be platform-specific variations, for
+  // example, x86 will perform the shift mod 32 but ARM will perform the shift
+  // mod 256.
   //
-  // For shift-by-immediate the platform assembler may restrict the
-  // immediate, for example, the ARM assembler requires the count
-  // for 32-bit shifts to be in the range [0,31].
+  // For shift-by-immediate the platform assembler may restrict the immediate,
+  // for example, the ARM assembler requires the count for 32-bit shifts to be
+  // in the range [0,31].
 
   inline void lshift32(Imm32 shift, Register srcDest) PER_SHARED_ARCH;
   inline void rshift32(Imm32 shift, Register srcDest) PER_SHARED_ARCH;
   inline void rshift32Arithmetic(Imm32 shift, Register srcDest) PER_SHARED_ARCH;
 
   inline void lshiftPtr(Imm32 imm, Register dest) PER_ARCH;
   inline void rshiftPtr(Imm32 imm, Register dest) PER_ARCH;
   inline void rshiftPtr(Imm32 imm, Register src, Register dest)
@@ -1942,16 +1942,24 @@ class MacroAssembler : public MacroAssem
       DEFINED_ON(mips_shared);
 
   void compareExchange(Scalar::Type type, const Synchronization& sync,
                        const BaseIndex& mem, Register expected,
                        Register replacement, Register valueTemp,
                        Register offsetTemp, Register maskTemp, Register output)
       DEFINED_ON(mips_shared);
 
+  // x64: `output` must be rax.
+  // ARM: Registers must be distinct; `replacement` and `output` must be
+  // (even,odd) pairs.
+
+  void compareExchange64(const Synchronization& sync, const Address& mem,
+                         Register64 expected, Register64 replacement,
+                         Register64 output) DEFINED_ON(arm, arm64, x64);
+
   // Exchange with memory.  Return the value initially in memory.
   // MIPS: `valueTemp`, `offsetTemp` and `maskTemp` must be defined for 8-bit
   // and 16-bit wide operations.
 
   void atomicExchange(Scalar::Type type, const Synchronization& sync,
                       const Address& mem, Register value, Register output)
       DEFINED_ON(arm, arm64, x86_shared);
 
@@ -1964,16 +1972,20 @@ class MacroAssembler : public MacroAssem
                       Register offsetTemp, Register maskTemp, Register output)
       DEFINED_ON(mips_shared);
 
   void atomicExchange(Scalar::Type type, const Synchronization& sync,
                       const BaseIndex& mem, Register value, Register valueTemp,
                       Register offsetTemp, Register maskTemp, Register output)
       DEFINED_ON(mips_shared);
 
+  void atomicExchange64(const Synchronization& sync, const Address& mem,
+                        Register64 value, Register64 output)
+      DEFINED_ON(arm64, x64);
+
   // Read-modify-write with memory.  Return the value in memory before the
   // operation.
   //
   // x86-shared:
   //   For 8-bit operations, `value` and `output` must have a byte subregister.
   //   For Add and Sub, `temp` must be invalid.
   //   For And, Or, and Xor, `output` must be eax and `temp` must have a byte
   //   subregister.
@@ -2005,16 +2017,25 @@ class MacroAssembler : public MacroAssem
                      Register valueTemp, Register offsetTemp, Register maskTemp,
                      Register output) DEFINED_ON(mips_shared);
 
   void atomicFetchOp(Scalar::Type type, const Synchronization& sync,
                      AtomicOp op, Register value, const BaseIndex& mem,
                      Register valueTemp, Register offsetTemp, Register maskTemp,
                      Register output) DEFINED_ON(mips_shared);
 
+  // x64:
+  //   For Add and Sub, `temp` must be invalid.
+  //   For And, Or, and Xor, `output` must be eax and `temp` must have a byte
+  //   subregister.
+
+  void atomicFetchOp64(const Synchronization& sync, AtomicOp op,
+                       Register64 value, const Address& mem, Register64 temp,
+                       Register64 output) DEFINED_ON(arm64, x64);
+
   // ========================================================================
   // Wasm atomic operations.
   //
   // Constraints, when omitted, are exactly as for the primitive operations
   // above.
 
   void wasmCompareExchange(const wasm::MemoryAccessDesc& access,
                            const Address& mem, Register expected,
@@ -2128,48 +2149,53 @@ class MacroAssembler : public MacroAssem
   void wasmAtomicLoad64(const wasm::MemoryAccessDesc& access,
                         const Address& mem, Register64 temp, Register64 output)
       DEFINED_ON(arm, mips32, x86);
 
   void wasmAtomicLoad64(const wasm::MemoryAccessDesc& access,
                         const BaseIndex& mem, Register64 temp,
                         Register64 output) DEFINED_ON(arm, mips32, x86);
 
-  // x86: `expected` must be the same as `output`, and must be edx:eax
-  // x86: `replacement` must be ecx:ebx
+  // x86: `expected` must be the same as `output`, and must be edx:eax.
+  // x86: `replacement` must be ecx:ebx.
   // x64: `output` must be rax.
   // ARM: Registers must be distinct; `replacement` and `output` must be
-  // (even,odd) pairs. MIPS: Registers must be distinct.
+  // (even,odd) pairs.
+  // ARM64: The base register in `mem` must not overlap `output`.
+  // MIPS: Registers must be distinct.
 
   void wasmCompareExchange64(const wasm::MemoryAccessDesc& access,
                              const Address& mem, Register64 expected,
                              Register64 replacement,
                              Register64 output) PER_ARCH;
 
   void wasmCompareExchange64(const wasm::MemoryAccessDesc& access,
                              const BaseIndex& mem, Register64 expected,
                              Register64 replacement,
                              Register64 output) PER_ARCH;
 
   // x86: `value` must be ecx:ebx; `output` must be edx:eax.
   // ARM: Registers must be distinct; `value` and `output` must be (even,odd)
-  // pairs. MIPS: Registers must be distinct.
+  // pairs.
+  // MIPS: Registers must be distinct.
 
   void wasmAtomicExchange64(const wasm::MemoryAccessDesc& access,
                             const Address& mem, Register64 value,
                             Register64 output) PER_ARCH;
 
   void wasmAtomicExchange64(const wasm::MemoryAccessDesc& access,
                             const BaseIndex& mem, Register64 value,
                             Register64 output) PER_ARCH;
 
   // x86: `output` must be edx:eax, `temp` must be ecx:ebx.
   // x64: For And, Or, and Xor `output` must be rax.
   // ARM: Registers must be distinct; `temp` and `output` must be (even,odd)
-  // pairs. MIPS: Registers must be distinct. MIPS32: `temp` should be invalid.
+  // pairs.
+  // MIPS: Registers must be distinct.
+  // MIPS32: `temp` should be invalid.
 
   void wasmAtomicFetchOp64(const wasm::MemoryAccessDesc& access, AtomicOp op,
                            Register64 value, const Address& mem,
                            Register64 temp, Register64 output)
       DEFINED_ON(arm, arm64, mips32, mips64, x64);
 
   void wasmAtomicFetchOp64(const wasm::MemoryAccessDesc& access, AtomicOp op,
                            Register64 value, const BaseIndex& mem,
--- a/js/src/jit/arm/AtomicOperations-arm.h
+++ b/js/src/jit/arm/AtomicOperations-arm.h
@@ -27,16 +27,25 @@
 // instruction reorderings (effectively those allowed by TSO) even for seq_cst
 // ordered operations, but these reorderings are not allowed by JS.  To do
 // better we will end up with inline assembler or JIT-generated code.
 
 #if !defined(__clang__) && !defined(__GNUC__)
 #  error "This file only for gcc-compatible compilers"
 #endif
 
+inline bool js::jit::AtomicOperations::Initialize() {
+  // Nothing
+  return true;
+}
+
+inline void js::jit::AtomicOperations::ShutDown() {
+  // Nothing
+}
+
 inline bool js::jit::AtomicOperations::hasAtomic8() {
   // This guard is really only for tier-2 and tier-3 systems: LDREXD and
   // STREXD have been available since ARMv6K, and only ARMv7 and later are
   // tier-1.
   return HasLDSTREXBHD();
 }
 
 inline bool js::jit::AtomicOperations::isLockfree8() {
--- a/js/src/jit/arm/MacroAssembler-arm.cpp
+++ b/js/src/jit/arm/MacroAssembler-arm.cpp
@@ -5285,69 +5285,80 @@ void MacroAssembler::wasmAtomicLoad64(co
 
 void MacroAssembler::wasmAtomicLoad64(const wasm::MemoryAccessDesc& access,
                                       const BaseIndex& mem, Register64 temp,
                                       Register64 output) {
   WasmAtomicLoad64(*this, access, mem, temp, output);
 }
 
 template <typename T>
-static void WasmCompareExchange64(MacroAssembler& masm,
-                                  const wasm::MemoryAccessDesc& access,
-                                  const T& mem, Register64 expect,
-                                  Register64 replace, Register64 output) {
+static void CompareExchange64(MacroAssembler& masm,
+                              const wasm::MemoryAccessDesc* access,
+                              const Synchronization& sync, const T& mem,
+                              Register64 expect, Register64 replace,
+                              Register64 output) {
   MOZ_ASSERT(expect != replace && replace != output && output != expect);
 
   MOZ_ASSERT((replace.low.code() & 1) == 0);
   MOZ_ASSERT(replace.low.code() + 1 == replace.high.code());
 
   MOZ_ASSERT((output.low.code() & 1) == 0);
   MOZ_ASSERT(output.low.code() + 1 == output.high.code());
 
   Label again;
   Label done;
 
   SecondScratchRegisterScope scratch2(masm);
   Register ptr = ComputePointerForAtomic(masm, mem, scratch2);
 
-  masm.memoryBarrierBefore(access.sync());
+  masm.memoryBarrierBefore(sync);
 
   masm.bind(&again);
   BufferOffset load = masm.as_ldrexd(output.low, output.high, ptr);
-  masm.append(access, load.getOffset());
+  if (access) {
+    masm.append(*access, load.getOffset());
+  }
 
   masm.as_cmp(output.low, O2Reg(expect.low));
   masm.as_cmp(output.high, O2Reg(expect.high), MacroAssembler::Equal);
   masm.as_b(&done, MacroAssembler::NotEqual);
 
   ScratchRegisterScope scratch(masm);
 
   // Rd (temp) must differ from the two other arguments to strex.
   masm.as_strexd(scratch, replace.low, replace.high, ptr);
   masm.as_cmp(scratch, Imm8(1));
   masm.as_b(&again, MacroAssembler::Equal);
   masm.bind(&done);
 
-  masm.memoryBarrierAfter(access.sync());
+  masm.memoryBarrierAfter(sync);
 }
 
 void MacroAssembler::wasmCompareExchange64(const wasm::MemoryAccessDesc& access,
                                            const Address& mem,
                                            Register64 expect,
                                            Register64 replace,
                                            Register64 output) {
-  WasmCompareExchange64(*this, access, mem, expect, replace, output);
+  CompareExchange64(*this, &access, access.sync(), mem, expect, replace,
+                    output);
 }
 
 void MacroAssembler::wasmCompareExchange64(const wasm::MemoryAccessDesc& access,
                                            const BaseIndex& mem,
                                            Register64 expect,
                                            Register64 replace,
                                            Register64 output) {
-  WasmCompareExchange64(*this, access, mem, expect, replace, output);
+  CompareExchange64(*this, &access, access.sync(), mem, expect, replace,
+                    output);
+}
+
+void MacroAssembler::compareExchange64(const Synchronization& sync,
+                                       const Address& mem, Register64 expect,
+                                       Register64 replace, Register64 output) {
+  CompareExchange64(*this, nullptr, sync, mem, expect, replace, output);
 }
 
 template <typename T>
 static void WasmAtomicExchange64(MacroAssembler& masm,
                                  const wasm::MemoryAccessDesc& access,
                                  const T& mem, Register64 value,
                                  Register64 output) {
   MOZ_ASSERT(output != value);
--- a/js/src/jit/arm64/AtomicOperations-arm64-gcc.h
+++ b/js/src/jit/arm64/AtomicOperations-arm64-gcc.h
@@ -13,16 +13,25 @@
 #include "mozilla/Types.h"
 
 #include "vm/ArrayBufferObject.h"
 
 #if !defined(__clang__) && !defined(__GNUC__)
 #  error "This file only for gcc-compatible compilers"
 #endif
 
+inline bool js::jit::AtomicOperations::Initialize() {
+  // Nothing
+  return true;
+}
+
+inline void js::jit::AtomicOperations::ShutDown() {
+  // Nothing
+}
+
 inline bool js::jit::AtomicOperations::hasAtomic8() { return true; }
 
 inline bool js::jit::AtomicOperations::isLockfree8() {
   MOZ_ASSERT(__atomic_always_lock_free(sizeof(int8_t), 0));
   MOZ_ASSERT(__atomic_always_lock_free(sizeof(int16_t), 0));
   MOZ_ASSERT(__atomic_always_lock_free(sizeof(int32_t), 0));
   MOZ_ASSERT(__atomic_always_lock_free(sizeof(int64_t), 0));
   return true;
--- a/js/src/jit/arm64/AtomicOperations-arm64-msvc.h
+++ b/js/src/jit/arm64/AtomicOperations-arm64-msvc.h
@@ -32,16 +32,25 @@
 // using those functions in many cases here (though not all).  I have not done
 // so because (a) I don't yet know how far back those functions are supported
 // and (b) I expect we'll end up dropping into assembler here eventually so as
 // to guarantee that the C++ compiler won't optimize the code.
 
 // Note, _InterlockedCompareExchange takes the *new* value as the second
 // argument and the *comparand* (expected old value) as the third argument.
 
+inline bool js::jit::AtomicOperations::Initialize() {
+  // Nothing
+  return true;
+}
+
+inline void js::jit::AtomicOperations::ShutDown() {
+  // Nothing
+}
+
 inline bool js::jit::AtomicOperations::hasAtomic8() { return true; }
 
 inline bool js::jit::AtomicOperations::isLockfree8() {
   // The MSDN docs suggest very strongly that if code is compiled for Pentium
   // or better the 64-bit primitives will be lock-free, see eg the "Remarks"
   // secion of the page for _InterlockedCompareExchange64, currently here:
   // https://msdn.microsoft.com/en-us/library/ttk2z1ws%28v=vs.85%29.aspx
   //
--- a/js/src/jit/arm64/MacroAssembler-arm64.cpp
+++ b/js/src/jit/arm64/MacroAssembler-arm64.cpp
@@ -1599,16 +1599,18 @@ static void CompareExchange(MacroAssembl
   Label again;
   Label done;
 
   vixl::UseScratchRegisterScope temps(&masm);
 
   Register scratch2 = temps.AcquireX().asUnsized();
   MemOperand ptr = ComputePointerForAtomic(masm, mem, scratch2);
 
+  MOZ_ASSERT(ptr.base().asUnsized() != output);
+
   masm.memoryBarrierBefore(sync);
 
   Register scratch = temps.AcquireX().asUnsized();
 
   masm.bind(&again);
   SignOrZeroExtend(masm, type, targetWidth, oldval, scratch);
   LoadExclusive(masm, access, type, targetWidth, ptr, output);
   masm.Cmp(R(output, targetWidth), R(scratch, targetWidth));
@@ -1702,16 +1704,37 @@ void MacroAssembler::compareExchange(Sca
 void MacroAssembler::compareExchange(Scalar::Type type,
                                      const Synchronization& sync,
                                      const BaseIndex& mem, Register oldval,
                                      Register newval, Register output) {
   CompareExchange(*this, nullptr, type, Width::_32, sync, mem, oldval, newval,
                   output);
 }
 
+void MacroAssembler::compareExchange64(const Synchronization& sync,
+                                       const Address& mem, Register64 expect,
+                                       Register64 replace, Register64 output) {
+  CompareExchange(*this, nullptr, Scalar::Int64, Width::_64, sync, mem,
+                  expect.reg, replace.reg, output.reg);
+}
+
+void MacroAssembler::atomicExchange64(const Synchronization& sync,
+                                      const Address& mem, Register64 value,
+                                      Register64 output) {
+  AtomicExchange(*this, nullptr, Scalar::Int64, Width::_64, sync, mem,
+                 value.reg, output.reg);
+}
+
+void MacroAssembler::atomicFetchOp64(const Synchronization& sync, AtomicOp op,
+                                     Register64 value, const Address& mem,
+                                     Register64 temp, Register64 output) {
+  AtomicFetchOp<true>(*this, nullptr, Scalar::Int64, Width::_64, sync, op, mem,
+                      value.reg, temp.reg, output.reg);
+}
+
 void MacroAssembler::wasmCompareExchange(const wasm::MemoryAccessDesc& access,
                                          const Address& mem, Register oldval,
                                          Register newval, Register output) {
   CompareExchange(*this, &access, access.type(), Width::_32, access.sync(), mem,
                   oldval, newval, output);
 }
 
 void MacroAssembler::wasmCompareExchange(const wasm::MemoryAccessDesc& access,
--- a/js/src/jit/mips-shared/AtomicOperations-mips-shared.h
+++ b/js/src/jit/mips-shared/AtomicOperations-mips-shared.h
@@ -56,16 +56,25 @@ struct MOZ_RAII AddressGuard {
   ~AddressGuard() { gAtomic64Lock.release(); }
 };
 
 #endif
 
 }  // namespace jit
 }  // namespace js
 
+inline bool js::jit::AtomicOperations::Initialize() {
+  // Nothing
+  return true;
+}
+
+inline void js::jit::AtomicOperations::ShutDown() {
+  // Nothing
+}
+
 inline bool js::jit::AtomicOperations::hasAtomic8() { return true; }
 
 inline bool js::jit::AtomicOperations::isLockfree8() {
   MOZ_ASSERT(__atomic_always_lock_free(sizeof(int8_t), 0));
   MOZ_ASSERT(__atomic_always_lock_free(sizeof(int16_t), 0));
   MOZ_ASSERT(__atomic_always_lock_free(sizeof(int32_t), 0));
 #if defined(JS_64BIT)
   MOZ_ASSERT(__atomic_always_lock_free(sizeof(int64_t), 0));
--- a/js/src/jit/moz.build
+++ b/js/src/jit/moz.build
@@ -107,16 +107,17 @@ UNIFIED_SOURCES += [
 if not CONFIG['ENABLE_ION']:
     LOpcodesGenerated.inputs += ['none/LIR-none.h']
     UNIFIED_SOURCES += [
         'none/Trampoline-none.cpp'
     ]
 elif CONFIG['JS_CODEGEN_X86'] or CONFIG['JS_CODEGEN_X64']:
     LOpcodesGenerated.inputs += ['x86-shared/LIR-x86-shared.h']
     UNIFIED_SOURCES += [
+        'shared/AtomicOperations-shared-jit.cpp',
         'x86-shared/Architecture-x86-shared.cpp',
         'x86-shared/Assembler-x86-shared.cpp',
         'x86-shared/AssemblerBuffer-x86-shared.cpp',
         'x86-shared/CodeGenerator-x86-shared.cpp',
         'x86-shared/Lowering-x86-shared.cpp',
         'x86-shared/MacroAssembler-x86-shared-SIMD.cpp',
         'x86-shared/MacroAssembler-x86-shared.cpp',
         'x86-shared/MoveEmitter-x86-shared.cpp',
@@ -149,16 +150,17 @@ elif CONFIG['JS_CODEGEN_ARM']:
         'arm/Bailouts-arm.cpp',
         'arm/CodeGenerator-arm.cpp',
         'arm/disasm/Constants-arm.cpp',
         'arm/disasm/Disasm-arm.cpp',
         'arm/Lowering-arm.cpp',
         'arm/MacroAssembler-arm.cpp',
         'arm/MoveEmitter-arm.cpp',
         'arm/Trampoline-arm.cpp',
+        'shared/AtomicOperations-shared-jit.cpp',
     ]
     if CONFIG['JS_SIMULATOR_ARM']:
         UNIFIED_SOURCES += [
             'arm/Simulator-arm.cpp'
         ]
     elif CONFIG['OS_ARCH'] == 'Darwin':
         SOURCES += [
             'arm/llvm-compiler-rt/arm/aeabi_idivmod.S',
@@ -180,17 +182,18 @@ elif CONFIG['JS_CODEGEN_ARM64']:
         'arm64/vixl/Decoder-vixl.cpp',
         'arm64/vixl/Disasm-vixl.cpp',
         'arm64/vixl/Instructions-vixl.cpp',
         'arm64/vixl/Instrument-vixl.cpp',
         'arm64/vixl/MacroAssembler-vixl.cpp',
         'arm64/vixl/MozAssembler-vixl.cpp',
         'arm64/vixl/MozCpu-vixl.cpp',
         'arm64/vixl/MozInstructions-vixl.cpp',
-        'arm64/vixl/Utils-vixl.cpp'
+        'arm64/vixl/Utils-vixl.cpp',
+        'shared/AtomicOperations-shared-jit.cpp',
     ]
     if CONFIG['JS_SIMULATOR_ARM64']:
         UNIFIED_SOURCES += [
             'arm64/vixl/Debugger-vixl.cpp',
             'arm64/vixl/Logic-vixl.cpp',
             'arm64/vixl/MozSimulator-vixl.cpp',
             'arm64/vixl/Simulator-vixl.cpp'
         ]
--- a/js/src/jit/none/AtomicOperations-feeling-lucky.h
+++ b/js/src/jit/none/AtomicOperations-feeling-lucky.h
@@ -95,16 +95,25 @@
 // Sanity check.
 
 #if defined(HAS_64BIT_LOCKFREE) && !defined(HAS_64BIT_ATOMICS)
 #  error "This combination of features is senseless, please fix"
 #endif
 
 // Try to avoid platform #ifdefs below this point.
 
+inline bool js::jit::AtomicOperations::Initialize() {
+  // Nothing
+  return true;
+}
+
+inline void js::jit::AtomicOperations::ShutDown() {
+  // Nothing
+}
+
 #ifdef GNUC_COMPATIBLE
 
 inline bool js::jit::AtomicOperations::hasAtomic8() {
 #  if defined(HAS_64BIT_ATOMICS)
   return true;
 #  else
   return false;
 #  endif
new file mode 100644
--- /dev/null
+++ b/js/src/jit/shared/AtomicOperations-shared-jit.cpp
@@ -0,0 +1,1018 @@
+/* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 2 -*-
+ * vim: set ts=8 sts=4 et sw=4 tw=99:
+ * This Source Code Form is subject to the terms of the Mozilla Public
+ * License, v. 2.0. If a copy of the MPL was not distributed with this
+ * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
+
+#include "mozilla/Atomics.h"
+
+#ifdef JS_CODEGEN_ARM
+#  include "jit/arm/Architecture-arm.h"
+#endif
+#include "jit/AtomicOperations.h"
+#include "jit/IonTypes.h"
+#include "jit/MacroAssembler.h"
+#include "jit/RegisterSets.h"
+
+#include "jit/MacroAssembler-inl.h"
+
+using namespace js;
+using namespace js::jit;
+
+// Assigned registers must follow these rules:
+//
+//  - if they overlap the argument registers (for arguments we use) then they
+//
+//                     M   M   U   U   SSSS  TTTTT
+//          ====\      MM MM   U   U  S        T      /====
+//          =====>     M M M   U   U   SSS     T     <=====
+//          ====/      M   M   U   U      S    T      \====
+//                     M   M    UUU   SSSS     T
+//
+//    require no register movement, even for 64-bit registers.  (If this becomes
+//    too complex to handle then we need to create an abstraction that uses the
+//    MoveResolver, see comments on bug 1394420.)
+//
+//  - they should be volatile when possible so that we don't have to save and
+//    restore them.
+//
+// Note that the functions we're generating have a very limited number of
+// signatures, and the register assignments need only work for these signatures.
+// The signatures are these:
+//
+//   ()
+//   (ptr)
+//   (ptr, val/val64)
+//   (ptr, ptr)
+//   (ptr, val/val64, val/val64)
+//
+// It would be nice to avoid saving and restoring all the nonvolatile registers
+// for all the operations, and instead save and restore only the registers used
+// by each specific operation, but the amount of protocol needed to accomplish
+// that probably does not pay for itself.
+
+#if defined(JS_CODEGEN_X64)
+
+// Selected registers match the argument registers exactly, and none of them
+// overlap the result register.
+
+static const LiveRegisterSet AtomicNonVolatileRegs;
+
+static constexpr Register AtomicPtrReg = IntArgReg0;
+static constexpr Register AtomicPtr2Reg = IntArgReg1;
+static constexpr Register AtomicValReg = IntArgReg1;
+static constexpr Register64 AtomicValReg64(IntArgReg1);
+static constexpr Register AtomicVal2Reg = IntArgReg2;
+static constexpr Register64 AtomicVal2Reg64(IntArgReg2);
+static constexpr Register AtomicTemp = IntArgReg3;
+static constexpr Register64 AtomicTemp64(IntArgReg3);
+
+#elif defined(JS_CODEGEN_ARM64)
+
+// Selected registers match the argument registers, except that the Ptr is not
+// in IntArgReg0 so as not to conflict with the result register.
+
+static const LiveRegisterSet AtomicNonVolatileRegs;
+
+static constexpr Register AtomicPtrReg = IntArgReg4;
+static constexpr Register AtomicPtr2Reg = IntArgReg1;
+static constexpr Register AtomicValReg = IntArgReg1;
+static constexpr Register64 AtomicValReg64(IntArgReg1);
+static constexpr Register AtomicVal2Reg = IntArgReg2;
+static constexpr Register64 AtomicVal2Reg64(IntArgReg2);
+static constexpr Register AtomicTemp = IntArgReg3;
+static constexpr Register64 AtomicTemp64(IntArgReg3);
+
+#elif defined(JS_CODEGEN_ARM)
+
+// Assigned registers except temp are disjoint from the argument registers,
+// since accounting for both 32-bit and 64-bit arguments and constraints on the
+// result register is much too messy.  The temp is in an argument register since
+// it won't be used until we've moved all arguments to other registers.
+
+static const LiveRegisterSet AtomicNonVolatileRegs =
+  LiveRegisterSet(GeneralRegisterSet((uint32_t(1) << Registers::r4) |
+                                     (uint32_t(1) << Registers::r5) |
+                                     (uint32_t(1) << Registers::r6) |
+                                     (uint32_t(1) << Registers::r7) |
+                                     (uint32_t(1) << Registers::r8)),
+                  FloatRegisterSet(0));
+
+static constexpr Register AtomicPtrReg = r8;
+static constexpr Register AtomicPtr2Reg = r6;
+static constexpr Register AtomicTemp = r3;
+static constexpr Register AtomicValReg = r6;
+static constexpr Register64 AtomicValReg64(r7, r6);
+static constexpr Register AtomicVal2Reg = r4;
+static constexpr Register64 AtomicVal2Reg64(r5, r4);
+
+#elif defined(JS_CODEGEN_X86)
+
+// There are no argument registers.
+
+static const LiveRegisterSet AtomicNonVolatileRegs =
+  LiveRegisterSet(GeneralRegisterSet((1 << X86Encoding::rbx) |
+                                     (1 << X86Encoding::rsi)),
+                  FloatRegisterSet(0));
+
+static constexpr Register AtomicPtrReg = esi;
+static constexpr Register AtomicPtr2Reg = ebx;
+static constexpr Register AtomicValReg = ebx;
+static constexpr Register AtomicVal2Reg = ecx;
+static constexpr Register AtomicTemp = edx;
+
+// 64-bit registers for cmpxchg8b.  ValReg/Val2Reg/Temp are not used in this
+// case.
+
+static constexpr Register64 AtomicValReg64(edx, eax);
+static constexpr Register64 AtomicVal2Reg64(ecx, ebx);
+
+#else
+#  error "Not implemented - not a tier1 platform"
+#endif
+
+// These are useful shorthands and hide the meaningless uint/int distinction.
+
+static constexpr Scalar::Type SIZE8 = Scalar::Uint8;
+static constexpr Scalar::Type SIZE16 = Scalar::Uint16;
+static constexpr Scalar::Type SIZE32 = Scalar::Uint32;
+static constexpr Scalar::Type SIZE64 = Scalar::Int64;
+#ifdef JS_64BIT
+static constexpr Scalar::Type SIZEWORD = SIZE64;
+#else
+static constexpr Scalar::Type SIZEWORD = SIZE32;
+#endif
+
+// A "block" is a sequence of bytes that is a reasonable quantum to copy to
+// amortize call overhead when implementing memcpy and memmove.  A block will
+// not fit in registers on all platforms and copying it without using
+// intermediate memory will therefore be sensitive to overlap.
+//
+// A "word" is an item that we can copy using only register intermediate storage
+// on all platforms; words can be individually copied without worrying about
+// overlap.
+//
+// Blocks and words can be aligned or unaligned; specific (generated) copying
+// functions handle this in platform-specific ways.
+
+static constexpr size_t WORDSIZE = sizeof(uintptr_t); // Also see SIZEWORD above
+static constexpr size_t BLOCKSIZE = 8 * WORDSIZE;     // Must be a power of 2
+
+static_assert(BLOCKSIZE % WORDSIZE == 0, "A block is an integral number of words");
+
+static constexpr size_t WORDMASK = WORDSIZE - 1;
+static constexpr size_t BLOCKMASK = BLOCKSIZE - 1;
+
+struct ArgIterator
+{
+    ABIArgGenerator abi;
+    unsigned argBase = 0;
+};
+
+static void GenGprArg(MacroAssembler& masm, MIRType t, ArgIterator* iter,
+                      Register reg) {
+  MOZ_ASSERT(t == MIRType::Pointer || t == MIRType::Int32);
+  ABIArg arg = iter->abi.next(t);
+  switch (arg.kind()) {
+    case ABIArg::GPR: {
+      if (arg.gpr() != reg) {
+        masm.movePtr(arg.gpr(), reg);
+      }
+      break;
+    }
+    case ABIArg::Stack: {
+      Address src(masm.getStackPointer(),
+                  iter->argBase + arg.offsetFromArgBase());
+      masm.loadPtr(src, reg);
+      break;
+    }
+    default: {
+      MOZ_CRASH("Not possible");
+    }
+  }
+}
+
+static void GenGpr64Arg(MacroAssembler& masm, ArgIterator* iter,
+                        Register64 reg) {
+  ABIArg arg = iter->abi.next(MIRType::Int64);
+  switch (arg.kind()) {
+    case ABIArg::GPR: {
+      if (arg.gpr64() != reg) {
+        masm.move64(arg.gpr64(), reg);
+      }
+      break;
+    }
+    case ABIArg::Stack: {
+      Address src(masm.getStackPointer(),
+                  iter->argBase + arg.offsetFromArgBase());
+#ifdef JS_64BIT
+      masm.load64(src, reg);
+#else
+      masm.load32(LowWord(src), reg.low);
+      masm.load32(HighWord(src), reg.high);
+#endif
+      break;
+    }
+#if defined(JS_CODEGEN_REGISTER_PAIR)
+    case ABIArg::GPR_PAIR: {
+      if (arg.gpr64() != reg) {
+        masm.move32(arg.oddGpr(), reg.high);
+        masm.move32(arg.evenGpr(), reg.low);
+      }
+      break;
+    }
+#endif
+    default: {
+      MOZ_CRASH("Not possible");
+    }
+  }
+}
+
+static uint32_t GenPrologue(MacroAssembler& masm, ArgIterator* iter) {
+  masm.assumeUnreachable("Shouldn't get here");
+  masm.flushBuffer();
+  masm.haltingAlign(CodeAlignment);
+  masm.setFramePushed(0);
+  uint32_t start = masm.currentOffset();
+  masm.PushRegsInMask(AtomicNonVolatileRegs);
+  iter->argBase = sizeof(void*) + masm.framePushed();
+  return start;
+}
+
+static void GenEpilogue(MacroAssembler& masm) {
+  masm.PopRegsInMask(AtomicNonVolatileRegs);
+  MOZ_ASSERT(masm.framePushed() == 0);
+#if defined(JS_CODEGEN_ARM64)
+  masm.Ret();
+#elif defined(JS_CODEGEN_ARM)
+  masm.mov(lr, pc);
+#else
+  masm.ret();
+#endif
+}
+
+#ifndef JS_64BIT
+static uint32_t GenNop(MacroAssembler& masm) {
+  ArgIterator iter;
+  uint32_t start = GenPrologue(masm, &iter);
+  GenEpilogue(masm);
+  return start;
+}
+#endif
+
+static uint32_t GenFenceSeqCst(MacroAssembler& masm) {
+  ArgIterator iter;
+  uint32_t start = GenPrologue(masm, &iter);
+  masm.memoryBarrier(MembarFull);
+  GenEpilogue(masm);
+  return start;
+}
+
+static uint32_t GenLoad(MacroAssembler& masm, Scalar::Type size,
+                        Synchronization sync) {
+  ArgIterator iter;
+  uint32_t start = GenPrologue(masm, &iter);
+  GenGprArg(masm, MIRType::Pointer, &iter, AtomicPtrReg);
+
+  masm.memoryBarrier(sync.barrierBefore);
+  Address addr(AtomicPtrReg, 0);
+  switch (size) {
+    case SIZE8:
+      masm.load8ZeroExtend(addr, ReturnReg);
+      break;
+    case SIZE16:
+      masm.load16ZeroExtend(addr, ReturnReg);
+      break;
+    case SIZE32:
+      masm.load32(addr, ReturnReg);
+      break;
+    case SIZE64:
+#if defined(JS_64BIT)
+      masm.load64(addr, ReturnReg64);
+      break;
+#else
+      MOZ_CRASH("64-bit atomic load not available on this platform");
+#endif
+    default:
+      MOZ_CRASH("Unknown size");
+  }
+  masm.memoryBarrier(sync.barrierAfter);
+
+  GenEpilogue(masm);
+  return start;
+}
+
+static uint32_t GenStore(MacroAssembler& masm, Scalar::Type size,
+                         Synchronization sync) {
+  ArgIterator iter;
+  uint32_t start = GenPrologue(masm, &iter);
+  GenGprArg(masm, MIRType::Pointer, &iter, AtomicPtrReg);
+
+  masm.memoryBarrier(sync.barrierBefore);
+  Address addr(AtomicPtrReg, 0);
+  switch (size) {
+    case SIZE8:
+      GenGprArg(masm, MIRType::Int32, &iter, AtomicValReg);
+      masm.store8(AtomicValReg, addr);
+      break;
+    case SIZE16:
+      GenGprArg(masm, MIRType::Int32, &iter, AtomicValReg);
+      masm.store16(AtomicValReg, addr);
+      break;
+    case SIZE32:
+      GenGprArg(masm, MIRType::Int32, &iter, AtomicValReg);
+      masm.store32(AtomicValReg, addr);
+      break;
+    case SIZE64:
+#if defined(JS_64BIT)
+      GenGpr64Arg(masm, &iter, AtomicValReg64);
+      masm.store64(AtomicValReg64, addr);
+      break;
+#else
+      MOZ_CRASH("64-bit atomic store not available on this platform");
+#endif
+    default:
+      MOZ_CRASH("Unknown size");
+  }
+  masm.memoryBarrier(sync.barrierAfter);
+
+  GenEpilogue(masm);
+  return start;
+}
+
+enum class CopyDir {
+  DOWN,                       // Move data down, ie, iterate toward higher addresses
+  UP                          // The other way
+};
+
+static uint32_t GenCopy(MacroAssembler& masm, Scalar::Type size,
+                        uint32_t unroll, CopyDir direction) {
+  ArgIterator iter;
+  uint32_t start = GenPrologue(masm, &iter);
+
+  Register dest = AtomicPtrReg;
+  Register src = AtomicPtr2Reg;
+
+  GenGprArg(masm, MIRType::Pointer, &iter, dest);
+  GenGprArg(masm, MIRType::Pointer, &iter, src);
+
+  uint32_t offset = direction == CopyDir::DOWN ? 0 : unroll-1;
+  for (uint32_t i = 0; i < unroll; i++) {
+    switch (size) {
+      case SIZE8:
+        masm.load8ZeroExtend(Address(src, offset), AtomicTemp);
+        masm.store8(AtomicTemp, Address(dest, offset));
+        break;
+      case SIZE16:
+        masm.load16ZeroExtend(Address(src, offset*2), AtomicTemp);
+        masm.store16(AtomicTemp, Address(dest, offset*2));
+        break;
+      case SIZE32:
+        masm.load32(Address(src, offset*4), AtomicTemp);
+        masm.store32(AtomicTemp, Address(dest, offset*4));
+        break;
+      case SIZE64:
+#if defined(JS_64BIT)
+        masm.load64(Address(src, offset*8), AtomicTemp64);
+        masm.store64(AtomicTemp64, Address(dest, offset*8));
+        break;
+#else
+        MOZ_CRASH("64-bit atomic load/store not available on this platform");
+#endif
+      default:
+        MOZ_CRASH("Unknown size");
+    }
+    offset += direction == CopyDir::DOWN ? 1 : -1;
+  }
+
+  GenEpilogue(masm);
+  return start;
+}
+
+static uint32_t GenCmpxchg(MacroAssembler& masm, Scalar::Type size,
+                           Synchronization sync) {
+  ArgIterator iter;
+  uint32_t start = GenPrologue(masm, &iter);
+  GenGprArg(masm, MIRType::Pointer, &iter, AtomicPtrReg);
+
+  Address addr(AtomicPtrReg, 0);
+  switch (size) {
+    case SIZE8:
+    case SIZE16:
+    case SIZE32:
+      GenGprArg(masm, MIRType::Int32, &iter, AtomicValReg);
+      GenGprArg(masm, MIRType::Int32, &iter, AtomicVal2Reg);
+      masm.compareExchange(size, sync, addr, AtomicValReg, AtomicVal2Reg, ReturnReg);
+      break;
+    case SIZE64:
+      GenGpr64Arg(masm, &iter, AtomicValReg64);
+      GenGpr64Arg(masm, &iter, AtomicVal2Reg64);
+#if defined(JS_CODEGEN_X86)
+      MOZ_ASSERT(AtomicValReg64 == Register64(edx, eax));
+      MOZ_ASSERT(AtomicVal2Reg64 == Register64(ecx, ebx));
+      masm.lock_cmpxchg8b(edx, eax, ecx, ebx, Operand(addr));
+
+      MOZ_ASSERT(ReturnReg64 == Register64(edi, eax));
+      masm.mov(edx, edi);
+#else
+      masm.compareExchange64(sync, addr, AtomicValReg64, AtomicVal2Reg64, ReturnReg64);
+#endif
+      break;
+    default:
+      MOZ_CRASH("Unknown size");
+  }
+
+  GenEpilogue(masm);
+  return start;
+}
+
+static uint32_t GenExchange(MacroAssembler& masm, Scalar::Type size,
+                            Synchronization sync) {
+  ArgIterator iter;
+  uint32_t start = GenPrologue(masm, &iter);
+  GenGprArg(masm, MIRType::Pointer, &iter, AtomicPtrReg);
+
+  Address addr(AtomicPtrReg, 0);
+  switch (size) {
+    case SIZE8:
+    case SIZE16:
+    case SIZE32:
+      GenGprArg(masm, MIRType::Int32, &iter, AtomicValReg);
+      masm.atomicExchange(size, sync, addr, AtomicValReg, ReturnReg);
+      break;
+    case SIZE64:
+#if defined(JS_64BIT)
+      GenGpr64Arg(masm, &iter, AtomicValReg64);
+      masm.atomicExchange64(sync, addr, AtomicValReg64, ReturnReg64);
+      break;
+#else
+      MOZ_CRASH("64-bit atomic exchange not available on this platform");
+#endif
+    default:
+      MOZ_CRASH("Unknown size");
+  }
+
+  GenEpilogue(masm);
+  return start;
+}
+
+static uint32_t
+GenFetchOp(MacroAssembler& masm, Scalar::Type size, AtomicOp op,
+           Synchronization sync) {
+  ArgIterator iter;
+  uint32_t start = GenPrologue(masm, &iter);
+  GenGprArg(masm, MIRType::Pointer, &iter, AtomicPtrReg);
+
+  Address addr(AtomicPtrReg, 0);
+  switch (size) {
+    case SIZE8:
+    case SIZE16:
+    case SIZE32: {
+#if defined(JS_CODEGEN_X86) || defined(JS_CODEGEN_X64)
+      Register tmp = op == AtomicFetchAddOp || op == AtomicFetchSubOp
+        ? Register::Invalid()
+        : AtomicTemp;
+#else
+      Register tmp = AtomicTemp;
+#endif
+      GenGprArg(masm, MIRType::Int32, &iter, AtomicValReg);
+      masm.atomicFetchOp(size, sync, op, AtomicValReg, addr, tmp, ReturnReg);
+      break;
+    }
+    case SIZE64: {
+#if defined(JS_64BIT)
+#  if defined(JS_CODEGEN_X64)
+      Register64 tmp = op == AtomicFetchAddOp || op == AtomicFetchSubOp
+        ? Register64::Invalid()
+        : AtomicTemp64;
+#  else
+      Register64 tmp = AtomicTemp64;
+#  endif
+      GenGpr64Arg(masm, &iter, AtomicValReg64);
+      masm.atomicFetchOp64(sync, op, AtomicValReg64, addr, tmp, ReturnReg64);
+      break;
+#else
+      MOZ_CRASH("64-bit atomic fetchOp not available on this platform");
+#endif
+    }
+    default:
+      MOZ_CRASH("Unknown size");
+  }
+
+  GenEpilogue(masm);
+  return start;
+}
+
+namespace js {
+namespace jit {
+
+void (*AtomicFenceSeqCst)();
+
+#ifndef JS_64BIT
+void (*AtomicCompilerFence)();
+#endif
+
+uint8_t (*AtomicLoad8SeqCst)(const uint8_t* addr);
+uint16_t (*AtomicLoad16SeqCst)(const uint16_t* addr);
+uint32_t (*AtomicLoad32SeqCst)(const uint32_t* addr);
+#ifdef JS_64BIT
+uint64_t (*AtomicLoad64SeqCst)(const uint64_t* addr);
+#endif
+
+uint8_t (*AtomicLoad8Unsynchronized)(const uint8_t* addr);
+uint16_t (*AtomicLoad16Unsynchronized)(const uint16_t* addr);
+uint32_t (*AtomicLoad32Unsynchronized)(const uint32_t* addr);
+#ifdef JS_64BIT
+uint64_t (*AtomicLoad64Unsynchronized)(const uint64_t* addr);
+#endif
+
+uint8_t (*AtomicStore8SeqCst)(uint8_t* addr, uint8_t val);
+uint16_t (*AtomicStore16SeqCst)(uint16_t* addr, uint16_t val);
+uint32_t (*AtomicStore32SeqCst)(uint32_t* addr, uint32_t val);
+#ifdef JS_64BIT
+uint64_t (*AtomicStore64SeqCst)(uint64_t* addr, uint64_t val);
+#endif
+
+uint8_t (*AtomicStore8Unsynchronized)(uint8_t* addr, uint8_t val);
+uint16_t (*AtomicStore16Unsynchronized)(uint16_t* addr, uint16_t val);
+uint32_t (*AtomicStore32Unsynchronized)(uint32_t* addr, uint32_t val);
+#ifdef JS_64BIT
+uint64_t (*AtomicStore64Unsynchronized)(uint64_t* addr, uint64_t val);
+#endif
+
+// See the definitions of BLOCKSIZE and WORDSIZE earlier.  The "unaligned"
+// functions perform individual byte copies (and must always be "down" or "up").
+// The others ignore alignment issues, and thus either depend on unaligned
+// accesses being OK or not being invoked on unaligned addresses.
+//
+// src and dest point to the lower addresses of the respective data areas
+// irrespective of "up" or "down".
+
+static void (*AtomicCopyUnalignedBlockDownUnsynchronized)(uint8_t* dest, const uint8_t* src);
+static void (*AtomicCopyUnalignedBlockUpUnsynchronized)(uint8_t* dest, const uint8_t* src);
+static void (*AtomicCopyUnalignedWordDownUnsynchronized)(uint8_t* dest, const uint8_t* src);
+static void (*AtomicCopyUnalignedWordUpUnsynchronized)(uint8_t* dest, const uint8_t* src);
+
+static void (*AtomicCopyBlockDownUnsynchronized)(uint8_t* dest, const uint8_t* src);
+static void (*AtomicCopyBlockUpUnsynchronized)(uint8_t* dest, const uint8_t* src);
+static void (*AtomicCopyWordUnsynchronized)(uint8_t* dest, const uint8_t* src);
+static void (*AtomicCopyByteUnsynchronized)(uint8_t* dest, const uint8_t* src);
+
+uint8_t (*AtomicCmpXchg8SeqCst)(uint8_t* addr, uint8_t oldval, uint8_t newval);
+uint16_t (*AtomicCmpXchg16SeqCst)(uint16_t* addr, uint16_t oldval, uint16_t newval);
+uint32_t (*AtomicCmpXchg32SeqCst)(uint32_t* addr, uint32_t oldval, uint32_t newval);
+uint64_t (*AtomicCmpXchg64SeqCst)(uint64_t* addr, uint64_t oldval, uint64_t newval);
+
+uint8_t (*AtomicExchange8SeqCst)(uint8_t* addr, uint8_t val);
+uint16_t (*AtomicExchange16SeqCst)(uint16_t* addr, uint16_t val);
+uint32_t (*AtomicExchange32SeqCst)(uint32_t* addr, uint32_t val);
+#ifdef JS_64BIT
+uint64_t (*AtomicExchange64SeqCst)(uint64_t* addr, uint64_t val);
+#endif
+
+uint8_t (*AtomicAdd8SeqCst)(uint8_t* addr, uint8_t val);
+uint16_t (*AtomicAdd16SeqCst)(uint16_t* addr, uint16_t val);
+uint32_t (*AtomicAdd32SeqCst)(uint32_t* addr, uint32_t val);
+#ifdef JS_64BIT
+uint64_t (*AtomicAdd64SeqCst)(uint64_t* addr, uint64_t val);
+#endif
+
+uint8_t (*AtomicAnd8SeqCst)(uint8_t* addr, uint8_t val);
+uint16_t (*AtomicAnd16SeqCst)(uint16_t* addr, uint16_t val);
+uint32_t (*AtomicAnd32SeqCst)(uint32_t* addr, uint32_t val);
+#ifdef JS_64BIT
+uint64_t (*AtomicAnd64SeqCst)(uint64_t* addr, uint64_t val);
+#endif
+
+uint8_t (*AtomicOr8SeqCst)(uint8_t* addr, uint8_t val);
+uint16_t (*AtomicOr16SeqCst)(uint16_t* addr, uint16_t val);
+uint32_t (*AtomicOr32SeqCst)(uint32_t* addr, uint32_t val);
+#ifdef JS_64BIT
+uint64_t (*AtomicOr64SeqCst)(uint64_t* addr, uint64_t val);
+#endif
+
+uint8_t (*AtomicXor8SeqCst)(uint8_t* addr, uint8_t val);
+uint16_t (*AtomicXor16SeqCst)(uint16_t* addr, uint16_t val);
+uint32_t (*AtomicXor32SeqCst)(uint32_t* addr, uint32_t val);
+#ifdef JS_64BIT
+uint64_t (*AtomicXor64SeqCst)(uint64_t* addr, uint64_t val);
+#endif
+
+static bool UnalignedAccessesAreOK() {
+#ifdef DEBUG
+  const char* flag = getenv("JS_NO_UNALIGNED_MEMCPY");
+  if (flag && *flag == '1')
+    return false;
+#endif
+#if defined(JS_CODEGEN_X86) || defined(JS_CODEGEN_X64)
+  return true;
+#elif defined(JS_CODEGEN_ARM)
+  return !HasAlignmentFault();
+#elif defined(JS_CODEGEN_ARM64)
+  // This is not necessarily true but it's the best guess right now.
+  return true;
+#else
+  return false;
+#endif
+}
+
+void AtomicMemcpyDownUnsynchronized(uint8_t* dest, const uint8_t* src,
+                                    size_t nbytes) {
+  const uint8_t* lim = src + nbytes;
+
+  // Set up bulk copying.  The cases are ordered the way they are on the
+  // assumption that if we can achieve aligned copies even with a little
+  // preprocessing then that is better than unaligned copying on a platform
+  // that supports it.
+
+  if (nbytes >= WORDSIZE) {
+    void (*copyBlock)(uint8_t* dest, const uint8_t* src);
+    void (*copyWord)(uint8_t* dest, const uint8_t* src);
+
+    if (((uintptr_t(dest) ^ uintptr_t(src)) & WORDMASK) == 0) {
+      const uint8_t* cutoff = (const uint8_t*)JS_ROUNDUP(uintptr_t(src),
+                                                         WORDSIZE);
+      MOZ_ASSERT(cutoff <= lim); // because nbytes >= WORDSIZE
+      while (src < cutoff) {
+        AtomicCopyByteUnsynchronized(dest++, src++);
+      }
+      copyBlock = AtomicCopyBlockDownUnsynchronized;
+      copyWord = AtomicCopyWordUnsynchronized;
+    }
+    else if (UnalignedAccessesAreOK()) {
+      copyBlock = AtomicCopyBlockDownUnsynchronized;
+      copyWord = AtomicCopyWordUnsynchronized;
+    } else {
+      copyBlock = AtomicCopyUnalignedBlockDownUnsynchronized;
+      copyWord = AtomicCopyUnalignedWordDownUnsynchronized;
+    }
+
+    // Bulk copy, first larger blocks and then individual words.
+
+    const uint8_t* blocklim = src + ((lim - src) & ~BLOCKMASK);
+    while (src < blocklim) {
+      copyBlock(dest, src);
+      dest += BLOCKSIZE;
+      src += BLOCKSIZE;
+    }
+
+    const uint8_t* wordlim = src + ((lim - src) & ~WORDMASK);
+    while (src < wordlim) {
+      copyWord(dest, src);
+      dest += WORDSIZE;
+      src += WORDSIZE;
+    }
+  }
+
+  // Byte copy any remaining tail.
+
+  while (src < lim) {
+    AtomicCopyByteUnsynchronized(dest++, src++);
+  }
+}
+
+void AtomicMemcpyUpUnsynchronized(uint8_t* dest, const uint8_t* src,
+                                  size_t nbytes) {
+  const uint8_t* lim = src;
+
+  src += nbytes;
+  dest += nbytes;
+
+  if (nbytes >= WORDSIZE) {
+    void (*copyBlock)(uint8_t* dest, const uint8_t* src);
+    void (*copyWord)(uint8_t* dest, const uint8_t* src);
+
+    if (((uintptr_t(dest) ^ uintptr_t(src)) & WORDMASK) == 0) {
+      const uint8_t* cutoff = (const uint8_t*)(uintptr_t(src) & ~WORDMASK);
+      MOZ_ASSERT(cutoff >= lim); // Because nbytes >= WORDSIZE
+      while (src > cutoff) {
+        AtomicCopyByteUnsynchronized(--dest, --src);
+      }
+      copyBlock = AtomicCopyBlockUpUnsynchronized;
+      copyWord = AtomicCopyWordUnsynchronized;
+    }
+    else if (UnalignedAccessesAreOK()) {
+      copyBlock = AtomicCopyBlockUpUnsynchronized;
+      copyWord = AtomicCopyWordUnsynchronized;
+    } else {
+      copyBlock = AtomicCopyUnalignedBlockUpUnsynchronized;
+      copyWord = AtomicCopyUnalignedWordUpUnsynchronized;
+    }
+
+    const uint8_t* blocklim = src - ((src - lim) & ~BLOCKMASK);
+    while (src > blocklim) {
+      dest -= BLOCKSIZE;
+      src -= BLOCKSIZE;
+      copyBlock(dest, src);
+    }
+
+    const uint8_t* wordlim = src - ((src - lim) & ~WORDMASK);
+    while (src > wordlim) {
+      dest -= WORDSIZE;
+      src -= WORDSIZE;
+      copyWord(dest, src);
+    }
+  }
+
+  while (src > lim) {
+    AtomicCopyByteUnsynchronized(--dest, --src);
+  }
+}
+
+// These will be read and written only by the main thread during startup and
+// shutdown.
+
+static uint8_t* codeSegment;
+static uint32_t codeSegmentSize;
+
+bool InitializeJittedAtomics() {
+  // We should only initialize once.
+  MOZ_ASSERT(!codeSegment);
+
+  LifoAlloc lifo(4096);
+  TempAllocator alloc(&lifo);
+  JitContext jcx(&alloc);
+  StackMacroAssembler masm;
+
+  uint32_t fenceSeqCst = GenFenceSeqCst(masm);
+
+#ifndef JS_64BIT
+  uint32_t nop = GenNop(masm);
+#endif
+
+  Synchronization Full = Synchronization::Full();
+  Synchronization None = Synchronization::None();
+
+  uint32_t load8SeqCst = GenLoad(masm, SIZE8, Full);
+  uint32_t load16SeqCst = GenLoad(masm, SIZE16, Full);
+  uint32_t load32SeqCst = GenLoad(masm, SIZE32, Full);
+#ifdef JS_64BIT
+  uint32_t load64SeqCst = GenLoad(masm, SIZE64, Full);
+#endif
+
+  uint32_t load8Unsynchronized = GenLoad(masm, SIZE8, None);
+  uint32_t load16Unsynchronized = GenLoad(masm, SIZE16, None);
+  uint32_t load32Unsynchronized = GenLoad(masm, SIZE32, None);
+#ifdef JS_64BIT
+  uint32_t load64Unsynchronized = GenLoad(masm, SIZE64, None);
+#endif
+
+  uint32_t store8SeqCst = GenStore(masm, SIZE8, Full);
+  uint32_t store16SeqCst = GenStore(masm, SIZE16, Full);
+  uint32_t store32SeqCst = GenStore(masm, SIZE32, Full);
+#ifdef JS_64BIT
+  uint32_t store64SeqCst = GenStore(masm, SIZE64, Full);
+#endif
+
+  uint32_t store8Unsynchronized = GenStore(masm, SIZE8, None);
+  uint32_t store16Unsynchronized = GenStore(masm, SIZE16, None);
+  uint32_t store32Unsynchronized = GenStore(masm, SIZE32, None);
+#ifdef JS_64BIT
+  uint32_t store64Unsynchronized = GenStore(masm, SIZE64, None);
+#endif
+
+  uint32_t copyUnalignedBlockDownUnsynchronized =
+    GenCopy(masm, SIZE8, BLOCKSIZE, CopyDir::DOWN);
+  uint32_t copyUnalignedBlockUpUnsynchronized =
+    GenCopy(masm, SIZE8, BLOCKSIZE, CopyDir::UP);
+  uint32_t copyUnalignedWordDownUnsynchronized =
+    GenCopy(masm, SIZE8, WORDSIZE, CopyDir::DOWN);
+  uint32_t copyUnalignedWordUpUnsynchronized =
+    GenCopy(masm, SIZE8, WORDSIZE, CopyDir::UP);
+
+  uint32_t copyBlockDownUnsynchronized =
+    GenCopy(masm, SIZEWORD, BLOCKSIZE/WORDSIZE, CopyDir::DOWN);
+  uint32_t copyBlockUpUnsynchronized =
+    GenCopy(masm, SIZEWORD, BLOCKSIZE/WORDSIZE, CopyDir::UP);
+  uint32_t copyWordUnsynchronized = GenCopy(masm, SIZEWORD, 1, CopyDir::DOWN);
+  uint32_t copyByteUnsynchronized = GenCopy(masm, SIZE8, 1, CopyDir::DOWN);
+
+  uint32_t cmpxchg8SeqCst = GenCmpxchg(masm, SIZE8, Full);
+  uint32_t cmpxchg16SeqCst = GenCmpxchg(masm, SIZE16, Full);
+  uint32_t cmpxchg32SeqCst = GenCmpxchg(masm, SIZE32, Full);
+  uint32_t cmpxchg64SeqCst = GenCmpxchg(masm, SIZE64, Full);
+
+  uint32_t exchange8SeqCst = GenExchange(masm, SIZE8, Full);
+  uint32_t exchange16SeqCst = GenExchange(masm, SIZE16, Full);
+  uint32_t exchange32SeqCst = GenExchange(masm, SIZE32, Full);
+#ifdef JS_64BIT
+  uint32_t exchange64SeqCst = GenExchange(masm, SIZE64, Full);
+#endif
+
+  uint32_t add8SeqCst = GenFetchOp(masm, SIZE8, AtomicFetchAddOp, Full);
+  uint32_t add16SeqCst = GenFetchOp(masm, SIZE16, AtomicFetchAddOp, Full);
+  uint32_t add32SeqCst = GenFetchOp(masm, SIZE32, AtomicFetchAddOp, Full);
+#ifdef JS_64BIT
+  uint32_t add64SeqCst = GenFetchOp(masm, SIZE64, AtomicFetchAddOp, Full);
+#endif
+
+  uint32_t and8SeqCst = GenFetchOp(masm, SIZE8, AtomicFetchAndOp, Full);
+  uint32_t and16SeqCst = GenFetchOp(masm, SIZE16, AtomicFetchAndOp, Full);
+  uint32_t and32SeqCst = GenFetchOp(masm, SIZE32, AtomicFetchAndOp, Full);
+#ifdef JS_64BIT
+  uint32_t and64SeqCst = GenFetchOp(masm, SIZE64, AtomicFetchAndOp, Full);
+#endif
+
+  uint32_t or8SeqCst = GenFetchOp(masm, SIZE8, AtomicFetchOrOp, Full);
+  uint32_t or16SeqCst = GenFetchOp(masm, SIZE16, AtomicFetchOrOp, Full);
+  uint32_t or32SeqCst = GenFetchOp(masm, SIZE32, AtomicFetchOrOp, Full);
+#ifdef JS_64BIT
+  uint32_t or64SeqCst = GenFetchOp(masm, SIZE64, AtomicFetchOrOp, Full);
+#endif
+
+  uint32_t xor8SeqCst = GenFetchOp(masm, SIZE8, AtomicFetchXorOp, Full);
+  uint32_t xor16SeqCst = GenFetchOp(masm, SIZE16, AtomicFetchXorOp, Full);
+  uint32_t xor32SeqCst = GenFetchOp(masm, SIZE32, AtomicFetchXorOp, Full);
+#ifdef JS_64BIT
+  uint32_t xor64SeqCst = GenFetchOp(masm, SIZE64, AtomicFetchXorOp, Full);
+#endif
+
+  masm.finish();
+  if (masm.oom()) {
+    return false;
+  }
+
+  // Allocate executable memory.
+  uint32_t codeLength = masm.bytesNeeded();
+  size_t roundedCodeLength = JS_ROUNDUP(codeLength, ExecutableCodePageSize);
+  uint8_t* code =
+    (uint8_t*)AllocateExecutableMemory(roundedCodeLength,
+                                       ProtectionSetting::Writable,
+                                       MemCheckKind::MakeUndefined);
+  if (!code) {
+    return false;
+  }
+
+  // Zero the padding.
+  memset(code + codeLength, 0, roundedCodeLength - codeLength);
+
+  // Copy the code into place but do not flush, as the flush path requires a
+  // JSContext* we do not have.
+  masm.executableCopy(code, /* flushICache = */ false);
+
+  // Flush the icache using a primitive method.
+  ExecutableAllocator::cacheFlush(code, roundedCodeLength);
+
+  // Reprotect the whole region to avoid having separate RW and RX mappings.
+  if (!ExecutableAllocator::makeExecutable(code, roundedCodeLength)) {
+    DeallocateExecutableMemory(code, roundedCodeLength);
+    return false;
+  }
+
+  // Create the function pointers.
+
+  AtomicFenceSeqCst = (void(*)())(code + fenceSeqCst);
+
+#ifndef JS_64BIT
+  AtomicCompilerFence = (void(*)())(code + nop);
+#endif
+
+  AtomicLoad8SeqCst = (uint8_t(*)(const uint8_t* addr))(code + load8SeqCst);
+  AtomicLoad16SeqCst = (uint16_t(*)(const uint16_t* addr))(code + load16SeqCst);
+  AtomicLoad32SeqCst = (uint32_t(*)(const uint32_t* addr))(code + load32SeqCst);
+#ifdef JS_64BIT
+  AtomicLoad64SeqCst = (uint64_t(*)(const uint64_t* addr))(code + load64SeqCst);
+#endif
+
+  AtomicLoad8Unsynchronized =
+    (uint8_t(*)(const uint8_t* addr))(code + load8Unsynchronized);
+  AtomicLoad16Unsynchronized =
+    (uint16_t(*)(const uint16_t* addr))(code + load16Unsynchronized);
+  AtomicLoad32Unsynchronized =
+    (uint32_t(*)(const uint32_t* addr))(code + load32Unsynchronized);
+#ifdef JS_64BIT
+  AtomicLoad64Unsynchronized =
+    (uint64_t(*)(const uint64_t* addr))(code + load64Unsynchronized);
+#endif
+
+  AtomicStore8SeqCst =
+    (uint8_t(*)(uint8_t* addr, uint8_t val))(code + store8SeqCst);
+  AtomicStore16SeqCst =
+    (uint16_t(*)(uint16_t* addr, uint16_t val))(code + store16SeqCst);
+  AtomicStore32SeqCst =
+    (uint32_t(*)(uint32_t* addr, uint32_t val))(code + store32SeqCst);
+#ifdef JS_64BIT
+  AtomicStore64SeqCst =
+    (uint64_t(*)(uint64_t* addr, uint64_t val))(code + store64SeqCst);
+#endif
+
+  AtomicStore8Unsynchronized =
+    (uint8_t(*)(uint8_t* addr, uint8_t val))(code + store8Unsynchronized);
+  AtomicStore16Unsynchronized =
+    (uint16_t(*)(uint16_t* addr, uint16_t val))(code + store16Unsynchronized);
+  AtomicStore32Unsynchronized =
+    (uint32_t(*)(uint32_t* addr, uint32_t val))(code + store32Unsynchronized);
+#ifdef JS_64BIT
+  AtomicStore64Unsynchronized =
+    (uint64_t(*)(uint64_t* addr, uint64_t val))(code + store64Unsynchronized);
+#endif
+
+  AtomicCopyUnalignedBlockDownUnsynchronized =
+    (void(*)(uint8_t* dest, const uint8_t* src))(
+      code + copyUnalignedBlockDownUnsynchronized);
+  AtomicCopyUnalignedBlockUpUnsynchronized =
+    (void(*)(uint8_t* dest, const uint8_t* src))(
+      code + copyUnalignedBlockUpUnsynchronized);
+  AtomicCopyUnalignedWordDownUnsynchronized =
+    (void(*)(uint8_t* dest, const uint8_t* src))(
+      code + copyUnalignedWordDownUnsynchronized);
+  AtomicCopyUnalignedWordUpUnsynchronized =
+    (void(*)(uint8_t* dest, const uint8_t* src))(
+      code + copyUnalignedWordUpUnsynchronized);
+
+  AtomicCopyBlockDownUnsynchronized =
+    (void(*)(uint8_t* dest, const uint8_t* src))(
+      code + copyBlockDownUnsynchronized);
+  AtomicCopyBlockUpUnsynchronized =
+    (void(*)(uint8_t* dest, const uint8_t* src))(
+      code + copyBlockUpUnsynchronized);
+  AtomicCopyWordUnsynchronized =
+    (void(*)(uint8_t* dest, const uint8_t* src))(code + copyWordUnsynchronized);
+  AtomicCopyByteUnsynchronized =
+    (void(*)(uint8_t* dest, const uint8_t* src))(code + copyByteUnsynchronized);
+
+  AtomicCmpXchg8SeqCst =
+    (uint8_t(*)(uint8_t* addr, uint8_t oldval, uint8_t newval))(
+      code + cmpxchg8SeqCst);
+  AtomicCmpXchg16SeqCst =
+    (uint16_t(*)(uint16_t* addr, uint16_t oldval, uint16_t newval))(
+      code + cmpxchg16SeqCst);
+  AtomicCmpXchg32SeqCst =
+    (uint32_t(*)(uint32_t* addr, uint32_t oldval, uint32_t newval))(
+      code + cmpxchg32SeqCst);
+  AtomicCmpXchg64SeqCst =
+    (uint64_t(*)(uint64_t* addr, uint64_t oldval, uint64_t newval))(
+      code + cmpxchg64SeqCst);
+
+  AtomicExchange8SeqCst = (uint8_t(*)(uint8_t* addr, uint8_t val))(
+    code + exchange8SeqCst);
+  AtomicExchange16SeqCst = (uint16_t(*)(uint16_t* addr, uint16_t val))(
+    code + exchange16SeqCst);
+  AtomicExchange32SeqCst = (uint32_t(*)(uint32_t* addr, uint32_t val))(
+    code + exchange32SeqCst);
+#ifdef JS_64BIT
+  AtomicExchange64SeqCst = (uint64_t(*)(uint64_t* addr, uint64_t val))(
+    code + exchange64SeqCst);
+#endif
+
+  AtomicAdd8SeqCst =
+    (uint8_t(*)(uint8_t* addr, uint8_t val))(code + add8SeqCst);
+  AtomicAdd16SeqCst =
+    (uint16_t(*)(uint16_t* addr, uint16_t val))(code + add16SeqCst);
+  AtomicAdd32SeqCst =
+    (uint32_t(*)(uint32_t* addr, uint32_t val))(code + add32SeqCst);
+#ifdef JS_64BIT
+  AtomicAdd64SeqCst =
+    (uint64_t(*)(uint64_t* addr, uint64_t val))(code + add64SeqCst);
+#endif
+
+  AtomicAnd8SeqCst =
+    (uint8_t(*)(uint8_t* addr, uint8_t val))(code + and8SeqCst);
+  AtomicAnd16SeqCst =
+    (uint16_t(*)(uint16_t* addr, uint16_t val))(code + and16SeqCst);
+  AtomicAnd32SeqCst =
+    (uint32_t(*)(uint32_t* addr, uint32_t val))(code + and32SeqCst);
+#ifdef JS_64BIT
+  AtomicAnd64SeqCst =
+    (uint64_t(*)(uint64_t* addr, uint64_t val))(code + and64SeqCst);
+#endif
+
+  AtomicOr8SeqCst =
+    (uint8_t(*)(uint8_t* addr, uint8_t val))(code + or8SeqCst);
+  AtomicOr16SeqCst =
+    (uint16_t(*)(uint16_t* addr, uint16_t val))(code + or16SeqCst);
+  AtomicOr32SeqCst =
+    (uint32_t(*)(uint32_t* addr, uint32_t val))(code + or32SeqCst);
+#ifdef JS_64BIT
+  AtomicOr64SeqCst =
+    (uint64_t(*)(uint64_t* addr, uint64_t val))(code + or64SeqCst);
+#endif
+
+  AtomicXor8SeqCst =
+    (uint8_t(*)(uint8_t* addr, uint8_t val))(code + xor8SeqCst);
+  AtomicXor16SeqCst =
+    (uint16_t(*)(uint16_t* addr, uint16_t val))(code + xor16SeqCst);
+  AtomicXor32SeqCst =
+    (uint32_t(*)(uint32_t* addr, uint32_t val))(code + xor32SeqCst);
+#ifdef JS_64BIT
+  AtomicXor64SeqCst =
+    (uint64_t(*)(uint64_t* addr, uint64_t val))(code + xor64SeqCst);
+#endif
+
+  codeSegment = code;
+  codeSegmentSize = roundedCodeLength;
+
+  return true;
+}
+
+void ShutDownJittedAtomics() {
+  // Must have been initialized.
+  MOZ_ASSERT(codeSegment);
+
+  DeallocateExecutableMemory(codeSegment, codeSegmentSize);
+  codeSegment = nullptr;
+  codeSegmentSize = 0;
+}
+
+} // jit
+} // js
new file mode 100644
--- /dev/null
+++ b/js/src/jit/shared/AtomicOperations-shared-jit.h
@@ -0,0 +1,605 @@
+/* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 2 -*-
+ * vim: set ts=8 sts=4 et sw=4 tw=99:
+ * This Source Code Form is subject to the terms of the Mozilla Public
+ * License, v. 2.0. If a copy of the MPL was not distributed with this
+ * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
+
+/* For overall documentation, see jit/AtomicOperations.h.
+ *
+ * NOTE CAREFULLY: This file is only applicable when we have configured a JIT
+ * and the JIT is for the same architecture that we're compiling the shell for.
+ * Simulators must use a different mechanism.
+ *
+ * See comments before the include nest near the end of jit/AtomicOperations.h
+ * if you didn't understand that.
+ */
+
+#ifndef jit_shared_AtomicOperations_shared_jit_h
+#define jit_shared_AtomicOperations_shared_jit_h
+
+#include "mozilla/Assertions.h"
+#include "mozilla/Types.h"
+
+#include "jsapi.h"
+
+#include "vm/ArrayBufferObject.h"
+
+namespace js {
+namespace jit {
+
+// The function pointers in this section all point to jitted code.
+//
+// On 32-bit systems we assume for simplicity's sake that we don't have any
+// 64-bit atomic operations except cmpxchg (this is a concession to x86 but it's
+// not a hardship).  On 32-bit systems we therefore implement other 64-bit
+// atomic operations in terms of cmpxchg along with some C++ code and a local
+// reordering fence to prevent other loads and stores from being intermingled
+// with operations in the implementation of the atomic.
+
+// `fence` performs a full memory barrier.
+extern void (*AtomicFenceSeqCst)();
+
+#ifndef JS_64BIT
+// `compiler_fence` erects a reordering boundary for operations on the current
+// thread.  We use it to prevent the compiler from reordering loads and stores
+// inside larger primitives that are synthesized from cmpxchg.
+extern void (*AtomicCompilerFence)();
+#endif
+
+extern uint8_t (*AtomicLoad8SeqCst)(const uint8_t* addr);
+extern uint16_t (*AtomicLoad16SeqCst)(const uint16_t* addr);
+extern uint32_t (*AtomicLoad32SeqCst)(const uint32_t* addr);
+#ifdef JS_64BIT
+extern uint64_t (*AtomicLoad64SeqCst)(const uint64_t* addr);
+#endif
+
+// These are access-atomic up to sizeof(uintptr_t).
+extern uint8_t (*AtomicLoad8Unsynchronized)(const uint8_t* addr);
+extern uint16_t (*AtomicLoad16Unsynchronized)(const uint16_t* addr);
+extern uint32_t (*AtomicLoad32Unsynchronized)(const uint32_t* addr);
+#ifdef JS_64BIT
+extern uint64_t (*AtomicLoad64Unsynchronized)(const uint64_t* addr);
+#endif
+
+extern uint8_t (*AtomicStore8SeqCst)(uint8_t* addr, uint8_t val);
+extern uint16_t (*AtomicStore16SeqCst)(uint16_t* addr, uint16_t val);
+extern uint32_t (*AtomicStore32SeqCst)(uint32_t* addr, uint32_t val);
+#ifdef JS_64BIT
+extern uint64_t (*AtomicStore64SeqCst)(uint64_t* addr, uint64_t val);
+#endif
+
+// These are access-atomic up to sizeof(uintptr_t).
+extern uint8_t (*AtomicStore8Unsynchronized)(uint8_t* addr, uint8_t val);
+extern uint16_t (*AtomicStore16Unsynchronized)(uint16_t* addr, uint16_t val);
+extern uint32_t (*AtomicStore32Unsynchronized)(uint32_t* addr, uint32_t val);
+#ifdef JS_64BIT
+extern uint64_t (*AtomicStore64Unsynchronized)(uint64_t* addr, uint64_t val);
+#endif
+
+// `exchange` takes a cell address and a value.  It stores it in the cell and
+// returns the value previously in the cell.
+extern uint8_t (*AtomicExchange8SeqCst)(uint8_t* addr, uint8_t val);
+extern uint16_t (*AtomicExchange16SeqCst)(uint16_t* addr, uint16_t val);
+extern uint32_t (*AtomicExchange32SeqCst)(uint32_t* addr, uint32_t val);
+#ifdef JS_64BIT
+extern uint64_t (*AtomicExchange64SeqCst)(uint64_t* addr, uint64_t val);
+#endif
+
+// `add` adds a value atomically to the cell and returns the old value in the
+// cell.  (There is no `sub`; just add the negated value.)
+extern uint8_t (*AtomicAdd8SeqCst)(uint8_t* addr, uint8_t val);
+extern uint16_t (*AtomicAdd16SeqCst)(uint16_t* addr, uint16_t val);
+extern uint32_t (*AtomicAdd32SeqCst)(uint32_t* addr, uint32_t val);
+#ifdef JS_64BIT
+extern uint64_t (*AtomicAdd64SeqCst)(uint64_t* addr, uint64_t val);
+#endif
+
+// `and` bitwise-ands a value atomically into the cell and returns the old value
+// in the cell.
+extern uint8_t (*AtomicAnd8SeqCst)(uint8_t* addr, uint8_t val);
+extern uint16_t (*AtomicAnd16SeqCst)(uint16_t* addr, uint16_t val);
+extern uint32_t (*AtomicAnd32SeqCst)(uint32_t* addr, uint32_t val);
+#ifdef JS_64BIT
+extern uint64_t (*AtomicAnd64SeqCst)(uint64_t* addr, uint64_t val);
+#endif
+
+// `or` bitwise-ors a value atomically into the cell and returns the old value
+// in the cell.
+extern uint8_t (*AtomicOr8SeqCst)(uint8_t* addr, uint8_t val);
+extern uint16_t (*AtomicOr16SeqCst)(uint16_t* addr, uint16_t val);
+extern uint32_t (*AtomicOr32SeqCst)(uint32_t* addr, uint32_t val);
+#ifdef JS_64BIT
+extern uint64_t (*AtomicOr64SeqCst)(uint64_t* addr, uint64_t val);
+#endif
+
+// `xor` bitwise-xors a value atomically into the cell and returns the old value
+// in the cell.
+extern uint8_t (*AtomicXor8SeqCst)(uint8_t* addr, uint8_t val);
+extern uint16_t (*AtomicXor16SeqCst)(uint16_t* addr, uint16_t val);
+extern uint32_t (*AtomicXor32SeqCst)(uint32_t* addr, uint32_t val);
+#ifdef JS_64BIT
+extern uint64_t (*AtomicXor64SeqCst)(uint64_t* addr, uint64_t val);
+#endif
+
+// `cmpxchg` takes a cell address, an expected value and a replacement value.
+// If the value in the cell equals the expected value then the replacement value
+// is stored in the cell.  It always returns the value previously in the cell.
+extern uint8_t (*AtomicCmpXchg8SeqCst)(uint8_t* addr, uint8_t oldval, uint8_t newval);
+extern uint16_t (*AtomicCmpXchg16SeqCst)(uint16_t* addr, uint16_t oldval, uint16_t newval);
+extern uint32_t (*AtomicCmpXchg32SeqCst)(uint32_t* addr, uint32_t oldval, uint32_t newval);
+extern uint64_t (*AtomicCmpXchg64SeqCst)(uint64_t* addr, uint64_t oldval, uint64_t newval);
+
+// `...MemcpyDown` moves bytes toward lower addresses in memory: dest <= src.
+// `...MemcpyUp` moves bytes toward higher addresses in memory: dest >= src.
+extern void AtomicMemcpyDownUnsynchronized(uint8_t* dest, const uint8_t* src, size_t nbytes);
+extern void AtomicMemcpyUpUnsynchronized(uint8_t* dest, const uint8_t* src, size_t nbytes);
+
+} }
+
+inline bool js::jit::AtomicOperations::hasAtomic8() {
+  return true;
+}
+
+inline bool js::jit::AtomicOperations::isLockfree8() {
+  return true;
+}
+
+inline void
+js::jit::AtomicOperations::fenceSeqCst() {
+  AtomicFenceSeqCst();
+}
+
+#define JIT_LOADOP(T, U, loadop)                            \
+  template<> inline T                                       \
+  AtomicOperations::loadSeqCst(T* addr) {                   \
+    JS::AutoSuppressGCAnalysis nogc;                        \
+    return (T)loadop((U*)addr);                             \
+  }
+
+#ifndef JS_64BIT
+#  define JIT_LOADOP_CAS(T)                                     \
+  template<>                                                    \
+  inline T                                                      \
+  AtomicOperations::loadSeqCst(T* addr) {                       \
+    JS::AutoSuppressGCAnalysis nogc;                            \
+    AtomicCompilerFence();                                      \
+    return (T)AtomicCmpXchg64SeqCst((uint64_t*)addr, 0, 0);     \
+  }
+#endif // !JS_64BIT
+
+namespace js {
+namespace jit {
+
+JIT_LOADOP(int8_t, uint8_t, AtomicLoad8SeqCst)
+JIT_LOADOP(uint8_t, uint8_t, AtomicLoad8SeqCst)
+JIT_LOADOP(int16_t, uint16_t, AtomicLoad16SeqCst)
+JIT_LOADOP(uint16_t, uint16_t, AtomicLoad16SeqCst)
+JIT_LOADOP(int32_t, uint32_t, AtomicLoad32SeqCst)
+JIT_LOADOP(uint32_t, uint32_t, AtomicLoad32SeqCst)
+
+#ifdef JIT_LOADOP_CAS
+JIT_LOADOP_CAS(int64_t)
+JIT_LOADOP_CAS(uint64_t)
+#else
+JIT_LOADOP(int64_t, uint64_t, AtomicLoad64SeqCst)
+JIT_LOADOP(uint64_t, uint64_t, AtomicLoad64SeqCst)
+#endif
+
+}}
+
+#undef JIT_LOADOP
+#undef JIT_LOADOP_CAS
+
+#define JIT_STOREOP(T, U, storeop)                      \
+  template<> inline void                                \
+  AtomicOperations::storeSeqCst(T* addr, T val) {       \
+    JS::AutoSuppressGCAnalysis nogc;                    \
+    storeop((U*)addr, val);                             \
+  }
+
+#ifndef JS_64BIT
+#  define JIT_STOREOP_CAS(T)                                          \
+  template<>                                                          \
+  inline void                                                         \
+  AtomicOperations::storeSeqCst(T* addr, T val) {                     \
+    JS::AutoSuppressGCAnalysis nogc;                                  \
+    AtomicCompilerFence();                                            \
+    T oldval = *addr; /* good initial approximation */                \
+    for (;;) {                                                        \
+      T nextval = (T)AtomicCmpXchg64SeqCst((uint64_t*)addr,           \
+                                           (uint64_t)oldval,          \
+                                           (uint64_t)val);            \
+      if (nextval == oldval) {                                        \
+        break;                                                        \
+      }                                                               \
+      oldval = nextval;                                               \
+    }                                                                 \
+    AtomicCompilerFence();                                            \
+  }
+#endif // !JS_64BIT
+
+namespace js {
+namespace jit {
+
+JIT_STOREOP(int8_t, uint8_t, AtomicStore8SeqCst)
+JIT_STOREOP(uint8_t, uint8_t, AtomicStore8SeqCst)
+JIT_STOREOP(int16_t, uint16_t, AtomicStore16SeqCst)
+JIT_STOREOP(uint16_t, uint16_t, AtomicStore16SeqCst)
+JIT_STOREOP(int32_t, uint32_t, AtomicStore32SeqCst)
+JIT_STOREOP(uint32_t, uint32_t, AtomicStore32SeqCst)
+
+#ifdef JIT_STOREOP_CAS
+JIT_STOREOP_CAS(int64_t)
+JIT_STOREOP_CAS(uint64_t)
+#else
+JIT_STOREOP(int64_t, uint64_t, AtomicStore64SeqCst)
+JIT_STOREOP(uint64_t, uint64_t, AtomicStore64SeqCst)
+#endif
+
+}}
+
+#undef JIT_STOREOP
+#undef JIT_STOREOP_CAS
+
+#define JIT_EXCHANGEOP(T, U, xchgop)                            \
+  template<> inline T                                           \
+  AtomicOperations::exchangeSeqCst(T* addr, T val) {            \
+    JS::AutoSuppressGCAnalysis nogc;                            \
+    return (T)xchgop((U*)addr, (U)val);                         \
+  }
+
+#ifndef JS_64BIT
+#  define JIT_EXCHANGEOP_CAS(T)                                       \
+  template<> inline T                                                 \
+  AtomicOperations::exchangeSeqCst(T* addr, T val) {                  \
+    JS::AutoSuppressGCAnalysis nogc;                                  \
+    AtomicCompilerFence();                                            \
+    T oldval = *addr;                                                 \
+    for (;;) {                                                        \
+      T nextval = (T)AtomicCmpXchg64SeqCst((uint64_t*)addr,           \
+                                           (uint64_t)oldval,          \
+                                           (uint64_t)val);            \
+      if (nextval == oldval) {                                        \
+        break;                                                        \
+      }                                                               \
+      oldval = nextval;                                               \
+    }                                                                 \
+    AtomicCompilerFence();                                            \
+    return oldval;                                                    \
+  }
+#endif // !JS_64BIT
+
+namespace js {
+namespace jit {
+
+JIT_EXCHANGEOP(int8_t, uint8_t, AtomicExchange8SeqCst)
+JIT_EXCHANGEOP(uint8_t, uint8_t, AtomicExchange8SeqCst)
+JIT_EXCHANGEOP(int16_t, uint16_t, AtomicExchange16SeqCst)
+JIT_EXCHANGEOP(uint16_t, uint16_t, AtomicExchange16SeqCst)
+JIT_EXCHANGEOP(int32_t, uint32_t, AtomicExchange32SeqCst)
+JIT_EXCHANGEOP(uint32_t, uint32_t, AtomicExchange32SeqCst)
+
+#ifdef JIT_EXCHANGEOP_CAS
+JIT_EXCHANGEOP_CAS(int64_t)
+JIT_EXCHANGEOP_CAS(uint64_t)
+#else
+JIT_EXCHANGEOP(int64_t, uint64_t, AtomicExchange64SeqCst)
+JIT_EXCHANGEOP(uint64_t, uint64_t, AtomicExchange64SeqCst)
+#endif
+
+}}
+
+#undef JIT_EXCHANGEOP
+#undef JIT_EXCHANGEOP_CAS
+
+#define JIT_CAS(T, U, cmpxchg)                                          \
+  template<> inline T                                                   \
+  AtomicOperations::compareExchangeSeqCst(T* addr, T oldval, T newval) { \
+    JS::AutoSuppressGCAnalysis nogc;                                    \
+    return (T)cmpxchg((U*)addr, (U)oldval, (U)newval);                  \
+  }
+
+namespace js {
+namespace jit {
+
+JIT_CAS(int8_t, uint8_t, AtomicCmpXchg8SeqCst)
+JIT_CAS(uint8_t, uint8_t, AtomicCmpXchg8SeqCst)
+JIT_CAS(int16_t, uint16_t, AtomicCmpXchg16SeqCst)
+JIT_CAS(uint16_t, uint16_t, AtomicCmpXchg16SeqCst)
+JIT_CAS(int32_t, uint32_t, AtomicCmpXchg32SeqCst)
+JIT_CAS(uint32_t, uint32_t, AtomicCmpXchg32SeqCst)
+JIT_CAS(int64_t, uint64_t, AtomicCmpXchg64SeqCst)
+JIT_CAS(uint64_t, uint64_t, AtomicCmpXchg64SeqCst)
+
+}}
+
+#undef JIT_CAS
+
+#define JIT_FETCHADDOP(T, U, xadd)                                   \
+  template<> inline T                                                \
+  AtomicOperations::fetchAddSeqCst(T* addr, T val) {                 \
+    JS::AutoSuppressGCAnalysis nogc;                                 \
+    return (T)xadd((U*)addr, (U)val);                                \
+  }                                                                  \
+
+#define JIT_FETCHSUBOP(T)                                            \
+  template<> inline T                                                \
+  AtomicOperations::fetchSubSeqCst(T* addr, T val) {                 \
+    JS::AutoSuppressGCAnalysis nogc;                                 \
+    return fetchAddSeqCst(addr, (T)(0-val));                         \
+  }
+
+#ifndef JS_64BIT
+#  define JIT_FETCHADDOP_CAS(T)                                         \
+  template<> inline T                                                   \
+  AtomicOperations::fetchAddSeqCst(T* addr, T val) {                    \
+    JS::AutoSuppressGCAnalysis nogc;                                    \
+    AtomicCompilerFence();                                              \
+    T oldval = *addr; /* Good initial approximation */                  \
+    for (;;) {                                                          \
+      T nextval = (T)AtomicCmpXchg64SeqCst((uint64_t*)addr,             \
+                                           (uint64_t)oldval,            \
+                                           (uint64_t)(oldval + val));   \
+      if (nextval == oldval) {                                          \
+        break;                                                          \
+      }                                                                 \
+      oldval = nextval;                                                 \
+    }                                                                   \
+    AtomicCompilerFence();                                              \
+    return oldval;                                                      \
+  }
+#endif // !JS_64BIT
+
+namespace js {
+namespace jit {
+
+JIT_FETCHADDOP(int8_t, uint8_t, AtomicAdd8SeqCst)
+JIT_FETCHADDOP(uint8_t, uint8_t, AtomicAdd8SeqCst)
+JIT_FETCHADDOP(int16_t, uint16_t, AtomicAdd16SeqCst)
+JIT_FETCHADDOP(uint16_t, uint16_t, AtomicAdd16SeqCst)
+JIT_FETCHADDOP(int32_t, uint32_t, AtomicAdd32SeqCst)
+JIT_FETCHADDOP(uint32_t, uint32_t, AtomicAdd32SeqCst)
+
+#ifdef JIT_FETCHADDOP_CAS
+JIT_FETCHADDOP_CAS(int64_t)
+JIT_FETCHADDOP_CAS(uint64_t)
+#else
+JIT_FETCHADDOP(int64_t,  uint64_t, AtomicAdd64SeqCst)
+JIT_FETCHADDOP(uint64_t, uint64_t, AtomicAdd64SeqCst)
+#endif
+
+JIT_FETCHSUBOP(int8_t)
+JIT_FETCHSUBOP(uint8_t)
+JIT_FETCHSUBOP(int16_t)
+JIT_FETCHSUBOP(uint16_t)
+JIT_FETCHSUBOP(int32_t)
+JIT_FETCHSUBOP(uint32_t)
+JIT_FETCHSUBOP(int64_t)
+JIT_FETCHSUBOP(uint64_t)
+
+}}
+
+#undef JIT_FETCHADDOP
+#undef JIT_FETCHADDOP_CAS
+#undef JIT_FETCHSUBOP
+
+#define JIT_FETCHBITOPX(T, U, name, op)                                 \
+  template<> inline T                                                   \
+  AtomicOperations::name(T* addr, T val) {                              \
+    JS::AutoSuppressGCAnalysis nogc;                                    \
+    return (T)op((U *)addr, (U)val);                                    \
+  }
+
+#define JIT_FETCHBITOP(T, U, andop, orop, xorop)                        \
+  JIT_FETCHBITOPX(T, U, fetchAndSeqCst, andop)                          \
+  JIT_FETCHBITOPX(T, U, fetchOrSeqCst, orop)                            \
+  JIT_FETCHBITOPX(T, U, fetchXorSeqCst, xorop)
+
+#ifndef JS_64BIT
+
+#  define AND_OP &
+#  define OR_OP  |
+#  define XOR_OP ^
+
+#  define JIT_FETCHBITOPX_CAS(T, name, OP)                              \
+  template<> inline T                                                   \
+  AtomicOperations::name(T* addr, T val) {                              \
+    JS::AutoSuppressGCAnalysis nogc;                                    \
+    AtomicCompilerFence();                                              \
+    T oldval = *addr;                                                   \
+    for (;;) {                                                          \
+      T nextval = (T)AtomicCmpXchg64SeqCst((uint64_t*)addr,             \
+                                           (uint64_t)oldval,            \
+                                           (uint64_t)(oldval OP val));  \
+      if (nextval == oldval) {                                          \
+        break;                                                          \
+      }                                                                 \
+      oldval = nextval;                                                 \
+    }                                                                   \
+    AtomicCompilerFence();                                              \
+    return oldval;                                                      \
+  }
+
+#  define JIT_FETCHBITOP_CAS(T)                                      \
+  JIT_FETCHBITOPX_CAS(T, fetchAndSeqCst, AND_OP)                     \
+  JIT_FETCHBITOPX_CAS(T, fetchOrSeqCst, OR_OP)                       \
+  JIT_FETCHBITOPX_CAS(T, fetchXorSeqCst, XOR_OP)
+
+#endif  // !JS_64BIT
+
+namespace js {
+namespace jit {
+
+JIT_FETCHBITOP(int8_t, uint8_t, AtomicAnd8SeqCst, AtomicOr8SeqCst, AtomicXor8SeqCst)
+JIT_FETCHBITOP(uint8_t, uint8_t, AtomicAnd8SeqCst, AtomicOr8SeqCst, AtomicXor8SeqCst)
+JIT_FETCHBITOP(int16_t, uint16_t, AtomicAnd16SeqCst, AtomicOr16SeqCst, AtomicXor16SeqCst)
+JIT_FETCHBITOP(uint16_t, uint16_t, AtomicAnd16SeqCst, AtomicOr16SeqCst, AtomicXor16SeqCst)
+JIT_FETCHBITOP(int32_t, uint32_t,  AtomicAnd32SeqCst, AtomicOr32SeqCst, AtomicXor32SeqCst)
+JIT_FETCHBITOP(uint32_t, uint32_t, AtomicAnd32SeqCst, AtomicOr32SeqCst, AtomicXor32SeqCst)
+
+#ifdef JIT_FETCHBITOP_CAS
+JIT_FETCHBITOP_CAS(int64_t)
+JIT_FETCHBITOP_CAS(uint64_t)
+#else
+JIT_FETCHBITOP(int64_t,  uint64_t, AtomicAnd64SeqCst, AtomicOr64SeqCst, AtomicXor64SeqCst)
+JIT_FETCHBITOP(uint64_t, uint64_t, AtomicAnd64SeqCst, AtomicOr64SeqCst, AtomicXor64SeqCst)
+#endif
+
+}}
+
+#undef JIT_FETCHBITOPX_CAS
+#undef JIT_FETCHBITOPX
+#undef JIT_FETCHBITOP_CAS
+#undef JIT_FETCHBITOP
+
+#define JIT_LOADSAFE(T, U, loadop)                              \
+  template<>                                                    \
+  inline T                                                      \
+  js::jit::AtomicOperations::loadSafeWhenRacy(T* addr) {        \
+    JS::AutoSuppressGCAnalysis nogc;                            \
+    union { U u; T t; };                                        \
+    u = loadop((U*)addr);                                       \
+    return t;                                                   \
+  }
+
+#ifndef JS_64BIT
+#  define JIT_LOADSAFE_TEARING(T)                               \
+  template<>                                                    \
+  inline T                                                      \
+  js::jit::AtomicOperations::loadSafeWhenRacy(T* addr) {        \
+    JS::AutoSuppressGCAnalysis nogc;                            \
+    MOZ_ASSERT(sizeof(T) == 8);                                 \
+    union { uint32_t u[2]; T t; };                              \
+    uint32_t* ptr = (uint32_t*)addr;                            \
+    u[0] = AtomicLoad32Unsynchronized(ptr);                     \
+    u[1] = AtomicLoad32Unsynchronized(ptr + 1);                 \
+    return t;                                                   \
+  }
+#endif // !JS_64BIT
+
+namespace js {
+namespace jit {
+
+JIT_LOADSAFE(int8_t,   uint8_t, AtomicLoad8Unsynchronized)
+JIT_LOADSAFE(uint8_t,  uint8_t, AtomicLoad8Unsynchronized)
+JIT_LOADSAFE(int16_t,  uint16_t, AtomicLoad16Unsynchronized)
+JIT_LOADSAFE(uint16_t, uint16_t, AtomicLoad16Unsynchronized)
+JIT_LOADSAFE(int32_t,  uint32_t, AtomicLoad32Unsynchronized)
+JIT_LOADSAFE(uint32_t, uint32_t, AtomicLoad32Unsynchronized)
+#ifdef JIT_LOADSAFE_TEARING
+JIT_LOADSAFE_TEARING(int64_t)
+JIT_LOADSAFE_TEARING(uint64_t)
+JIT_LOADSAFE_TEARING(double)
+#else
+JIT_LOADSAFE(int64_t,  uint64_t, AtomicLoad64Unsynchronized)
+JIT_LOADSAFE(uint64_t, uint64_t, AtomicLoad64Unsynchronized)
+JIT_LOADSAFE(double,   uint64_t, AtomicLoad64Unsynchronized)
+#endif
+JIT_LOADSAFE(float,    uint32_t, AtomicLoad32Unsynchronized)
+
+// Clang requires a specialization for uint8_clamped.
+template<>
+inline uint8_clamped js::jit::AtomicOperations::loadSafeWhenRacy(
+  uint8_clamped* addr) {
+  return uint8_clamped(loadSafeWhenRacy((uint8_t*)addr));
+}
+
+}}
+
+#undef JIT_LOADSAFE
+#undef JIT_LOADSAFE_TEARING
+
+#define JIT_STORESAFE(T, U, storeop)                               \
+  template<>                                                       \
+  inline void                                                      \
+  js::jit::AtomicOperations::storeSafeWhenRacy(T* addr, T val) {   \
+    JS::AutoSuppressGCAnalysis nogc;                               \
+    union { U u; T t; };                                           \
+    t = val;                                                       \
+    storeop((U*)addr, u);                                          \
+  }
+
+#ifndef JS_64BIT
+#  define JIT_STORESAFE_TEARING(T)                                    \
+  template<>                                                          \
+  inline void                                                         \
+  js::jit::AtomicOperations::storeSafeWhenRacy(T* addr, T val) {      \
+    JS::AutoSuppressGCAnalysis nogc;                                  \
+    union { uint32_t u[2]; T t; };                                    \
+    t = val;                                                          \
+    uint32_t* ptr = (uint32_t*)addr;                                  \
+    AtomicStore32Unsynchronized(ptr, u[0]);                           \
+    AtomicStore32Unsynchronized(ptr + 1, u[1]);                       \
+  }
+#endif // !JS_64BIT
+
+namespace js {
+namespace jit {
+
+JIT_STORESAFE(int8_t,   uint8_t, AtomicStore8Unsynchronized)
+JIT_STORESAFE(uint8_t,  uint8_t, AtomicStore8Unsynchronized)
+JIT_STORESAFE(int16_t,  uint16_t, AtomicStore16Unsynchronized)
+JIT_STORESAFE(uint16_t, uint16_t, AtomicStore16Unsynchronized)
+JIT_STORESAFE(int32_t,  uint32_t, AtomicStore32Unsynchronized)
+JIT_STORESAFE(uint32_t, uint32_t, AtomicStore32Unsynchronized)
+#ifdef JIT_STORESAFE_TEARING
+JIT_STORESAFE_TEARING(int64_t)
+JIT_STORESAFE_TEARING(uint64_t)
+JIT_STORESAFE_TEARING(double)
+#else
+JIT_STORESAFE(int64_t,  uint64_t, AtomicStore64Unsynchronized)
+JIT_STORESAFE(uint64_t, uint64_t, AtomicStore64Unsynchronized)
+JIT_STORESAFE(double,   uint64_t, AtomicStore64Unsynchronized)
+#endif
+JIT_STORESAFE(float,    uint32_t, AtomicStore32Unsynchronized)
+
+// Clang requires a specialization for uint8_clamped.
+template<>
+inline void js::jit::AtomicOperations::storeSafeWhenRacy(uint8_clamped* addr,
+                                                         uint8_clamped val) {
+    storeSafeWhenRacy((uint8_t*)addr, (uint8_t)val);
+}
+
+}}
+
+#undef JIT_STORESAFE
+#undef JIT_STORESAFE_TEARING
+
+void js::jit::AtomicOperations::memcpySafeWhenRacy(void* dest, const void* src,
+                                                   size_t nbytes) {
+    JS::AutoSuppressGCAnalysis nogc;
+    MOZ_ASSERT(!((char*)dest <= (char*)src && (char*)src < (char*)dest+nbytes));
+    MOZ_ASSERT(!((char*)src <= (char*)dest && (char*)dest < (char*)src+nbytes));
+    AtomicMemcpyDownUnsynchronized((uint8_t*)dest, (const uint8_t*)src, nbytes);
+}
+
+inline void js::jit::AtomicOperations::memmoveSafeWhenRacy(void* dest,
+                                                           const void* src,
+                                                           size_t nbytes) {
+    JS::AutoSuppressGCAnalysis nogc;
+    if ((char*)dest <= (char*)src) {
+        AtomicMemcpyDownUnsynchronized((uint8_t*)dest, (const uint8_t*)src,
+                                       nbytes);
+    } else {
+        AtomicMemcpyUpUnsynchronized((uint8_t*)dest, (const uint8_t*)src,
+                                     nbytes);
+    }
+}
+
+namespace js {
+namespace jit {
+
+extern bool InitializeJittedAtomics();
+extern void ShutDownJittedAtomics();
+
+}}
+
+inline bool js::jit::AtomicOperations::Initialize() {
+  return InitializeJittedAtomics();
+}
+
+inline void js::jit::AtomicOperations::ShutDown() {
+  ShutDownJittedAtomics();
+}
+
+#endif // jit_shared_AtomicOperations_shared_jit_h
--- a/js/src/jit/x64/MacroAssembler-x64.cpp
+++ b/js/src/jit/x64/MacroAssembler-x64.cpp
@@ -925,37 +925,43 @@ void MacroAssembler::wasmAtomicExchange6
   if (value != output) {
     movq(value.reg, output.reg);
   }
   append(access, masm.size());
   xchgq(output.reg, Operand(mem));
 }
 
 template <typename T>
-static void WasmAtomicFetchOp64(MacroAssembler& masm,
-                                const wasm::MemoryAccessDesc access,
-                                AtomicOp op, Register value, const T& mem,
-                                Register temp, Register output) {
+static void AtomicFetchOp64(MacroAssembler& masm,
+                            const wasm::MemoryAccessDesc* access, AtomicOp op,
+                            Register value, const T& mem, Register temp,
+                            Register output) {
   if (op == AtomicFetchAddOp) {
     if (value != output) {
       masm.movq(value, output);
     }
-    masm.append(access, masm.size());
+    if (access) {
+      masm.append(*access, masm.size());
+    }
     masm.lock_xaddq(output, Operand(mem));
   } else if (op == AtomicFetchSubOp) {
     if (value != output) {
       masm.movq(value, output);
     }
     masm.negq(output);
-    masm.append(access, masm.size());
+    if (access) {
+      masm.append(*access, masm.size());
+    }
     masm.lock_xaddq(output, Operand(mem));
   } else {
     Label again;
     MOZ_ASSERT(output == rax);
-    masm.append(access, masm.size());
+    if (access) {
+      masm.append(*access, masm.size());
+    }
     masm.movq(Operand(mem), rax);
     masm.bind(&again);
     masm.movq(rax, temp);
     switch (op) {
       case AtomicFetchAndOp:
         masm.andq(value, temp);
         break;
       case AtomicFetchOrOp:
@@ -971,24 +977,24 @@ static void WasmAtomicFetchOp64(MacroAss
     masm.j(MacroAssembler::NonZero, &again);
   }
 }
 
 void MacroAssembler::wasmAtomicFetchOp64(const wasm::MemoryAccessDesc& access,
                                          AtomicOp op, Register64 value,
                                          const Address& mem, Register64 temp,
                                          Register64 output) {
-  WasmAtomicFetchOp64(*this, access, op, value.reg, mem, temp.reg, output.reg);
+  AtomicFetchOp64(*this, &access, op, value.reg, mem, temp.reg, output.reg);
 }
 
 void MacroAssembler::wasmAtomicFetchOp64(const wasm::MemoryAccessDesc& access,
                                          AtomicOp op, Register64 value,
                                          const BaseIndex& mem, Register64 temp,
                                          Register64 output) {
-  WasmAtomicFetchOp64(*this, access, op, value.reg, mem, temp.reg, output.reg);
+  AtomicFetchOp64(*this, &access, op, value.reg, mem, temp.reg, output.reg);
 }
 
 void MacroAssembler::wasmAtomicEffectOp64(const wasm::MemoryAccessDesc& access,
                                           AtomicOp op, Register64 value,
                                           const BaseIndex& mem) {
   append(access, size());
   switch (op) {
     case AtomicFetchAddOp:
@@ -1006,9 +1012,35 @@ void MacroAssembler::wasmAtomicEffectOp6
     case AtomicFetchXorOp:
       lock_xorq(value.reg, Operand(mem));
       break;
     default:
       MOZ_CRASH();
   }
 }
 
+void MacroAssembler::compareExchange64(const Synchronization&,
+                                       const Address& mem, Register64 expected,
+                                       Register64 replacement,
+                                       Register64 output) {
+  MOZ_ASSERT(output.reg == rax);
+  if (expected != output) {
+    movq(expected.reg, output.reg);
+  }
+  lock_cmpxchgq(replacement.reg, Operand(mem));
+}
+
+void MacroAssembler::atomicExchange64(const Synchronization&,
+                                      const Address& mem, Register64 value,
+                                      Register64 output) {
+  if (value != output) {
+    movq(value.reg, output.reg);
+  }
+  xchgq(output.reg, Operand(mem));
+}
+
+void MacroAssembler::atomicFetchOp64(const Synchronization& sync, AtomicOp op,
+                                     Register64 value, const Address& mem,
+                                     Register64 temp, Register64 output) {
+  AtomicFetchOp64(*this, nullptr, op, value.reg, mem, temp.reg, output.reg);
+}
+
 //}}} check_macroassembler_style
--- a/js/src/jit/x86-shared/Assembler-x86-shared.h
+++ b/js/src/jit/x86-shared/Assembler-x86-shared.h
@@ -204,16 +204,29 @@ class CPUInfo {
   static bool popcntPresent;
   static bool bmi1Present;
   static bool bmi2Present;
   static bool lzcntPresent;
   static bool needAmdBugWorkaround;
 
   static void SetSSEVersion();
 
+  // The flags can become set at startup when we JIT non-JS code eagerly; thus
+  // we reset the flags before setting any flags explicitly during testing, so
+  // that the flags can be in a consistent state.
+
+  static void reset() {
+    maxSSEVersion = UnknownSSE;
+    maxEnabledSSEVersion = UnknownSSE;
+    avxPresent = false;
+    avxEnabled = false;
+    popcntPresent = false;
+    needAmdBugWorkaround = false;
+  }
+
  public:
   static bool IsSSE2Present() {
 #ifdef JS_CODEGEN_X64
     return true;
 #else
     return GetSSEVersion() >= SSE2;
 #endif
   }
@@ -223,24 +236,29 @@ class CPUInfo {
   static bool IsSSE42Present() { return GetSSEVersion() >= SSE4_2; }
   static bool IsPOPCNTPresent() { return popcntPresent; }
   static bool IsBMI1Present() { return bmi1Present; }
   static bool IsBMI2Present() { return bmi2Present; }
   static bool IsLZCNTPresent() { return lzcntPresent; }
   static bool NeedAmdBugWorkaround() { return needAmdBugWorkaround; }
 
   static void SetSSE3Disabled() {
+    reset();
     maxEnabledSSEVersion = SSE2;
     avxEnabled = false;
   }
   static void SetSSE4Disabled() {
+    reset();
     maxEnabledSSEVersion = SSSE3;
     avxEnabled = false;
   }
-  static void SetAVXEnabled() { avxEnabled = true; }
+  static void SetAVXEnabled() {
+    reset();
+    avxEnabled = true;
+  }
 };
 
 class AssemblerX86Shared : public AssemblerShared {
  protected:
   struct RelativePatch {
     int32_t offset;
     void* target;
     RelocationKind kind;
--- a/js/src/jit/x86-shared/AtomicOperations-x86-shared-gcc.h
+++ b/js/src/jit/x86-shared/AtomicOperations-x86-shared-gcc.h
@@ -53,16 +53,25 @@
 // assumes the code is race free.  This supposedly means C++ will allow some
 // instruction reorderings (effectively those allowed by TSO) even for seq_cst
 // ordered operations, but these reorderings are not allowed by JS.  To do
 // better we will end up with inline assembler or JIT-generated code.
 
 // For now, we require that the C++ compiler's atomics are lock free, even for
 // 64-bit accesses.
 
+inline bool js::jit::AtomicOperations::Initialize() {
+  // Nothing
+  return true;
+}
+
+inline void js::jit::AtomicOperations::ShutDown() {
+  // Nothing
+}
+
 // When compiling with Clang on 32-bit linux it will be necessary to link with
 // -latomic to get the proper 64-bit intrinsics.
 
 inline bool js::jit::AtomicOperations::hasAtomic8() { return true; }
 
 inline bool js::jit::AtomicOperations::isLockfree8() {
   MOZ_ASSERT(__atomic_always_lock_free(sizeof(int8_t), 0));
   MOZ_ASSERT(__atomic_always_lock_free(sizeof(int16_t), 0));
--- a/js/src/jit/x86-shared/AtomicOperations-x86-shared-msvc.h
+++ b/js/src/jit/x86-shared/AtomicOperations-x86-shared-msvc.h
@@ -32,16 +32,25 @@
 // using those functions in many cases here (though not all).  I have not done
 // so because (a) I don't yet know how far back those functions are supported
 // and (b) I expect we'll end up dropping into assembler here eventually so as
 // to guarantee that the C++ compiler won't optimize the code.
 
 // Note, _InterlockedCompareExchange takes the *new* value as the second
 // argument and the *comparand* (expected old value) as the third argument.
 
+inline bool js::jit::AtomicOperations::Initialize() {
+  // Nothing
+  return true;
+}
+
+inline void js::jit::AtomicOperations::ShutDown() {
+  // Nothing
+}
+
 inline bool js::jit::AtomicOperations::hasAtomic8() { return true; }
 
 inline bool js::jit::AtomicOperations::isLockfree8() {
   // The MSDN docs suggest very strongly that if code is compiled for Pentium
   // or better the 64-bit primitives will be lock-free, see eg the "Remarks"
   // secion of the page for _InterlockedCompareExchange64, currently here:
   // https://msdn.microsoft.com/en-us/library/ttk2z1ws%28v=vs.85%29.aspx
   //
--- a/js/src/vm/Initialization.cpp
+++ b/js/src/vm/Initialization.cpp
@@ -12,16 +12,17 @@
 
 #include <ctype.h>
 
 #include "jstypes.h"
 
 #include "builtin/AtomicsObject.h"
 #include "ds/MemoryProtectionExceptionHandler.h"
 #include "gc/Statistics.h"
+#include "jit/AtomicOperations.h"
 #include "jit/ExecutableAllocator.h"
 #include "jit/Ion.h"
 #include "jit/JitCommon.h"
 #include "js/Utility.h"
 #if ENABLE_INTL_API
 #  include "unicode/uclean.h"
 #  include "unicode/utypes.h"
 #endif  // ENABLE_INTL_API
@@ -122,16 +123,18 @@ JS_PUBLIC_API const char* JS::detail::In
   RETURN_IF_FAIL(js::jit::InitializeIon());
 
   RETURN_IF_FAIL(js::InitDateTimeState());
 
 #ifdef MOZ_VTUNE
   RETURN_IF_FAIL(js::vtune::Initialize());
 #endif
 
+  RETURN_IF_FAIL(js::jit::AtomicOperations::Initialize());
+
 #if EXPOSE_INTL_API
   UErrorCode err = U_ZERO_ERROR;
   u_init(&err);
   if (U_FAILURE(err)) {
     return "u_init() failed";
   }
 #endif  // EXPOSE_INTL_API
 
@@ -170,16 +173,18 @@ JS_PUBLIC_API void JS_ShutDown(void) {
   FutexThread::destroy();
 
   js::DestroyHelperThreadsState();
 
 #ifdef JS_SIMULATOR
   js::jit::SimulatorProcess::destroy();
 #endif
 
+  js::jit::AtomicOperations::ShutDown();
+
 #ifdef JS_TRACE_LOGGING
   js::DestroyTraceLoggerThreadState();
   js::DestroyTraceLoggerGraphState();
 #endif
 
   js::MemoryProtectionExceptionHandler::uninstall();
 
   js::wasm::ShutDown();