Update pixman to ad3cbfb073fc325e1b3152898ca71b8255675957
authorJeff Muizelaar <jmuizelaar@mozilla.com>
Mon, 28 Mar 2011 13:52:09 -0400
changeset 64161 820512b8223e43485f30276c8e062650cb948883
parent 64160 af61c4752e53c9623a3bc7fc0c05044d37cd3d08
child 64163 a2937d08aefbd86fd8f852565f15daeadac0ad07
push id19319
push usereakhgari@mozilla.com
push dateTue, 29 Mar 2011 18:19:24 +0000
treeherdermozilla-central@820512b8223e [default view] [failures only]
perfherder[talos] [build metrics] [platform microbench] (compared to previous push)
milestone2.2a1pre
first release with
nightly linux32
820512b8223e / 4.2a1pre / 20110329123450 / files
nightly linux64
820512b8223e / 4.2a1pre / 20110329123450 / files
nightly mac
820512b8223e / 4.2a1pre / 20110329123450 / files
nightly win32
820512b8223e / 4.2a1pre / 20110329123450 / files
nightly win64
last release without
nightly linux32
nightly linux64
nightly mac
nightly win32
nightly win64
releases
nightly linux32
nightly linux64
nightly mac
nightly win32
Update pixman to ad3cbfb073fc325e1b3152898ca71b8255675957 Alan Coopersmith (1): Sun's copyrights belong to Oracle now Alexandros Frantzis (2): Add simple support for the r8g8b8a8 and r8g8b8x8 formats. Add support for the r8g8b8a8 and r8g8b8x8 formats to the tests. Andrea Canciani (14): Improve precision of linear gradients Make classification consistent with rasterization Remove unused enum value Fix an overflow in the new radial gradient code Remove unused stop_range field Fix opacity check Improve conical gradients opacity check Improve handling of tangent circles Add a test for radial gradients Fix compilation on Win32 test: Fix tests for compilation on Windows test: Add Makefile for Win32 Do not include unused headers test: Silence MSVC warnings Cyril Brulebois (2): Fix argument quoting for AC_INIT. Fix linking issues when HAVE_FEENABLEEXCEPT is set. Jon TURNEY (2): Plug another leak in alphamap test Remove stray #include <fenv.h> Rolland Dudemaine (4): test: Fix for mismatched 'fence_malloc' prototype/implementation Correct the initialization of 'max_vx' test: Use the right enum types instead of int to fix warnings Fix "variable was set but never used" warnings Scott McCreary (1): Added check to find pthread on Haiku. Siarhei Siamashka (62): Fixed broken configure check for __thread support Do CPU features detection from 'constructor' function when compiled with gcc ARM: fix 'vld1.8'->'vld1.32' typo in add_8888_8888 NEON fast path ARM: NEON: source image pixel fetcher can be overrided now ARM: nearest scaling support for NEON scanline compositing functions ARM: macro template in C code to simplify using scaled fast paths ARM: performance tuning of NEON nearest scaled pixel fetcher ARM: NEON optimization for scaled over_8888_8888 with nearest filter ARM: NEON optimization for scaled over_8888_0565 with nearest filter ARM: NEON optimization for scaled src_8888_0565 with nearest filter ARM: NEON optimization for scaled src_0565_8888 with nearest filter ARM: optimization for scaled src_0565_0565 with nearest filter C fast path for a1 fill operation ARM: added 'neon_composite_over_n_8_8' fast path ARM: introduced 'fetch_mask_pixblock' macro to simplify code ARM: better NEON instructions scheduling for over_n_8_0565 ARM: added 'neon_composite_over_8888_n_0565' fast path ARM: reuse common NEON code for over_{n_8|8888_n|8888_8}_0565 ARM: added 'neon_composite_over_0565_n_0565' fast path ARM: added 'neon_composite_add_8888_8_8888' fast path ARM: better NEON instructions scheduling for add_8888_8888_8888 ARM: added 'neon_composite_add_n_8_8888' fast path ARM: added 'neon_composite_add_8888_n_8888' fast path ARM: added flags parameter to some asm fast path wrapper macros ARM: added 'neon_composite_in_n_8' fast path ARM: added 'neon_src_rpixbuf_8888' fast path Fix for potential unaligned memory accesses COPYING: added Nokia to the list of copyright holders Revert "Fix "syntax error: empty declaration" warnings." Fix for "syntax error: empty declaration" Solaris Studio warnings Workaround for a preprocessor issue in old Sun Studio Bugfix for a corner case in 'pixman_transform_is_inverse' Make 'fast_composite_scaled_nearest_*' less suspicious A new configure option --enable-static-testprogs ARM: do /proc/self/auxv based cpu features detection only in linux The code in 'bitmap_addrect' already assumes non-null 'reg->data' test: affine-test updated to stress 90/180/270 degrees rotation more New flags for 90/180/270 rotation C fast paths for a simple 90/270 degrees rotation Use const modifiers for source buffers in nearest scaling fast paths test: Extend scaling-test to support a8/solid mask and ADD operation Support for a8 and solid mask in nearest scaling main loop template Better support for NONE repeat in nearest scaling main loop template ARM: new macro template for using scaled fast paths with a8 mask ARM: NEON optimization for nearest scaled over_8888_8_0565 ARM: NEON optimization for nearest scaled over_0565_8_0565 SSE2 optimization for nearest scaled over_8888_n_8888 Ensure that tests run as the last step of a build for 'make check' Main loop template for fast single pass bilinear scaling test: check correctness of 'bilinear_pad_repeat_get_scanline_bounds' SSE2 optimization for bilinear scaled 'src_8888_8888' ARM: NEON optimization for bilinear scaled 'src_8888_8888' ARM: use prefetch in nearest scaled 'src_0565_0565' ARM: common macro for nearest scaling fast paths ARM: assembly optimized nearest scaled 'src_8888_8888' ARM: new bilinear fast path template macro in 'pixman-arm-common.h' ARM: NEON: common macro template for bilinear scanline scalers ARM: use common macro template for bilinear scaled 'src_8888_8888' ARM: NEON optimization for bilinear scaled 'src_8888_0565' ARM: NEON optimization for bilinear scaled 'src_0565_x888' ARM: NEON optimization for bilinear scaled 'src_0565_0565' ARM: a bit faster NEON bilinear scaling for r5g6b5 source images Søren Sandmann Pedersen (79): Remove the class field from source_image_t Pre-release version bump to 0.19.6 Post-release version bump to 0.19.7 Pre-release version bump to 0.20.0 Post-release version bump to 0.20.1 Version bump 0.21.1. COPYING: Stop saying that a modification is currently under discussion. Remove workaround for a bug in the 1.6 X server. [mmx] Mark some of the output variables as early-clobber. Delete the source_image_t struct. Generate {a,x}8r8g8b8, a8, 565 fetchers for nearest/affine images Pre-release version bump Post-release version bump to 0.21.3 test: Make composite test use some existing macros instead of defining its own Add enable_fp_exceptions() function in utils.[ch] Extend gradient-crash-test test: Move palette initialization to utils.[ch] test/utils.c: Initialize palette->rgba to 0. Make the argument to fence_malloc() an int64_t Add a stress-test program. Add a test compositing with the various PDF operators. Fix divide-by-zero in set_lum(). sse2: Skip src pixels that are zero in sse2_composite_over_8888_n_8888() Add iterators in the general implementation Move initialization of iterators for bits images to pixman-bits-image.c Eliminate the _pixman_image_store_scanline_32/64 functions Move iterator initialization to the respective image files Virtualize iterator initialization Use an iterator in pixman_image_get_solid() Move get_scanline_32/64 to the bits part of the image struct Allow NULL property_changed function Consolidate the various get_scanline_32() into get_scanline_narrow() Linear: Optimize for horizontal gradients Get rid of the classify methods Add direct-write optimization back Skip fetching pixels when possible Turn on testing for destination transformation Fix destination fetching Fix dangling-pointer bug in bits_image_fetch_bilinear_no_repeat_8888(). Pre-release version bump to 0.21.4 Post-release version bump to 0.21.5 Print a warning when a development snapshot is being configured. Move fallback decisions from implementations into pixman-cpu.c. Add a test for over_x888_8_0565 in lowlevel_blt_bench(). Add SSE2 fetcher for x8r8g8b8 Add SSE2 fetcher for a8 Improve performance of sse2_combine_over_u() Add SSE2 fetcher for 0565 Add pixman-conical-gradient.c to Makefile.win32. Move all the GTK+ based test programs to a new subdir, "demos" Add @TESTPROGS_EXTRA_LDFLAGS@ to AM_LDFLAGS test/Makefile.am: Move all the TEST_LDADD into a new global LDADD. Add pixman_composite_trapezoids(). Add a test program for pixman_composite_trapezoids(). Add support for triangles to pixman. Add a test program, tri-test Optimize adding opaque trapezoids onto a8 destination. Add new public function pixman_add_triangles() Avoid marking images dirty when properties are reset In pixman_image_set_transform() allow NULL for transform Coding style: core_combine_in_u_pixelsse2 -> core_combine_in_u_pixel_sse2 sse2: Convert all uses of MMX registers to use SSE2 registers instead. sse2: Delete unused MMX functions and constants and all _mm_empty()s sse2: Don't compile pixman-sse2.c with -mmmx anymore sse2: Remove all the core_combine_* functions sse2: Delete obsolete or redundant comments sse2: Remove pixman-x64-mmx-emulation.h sse2: Minor coding style cleanups. Delete pixman-x64-mmx-emulation.h from pixman/Makefile.am Minor fix to the RELEASING file Pre-release version bump to 0.21.6 Post-release version bump to 0.21.7 test: In image_endian_swap() use pixman_image_get_format() to get the bpp. test: Do endian swapping of the source and destination images. In delegate_{src,dest}_iter_init() call delegate directly. Fill out parts of iters in _pixman_implementation_{src,dest}_iter_init() Simplify the prototype for iterator initializers. test: Randomize some tests if PIXMAN_RANDOMIZE_TESTS is set test: Fix infinite loop in composite
gfx/cairo/libpixman/src/Makefile.in
gfx/cairo/libpixman/src/pixman-access.c
gfx/cairo/libpixman/src/pixman-arm-common.h
gfx/cairo/libpixman/src/pixman-arm-neon-asm.S
gfx/cairo/libpixman/src/pixman-arm-neon-asm.h
gfx/cairo/libpixman/src/pixman-arm-neon.c
gfx/cairo/libpixman/src/pixman-arm-simd-asm.S
gfx/cairo/libpixman/src/pixman-arm-simd.c
gfx/cairo/libpixman/src/pixman-bits-image.c
gfx/cairo/libpixman/src/pixman-combine.c.template
gfx/cairo/libpixman/src/pixman-combine.h.template
gfx/cairo/libpixman/src/pixman-combine32.c
gfx/cairo/libpixman/src/pixman-combine32.h
gfx/cairo/libpixman/src/pixman-combine64.c
gfx/cairo/libpixman/src/pixman-combine64.h
gfx/cairo/libpixman/src/pixman-compiler.h
gfx/cairo/libpixman/src/pixman-conical-gradient.c
gfx/cairo/libpixman/src/pixman-cpu.c
gfx/cairo/libpixman/src/pixman-fast-path.c
gfx/cairo/libpixman/src/pixman-fast-path.h
gfx/cairo/libpixman/src/pixman-general.c
gfx/cairo/libpixman/src/pixman-image.c
gfx/cairo/libpixman/src/pixman-implementation.c
gfx/cairo/libpixman/src/pixman-linear-gradient.c
gfx/cairo/libpixman/src/pixman-matrix.c
gfx/cairo/libpixman/src/pixman-mmx.c
gfx/cairo/libpixman/src/pixman-private.h
gfx/cairo/libpixman/src/pixman-radial-gradient.c
gfx/cairo/libpixman/src/pixman-region.c
gfx/cairo/libpixman/src/pixman-solid-fill.c
gfx/cairo/libpixman/src/pixman-sse2.c
gfx/cairo/libpixman/src/pixman-trap.c
gfx/cairo/libpixman/src/pixman-utils.c
gfx/cairo/libpixman/src/pixman-version.h
gfx/cairo/libpixman/src/pixman-vmx.c
gfx/cairo/libpixman/src/pixman.c
gfx/cairo/libpixman/src/pixman.h
layout/reftests/css-gradients/reftest.list
--- a/gfx/cairo/libpixman/src/Makefile.in
+++ b/gfx/cairo/libpixman/src/Makefile.in
@@ -68,18 +68,25 @@ DEFINES += -DPIXMAN_NO_TLS
 # Build MMX code either with VC or with gcc-on-x86
 ifdef _MSC_VER
 ifeq (86,$(findstring 86,$(OS_TEST)))
 ifneq (64,$(findstring 64,$(OS_TEST)))
 USE_MMX=1
 endif
 USE_SSE2=1
 MMX_CFLAGS=
+ifneq (,$(filter 1400 1500, $(_MSC_VER)))
+# MSVC 2005 and 2008 generate code that breaks alignment
+# restrictions in debug mode so always optimize.
+# See bug 640250 for more info.
+SSE2_CFLAGS=-O2
+else
 SSE2_CFLAGS=
 endif
+endif
 ifeq (arm,$(findstring arm,$(OS_TEST)))
 USE_ARM_SIMD_MSVC=1
 endif
 endif
 
 ifdef GNU_CC
 ifeq (86,$(findstring 86,$(OS_TEST)))
 USE_MMX=1
--- a/gfx/cairo/libpixman/src/pixman-access.c
+++ b/gfx/cairo/libpixman/src/pixman-access.c
@@ -206,16 +206,56 @@ fetch_scanline_b8g8r8x8 (pixman_image_t 
 	*buffer++ = (0xff000000 |
 	             ((p & 0xff000000) >> 24)	|
 	             ((p & 0x00ff0000) >> 8)	|
 	             ((p & 0x0000ff00) << 8));
     }
 }
 
 static void
+fetch_scanline_r8g8b8a8 (pixman_image_t *image,
+                         int             x,
+                         int             y,
+                         int             width,
+                         uint32_t *      buffer,
+                         const uint32_t *mask)
+{
+    const uint32_t *bits = image->bits.bits + y * image->bits.rowstride;
+    const uint32_t *pixel = (uint32_t *)bits + x;
+    const uint32_t *end = pixel + width;
+
+    while (pixel < end)
+    {
+	uint32_t p = READ (image, pixel++);
+
+	*buffer++ = (((p & 0x000000ff) << 24) | (p >> 8));
+    }
+}
+
+static void
+fetch_scanline_r8g8b8x8 (pixman_image_t *image,
+                         int             x,
+                         int             y,
+                         int             width,
+                         uint32_t *      buffer,
+                         const uint32_t *mask)
+{
+    const uint32_t *bits = image->bits.bits + y * image->bits.rowstride;
+    const uint32_t *pixel = (uint32_t *)bits + x;
+    const uint32_t *end = pixel + width;
+    
+    while (pixel < end)
+    {
+	uint32_t p = READ (image, pixel++);
+	
+	*buffer++ = (0xff000000 | (p >> 8));
+    }
+}
+
+static void
 fetch_scanline_x14r6g6b6 (pixman_image_t *image,
                           int             x,
                           int             y,
                           int             width,
                           uint32_t *      buffer,
                           const uint32_t *mask)
 {
     const uint32_t *bits = image->bits.bits + y * image->bits.rowstride;
@@ -1287,16 +1327,38 @@ fetch_pixel_b8g8r8x8 (bits_image_t *imag
     
     return ((0xff000000) |
 	    (pixel & 0xff000000) >> 24 |
 	    (pixel & 0x00ff0000) >> 8 |
 	    (pixel & 0x0000ff00) << 8);
 }
 
 static uint32_t
+fetch_pixel_r8g8b8a8 (bits_image_t *image,
+		      int           offset,
+		      int           line)
+{
+    uint32_t *bits = image->bits + line * image->rowstride;
+    uint32_t pixel = READ (image, (uint32_t *)bits + offset);
+    
+    return (((pixel & 0x000000ff) << 24) | (pixel >> 8));
+}
+
+static uint32_t
+fetch_pixel_r8g8b8x8 (bits_image_t *image,
+		      int           offset,
+		      int           line)
+{
+    uint32_t *bits = image->bits + line * image->rowstride;
+    uint32_t pixel = READ (image, (uint32_t *)bits + offset);
+    
+    return (0xff000000 | (pixel >> 8));
+}
+
+static uint32_t
 fetch_pixel_x14r6g6b6 (bits_image_t *image,
                        int           offset,
                        int           line)
 {
     uint32_t *bits = image->bits + line * image->rowstride;
     uint32_t pixel = READ (image, (uint32_t *) bits + offset);
     uint32_t r, g, b;
 
@@ -2023,16 +2085,49 @@ store_scanline_b8g8r8x8 (bits_image_t * 
 	WRITE (image, pixel++,
 	       ((values[i] >>  8) & 0x0000ff00) |
 	       ((values[i] <<  8) & 0x00ff0000) |
 	       ((values[i] << 24) & 0xff000000));
     }
 }
 
 static void
+store_scanline_r8g8b8a8 (bits_image_t *  image,
+                         int             x,
+                         int             y,
+                         int             width,
+                         const uint32_t *values)
+{
+    uint32_t *bits = image->bits + image->rowstride * y;
+    uint32_t *pixel = (uint32_t *)bits + x;
+    int i;
+    
+    for (i = 0; i < width; ++i)
+    {
+	WRITE (image, pixel++,
+	       ((values[i] >> 24) & 0x000000ff) | (values[i] << 8));
+    }
+}
+
+static void
+store_scanline_r8g8b8x8 (bits_image_t *  image,
+                         int             x,
+                         int             y,
+                         int             width,
+                         const uint32_t *values)
+{
+    uint32_t *bits = image->bits + image->rowstride * y;
+    uint32_t *pixel = (uint32_t *)bits + x;
+    int i;
+    
+    for (i = 0; i < width; ++i)
+	WRITE (image, pixel++, (values[i] << 8));
+}
+
+static void
 store_scanline_x14r6g6b6 (bits_image_t *  image,
                           int             x,
                           int             y,
                           int             width,
                           const uint32_t *values)
 {
     uint32_t *bits = image->bits + image->rowstride * y;
     uint32_t *pixel = ((uint32_t *) bits) + x;
@@ -2840,16 +2935,18 @@ static const format_info_t accessors[] =
 {
 /* 32 bpp formats */
     FORMAT_INFO (a8r8g8b8),
     FORMAT_INFO (x8r8g8b8),
     FORMAT_INFO (a8b8g8r8),
     FORMAT_INFO (x8b8g8r8),
     FORMAT_INFO (b8g8r8a8),
     FORMAT_INFO (b8g8r8x8),
+    FORMAT_INFO (r8g8b8a8),
+    FORMAT_INFO (r8g8b8x8),
     FORMAT_INFO (x14r6g6b6),
 
 /* 24bpp formats */
     FORMAT_INFO (r8g8b8),
     FORMAT_INFO (b8g8r8),
     
 /* 16bpp formats */
     FORMAT_INFO (r5g6b5),
--- a/gfx/cairo/libpixman/src/pixman-arm-common.h
+++ b/gfx/cairo/libpixman/src/pixman-arm-common.h
@@ -21,16 +21,18 @@
  * DEALINGS IN THE SOFTWARE.
  *
  * Author:  Siarhei Siamashka (siarhei.siamashka@nokia.com)
  */
 
 #ifndef PIXMAN_ARM_COMMON_H
 #define PIXMAN_ARM_COMMON_H
 
+#include "pixman-fast-path.h"
+
 /* Define some macros which can expand into proxy functions between
  * ARM assembly optimized functions and the rest of pixman fast path API.
  *
  * All the low level ARM assembly functions have to use ARM EABI
  * calling convention and take up to 8 arguments:
  *    width, height, dst, dst_stride, src, src_stride, mask, mask_stride
  *
  * The arguments are ordered with the most important coming first (the
@@ -40,16 +42,19 @@
  * omitted when doing a function call.
  *
  * Arguments 'src' and 'mask' contain either a pointer to the top left
  * pixel of the composited rectangle or a pixel color value depending
  * on the function type. In the case of just a color value (solid source
  * or mask), the corresponding stride argument is unused.
  */
 
+#define SKIP_ZERO_SRC  1
+#define SKIP_ZERO_MASK 2
+
 #define PIXMAN_ARM_BIND_FAST_PATH_SRC_DST(cputype, name,                \
                                           src_type, src_cnt,            \
                                           dst_type, dst_cnt)            \
 void                                                                    \
 pixman_composite_##name##_asm_##cputype (int32_t   w,                   \
                                          int32_t   h,                   \
                                          dst_type *dst,                 \
                                          int32_t   dst_stride,          \
@@ -80,17 +85,17 @@ cputype##_composite_##name (pixman_imple
     PIXMAN_IMAGE_GET_LINE (dst_image, dest_x, dest_y, dst_type,         \
                            dst_stride, dst_line, dst_cnt);              \
                                                                         \
     pixman_composite_##name##_asm_##cputype (width, height,             \
                                              dst_line, dst_stride,      \
                                              src_line, src_stride);     \
 }
 
-#define PIXMAN_ARM_BIND_FAST_PATH_N_DST(cputype, name,                  \
+#define PIXMAN_ARM_BIND_FAST_PATH_N_DST(flags, cputype, name,           \
                                         dst_type, dst_cnt)              \
 void                                                                    \
 pixman_composite_##name##_asm_##cputype (int32_t    w,                  \
                                          int32_t    h,                  \
                                          dst_type  *dst,                \
                                          int32_t    dst_stride,         \
                                          uint32_t   src);               \
                                                                         \
@@ -108,30 +113,31 @@ cputype##_composite_##name (pixman_imple
                             int32_t                  dest_y,            \
                             int32_t                  width,             \
                             int32_t                  height)            \
 {                                                                       \
     dst_type  *dst_line;                                                \
     int32_t    dst_stride;                                              \
     uint32_t   src;                                                     \
                                                                         \
-    src = _pixman_image_get_solid (src_image, dst_image->bits.format);  \
+    src = _pixman_image_get_solid (					\
+	imp, src_image, dst_image->bits.format);			\
                                                                         \
-    if (src == 0)                                                       \
+    if ((flags & SKIP_ZERO_SRC) && src == 0)                            \
 	return;                                                         \
                                                                         \
     PIXMAN_IMAGE_GET_LINE (dst_image, dest_x, dest_y, dst_type,         \
                            dst_stride, dst_line, dst_cnt);              \
                                                                         \
     pixman_composite_##name##_asm_##cputype (width, height,             \
                                              dst_line, dst_stride,      \
                                              src);                      \
 }
 
-#define PIXMAN_ARM_BIND_FAST_PATH_N_MASK_DST(cputype, name,             \
+#define PIXMAN_ARM_BIND_FAST_PATH_N_MASK_DST(flags, cputype, name,      \
                                              mask_type, mask_cnt,       \
                                              dst_type, dst_cnt)         \
 void                                                                    \
 pixman_composite_##name##_asm_##cputype (int32_t    w,                  \
                                          int32_t    h,                  \
                                          dst_type  *dst,                \
                                          int32_t    dst_stride,         \
                                          uint32_t   src,                \
@@ -154,33 +160,34 @@ cputype##_composite_##name (pixman_imple
                             int32_t                  width,             \
                             int32_t                  height)            \
 {                                                                       \
     dst_type  *dst_line;                                                \
     mask_type *mask_line;                                               \
     int32_t    dst_stride, mask_stride;                                 \
     uint32_t   src;                                                     \
                                                                         \
-    src = _pixman_image_get_solid (src_image, dst_image->bits.format);  \
+    src = _pixman_image_get_solid (					\
+	imp, src_image, dst_image->bits.format);			\
                                                                         \
-    if (src == 0)                                                       \
+    if ((flags & SKIP_ZERO_SRC) && src == 0)                            \
 	return;                                                         \
                                                                         \
     PIXMAN_IMAGE_GET_LINE (dst_image, dest_x, dest_y, dst_type,         \
                            dst_stride, dst_line, dst_cnt);              \
     PIXMAN_IMAGE_GET_LINE (mask_image, mask_x, mask_y, mask_type,       \
                            mask_stride, mask_line, mask_cnt);           \
                                                                         \
     pixman_composite_##name##_asm_##cputype (width, height,             \
                                              dst_line, dst_stride,      \
                                              src, 0,                    \
                                              mask_line, mask_stride);   \
 }
 
-#define PIXMAN_ARM_BIND_FAST_PATH_SRC_N_DST(cputype, name,              \
+#define PIXMAN_ARM_BIND_FAST_PATH_SRC_N_DST(flags, cputype, name,       \
                                             src_type, src_cnt,          \
                                             dst_type, dst_cnt)          \
 void                                                                    \
 pixman_composite_##name##_asm_##cputype (int32_t    w,                  \
                                          int32_t    h,                  \
                                          dst_type  *dst,                \
                                          int32_t    dst_stride,         \
                                          src_type  *src,                \
@@ -202,19 +209,20 @@ cputype##_composite_##name (pixman_imple
                             int32_t                  width,             \
                             int32_t                  height)            \
 {                                                                       \
     dst_type  *dst_line;                                                \
     src_type  *src_line;                                                \
     int32_t    dst_stride, src_stride;                                  \
     uint32_t   mask;                                                    \
                                                                         \
-    mask = _pixman_image_get_solid (mask_image, dst_image->bits.format);\
+    mask = _pixman_image_get_solid (					\
+	imp, mask_image, dst_image->bits.format);			\
                                                                         \
-    if (mask == 0)                                                      \
+    if ((flags & SKIP_ZERO_MASK) && mask == 0)                          \
 	return;                                                         \
                                                                         \
     PIXMAN_IMAGE_GET_LINE (dst_image, dest_x, dest_y, dst_type,         \
                            dst_stride, dst_line, dst_cnt);              \
     PIXMAN_IMAGE_GET_LINE (src_image, src_x, src_y, src_type,           \
                            src_stride, src_line, src_cnt);              \
                                                                         \
     pixman_composite_##name##_asm_##cputype (width, height,             \
@@ -265,9 +273,137 @@ cputype##_composite_##name (pixman_imple
                            mask_stride, mask_line, mask_cnt);           \
                                                                         \
     pixman_composite_##name##_asm_##cputype (width, height,             \
                                              dst_line, dst_stride,      \
                                              src_line, src_stride,      \
                                              mask_line, mask_stride);   \
 }
 
+#define PIXMAN_ARM_BIND_SCALED_NEAREST_SRC_DST(cputype, name, op,             \
+                                               src_type, dst_type)            \
+void                                                                          \
+pixman_scaled_nearest_scanline_##name##_##op##_asm_##cputype (                \
+                                                   int32_t          w,        \
+                                                   dst_type *       dst,      \
+                                                   const src_type * src,      \
+                                                   pixman_fixed_t   vx,       \
+                                                   pixman_fixed_t   unit_x);  \
+                                                                              \
+static force_inline void                                                      \
+scaled_nearest_scanline_##cputype##_##name##_##op (dst_type *       pd,       \
+                                                   const src_type * ps,       \
+                                                   int32_t          w,        \
+                                                   pixman_fixed_t   vx,       \
+                                                   pixman_fixed_t   unit_x,   \
+                                                   pixman_fixed_t   max_vx,   \
+                                                   pixman_bool_t    zero_src) \
+{                                                                             \
+    pixman_scaled_nearest_scanline_##name##_##op##_asm_##cputype (w, pd, ps,  \
+                                                                  vx, unit_x);\
+}                                                                             \
+                                                                              \
+FAST_NEAREST_MAINLOOP (cputype##_##name##_cover_##op,                         \
+                       scaled_nearest_scanline_##cputype##_##name##_##op,     \
+                       src_type, dst_type, COVER)                             \
+FAST_NEAREST_MAINLOOP (cputype##_##name##_none_##op,                          \
+                       scaled_nearest_scanline_##cputype##_##name##_##op,     \
+                       src_type, dst_type, NONE)                              \
+FAST_NEAREST_MAINLOOP (cputype##_##name##_pad_##op,                           \
+                       scaled_nearest_scanline_##cputype##_##name##_##op,     \
+                       src_type, dst_type, PAD)
+
+/* Provide entries for the fast path table */
+#define PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH(op,s,d,func)                      \
+    SIMPLE_NEAREST_FAST_PATH_COVER (op,s,d,func),                             \
+    SIMPLE_NEAREST_FAST_PATH_NONE (op,s,d,func),                              \
+    SIMPLE_NEAREST_FAST_PATH_PAD (op,s,d,func)
+
+#define PIXMAN_ARM_BIND_SCALED_NEAREST_SRC_A8_DST(flags, cputype, name, op,   \
+                                                  src_type, dst_type)         \
+void                                                                          \
+pixman_scaled_nearest_scanline_##name##_##op##_asm_##cputype (                \
+                                                   int32_t          w,        \
+                                                   dst_type *       dst,      \
+                                                   const src_type * src,      \
+                                                   pixman_fixed_t   vx,       \
+                                                   pixman_fixed_t   unit_x,   \
+                                                   const uint8_t *  mask);    \
+                                                                              \
+static force_inline void                                                      \
+scaled_nearest_scanline_##cputype##_##name##_##op (const uint8_t *  mask,     \
+                                                   dst_type *       pd,       \
+                                                   const src_type * ps,       \
+                                                   int32_t          w,        \
+                                                   pixman_fixed_t   vx,       \
+                                                   pixman_fixed_t   unit_x,   \
+                                                   pixman_fixed_t   max_vx,   \
+                                                   pixman_bool_t    zero_src) \
+{                                                                             \
+    if ((flags & SKIP_ZERO_SRC) && zero_src)                                  \
+	return;                                                               \
+    pixman_scaled_nearest_scanline_##name##_##op##_asm_##cputype (w, pd, ps,  \
+                                                                  vx, unit_x, \
+                                                                  mask);      \
+}                                                                             \
+                                                                              \
+FAST_NEAREST_MAINLOOP_COMMON (cputype##_##name##_cover_##op,                  \
+                              scaled_nearest_scanline_##cputype##_##name##_##op,\
+                              src_type, uint8_t, dst_type, COVER, TRUE, FALSE)\
+FAST_NEAREST_MAINLOOP_COMMON (cputype##_##name##_none_##op,                   \
+                              scaled_nearest_scanline_##cputype##_##name##_##op,\
+                              src_type, uint8_t, dst_type, NONE, TRUE, FALSE) \
+FAST_NEAREST_MAINLOOP_COMMON (cputype##_##name##_pad_##op,                    \
+                              scaled_nearest_scanline_##cputype##_##name##_##op,\
+                              src_type, uint8_t, dst_type, PAD, TRUE, FALSE)
+
+/* Provide entries for the fast path table */
+#define PIXMAN_ARM_SIMPLE_NEAREST_A8_MASK_FAST_PATH(op,s,d,func)              \
+    SIMPLE_NEAREST_A8_MASK_FAST_PATH_COVER (op,s,d,func),                     \
+    SIMPLE_NEAREST_A8_MASK_FAST_PATH_NONE (op,s,d,func),                      \
+    SIMPLE_NEAREST_A8_MASK_FAST_PATH_PAD (op,s,d,func)
+
+/*****************************************************************************/
+
+#define PIXMAN_ARM_BIND_SCALED_BILINEAR_SRC_DST(flags, cputype, name, op,     \
+                                                src_type, dst_type)           \
+void                                                                          \
+pixman_scaled_bilinear_scanline_##name##_##op##_asm_##cputype (               \
+                                                dst_type *       dst,         \
+                                                const src_type * top,         \
+                                                const src_type * bottom,      \
+                                                int              wt,          \
+                                                int              wb,          \
+                                                pixman_fixed_t   x,           \
+                                                pixman_fixed_t   ux,          \
+                                                int              width);      \
+                                                                              \
+static force_inline void                                                      \
+scaled_bilinear_scanline_##cputype##_##name##_##op (                          \
+                                                dst_type *       dst,         \
+                                                const uint32_t * mask,        \
+                                                const src_type * src_top,     \
+                                                const src_type * src_bottom,  \
+                                                int32_t          w,           \
+                                                int              wt,          \
+                                                int              wb,          \
+                                                pixman_fixed_t   vx,          \
+                                                pixman_fixed_t   unit_x,      \
+                                                pixman_fixed_t   max_vx,      \
+                                                pixman_bool_t    zero_src)    \
+{                                                                             \
+    if ((flags & SKIP_ZERO_SRC) && zero_src)                                  \
+	return;                                                               \
+    pixman_scaled_bilinear_scanline_##name##_##op##_asm_##cputype (           \
+                            dst, src_top, src_bottom, wt, wb, vx, unit_x, w); \
+}                                                                             \
+                                                                              \
+FAST_BILINEAR_MAINLOOP_COMMON (cputype##_##name##_cover_##op,                 \
+                       scaled_bilinear_scanline_##cputype##_##name##_##op,    \
+                       src_type, uint32_t, dst_type, COVER, FALSE, FALSE)     \
+FAST_BILINEAR_MAINLOOP_COMMON (cputype##_##name##_none_##op,                  \
+                       scaled_bilinear_scanline_##cputype##_##name##_##op,    \
+                       src_type, uint32_t, dst_type, NONE, FALSE, FALSE)      \
+FAST_BILINEAR_MAINLOOP_COMMON (cputype##_##name##_pad_##op,                   \
+                       scaled_bilinear_scanline_##cputype##_##name##_##op,    \
+                       src_type, uint32_t, dst_type, PAD, FALSE, FALSE)
+
 #endif
--- a/gfx/cairo/libpixman/src/pixman-arm-neon-asm.S
+++ b/gfx/cairo/libpixman/src/pixman-arm-neon-asm.S
@@ -248,17 +248,17 @@
 
 #if 1
 
 .macro pixman_composite_over_8888_0565_process_pixblock_tail_head
         vqadd.u8    d16, d2, d20
     vld1.16     {d4, d5}, [DST_R, :128]!
         vqadd.u8    q9, q0, q11
     vshrn.u16   d6, q2, #8
-    vld4.8      {d0, d1, d2, d3}, [SRC]!
+    fetch_src_pixblock
     vshrn.u16   d7, q2, #3
     vsli.u16    q2, q2, #5
         vshll.u8    q14, d16, #8
                                     PF add PF_X, PF_X, #8
         vshll.u8    q8, d19, #8
                                     PF tst PF_CTL, #0xF
     vsri.u8     d6, d6, #5
                                     PF addne PF_X, PF_X, #8
@@ -290,17 +290,17 @@
 
 #else
 
 /* If we did not care much about the performance, we would just use this... */
 .macro pixman_composite_over_8888_0565_process_pixblock_tail_head
     pixman_composite_over_8888_0565_process_pixblock_tail
     vst1.16     {d28, d29}, [DST_W, :128]!
     vld1.16     {d4, d5}, [DST_R, :128]!
-    vld4.32     {d0, d1, d2, d3}, [SRC]!
+    fetch_src_pixblock
     pixman_composite_over_8888_0565_process_pixblock_head
     cache_preload 8, 8
 .endm
 
 #endif
 
 /*
  * And now the final part. We are using 'generate_composite_function' macro
@@ -428,17 +428,17 @@ generate_composite_function \
     vsri.u16    q14, q8, #5
     vsri.u16    q14, q9, #11
 .endm
 
 .macro pixman_composite_src_8888_0565_process_pixblock_tail_head
         vsri.u16    q14, q8, #5
                                     PF add PF_X, PF_X, #8
                                     PF tst PF_CTL, #0xF
-    vld4.8      {d0, d1, d2, d3}, [SRC]!
+    fetch_src_pixblock
                                     PF addne PF_X, PF_X, #8
                                     PF subne PF_CTL, PF_CTL, #1
         vsri.u16    q14, q9, #11
                                     PF cmp PF_X, ORIG_W
                                     PF pld, [PF_SRC, PF_X, lsl #src_bpp_shift]
     vshll.u8    q8, d1, #8
         vst1.16     {d28, d29}, [DST_W, :128]!
                                     PF subge PF_X, PF_X, ORIG_W
@@ -473,17 +473,17 @@ generate_composite_function \
 
 .macro pixman_composite_src_0565_8888_process_pixblock_tail
 .endm
 
 /* TODO: expand macros and do better instructions scheduling */
 .macro pixman_composite_src_0565_8888_process_pixblock_tail_head
     pixman_composite_src_0565_8888_process_pixblock_tail
     vst4.8     {d28, d29, d30, d31}, [DST_W, :128]!
-    vld1.16    {d0, d1}, [SRC]!
+    fetch_src_pixblock
     pixman_composite_src_0565_8888_process_pixblock_head
     cache_preload 8, 8
 .endm
 
 generate_composite_function \
     pixman_composite_src_0565_8888_asm_neon, 16, 0, 32, \
     FLAG_DST_WRITEONLY | FLAG_DEINTERLEAVE_32BPP, \
     8, /* number of pixels, processed in a single block */ \
@@ -500,17 +500,17 @@ generate_composite_function \
     vqadd.u8    q14, q0, q2
     vqadd.u8    q15, q1, q3
 .endm
 
 .macro pixman_composite_add_8_8_process_pixblock_tail
 .endm
 
 .macro pixman_composite_add_8_8_process_pixblock_tail_head
-    vld1.8      {d0, d1, d2, d3}, [SRC]!
+    fetch_src_pixblock
                                     PF add PF_X, PF_X, #32
                                     PF tst PF_CTL, #0xF
     vld1.8      {d4, d5, d6, d7}, [DST_R, :128]!
                                     PF addne PF_X, PF_X, #32
                                     PF subne PF_CTL, PF_CTL, #1
         vst1.8      {d28, d29, d30, d31}, [DST_W, :128]!
                                     PF cmp PF_X, ORIG_W
                                     PF pld, [PF_SRC, PF_X, lsl #src_bpp_shift]
@@ -532,23 +532,23 @@ generate_composite_function \
     default_cleanup, \
     pixman_composite_add_8_8_process_pixblock_head, \
     pixman_composite_add_8_8_process_pixblock_tail, \
     pixman_composite_add_8_8_process_pixblock_tail_head
 
 /******************************************************************************/
 
 .macro pixman_composite_add_8888_8888_process_pixblock_tail_head
-    vld1.8      {d0, d1, d2, d3}, [SRC]!
+    fetch_src_pixblock
                                     PF add PF_X, PF_X, #8
                                     PF tst PF_CTL, #0xF
-    vld1.8      {d4, d5, d6, d7}, [DST_R, :128]!
+    vld1.32     {d4, d5, d6, d7}, [DST_R, :128]!
                                     PF addne PF_X, PF_X, #8
                                     PF subne PF_CTL, PF_CTL, #1
-        vst1.8      {d28, d29, d30, d31}, [DST_W, :128]!
+        vst1.32     {d28, d29, d30, d31}, [DST_W, :128]!
                                     PF cmp PF_X, ORIG_W
                                     PF pld, [PF_SRC, PF_X, lsl #src_bpp_shift]
                                     PF pld, [PF_DST, PF_X, lsl #dst_bpp_shift]
                                     PF subge PF_X, PF_X, ORIG_W
                                     PF subges PF_CTL, PF_CTL, #0x10
     vqadd.u8    q14, q0, q2
                                     PF ldrgeb DUMMY, [PF_SRC, SRC_STRIDE, lsl #src_bpp_shift]!
                                     PF ldrgeb DUMMY, [PF_DST, DST_STRIDE, lsl #dst_bpp_shift]!
@@ -608,17 +608,17 @@ generate_composite_function_single_scanl
         vrshr.u16   q13, q11, #8
                                     PF addne PF_X, PF_X, #8
                                     PF subne PF_CTL, PF_CTL, #1
         vraddhn.u16 d28, q14, q8
         vraddhn.u16 d29, q15, q9
                                     PF cmp PF_X, ORIG_W
         vraddhn.u16 d30, q12, q10
         vraddhn.u16 d31, q13, q11
-    vld4.8      {d0, d1, d2, d3}, [SRC]!
+    fetch_src_pixblock
                                     PF pld, [PF_SRC, PF_X, lsl #src_bpp_shift]
     vmvn.8      d22, d3
                                     PF pld, [PF_DST, PF_X, lsl #dst_bpp_shift]
         vst4.8      {d28, d29, d30, d31}, [DST_W, :128]!
                                     PF subge PF_X, PF_X, ORIG_W
     vmull.u8    q8, d22, d4
                                     PF subges PF_CTL, PF_CTL, #0x10
     vmull.u8    q9, d22, d5
@@ -662,17 +662,17 @@ generate_composite_function_single_scanl
                                     PF subne PF_CTL, PF_CTL, #1
         vraddhn.u16 d28, q14, q8
         vraddhn.u16 d29, q15, q9
                                     PF cmp PF_X, ORIG_W
         vraddhn.u16 d30, q12, q10
         vraddhn.u16 d31, q13, q11
         vqadd.u8    q14, q0, q14
         vqadd.u8    q15, q1, q15
-    vld4.8      {d0, d1, d2, d3}, [SRC]!
+    fetch_src_pixblock
                                     PF pld, [PF_SRC, PF_X, lsl #src_bpp_shift]
     vmvn.8      d22, d3
                                     PF pld, [PF_DST, PF_X, lsl #dst_bpp_shift]
         vst4.8      {d28, d29, d30, d31}, [DST_W, :128]!
                                     PF subge PF_X, PF_X, ORIG_W
     vmull.u8    q8, d22, d4
                                     PF subges PF_CTL, PF_CTL, #0x10
     vmull.u8    q9, d22, d5
@@ -786,70 +786,121 @@ generate_composite_function \
     pixman_composite_over_reverse_n_8888_process_pixblock_tail_head, \
     28, /* dst_w_basereg */ \
     0,  /* dst_r_basereg */ \
     4,  /* src_basereg   */ \
     24  /* mask_basereg  */
 
 /******************************************************************************/
 
-.macro pixman_composite_over_n_8_0565_process_pixblock_head
-    /* in */
-    vmull.u8    q0, d24, d8
-    vmull.u8    q1, d24, d9
-    vmull.u8    q6, d24, d10
-    vmull.u8    q7, d24, d11
-    vrshr.u16   q10, q0, #8
-    vrshr.u16   q11, q1, #8
-    vrshr.u16   q12, q6, #8
-    vrshr.u16   q13, q7, #8
-    vraddhn.u16 d0, q0, q10
-    vraddhn.u16 d1, q1, q11
-    vraddhn.u16 d2, q6, q12
-    vraddhn.u16 d3, q7, q13
+.macro pixman_composite_over_8888_8_0565_process_pixblock_head
+    vmull.u8    q0,  d24, d8    /* IN for SRC pixels (part1) */
+    vmull.u8    q1,  d24, d9
+    vmull.u8    q6,  d24, d10
+    vmull.u8    q7,  d24, d11
+        vshrn.u16   d6,  q2, #8 /* convert DST_R data to 32-bpp (part1) */
+        vshrn.u16   d7,  q2, #3
+        vsli.u16    q2,  q2, #5
+    vrshr.u16   q8,  q0,  #8    /* IN for SRC pixels (part2) */
+    vrshr.u16   q9,  q1,  #8
+    vrshr.u16   q10, q6,  #8
+    vrshr.u16   q11, q7,  #8
+    vraddhn.u16 d0,  q0,  q8
+    vraddhn.u16 d1,  q1,  q9
+    vraddhn.u16 d2,  q6,  q10
+    vraddhn.u16 d3,  q7,  q11
+        vsri.u8     d6,  d6, #5 /* convert DST_R data to 32-bpp (part2) */
+        vsri.u8     d7,  d7, #6
+    vmvn.8      d3,  d3
+        vshrn.u16   d30, q2, #2
+    vmull.u8    q8,  d3, d6     /* now do alpha blending */
+    vmull.u8    q9,  d3, d7
+    vmull.u8    q10, d3, d30
+.endm
 
-    vshrn.u16   d6, q2, #8
-    vshrn.u16   d7, q2, #3
-    vsli.u16    q2, q2, #5
-    vsri.u8     d6, d6, #5
-    vmvn.8      d3, d3
-    vsri.u8     d7, d7, #6
-    vshrn.u16   d30, q2, #2
-    /* now do alpha blending */
-    vmull.u8    q10, d3, d6
-    vmull.u8    q11, d3, d7
-    vmull.u8    q12, d3, d30
-    vrshr.u16   q13, q10, #8
-    vrshr.u16   q3, q11, #8
-    vrshr.u16   q15, q12, #8
-    vraddhn.u16 d20, q10, q13
-    vraddhn.u16 d23, q11, q3
-    vraddhn.u16 d22, q12, q15
+.macro pixman_composite_over_8888_8_0565_process_pixblock_tail
+    /* 3 cycle bubble (after vmull.u8) */
+    vrshr.u16   q13, q8,  #8
+    vrshr.u16   q11, q9,  #8
+    vrshr.u16   q15, q10, #8
+    vraddhn.u16 d16, q8,  q13
+    vraddhn.u16 d27, q9,  q11
+    vraddhn.u16 d26, q10, q15
+    vqadd.u8    d16, d2,  d16
+    /* 1 cycle bubble */
+    vqadd.u8    q9,  q0,  q13
+    vshll.u8    q14, d16, #8    /* convert to 16bpp */
+    vshll.u8    q8,  d19, #8
+    vshll.u8    q9,  d18, #8
+    vsri.u16    q14, q8,  #5
+    /* 1 cycle bubble */
+    vsri.u16    q14, q9,  #11
 .endm
 
-.macro pixman_composite_over_n_8_0565_process_pixblock_tail
-    vqadd.u8    d16, d2, d20
-    vqadd.u8    q9, q0, q11
-    /* convert to r5g6b5 */
-    vshll.u8    q14, d16, #8
-    vshll.u8    q8, d19, #8
-    vshll.u8    q9, d18, #8
-    vsri.u16    q14, q8, #5
-    vsri.u16    q14, q9, #11
+.macro pixman_composite_over_8888_8_0565_process_pixblock_tail_head
+    vld1.16     {d4, d5}, [DST_R, :128]!
+    vshrn.u16   d6,  q2,  #8
+    fetch_mask_pixblock
+    vshrn.u16   d7,  q2,  #3
+    fetch_src_pixblock
+    vmull.u8    q6,  d24, d10
+        vrshr.u16   q13, q8,  #8
+        vrshr.u16   q11, q9,  #8
+        vrshr.u16   q15, q10, #8
+        vraddhn.u16 d16, q8,  q13
+        vraddhn.u16 d27, q9,  q11
+        vraddhn.u16 d26, q10, q15
+        vqadd.u8    d16, d2,  d16
+    vmull.u8    q1,  d24, d9
+        vqadd.u8    q9,  q0,  q13
+        vshll.u8    q14, d16, #8
+    vmull.u8    q0,  d24, d8
+        vshll.u8    q8,  d19, #8
+        vshll.u8    q9,  d18, #8
+        vsri.u16    q14, q8,  #5
+    vmull.u8    q7,  d24, d11
+        vsri.u16    q14, q9,  #11
+
+    cache_preload 8, 8
+
+    vsli.u16    q2,  q2,  #5
+    vrshr.u16   q8,  q0,  #8
+    vrshr.u16   q9,  q1,  #8
+    vrshr.u16   q10, q6,  #8
+    vrshr.u16   q11, q7,  #8
+    vraddhn.u16 d0,  q0,  q8
+    vraddhn.u16 d1,  q1,  q9
+    vraddhn.u16 d2,  q6,  q10
+    vraddhn.u16 d3,  q7,  q11
+    vsri.u8     d6,  d6,  #5
+    vsri.u8     d7,  d7,  #6
+    vmvn.8      d3,  d3
+    vshrn.u16   d30, q2,  #2
+    vst1.16     {d28, d29}, [DST_W, :128]!
+    vmull.u8    q8,  d3,  d6
+    vmull.u8    q9,  d3,  d7
+    vmull.u8    q10, d3,  d30
 .endm
 
-/* TODO: expand macros and do better instructions scheduling */
-.macro pixman_composite_over_n_8_0565_process_pixblock_tail_head
-    pixman_composite_over_n_8_0565_process_pixblock_tail
-    vst1.16     {d28, d29}, [DST_W, :128]!
-    vld1.16     {d4, d5}, [DST_R, :128]!
-    vld1.8      {d24}, [MASK]!
-    cache_preload 8, 8
-    pixman_composite_over_n_8_0565_process_pixblock_head
-.endm
+generate_composite_function \
+    pixman_composite_over_8888_8_0565_asm_neon, 32, 8, 16, \
+    FLAG_DST_READWRITE | FLAG_DEINTERLEAVE_32BPP, \
+    8, /* number of pixels, processed in a single block */ \
+    5, /* prefetch distance */ \
+    default_init_need_all_regs, \
+    default_cleanup_need_all_regs, \
+    pixman_composite_over_8888_8_0565_process_pixblock_head, \
+    pixman_composite_over_8888_8_0565_process_pixblock_tail, \
+    pixman_composite_over_8888_8_0565_process_pixblock_tail_head, \
+    28, /* dst_w_basereg */ \
+    4,  /* dst_r_basereg */ \
+    8,  /* src_basereg   */ \
+    24  /* mask_basereg  */
+
+/******************************************************************************/
 
 /*
  * This function needs a special initialization of solid mask.
  * Solid source pixel data is fetched from stack at ARGS_STACK_OFFSET
  * offset, split into color components and replicated in d8-d11
  * registers. Additionally, this function needs all the NEON registers,
  * so it has to save d8-d15 registers which are callee saved according
  * to ABI. These registers are restored from 'cleanup' macro. All the
@@ -872,59 +923,59 @@ generate_composite_function \
 
 generate_composite_function \
     pixman_composite_over_n_8_0565_asm_neon, 0, 8, 16, \
     FLAG_DST_READWRITE, \
     8, /* number of pixels, processed in a single block */ \
     5, /* prefetch distance */ \
     pixman_composite_over_n_8_0565_init, \
     pixman_composite_over_n_8_0565_cleanup, \
-    pixman_composite_over_n_8_0565_process_pixblock_head, \
-    pixman_composite_over_n_8_0565_process_pixblock_tail, \
-    pixman_composite_over_n_8_0565_process_pixblock_tail_head
+    pixman_composite_over_8888_8_0565_process_pixblock_head, \
+    pixman_composite_over_8888_8_0565_process_pixblock_tail, \
+    pixman_composite_over_8888_8_0565_process_pixblock_tail_head
 
 /******************************************************************************/
 
-/* TODO: expand macros and do better instructions scheduling */
-.macro pixman_composite_over_8888_8_0565_process_pixblock_tail_head
-    vld1.16     {d4, d5}, [DST_R, :128]!
-    pixman_composite_over_n_8_0565_process_pixblock_tail
-    vld4.8      {d8, d9, d10, d11}, [SRC]!
-    cache_preload 8, 8
-    vld1.8      {d24}, [MASK]!
-    pixman_composite_over_n_8_0565_process_pixblock_head
-    vst1.16     {d28, d29}, [DST_W, :128]!
+.macro pixman_composite_over_8888_n_0565_init
+    add         DUMMY, sp, #(ARGS_STACK_OFFSET + 8)
+    vpush       {d8-d15}
+    vld1.32     {d24[0]}, [DUMMY]
+    vdup.8      d24, d24[3]
+.endm
+
+.macro pixman_composite_over_8888_n_0565_cleanup
+    vpop        {d8-d15}
 .endm
 
 generate_composite_function \
-    pixman_composite_over_8888_8_0565_asm_neon, 32, 8, 16, \
+    pixman_composite_over_8888_n_0565_asm_neon, 32, 0, 16, \
     FLAG_DST_READWRITE | FLAG_DEINTERLEAVE_32BPP, \
     8, /* number of pixels, processed in a single block */ \
     5, /* prefetch distance */ \
-    default_init_need_all_regs, \
-    default_cleanup_need_all_regs, \
-    pixman_composite_over_n_8_0565_process_pixblock_head, \
-    pixman_composite_over_n_8_0565_process_pixblock_tail, \
+    pixman_composite_over_8888_n_0565_init, \
+    pixman_composite_over_8888_n_0565_cleanup, \
+    pixman_composite_over_8888_8_0565_process_pixblock_head, \
+    pixman_composite_over_8888_8_0565_process_pixblock_tail, \
     pixman_composite_over_8888_8_0565_process_pixblock_tail_head, \
     28, /* dst_w_basereg */ \
     4,  /* dst_r_basereg */ \
     8,  /* src_basereg   */ \
     24  /* mask_basereg  */
 
 /******************************************************************************/
 
 .macro pixman_composite_src_0565_0565_process_pixblock_head
 .endm
 
 .macro pixman_composite_src_0565_0565_process_pixblock_tail
 .endm
 
 .macro pixman_composite_src_0565_0565_process_pixblock_tail_head
     vst1.16 {d0, d1, d2, d3}, [DST_W, :128]!
-    vld1.16 {d0, d1, d2, d3}, [SRC]!
+    fetch_src_pixblock
     cache_preload 16, 16
 .endm
 
 generate_composite_function \
     pixman_composite_src_0565_0565_asm_neon, 16, 0, 16, \
     FLAG_DST_WRITEONLY, \
     16, /* number of pixels, processed in a single block */ \
     10, /* prefetch distance */ \
@@ -1060,17 +1111,17 @@ generate_composite_function \
 .macro pixman_composite_src_8888_8888_process_pixblock_head
 .endm
 
 .macro pixman_composite_src_8888_8888_process_pixblock_tail
 .endm
 
 .macro pixman_composite_src_8888_8888_process_pixblock_tail_head
     vst1.32 {d0, d1, d2, d3}, [DST_W, :128]!
-    vld1.32 {d0, d1, d2, d3}, [SRC]!
+    fetch_src_pixblock
     cache_preload 8, 8
 .endm
 
 generate_composite_function \
     pixman_composite_src_8888_8888_asm_neon, 32, 0, 32, \
     FLAG_DST_WRITEONLY, \
     8, /* number of pixels, processed in a single block */ \
     10, /* prefetch distance */ \
@@ -1091,17 +1142,17 @@ generate_composite_function \
     vorr     q1, q1, q2
 .endm
 
 .macro pixman_composite_src_x888_8888_process_pixblock_tail
 .endm
 
 .macro pixman_composite_src_x888_8888_process_pixblock_tail_head
     vst1.32 {d0, d1, d2, d3}, [DST_W, :128]!
-    vld1.32 {d0, d1, d2, d3}, [SRC]!
+    fetch_src_pixblock
     vorr     q0, q0, q2
     vorr     q1, q1, q2
     cache_preload 8, 8
 .endm
 
 .macro pixman_composite_src_x888_8888_init
     vmov.u8  q2, #0xFF
     vshl.u32 q2, q2, #24
@@ -1166,17 +1217,17 @@ generate_composite_function \
     vqadd.u8    q15, q1, q15
 .endm
 
 /* TODO: expand macros and do better instructions scheduling */
 .macro pixman_composite_over_n_8_8888_process_pixblock_tail_head
     pixman_composite_over_n_8_8888_process_pixblock_tail
     vst4.8      {d28, d29, d30, d31}, [DST_W, :128]!
     vld4.8      {d4, d5, d6, d7}, [DST_R, :128]!
-    vld1.8      {d24}, [MASK]!
+    fetch_mask_pixblock
     cache_preload 8, 8
     pixman_composite_over_n_8_8888_process_pixblock_head
 .endm
 
 .macro pixman_composite_over_n_8_8888_init
     add         DUMMY, sp, #ARGS_STACK_OFFSET
     vpush       {d8-d15}
     vld1.32     {d11[0]}, [DUMMY]
@@ -1198,16 +1249,84 @@ generate_composite_function \
     pixman_composite_over_n_8_8888_init, \
     pixman_composite_over_n_8_8888_cleanup, \
     pixman_composite_over_n_8_8888_process_pixblock_head, \
     pixman_composite_over_n_8_8888_process_pixblock_tail, \
     pixman_composite_over_n_8_8888_process_pixblock_tail_head
 
 /******************************************************************************/
 
+.macro pixman_composite_over_n_8_8_process_pixblock_head
+    vmull.u8    q0,  d24, d8
+    vmull.u8    q1,  d25, d8
+    vmull.u8    q6,  d26, d8
+    vmull.u8    q7,  d27, d8
+    vrshr.u16   q10, q0,  #8
+    vrshr.u16   q11, q1,  #8
+    vrshr.u16   q12, q6,  #8
+    vrshr.u16   q13, q7,  #8
+    vraddhn.u16 d0,  q0,  q10
+    vraddhn.u16 d1,  q1,  q11
+    vraddhn.u16 d2,  q6,  q12
+    vraddhn.u16 d3,  q7,  q13
+    vmvn.8      q12, q0
+    vmvn.8      q13, q1
+    vmull.u8    q8,  d24, d4
+    vmull.u8    q9,  d25, d5
+    vmull.u8    q10, d26, d6
+    vmull.u8    q11, d27, d7
+.endm
+
+.macro pixman_composite_over_n_8_8_process_pixblock_tail
+    vrshr.u16   q14, q8,  #8
+    vrshr.u16   q15, q9,  #8
+    vrshr.u16   q12, q10, #8
+    vrshr.u16   q13, q11, #8
+    vraddhn.u16 d28, q14, q8
+    vraddhn.u16 d29, q15, q9
+    vraddhn.u16 d30, q12, q10
+    vraddhn.u16 d31, q13, q11
+    vqadd.u8    q14, q0,  q14
+    vqadd.u8    q15, q1,  q15
+.endm
+
+/* TODO: expand macros and do better instructions scheduling */
+.macro pixman_composite_over_n_8_8_process_pixblock_tail_head
+    vld1.8      {d4, d5, d6, d7}, [DST_R, :128]!
+    pixman_composite_over_n_8_8_process_pixblock_tail
+    fetch_mask_pixblock
+    cache_preload 32, 32
+    vst1.8      {d28, d29, d30, d31}, [DST_W, :128]!
+    pixman_composite_over_n_8_8_process_pixblock_head
+.endm
+
+.macro pixman_composite_over_n_8_8_init
+    add         DUMMY, sp, #ARGS_STACK_OFFSET
+    vpush       {d8-d15}
+    vld1.32     {d8[0]}, [DUMMY]
+    vdup.8      d8, d8[3]
+.endm
+
+.macro pixman_composite_over_n_8_8_cleanup
+    vpop        {d8-d15}
+.endm
+
+generate_composite_function \
+    pixman_composite_over_n_8_8_asm_neon, 0, 8, 8, \
+    FLAG_DST_READWRITE, \
+    32, /* number of pixels, processed in a single block */ \
+    5, /* prefetch distance */ \
+    pixman_composite_over_n_8_8_init, \
+    pixman_composite_over_n_8_8_cleanup, \
+    pixman_composite_over_n_8_8_process_pixblock_head, \
+    pixman_composite_over_n_8_8_process_pixblock_tail, \
+    pixman_composite_over_n_8_8_process_pixblock_tail_head
+
+/******************************************************************************/
+
 .macro pixman_composite_over_n_8888_8888_ca_process_pixblock_head
     /*
      * 'combine_mask_ca' replacement
      *
      * input:  solid src (n) in {d8,  d9,  d10, d11}
      *         dest in          {d4,  d5,  d6,  d7 }
      *         mask in          {d24, d25, d26, d27}
      * output: updated src in   {d0,  d1,  d2,  d3 }
@@ -1268,17 +1387,17 @@ generate_composite_function \
         vrshr.u16   q15, q9, #8
     vld4.8      {d4, d5, d6, d7}, [DST_R, :128]!
         vrshr.u16   q6, q10, #8
         vrshr.u16   q7, q11, #8
         vraddhn.u16 d28, q14, q8
         vraddhn.u16 d29, q15, q9
         vraddhn.u16 d30, q6, q10
         vraddhn.u16 d31, q7, q11
-    vld4.8      {d24, d25, d26, d27}, [MASK]!
+    fetch_mask_pixblock
         vqadd.u8    q14, q0, q14
         vqadd.u8    q15, q1, q15
     cache_preload 8, 8
     pixman_composite_over_n_8888_8888_ca_process_pixblock_head
     vst4.8      {d28, d29, d30, d31}, [DST_W, :128]!
 .endm
 
 .macro pixman_composite_over_n_8888_8888_ca_init
@@ -1303,16 +1422,68 @@ generate_composite_function \
     pixman_composite_over_n_8888_8888_ca_init, \
     pixman_composite_over_n_8888_8888_ca_cleanup, \
     pixman_composite_over_n_8888_8888_ca_process_pixblock_head, \
     pixman_composite_over_n_8888_8888_ca_process_pixblock_tail, \
     pixman_composite_over_n_8888_8888_ca_process_pixblock_tail_head
 
 /******************************************************************************/
 
+.macro pixman_composite_in_n_8_process_pixblock_head
+    /* expecting source data in {d0, d1, d2, d3} */
+    /* and destination data in {d4, d5, d6, d7} */
+    vmull.u8    q8,  d4,  d3
+    vmull.u8    q9,  d5,  d3
+    vmull.u8    q10, d6,  d3
+    vmull.u8    q11, d7,  d3
+.endm
+
+.macro pixman_composite_in_n_8_process_pixblock_tail
+    vrshr.u16   q14, q8,  #8
+    vrshr.u16   q15, q9,  #8
+    vrshr.u16   q12, q10, #8
+    vrshr.u16   q13, q11, #8
+    vraddhn.u16 d28, q8,  q14
+    vraddhn.u16 d29, q9,  q15
+    vraddhn.u16 d30, q10, q12
+    vraddhn.u16 d31, q11, q13
+.endm
+
+.macro pixman_composite_in_n_8_process_pixblock_tail_head
+    pixman_composite_in_n_8_process_pixblock_tail
+    vld1.8      {d4, d5, d6, d7}, [DST_R, :128]!
+    cache_preload 32, 32
+    pixman_composite_in_n_8_process_pixblock_head
+    vst1.8      {d28, d29, d30, d31}, [DST_W, :128]!
+.endm
+
+.macro pixman_composite_in_n_8_init
+    add         DUMMY, sp, #ARGS_STACK_OFFSET
+    vld1.32     {d3[0]}, [DUMMY]
+    vdup.8      d3, d3[3]
+.endm
+
+.macro pixman_composite_in_n_8_cleanup
+.endm
+
+generate_composite_function \
+    pixman_composite_in_n_8_asm_neon, 0, 0, 8, \
+    FLAG_DST_READWRITE, \
+    32, /* number of pixels, processed in a single block */ \
+    5, /* prefetch distance */ \
+    pixman_composite_in_n_8_init, \
+    pixman_composite_in_n_8_cleanup, \
+    pixman_composite_in_n_8_process_pixblock_head, \
+    pixman_composite_in_n_8_process_pixblock_tail, \
+    pixman_composite_in_n_8_process_pixblock_tail_head, \
+    28, /* dst_w_basereg */ \
+    4,  /* dst_r_basereg */ \
+    0,  /* src_basereg   */ \
+    24  /* mask_basereg  */
+
 .macro pixman_composite_add_n_8_8_process_pixblock_head
     /* expecting source data in {d8, d9, d10, d11} */
     /* d8 - blue, d9 - green, d10 - red, d11 - alpha */
     /* and destination data in {d4, d5, d6, d7} */
     /* mask is in d24, d25, d26, d27 */
     vmull.u8    q0, d24, d11
     vmull.u8    q1, d25, d11
     vmull.u8    q6, d26, d11
@@ -1332,17 +1503,17 @@ generate_composite_function \
 .macro pixman_composite_add_n_8_8_process_pixblock_tail
 .endm
 
 /* TODO: expand macros and do better instructions scheduling */
 .macro pixman_composite_add_n_8_8_process_pixblock_tail_head
     pixman_composite_add_n_8_8_process_pixblock_tail
     vst1.8      {d28, d29, d30, d31}, [DST_W, :128]!
     vld1.8      {d4, d5, d6, d7}, [DST_R, :128]!
-    vld1.8      {d24, d25, d26, d27}, [MASK]!
+    fetch_mask_pixblock
     cache_preload 32, 32
     pixman_composite_add_n_8_8_process_pixblock_head
 .endm
 
 .macro pixman_composite_add_n_8_8_init
     add         DUMMY, sp, #ARGS_STACK_OFFSET
     vpush       {d8-d15}
     vld1.32     {d11[0]}, [DUMMY]
@@ -1389,18 +1560,18 @@ generate_composite_function \
 .macro pixman_composite_add_8_8_8_process_pixblock_tail
 .endm
 
 /* TODO: expand macros and do better instructions scheduling */
 .macro pixman_composite_add_8_8_8_process_pixblock_tail_head
     pixman_composite_add_8_8_8_process_pixblock_tail
     vst1.8      {d28, d29, d30, d31}, [DST_W, :128]!
     vld1.8      {d4, d5, d6, d7}, [DST_R, :128]!
-    vld1.8      {d24, d25, d26, d27}, [MASK]!
-    vld1.8      {d0, d1, d2, d3}, [SRC]!
+    fetch_mask_pixblock
+    fetch_src_pixblock
     cache_preload 32, 32
     pixman_composite_add_8_8_8_process_pixblock_head
 .endm
 
 .macro pixman_composite_add_8_8_8_init
 .endm
 
 .macro pixman_composite_add_8_8_8_cleanup
@@ -1418,44 +1589,60 @@ generate_composite_function \
     pixman_composite_add_8_8_8_process_pixblock_tail_head
 
 /******************************************************************************/
 
 .macro pixman_composite_add_8888_8888_8888_process_pixblock_head
     /* expecting source data in {d0, d1, d2, d3} */
     /* destination data in {d4, d5, d6, d7} */
     /* mask in {d24, d25, d26, d27} */
-    vmull.u8    q8, d27, d0
-    vmull.u8    q9, d27, d1
+    vmull.u8    q8,  d27, d0
+    vmull.u8    q9,  d27, d1
     vmull.u8    q10, d27, d2
     vmull.u8    q11, d27, d3
-    vrshr.u16   q0, q8, #8
-    vrshr.u16   q1, q9, #8
-    vrshr.u16   q12, q10, #8
-    vrshr.u16   q13, q11, #8
-    vraddhn.u16 d0, q0, q8
-    vraddhn.u16 d1, q1, q9
-    vraddhn.u16 d2, q12, q10
-    vraddhn.u16 d3, q13, q11
-    vqadd.u8    q14, q0, q2
-    vqadd.u8    q15, q1, q3
+    /* 1 cycle bubble */
+    vrsra.u16   q8,  q8,  #8
+    vrsra.u16   q9,  q9,  #8
+    vrsra.u16   q10, q10, #8
+    vrsra.u16   q11, q11, #8
 .endm
 
 .macro pixman_composite_add_8888_8888_8888_process_pixblock_tail
+    /* 2 cycle bubble */
+    vrshrn.u16  d28, q8,  #8
+    vrshrn.u16  d29, q9,  #8
+    vrshrn.u16  d30, q10, #8
+    vrshrn.u16  d31, q11, #8
+    vqadd.u8    q14, q2,  q14
+    /* 1 cycle bubble */
+    vqadd.u8    q15, q3,  q15
 .endm
 
-/* TODO: expand macros and do better instructions scheduling */
 .macro pixman_composite_add_8888_8888_8888_process_pixblock_tail_head
-    pixman_composite_add_8888_8888_8888_process_pixblock_tail
-    vst4.8      {d28, d29, d30, d31}, [DST_W, :128]!
+    fetch_src_pixblock
+        vrshrn.u16  d28, q8,  #8
+    fetch_mask_pixblock
+        vrshrn.u16  d29, q9,  #8
+    vmull.u8    q8,  d27, d0
+        vrshrn.u16  d30, q10, #8
+    vmull.u8    q9,  d27, d1
+        vrshrn.u16  d31, q11, #8
+    vmull.u8    q10, d27, d2
+        vqadd.u8    q14, q2,  q14
+    vmull.u8    q11, d27, d3
+        vqadd.u8    q15, q3,  q15
+    vrsra.u16   q8,  q8,  #8
     vld4.8      {d4, d5, d6, d7}, [DST_R, :128]!
-    vld4.8      {d24, d25, d26, d27}, [MASK]!
-    vld4.8      {d0, d1, d2, d3}, [SRC]!
+    vrsra.u16   q9,  q9,  #8
+        vst4.8      {d28, d29, d30, d31}, [DST_W, :128]!
+    vrsra.u16   q10, q10, #8
+
     cache_preload 8, 8
-    pixman_composite_add_8888_8888_8888_process_pixblock_head
+
+    vrsra.u16   q11, q11, #8
 .endm
 
 generate_composite_function \
     pixman_composite_add_8888_8888_8888_asm_neon, 32, 32, 32, \
     FLAG_DST_READWRITE | FLAG_DEINTERLEAVE_32BPP, \
     8, /* number of pixels, processed in a single block */ \
     10, /* prefetch distance */ \
     default_init, \
@@ -1471,16 +1658,88 @@ generate_composite_function_single_scanl
     default_init, \
     default_cleanup, \
     pixman_composite_add_8888_8888_8888_process_pixblock_head, \
     pixman_composite_add_8888_8888_8888_process_pixblock_tail, \
     pixman_composite_add_8888_8888_8888_process_pixblock_tail_head
 
 /******************************************************************************/
 
+generate_composite_function \
+    pixman_composite_add_8888_8_8888_asm_neon, 32, 8, 32, \
+    FLAG_DST_READWRITE | FLAG_DEINTERLEAVE_32BPP, \
+    8, /* number of pixels, processed in a single block */ \
+    5, /* prefetch distance */ \
+    default_init, \
+    default_cleanup, \
+    pixman_composite_add_8888_8888_8888_process_pixblock_head, \
+    pixman_composite_add_8888_8888_8888_process_pixblock_tail, \
+    pixman_composite_add_8888_8888_8888_process_pixblock_tail_head, \
+    28, /* dst_w_basereg */ \
+    4,  /* dst_r_basereg */ \
+    0,  /* src_basereg   */ \
+    27  /* mask_basereg  */
+
+/******************************************************************************/
+
+.macro pixman_composite_add_n_8_8888_init
+    add         DUMMY, sp, #ARGS_STACK_OFFSET
+    vld1.32     {d3[0]}, [DUMMY]
+    vdup.8      d0, d3[0]
+    vdup.8      d1, d3[1]
+    vdup.8      d2, d3[2]
+    vdup.8      d3, d3[3]
+.endm
+
+.macro pixman_composite_add_n_8_8888_cleanup
+.endm
+
+generate_composite_function \
+    pixman_composite_add_n_8_8888_asm_neon, 0, 8, 32, \
+    FLAG_DST_READWRITE | FLAG_DEINTERLEAVE_32BPP, \
+    8, /* number of pixels, processed in a single block */ \
+    5, /* prefetch distance */ \
+    pixman_composite_add_n_8_8888_init, \
+    pixman_composite_add_n_8_8888_cleanup, \
+    pixman_composite_add_8888_8888_8888_process_pixblock_head, \
+    pixman_composite_add_8888_8888_8888_process_pixblock_tail, \
+    pixman_composite_add_8888_8888_8888_process_pixblock_tail_head, \
+    28, /* dst_w_basereg */ \
+    4,  /* dst_r_basereg */ \
+    0,  /* src_basereg   */ \
+    27  /* mask_basereg  */
+
+/******************************************************************************/
+
+.macro pixman_composite_add_8888_n_8888_init
+    add         DUMMY, sp, #(ARGS_STACK_OFFSET + 8)
+    vld1.32     {d27[0]}, [DUMMY]
+    vdup.8      d27, d27[3]
+.endm
+
+.macro pixman_composite_add_8888_n_8888_cleanup
+.endm
+
+generate_composite_function \
+    pixman_composite_add_8888_n_8888_asm_neon, 32, 0, 32, \
+    FLAG_DST_READWRITE | FLAG_DEINTERLEAVE_32BPP, \
+    8, /* number of pixels, processed in a single block */ \
+    5, /* prefetch distance */ \
+    pixman_composite_add_8888_n_8888_init, \
+    pixman_composite_add_8888_n_8888_cleanup, \
+    pixman_composite_add_8888_8888_8888_process_pixblock_head, \
+    pixman_composite_add_8888_8888_8888_process_pixblock_tail, \
+    pixman_composite_add_8888_8888_8888_process_pixblock_tail_head, \
+    28, /* dst_w_basereg */ \
+    4,  /* dst_r_basereg */ \
+    0,  /* src_basereg   */ \
+    27  /* mask_basereg  */
+
+/******************************************************************************/
+
 .macro pixman_composite_out_reverse_8888_n_8888_process_pixblock_head
     /* expecting source data in {d0, d1, d2, d3} */
     /* destination data in {d4, d5, d6, d7} */
     /* solid mask is in d15 */
 
     /* 'in' */
     vmull.u8    q8, d15, d3
     vmull.u8    q6, d15, d2
@@ -1512,19 +1771,19 @@ generate_composite_function_single_scanl
     vraddhn.u16 d30, q12, q10
     vraddhn.u16 d31, q13, q11
 .endm
 
 /* TODO: expand macros and do better instructions scheduling */
 .macro pixman_composite_out_reverse_8888_8888_8888_process_pixblock_tail_head
     vld4.8     {d4, d5, d6, d7}, [DST_R, :128]!
     pixman_composite_out_reverse_8888_n_8888_process_pixblock_tail
-    vld4.8     {d0, d1, d2, d3}, [SRC]!
+    fetch_src_pixblock
     cache_preload 8, 8
-    vld4.8     {d12, d13, d14, d15}, [MASK]!
+    fetch_mask_pixblock
     pixman_composite_out_reverse_8888_n_8888_process_pixblock_head
     vst4.8     {d28, d29, d30, d31}, [DST_W, :128]!
 .endm
 
 generate_composite_function_single_scanline \
     pixman_composite_scanline_out_reverse_mask_asm_neon, 32, 32, 32, \
     FLAG_DST_READWRITE | FLAG_DEINTERLEAVE_32BPP, \
     8, /* number of pixels, processed in a single block */ \
@@ -1549,17 +1808,17 @@ generate_composite_function_single_scanl
     vqadd.u8    q14, q0, q14
     vqadd.u8    q15, q1, q15
 .endm
 
 /* TODO: expand macros and do better instructions scheduling */
 .macro pixman_composite_over_8888_n_8888_process_pixblock_tail_head
     vld4.8     {d4, d5, d6, d7}, [DST_R, :128]!
     pixman_composite_over_8888_n_8888_process_pixblock_tail
-    vld4.8     {d0, d1, d2, d3}, [SRC]!
+    fetch_src_pixblock
     cache_preload 8, 8
     pixman_composite_over_8888_n_8888_process_pixblock_head
     vst4.8     {d28, d29, d30, d31}, [DST_W, :128]!
 .endm
 
 .macro pixman_composite_over_8888_n_8888_init
     add         DUMMY, sp, #48
     vpush       {d8-d15}
@@ -1583,19 +1842,19 @@ generate_composite_function \
     pixman_composite_over_8888_n_8888_process_pixblock_tail_head
 
 /******************************************************************************/
 
 /* TODO: expand macros and do better instructions scheduling */
 .macro pixman_composite_over_8888_8888_8888_process_pixblock_tail_head
     vld4.8     {d4, d5, d6, d7}, [DST_R, :128]!
     pixman_composite_over_8888_n_8888_process_pixblock_tail
-    vld4.8     {d0, d1, d2, d3}, [SRC]!
+    fetch_src_pixblock
     cache_preload 8, 8
-    vld4.8     {d12, d13, d14, d15}, [MASK]!
+    fetch_mask_pixblock
     pixman_composite_over_8888_n_8888_process_pixblock_head
     vst4.8     {d28, d29, d30, d31}, [DST_W, :128]!
 .endm
 
 generate_composite_function \
     pixman_composite_over_8888_8888_8888_asm_neon, 32, 32, 32, \
     FLAG_DST_READWRITE | FLAG_DEINTERLEAVE_32BPP, \
     8, /* number of pixels, processed in a single block */ \
@@ -1625,19 +1884,19 @@ generate_composite_function_single_scanl
     12  /* mask_basereg  */
 
 /******************************************************************************/
 
 /* TODO: expand macros and do better instructions scheduling */
 .macro pixman_composite_over_8888_8_8888_process_pixblock_tail_head
     vld4.8     {d4, d5, d6, d7}, [DST_R, :128]!
     pixman_composite_over_8888_n_8888_process_pixblock_tail
-    vld4.8     {d0, d1, d2, d3}, [SRC]!
+    fetch_src_pixblock
     cache_preload 8, 8
-    vld1.8     {d15}, [MASK]!
+    fetch_mask_pixblock
     pixman_composite_over_8888_n_8888_process_pixblock_head
     vst4.8     {d28, d29, d30, d31}, [DST_W, :128]!
 .endm
 
 generate_composite_function \
     pixman_composite_over_8888_8_8888_asm_neon, 32, 8, 32, \
     FLAG_DST_READWRITE | FLAG_DEINTERLEAVE_32BPP, \
     8, /* number of pixels, processed in a single block */ \
@@ -1657,17 +1916,17 @@ generate_composite_function \
 .macro pixman_composite_src_0888_0888_process_pixblock_head
 .endm
 
 .macro pixman_composite_src_0888_0888_process_pixblock_tail
 .endm
 
 .macro pixman_composite_src_0888_0888_process_pixblock_tail_head
     vst3.8 {d0, d1, d2}, [DST_W]!
-    vld3.8 {d0, d1, d2}, [SRC]!
+    fetch_src_pixblock
     cache_preload 8, 8
 .endm
 
 generate_composite_function \
     pixman_composite_src_0888_0888_asm_neon, 24, 0, 24, \
     FLAG_DST_WRITEONLY, \
     8, /* number of pixels, processed in a single block */ \
     10, /* prefetch distance */ \
@@ -1687,17 +1946,17 @@ generate_composite_function \
     vswp   d0, d2
 .endm
 
 .macro pixman_composite_src_0888_8888_rev_process_pixblock_tail
 .endm
 
 .macro pixman_composite_src_0888_8888_rev_process_pixblock_tail_head
     vst4.8 {d0, d1, d2, d3}, [DST_W]!
-    vld3.8 {d0, d1, d2}, [SRC]!
+    fetch_src_pixblock
     vswp   d0, d2
     cache_preload 8, 8
 .endm
 
 .macro pixman_composite_src_0888_8888_rev_init
     veor   d3, d3, d3
 .endm
 
@@ -1726,17 +1985,17 @@ generate_composite_function \
 .macro pixman_composite_src_0888_0565_rev_process_pixblock_tail
     vshll.u8    q14, d0, #8
     vsri.u16    q14, q8, #5
     vsri.u16    q14, q9, #11
 .endm
 
 .macro pixman_composite_src_0888_0565_rev_process_pixblock_tail_head
         vshll.u8    q14, d0, #8
-    vld3.8 {d0, d1, d2}, [SRC]!
+    fetch_src_pixblock
         vsri.u16    q14, q8, #5
         vsri.u16    q14, q9, #11
     vshll.u8    q8, d1, #8
         vst1.16 {d28, d29}, [DST_W, :128]!
     vshll.u8    q9, d2, #8
 .endm
 
 generate_composite_function \
@@ -1772,17 +2031,17 @@ generate_composite_function \
     vraddhn.u16 d28, q13, q10
 .endm
 
 .macro pixman_composite_src_pixbuf_8888_process_pixblock_tail_head
         vrshr.u16   q11, q8, #8
         vswp        d3, d31
         vrshr.u16   q12, q9, #8
         vrshr.u16   q13, q10, #8
-    vld4.8 {d0, d1, d2, d3}, [SRC]!
+    fetch_src_pixblock
         vraddhn.u16 d30, q11, q8
                                     PF add PF_X, PF_X, #8
                                     PF tst PF_CTL, #0xF
                                     PF addne PF_X, PF_X, #8
                                     PF subne PF_CTL, PF_CTL, #1
         vraddhn.u16 d29, q12, q9
         vraddhn.u16 d28, q13, q10
     vmull.u8    q8, d3, d0
@@ -1808,16 +2067,73 @@ generate_composite_function \
     pixman_composite_src_pixbuf_8888_process_pixblock_tail_head, \
     28, /* dst_w_basereg */ \
     0, /* dst_r_basereg */ \
     0, /* src_basereg   */ \
     0  /* mask_basereg  */
 
 /******************************************************************************/
 
+.macro pixman_composite_src_rpixbuf_8888_process_pixblock_head
+    vmull.u8    q8, d3, d0
+    vmull.u8    q9, d3, d1
+    vmull.u8    q10, d3, d2
+.endm
+
+.macro pixman_composite_src_rpixbuf_8888_process_pixblock_tail
+    vrshr.u16   q11, q8, #8
+    vswp        d3, d31
+    vrshr.u16   q12, q9, #8
+    vrshr.u16   q13, q10, #8
+    vraddhn.u16 d28, q11, q8
+    vraddhn.u16 d29, q12, q9
+    vraddhn.u16 d30, q13, q10
+.endm
+
+.macro pixman_composite_src_rpixbuf_8888_process_pixblock_tail_head
+        vrshr.u16   q11, q8, #8
+        vswp        d3, d31
+        vrshr.u16   q12, q9, #8
+        vrshr.u16   q13, q10, #8
+    fetch_src_pixblock
+        vraddhn.u16 d28, q11, q8
+                                    PF add PF_X, PF_X, #8
+                                    PF tst PF_CTL, #0xF
+                                    PF addne PF_X, PF_X, #8
+                                    PF subne PF_CTL, PF_CTL, #1
+        vraddhn.u16 d29, q12, q9
+        vraddhn.u16 d30, q13, q10
+    vmull.u8    q8, d3, d0
+    vmull.u8    q9, d3, d1
+    vmull.u8    q10, d3, d2
+        vst4.8 {d28, d29, d30, d31}, [DST_W, :128]!
+                                    PF cmp PF_X, ORIG_W
+                                    PF pld, [PF_SRC, PF_X, lsl #src_bpp_shift]
+                                    PF subge PF_X, PF_X, ORIG_W
+                                    PF subges PF_CTL, PF_CTL, #0x10
+                                    PF ldrgeb DUMMY, [PF_SRC, SRC_STRIDE, lsl #src_bpp_shift]!
+.endm
+
+generate_composite_function \
+    pixman_composite_src_rpixbuf_8888_asm_neon, 32, 0, 32, \
+    FLAG_DST_WRITEONLY | FLAG_DEINTERLEAVE_32BPP, \
+    8, /* number of pixels, processed in a single block */ \
+    10, /* prefetch distance */ \
+    default_init, \
+    default_cleanup, \
+    pixman_composite_src_rpixbuf_8888_process_pixblock_head, \
+    pixman_composite_src_rpixbuf_8888_process_pixblock_tail, \
+    pixman_composite_src_rpixbuf_8888_process_pixblock_tail_head, \
+    28, /* dst_w_basereg */ \
+    0, /* dst_r_basereg */ \
+    0, /* src_basereg   */ \
+    0  /* mask_basereg  */
+
+/******************************************************************************/
+
 .macro pixman_composite_over_0565_8_0565_process_pixblock_head
     /* mask is in d15 */
     convert_0565_to_x888 q4, d2, d1, d0
     convert_0565_to_x888 q5, d6, d5, d4
     /* source pixel data is in      {d0, d1, d2, XX} */
     /* destination pixel data is in {d4, d5, d6, XX} */
     vmvn.8      d7,  d15
     vmull.u8    q6,  d15, d2
@@ -1844,19 +2160,19 @@ generate_composite_function \
     vqadd.u8    q0,  q0,  q14
     vqadd.u8    q1,  q1,  q15
     /* 32bpp result is in {d0, d1, d2, XX} */
     convert_8888_to_0565 d2, d1, d0, q14, q15, q3
 .endm
 
 /* TODO: expand macros and do better instructions scheduling */
 .macro pixman_composite_over_0565_8_0565_process_pixblock_tail_head
-    vld1.8     {d15}, [MASK]!
+    fetch_mask_pixblock
     pixman_composite_over_0565_8_0565_process_pixblock_tail
-    vld1.16    {d8, d9}, [SRC]!
+    fetch_src_pixblock
     vld1.16    {d10, d11}, [DST_R, :128]!
     cache_preload 8, 8
     pixman_composite_over_0565_8_0565_process_pixblock_head
     vst1.16    {d28, d29}, [DST_W, :128]!
 .endm
 
 generate_composite_function \
     pixman_composite_over_0565_8_0565_asm_neon, 16, 8, 16, \
@@ -1870,16 +2186,44 @@ generate_composite_function \
     pixman_composite_over_0565_8_0565_process_pixblock_tail_head, \
     28, /* dst_w_basereg */ \
     10,  /* dst_r_basereg */ \
     8,  /* src_basereg   */ \
     15  /* mask_basereg  */
 
 /******************************************************************************/
 
+.macro pixman_composite_over_0565_n_0565_init
+    add         DUMMY, sp, #(ARGS_STACK_OFFSET + 8)
+    vpush       {d8-d15}
+    vld1.32     {d15[0]}, [DUMMY]
+    vdup.8      d15, d15[3]
+.endm
+
+.macro pixman_composite_over_0565_n_0565_cleanup
+    vpop        {d8-d15}
+.endm
+
+generate_composite_function \
+    pixman_composite_over_0565_n_0565_asm_neon, 16, 0, 16, \
+    FLAG_DST_READWRITE, \
+    8, /* number of pixels, processed in a single block */ \
+    5, /* prefetch distance */ \
+    pixman_composite_over_0565_n_0565_init, \
+    pixman_composite_over_0565_n_0565_cleanup, \
+    pixman_composite_over_0565_8_0565_process_pixblock_head, \
+    pixman_composite_over_0565_8_0565_process_pixblock_tail, \
+    pixman_composite_over_0565_8_0565_process_pixblock_tail_head, \
+    28, /* dst_w_basereg */ \
+    10, /* dst_r_basereg */ \
+    8,  /* src_basereg   */ \
+    15  /* mask_basereg  */
+
+/******************************************************************************/
+
 .macro pixman_composite_add_0565_8_0565_process_pixblock_head
     /* mask is in d15 */
     convert_0565_to_x888 q4, d2, d1, d0
     convert_0565_to_x888 q5, d6, d5, d4
     /* source pixel data is in      {d0, d1, d2, XX} */
     /* destination pixel data is in {d4, d5, d6, XX} */
     vmull.u8    q6,  d15, d2
     vmull.u8    q5,  d15, d1
@@ -1896,19 +2240,19 @@ generate_composite_function \
     vqadd.u8    q0,  q0,  q2
     vqadd.u8    q1,  q1,  q3
     /* 32bpp result is in {d0, d1, d2, XX} */
     convert_8888_to_0565 d2, d1, d0, q14, q15, q3
 .endm
 
 /* TODO: expand macros and do better instructions scheduling */
 .macro pixman_composite_add_0565_8_0565_process_pixblock_tail_head
-    vld1.8     {d15}, [MASK]!
+    fetch_mask_pixblock
     pixman_composite_add_0565_8_0565_process_pixblock_tail
-    vld1.16    {d8, d9}, [SRC]!
+    fetch_src_pixblock
     vld1.16    {d10, d11}, [DST_R, :128]!
     cache_preload 8, 8
     pixman_composite_add_0565_8_0565_process_pixblock_head
     vst1.16    {d28, d29}, [DST_W, :128]!
 .endm
 
 generate_composite_function \
     pixman_composite_add_0565_8_0565_asm_neon, 16, 8, 16, \
@@ -1946,17 +2290,17 @@ generate_composite_function \
     vraddhn.u16 d1, q15, q9
     vraddhn.u16 d2, q12, q10
     /* 32bpp result is in {d0, d1, d2, XX} */
     convert_8888_to_0565 d2, d1, d0, q14, q15, q3
 .endm
 
 /* TODO: expand macros and do better instructions scheduling */
 .macro pixman_composite_out_reverse_8_0565_process_pixblock_tail_head
-    vld1.8     {d15}, [SRC]!
+    fetch_src_pixblock
     pixman_composite_out_reverse_8_0565_process_pixblock_tail
     vld1.16    {d10, d11}, [DST_R, :128]!
     cache_preload 8, 8
     pixman_composite_out_reverse_8_0565_process_pixblock_head
     vst1.16    {d28, d29}, [DST_W, :128]!
 .endm
 
 generate_composite_function \
@@ -1968,8 +2312,407 @@ generate_composite_function \
     default_cleanup_need_all_regs, \
     pixman_composite_out_reverse_8_0565_process_pixblock_head, \
     pixman_composite_out_reverse_8_0565_process_pixblock_tail, \
     pixman_composite_out_reverse_8_0565_process_pixblock_tail_head, \
     28, /* dst_w_basereg */ \
     10, /* dst_r_basereg */ \
     15, /* src_basereg   */ \
     0   /* mask_basereg  */
+
+/******************************************************************************/
+
+generate_composite_function_nearest_scanline \
+    pixman_scaled_nearest_scanline_8888_8888_OVER_asm_neon, 32, 0, 32, \
+    FLAG_DST_READWRITE | FLAG_DEINTERLEAVE_32BPP, \
+    8, /* number of pixels, processed in a single block */ \
+    default_init, \
+    default_cleanup, \
+    pixman_composite_over_8888_8888_process_pixblock_head, \
+    pixman_composite_over_8888_8888_process_pixblock_tail, \
+    pixman_composite_over_8888_8888_process_pixblock_tail_head
+
+generate_composite_function_nearest_scanline \
+    pixman_scaled_nearest_scanline_8888_0565_OVER_asm_neon, 32, 0, 16, \
+    FLAG_DST_READWRITE | FLAG_DEINTERLEAVE_32BPP, \
+    8, /* number of pixels, processed in a single block */ \
+    default_init, \
+    default_cleanup, \
+    pixman_composite_over_8888_0565_process_pixblock_head, \
+    pixman_composite_over_8888_0565_process_pixblock_tail, \
+    pixman_composite_over_8888_0565_process_pixblock_tail_head, \
+    28, /* dst_w_basereg */ \
+    4,  /* dst_r_basereg */ \
+    0,  /* src_basereg   */ \
+    24  /* mask_basereg  */
+
+generate_composite_function_nearest_scanline \
+    pixman_scaled_nearest_scanline_8888_0565_SRC_asm_neon, 32, 0, 16, \
+    FLAG_DST_WRITEONLY | FLAG_DEINTERLEAVE_32BPP, \
+    8, /* number of pixels, processed in a single block */ \
+    default_init, \
+    default_cleanup, \
+    pixman_composite_src_8888_0565_process_pixblock_head, \
+    pixman_composite_src_8888_0565_process_pixblock_tail, \
+    pixman_composite_src_8888_0565_process_pixblock_tail_head
+
+generate_composite_function_nearest_scanline \
+    pixman_scaled_nearest_scanline_0565_8888_SRC_asm_neon, 16, 0, 32, \
+    FLAG_DST_WRITEONLY | FLAG_DEINTERLEAVE_32BPP, \
+    8, /* number of pixels, processed in a single block */ \
+    default_init, \
+    default_cleanup, \
+    pixman_composite_src_0565_8888_process_pixblock_head, \
+    pixman_composite_src_0565_8888_process_pixblock_tail, \
+    pixman_composite_src_0565_8888_process_pixblock_tail_head
+
+generate_composite_function_nearest_scanline \
+    pixman_scaled_nearest_scanline_8888_8_0565_OVER_asm_neon, 32, 8, 16, \
+    FLAG_DST_READWRITE | FLAG_DEINTERLEAVE_32BPP, \
+    8, /* number of pixels, processed in a single block */ \
+    default_init_need_all_regs, \
+    default_cleanup_need_all_regs, \
+    pixman_composite_over_8888_8_0565_process_pixblock_head, \
+    pixman_composite_over_8888_8_0565_process_pixblock_tail, \
+    pixman_composite_over_8888_8_0565_process_pixblock_tail_head, \
+    28, /* dst_w_basereg */ \
+    4,  /* dst_r_basereg */ \
+    8,  /* src_basereg   */ \
+    24  /* mask_basereg  */
+
+generate_composite_function_nearest_scanline \
+    pixman_scaled_nearest_scanline_0565_8_0565_OVER_asm_neon, 16, 8, 16, \
+    FLAG_DST_READWRITE, \
+    8, /* number of pixels, processed in a single block */ \
+    default_init_need_all_regs, \
+    default_cleanup_need_all_regs, \
+    pixman_composite_over_0565_8_0565_process_pixblock_head, \
+    pixman_composite_over_0565_8_0565_process_pixblock_tail, \
+    pixman_composite_over_0565_8_0565_process_pixblock_tail_head, \
+    28, /* dst_w_basereg */ \
+    10,  /* dst_r_basereg */ \
+    8,  /* src_basereg   */ \
+    15  /* mask_basereg  */
+
+/******************************************************************************/
+
+/* Supplementary macro for setting function attributes */
+.macro pixman_asm_function fname
+    .func fname
+    .global fname
+#ifdef __ELF__
+    .hidden fname
+    .type fname, %function
+#endif
+fname:
+.endm
+
+/*
+ * Bilinear scaling support code which tries to provide pixel fetching, color
+ * format conversion, and interpolation as separate macros which can be used
+ * as the basic building blocks for constructing bilinear scanline functions.
+ */
+
+.macro bilinear_load_8888 reg1, reg2, tmp
+    mov       TMP2, X, asr #16
+    add       X, X, UX
+    add       TMP1, TOP, TMP2, asl #2
+    add       TMP2, BOTTOM, TMP2, asl #2
+    vld1.32   {reg1}, [TMP1]
+    vld1.32   {reg2}, [TMP2]
+.endm
+
+.macro bilinear_load_0565 reg1, reg2, tmp
+    mov       TMP2, X, asr #16
+    add       X, X, UX
+    add       TMP1, TOP, TMP2, asl #1
+    add       TMP2, BOTTOM, TMP2, asl #1
+    vld1.32   {reg2[0]}, [TMP1]
+    vld1.32   {reg2[1]}, [TMP2]
+    convert_four_0565_to_x888_packed reg2, reg1, reg2, tmp
+.endm
+
+.macro bilinear_load_and_vertical_interpolate_two_8888 \
+                    acc1, acc2, reg1, reg2, reg3, reg4, tmp1, tmp2
+
+    bilinear_load_8888 reg1, reg2, tmp1
+    vmull.u8  acc1, reg1, d28
+    vmlal.u8  acc1, reg2, d29
+    bilinear_load_8888 reg3, reg4, tmp2
+    vmull.u8  acc2, reg3, d28
+    vmlal.u8  acc2, reg4, d29
+.endm
+
+.macro bilinear_load_and_vertical_interpolate_four_8888 \
+                xacc1, xacc2, xreg1, xreg2, xreg3, xreg4, xacc2lo, xacc2hi \
+                yacc1, yacc2, yreg1, yreg2, yreg3, yreg4, yacc2lo, yacc2hi
+
+    bilinear_load_and_vertical_interpolate_two_8888 \
+                xacc1, xacc2, xreg1, xreg2, xreg3, xreg4, xacc2lo, xacc2hi
+    bilinear_load_and_vertical_interpolate_two_8888 \
+                yacc1, yacc2, yreg1, yreg2, yreg3, yreg4, yacc2lo, yacc2hi
+.endm
+
+.macro bilinear_load_and_vertical_interpolate_two_0565 \
+                acc1, acc2, reg1, reg2, reg3, reg4, acc2lo, acc2hi
+
+    mov       TMP2, X, asr #16
+    add       X, X, UX
+    mov       TMP4, X, asr #16
+    add       X, X, UX
+    add       TMP1, TOP, TMP2, asl #1
+    add       TMP2, BOTTOM, TMP2, asl #1
+    add       TMP3, TOP, TMP4, asl #1
+    add       TMP4, BOTTOM, TMP4, asl #1
+    vld1.32   {acc2lo[0]}, [TMP1]
+    vld1.32   {acc2hi[0]}, [TMP3]
+    vld1.32   {acc2lo[1]}, [TMP2]
+    vld1.32   {acc2hi[1]}, [TMP4]
+    convert_0565_to_x888 acc2, reg3, reg2, reg1
+    vzip.u8   reg1, reg3
+    vzip.u8   reg2, reg4
+    vzip.u8   reg3, reg4
+    vzip.u8   reg1, reg2
+    vmull.u8  acc1, reg1, d28
+    vmlal.u8  acc1, reg2, d29
+    vmull.u8  acc2, reg3, d28
+    vmlal.u8  acc2, reg4, d29
+.endm
+
+.macro bilinear_load_and_vertical_interpolate_four_0565 \
+                xacc1, xacc2, xreg1, xreg2, xreg3, xreg4, xacc2lo, xacc2hi \
+                yacc1, yacc2, yreg1, yreg2, yreg3, yreg4, yacc2lo, yacc2hi
+
+    mov       TMP2, X, asr #16
+    add       X, X, UX
+    mov       TMP4, X, asr #16
+    add       X, X, UX
+    add       TMP1, TOP, TMP2, asl #1
+    add       TMP2, BOTTOM, TMP2, asl #1
+    add       TMP3, TOP, TMP4, asl #1
+    add       TMP4, BOTTOM, TMP4, asl #1
+    vld1.32   {xacc2lo[0]}, [TMP1]
+    vld1.32   {xacc2hi[0]}, [TMP3]
+    vld1.32   {xacc2lo[1]}, [TMP2]
+    vld1.32   {xacc2hi[1]}, [TMP4]
+    convert_0565_to_x888 xacc2, xreg3, xreg2, xreg1
+    mov       TMP2, X, asr #16
+    add       X, X, UX
+    mov       TMP4, X, asr #16
+    add       X, X, UX
+    add       TMP1, TOP, TMP2, asl #1
+    add       TMP2, BOTTOM, TMP2, asl #1
+    add       TMP3, TOP, TMP4, asl #1
+    add       TMP4, BOTTOM, TMP4, asl #1
+    vld1.32   {yacc2lo[0]}, [TMP1]
+    vzip.u8   xreg1, xreg3
+    vld1.32   {yacc2hi[0]}, [TMP3]
+    vzip.u8   xreg2, xreg4
+    vld1.32   {yacc2lo[1]}, [TMP2]
+    vzip.u8   xreg3, xreg4
+    vld1.32   {yacc2hi[1]}, [TMP4]
+    vzip.u8   xreg1, xreg2
+    convert_0565_to_x888 yacc2, yreg3, yreg2, yreg1
+    vmull.u8  xacc1, xreg1, d28
+    vzip.u8   yreg1, yreg3
+    vmlal.u8  xacc1, xreg2, d29
+    vzip.u8   yreg2, yreg4
+    vmull.u8  xacc2, xreg3, d28
+    vzip.u8   yreg3, yreg4
+    vmlal.u8  xacc2, xreg4, d29
+    vzip.u8   yreg1, yreg2
+    vmull.u8  yacc1, yreg1, d28
+    vmlal.u8  yacc1, yreg2, d29
+    vmull.u8  yacc2, yreg3, d28
+    vmlal.u8  yacc2, yreg4, d29
+.endm
+
+.macro bilinear_store_8888 numpix, tmp1, tmp2
+.if numpix == 4
+    vst1.32   {d0, d1}, [OUT]!
+.elseif numpix == 2
+    vst1.32   {d0}, [OUT]!
+.elseif numpix == 1
+    vst1.32   {d0[0]}, [OUT, :32]!
+.else
+    .error bilinear_store_8888 numpix is unsupported
+.endif
+.endm
+
+.macro bilinear_store_0565 numpix, tmp1, tmp2
+    vuzp.u8 d0, d1
+    vuzp.u8 d2, d3
+    vuzp.u8 d1, d3
+    vuzp.u8 d0, d2
+    convert_8888_to_0565 d2, d1, d0, q1, tmp1, tmp2
+.if numpix == 4
+    vst1.16   {d2}, [OUT]!
+.elseif numpix == 2
+    vst1.32   {d2[0]}, [OUT]!
+.elseif numpix == 1
+    vst1.16   {d2[0]}, [OUT]!
+.else
+    .error bilinear_store_0565 numpix is unsupported
+.endif
+.endm
+
+.macro bilinear_interpolate_last_pixel src_fmt, dst_fmt
+    bilinear_load_&src_fmt d0, d1, d2
+    vmull.u8  q1, d0, d28
+    vmlal.u8  q1, d1, d29
+    vshr.u16  d30, d24, #8
+    /* 4 cycles bubble */
+    vshll.u16 q0, d2, #8
+    vmlsl.u16 q0, d2, d30
+    vmlal.u16 q0, d3, d30
+    /* 5 cycles bubble */
+    vshrn.u32 d0, q0, #16
+    /* 3 cycles bubble */
+    vmovn.u16 d0, q0
+    /* 1 cycle bubble */
+    bilinear_store_&dst_fmt 1, q2, q3
+.endm
+
+.macro bilinear_interpolate_two_pixels src_fmt, dst_fmt
+    bilinear_load_and_vertical_interpolate_two_&src_fmt \
+                q1, q11, d0, d1, d20, d21, d22, d23
+    vshr.u16  q15, q12, #8
+    vadd.u16  q12, q12, q13
+    vshll.u16 q0, d2, #8
+    vmlsl.u16 q0, d2, d30
+    vmlal.u16 q0, d3, d30
+    vshll.u16 q10, d22, #8
+    vmlsl.u16 q10, d22, d31
+    vmlal.u16 q10, d23, d31
+    vshrn.u32 d30, q0, #16
+    vshrn.u32 d31, q10, #16
+    vmovn.u16 d0, q15
+    bilinear_store_&dst_fmt 2, q2, q3
+.endm
+
+.macro bilinear_interpolate_four_pixels src_fmt, dst_fmt
+    bilinear_load_and_vertical_interpolate_four_&src_fmt \
+                q1, q11, d0, d1, d20, d21, d22, d23 \
+                q3, q9,  d4, d5, d16, d17, d18, d19
+    pld       [TMP1, PF_OFFS]
+    vshr.u16  q15, q12, #8
+    vadd.u16  q12, q12, q13
+    vshll.u16 q0, d2, #8
+    vmlsl.u16 q0, d2, d30
+    vmlal.u16 q0, d3, d30
+    vshll.u16 q10, d22, #8
+    vmlsl.u16 q10, d22, d31
+    vmlal.u16 q10, d23, d31
+    vshr.u16  q15, q12, #8
+    vshll.u16 q2, d6, #8
+    vmlsl.u16 q2, d6, d30
+    vmlal.u16 q2, d7, d30
+    vshll.u16 q8, d18, #8
+    pld       [TMP2, PF_OFFS]
+    vmlsl.u16 q8, d18, d31
+    vmlal.u16 q8, d19, d31
+    vadd.u16  q12, q12, q13
+    vshrn.u32 d0, q0, #16
+    vshrn.u32 d1, q10, #16
+    vshrn.u32 d4, q2, #16
+    vshrn.u32 d5, q8, #16
+    vmovn.u16 d0, q0
+    vmovn.u16 d1, q2
+    bilinear_store_&dst_fmt 4, q2, q3
+.endm
+
+/*
+ * Main template macro for generating NEON optimized bilinear scanline
+ * functions.
+ *
+ * TODO: use software pipelining and aligned writes to the destination buffer
+ *       in order to improve performance
+ *
+ * Bilinear scanline scaler macro template uses the following arguments:
+ *  fname             - name of the function to generate
+ *  src_fmt           - source color format (8888 or 0565)
+ *  dst_fmt           - destination color format (8888 or 0565)
+ *  bpp_shift         - (1 << bpp_shift) is the size of source pixel in bytes
+ *  prefetch_distance - prefetch in the source image by that many
+ *                      pixels ahead
+ */
+
+.macro generate_bilinear_scanline_func fname, src_fmt, dst_fmt, \
+                                       bpp_shift, prefetch_distance
+
+pixman_asm_function fname
+    OUT       .req      r0
+    TOP       .req      r1
+    BOTTOM    .req      r2
+    WT        .req      r3
+    WB        .req      r4
+    X         .req      r5
+    UX        .req      r6
+    WIDTH     .req      ip
+    TMP1      .req      r3
+    TMP2      .req      r4
+    PF_OFFS   .req      r7
+    TMP3      .req      r8
+    TMP4      .req      r9
+
+    mov       ip, sp
+    push      {r4, r5, r6, r7, r8, r9}
+    mov       PF_OFFS, #prefetch_distance
+    ldmia     ip, {WB, X, UX, WIDTH}
+    mul       PF_OFFS, PF_OFFS, UX
+
+    cmp       WIDTH, #0
+    ble       3f
+
+    vdup.u16  q12, X
+    vdup.u16  q13, UX
+    vdup.u8   d28, WT
+    vdup.u8   d29, WB
+    vadd.u16  d25, d25, d26
+    vadd.u16  q13, q13, q13
+
+    subs      WIDTH, WIDTH, #4
+    blt       1f
+    mov       PF_OFFS, PF_OFFS, asr #(16 - bpp_shift)
+0:
+    bilinear_interpolate_four_pixels src_fmt, dst_fmt
+    subs      WIDTH, WIDTH, #4
+    bge       0b
+1:
+    tst       WIDTH, #2
+    beq       2f
+    bilinear_interpolate_two_pixels src_fmt, dst_fmt
+2:
+    tst       WIDTH, #1
+    beq       3f
+    bilinear_interpolate_last_pixel src_fmt, dst_fmt
+3:
+    pop       {r4, r5, r6, r7, r8, r9}
+    bx        lr
+
+    .unreq    OUT
+    .unreq    TOP
+    .unreq    BOTTOM
+    .unreq    WT
+    .unreq    WB
+    .unreq    X
+    .unreq    UX
+    .unreq    WIDTH
+    .unreq    TMP1
+    .unreq    TMP2
+    .unreq    PF_OFFS
+    .unreq    TMP3
+    .unreq    TMP4
+.endfunc
+
+.endm
+
+generate_bilinear_scanline_func \
+    pixman_scaled_bilinear_scanline_8888_8888_SRC_asm_neon, 8888, 8888, 2, 28
+
+generate_bilinear_scanline_func \
+    pixman_scaled_bilinear_scanline_8888_0565_SRC_asm_neon, 8888, 0565, 2, 28
+
+generate_bilinear_scanline_func \
+    pixman_scaled_bilinear_scanline_0565_x888_SRC_asm_neon, 0565, 8888, 1, 28
+
+generate_bilinear_scanline_func \
+    pixman_scaled_bilinear_scanline_0565_0565_SRC_asm_neon, 0565, 0565, 1, 28
--- a/gfx/cairo/libpixman/src/pixman-arm-neon-asm.h
+++ b/gfx/cairo/libpixman/src/pixman-arm-neon-asm.h
@@ -200,16 +200,131 @@
 .macro pixst_a numpix, bpp, basereg, mem_operand
 .if (bpp * numpix) <= 128
     pixst numpix, bpp, basereg, mem_operand, %(bpp * numpix)
 .else
     pixst numpix, bpp, basereg, mem_operand, 128
 .endif
 .endm
 
+/*
+ * Pixel fetcher for nearest scaling (needs TMP1, TMP2, VX, UNIT_X register
+ * aliases to be defined)
+ */
+.macro pixld1_s elem_size, reg1, mem_operand
+.if elem_size == 16
+    mov     TMP1, VX, asr #16
+    add     VX, VX, UNIT_X
+    add     TMP1, mem_operand, TMP1, asl #1
+    mov     TMP2, VX, asr #16
+    add     VX, VX, UNIT_X
+    add     TMP2, mem_operand, TMP2, asl #1
+    vld1.16 {d&reg1&[0]}, [TMP1, :16]
+    mov     TMP1, VX, asr #16
+    add     VX, VX, UNIT_X
+    add     TMP1, mem_operand, TMP1, asl #1
+    vld1.16 {d&reg1&[1]}, [TMP2, :16]
+    mov     TMP2, VX, asr #16
+    add     VX, VX, UNIT_X
+    add     TMP2, mem_operand, TMP2, asl #1
+    vld1.16 {d&reg1&[2]}, [TMP1, :16]
+    vld1.16 {d&reg1&[3]}, [TMP2, :16]
+.elseif elem_size == 32
+    mov     TMP1, VX, asr #16
+    add     VX, VX, UNIT_X
+    add     TMP1, mem_operand, TMP1, asl #2
+    mov     TMP2, VX, asr #16
+    add     VX, VX, UNIT_X
+    add     TMP2, mem_operand, TMP2, asl #2
+    vld1.32 {d&reg1&[0]}, [TMP1, :32]
+    vld1.32 {d&reg1&[1]}, [TMP2, :32]
+.else
+    .error "unsupported"
+.endif
+.endm
+
+.macro pixld2_s elem_size, reg1, reg2, mem_operand
+.if elem_size == 32
+    mov     TMP1, VX, asr #16
+    add     VX, VX, UNIT_X, asl #1
+    add     TMP1, mem_operand, TMP1, asl #2
+    mov     TMP2, VX, asr #16
+    sub     VX, VX, UNIT_X
+    add     TMP2, mem_operand, TMP2, asl #2
+    vld1.32 {d&reg1&[0]}, [TMP1, :32]
+    mov     TMP1, VX, asr #16
+    add     VX, VX, UNIT_X, asl #1
+    add     TMP1, mem_operand, TMP1, asl #2
+    vld1.32 {d&reg2&[0]}, [TMP2, :32]
+    mov     TMP2, VX, asr #16
+    add     VX, VX, UNIT_X
+    add     TMP2, mem_operand, TMP2, asl #2
+    vld1.32 {d&reg1&[1]}, [TMP1, :32]
+    vld1.32 {d&reg2&[1]}, [TMP2, :32]
+.else
+    pixld1_s elem_size, reg1, mem_operand
+    pixld1_s elem_size, reg2, mem_operand
+.endif
+.endm
+
+.macro pixld0_s elem_size, reg1, idx, mem_operand
+.if elem_size == 16
+    mov     TMP1, VX, asr #16
+    add     VX, VX, UNIT_X
+    add     TMP1, mem_operand, TMP1, asl #1
+    vld1.16 {d&reg1&[idx]}, [TMP1, :16]
+.elseif elem_size == 32
+    mov     TMP1, VX, asr #16
+    add     VX, VX, UNIT_X
+    add     TMP1, mem_operand, TMP1, asl #2
+    vld1.32 {d&reg1&[idx]}, [TMP1, :32]
+.endif
+.endm
+
+.macro pixld_s_internal numbytes, elem_size, basereg, mem_operand
+.if numbytes == 32
+    pixld2_s elem_size, %(basereg+4), %(basereg+5), mem_operand
+    pixld2_s elem_size, %(basereg+6), %(basereg+7), mem_operand
+    pixdeinterleave elem_size, %(basereg+4)
+.elseif numbytes == 16
+    pixld2_s elem_size, %(basereg+2), %(basereg+3), mem_operand
+.elseif numbytes == 8
+    pixld1_s elem_size, %(basereg+1), mem_operand
+.elseif numbytes == 4
+    .if elem_size == 32
+        pixld0_s elem_size, %(basereg+0), 1, mem_operand
+    .elseif elem_size == 16
+        pixld0_s elem_size, %(basereg+0), 2, mem_operand
+        pixld0_s elem_size, %(basereg+0), 3, mem_operand
+    .else
+        pixld0_s elem_size, %(basereg+0), 4, mem_operand
+        pixld0_s elem_size, %(basereg+0), 5, mem_operand
+        pixld0_s elem_size, %(basereg+0), 6, mem_operand
+        pixld0_s elem_size, %(basereg+0), 7, mem_operand
+    .endif
+.elseif numbytes == 2
+    .if elem_size == 16
+        pixld0_s elem_size, %(basereg+0), 1, mem_operand
+    .else
+        pixld0_s elem_size, %(basereg+0), 2, mem_operand
+        pixld0_s elem_size, %(basereg+0), 3, mem_operand
+    .endif
+.elseif numbytes == 1
+    pixld0_s elem_size, %(basereg+0), 1, mem_operand
+.else
+    .error "unsupported size: numbytes"
+.endif
+.endm
+
+.macro pixld_s numpix, bpp, basereg, mem_operand
+.if bpp > 0
+    pixld_s_internal %(numpix * bpp / 8), %(bpp), basereg, mem_operand
+.endif
+.endm
+
 .macro vuzp8 reg1, reg2
     vuzp.8 d&reg1, d&reg2
 .endm
 
 .macro vzip8 reg1, reg2
     vzip.8 d&reg1, d&reg2
 .endm
 
@@ -311,16 +426,21 @@
     pld [DST_R, #(PREFETCH_DISTANCE_SIMPLE * dst_r_bpp / 8)]
 .endif
 .if mask_bpp > 0
     pld [MASK, #(PREFETCH_DISTANCE_SIMPLE * mask_bpp / 8)]
 .endif
 .endif
 .endm
 
+.macro fetch_mask_pixblock
+    pixld       pixblock_size, mask_bpp, \
+                (mask_basereg - pixblock_size * mask_bpp / 64), MASK
+.endm
+
 /*
  * Macro which is used to process leading pixels until destination
  * pointer is properly aligned (at 16 bytes boundary). When destination
  * buffer uses 16bpp format, this is unnecessary, or even pointless.
  */
 .macro ensure_destination_ptr_alignment process_pixblock_head, \
                                         process_pixblock_tail, \
                                         process_pixblock_tail_head
@@ -330,17 +450,17 @@
 
 .irp lowbit, 1, 2, 4, 8, 16
 local skip1
 .if (dst_w_bpp <= (lowbit * 8)) && ((lowbit * 8) < (pixblock_size * dst_w_bpp))
 .if lowbit < 16 /* we don't need more than 16-byte alignment */
     tst         DST_R, #lowbit
     beq         1f
 .endif
-    pixld       (lowbit * 8 / dst_w_bpp), src_bpp, src_basereg, SRC
+    pixld_src   (lowbit * 8 / dst_w_bpp), src_bpp, src_basereg, SRC
     pixld       (lowbit * 8 / dst_w_bpp), mask_bpp, mask_basereg, MASK
 .if dst_r_bpp > 0
     pixld_a     (lowbit * 8 / dst_r_bpp), dst_r_bpp, dst_r_basereg, DST_R
 .else
     add         DST_R, DST_R, #lowbit
 .endif
     PF add      PF_X, PF_X, #(lowbit * 8 / dst_w_bpp)
     sub         W, W, #(lowbit * 8 / dst_w_bpp)
@@ -392,17 +512,17 @@ 2:
                                process_pixblock_tail, \
                                process_pixblock_tail_head
     tst         W, #(pixblock_size - 1)
     beq         2f
 .irp chunk_size, 16, 8, 4, 2, 1
 .if pixblock_size > chunk_size
     tst         W, #chunk_size
     beq         1f
-    pixld       chunk_size, src_bpp, src_basereg, SRC
+    pixld_src   chunk_size, src_bpp, src_basereg, SRC
     pixld       chunk_size, mask_bpp, mask_basereg, MASK
 .if dst_aligned_flag != 0
     pixld_a     chunk_size, dst_r_bpp, dst_r_basereg, DST_R
 .else
     pixld       chunk_size, dst_r_bpp, dst_r_basereg, DST_R
 .endif
 .if cache_preload_flag != 0
     PF add      PF_X, PF_X, #chunk_size
@@ -526,16 +646,23 @@ fname:
     .set mask_bpp, mask_bpp_
     .set dst_w_bpp, dst_w_bpp_
     .set pixblock_size, pixblock_size_
     .set dst_w_basereg, dst_w_basereg_
     .set dst_r_basereg, dst_r_basereg_
     .set src_basereg, src_basereg_
     .set mask_basereg, mask_basereg_
 
+    .macro pixld_src x:vararg
+        pixld x
+    .endm
+    .macro fetch_src_pixblock
+        pixld_src   pixblock_size, src_bpp, \
+                    (src_basereg - pixblock_size * src_bpp / 64), SRC
+    .endm
 /*
  * Assign symbolic names to registers
  */
     W           .req        r0      /* width (is updated during processing) */
     H           .req        r1      /* height (is updated during processing) */
     DST_W       .req        r2      /* destination buffer pointer for writes */
     DST_STRIDE  .req        r3      /* destination image stride */
     SRC         .req        r4      /* source buffer pointer */
@@ -691,18 +818,17 @@ fname:
 0:
     ensure_destination_ptr_alignment process_pixblock_head, \
                                      process_pixblock_tail, \
                                      process_pixblock_tail_head
 
     /* Implement "head (tail_head) ... (tail_head) tail" loop pattern */
     pixld_a     pixblock_size, dst_r_bpp, \
                 (dst_r_basereg - pixblock_size * dst_r_bpp / 64), DST_R
-    pixld       pixblock_size, src_bpp, \
-                (src_basereg - pixblock_size * src_bpp / 64), SRC
+    fetch_src_pixblock
     pixld       pixblock_size, mask_bpp, \
                 (mask_basereg - pixblock_size * mask_bpp / 64), MASK
     PF add      PF_X, PF_X, #pixblock_size
     process_pixblock_head
     cache_preload 0, pixblock_size
     cache_preload_simple
     subs        W, W, #(pixblock_size * 2)
     blt         2f
@@ -734,18 +860,17 @@ 2:
  * nor prefetch are used.
  */
 8:
     /* Process exactly pixblock_size pixels if needed */
     tst         W, #pixblock_size
     beq         1f
     pixld       pixblock_size, dst_r_bpp, \
                 (dst_r_basereg - pixblock_size * dst_r_bpp / 64), DST_R
-    pixld       pixblock_size, src_bpp, \
-                (src_basereg - pixblock_size * src_bpp / 64), SRC
+    fetch_src_pixblock
     pixld       pixblock_size, mask_bpp, \
                 (mask_basereg - pixblock_size * mask_bpp / 64), MASK
     process_pixblock_head
     process_pixblock_tail
     pixst       pixblock_size, dst_w_bpp, \
                 (dst_w_basereg - pixblock_size * dst_w_bpp / 64), DST_W
 1:
     /* Process the remaining trailing pixels in the scanline */
@@ -756,16 +881,19 @@ 1:
     advance_to_next_scanline 8b
 9:
 .if regs_shortage
     pop         {r0, r1}
 .endif
     cleanup
     pop         {r4-r12, pc}  /* exit */
 
+    .purgem     fetch_src_pixblock
+    .purgem     pixld_src
+
     .unreq      SRC
     .unreq      MASK
     .unreq      DST_R
     .unreq      DST_W
     .unreq      ORIG_W
     .unreq      W
     .unreq      H
     .unreq      SRC_STRIDE
@@ -779,17 +907,18 @@ 9:
     .unreq      DUMMY
     .endfunc
 .endm
 
 /*
  * A simplified variant of function generation template for a single
  * scanline processing (for implementing pixman combine functions)
  */
-.macro generate_composite_function_single_scanline fname, \
+.macro generate_composite_function_scanline        use_nearest_scaling, \
+                                                   fname, \
                                                    src_bpp_, \
                                                    mask_bpp_, \
                                                    dst_w_bpp_, \
                                                    flags, \
                                                    pixblock_size_, \
                                                    init, \
                                                    cleanup, \
                                                    process_pixblock_head, \
@@ -816,54 +945,88 @@ fname:
     .set src_bpp, src_bpp_
     .set mask_bpp, mask_bpp_
     .set dst_w_bpp, dst_w_bpp_
     .set pixblock_size, pixblock_size_
     .set dst_w_basereg, dst_w_basereg_
     .set dst_r_basereg, dst_r_basereg_
     .set src_basereg, src_basereg_
     .set mask_basereg, mask_basereg_
-/*
- * Assign symbolic names to registers
- */
+
+.if use_nearest_scaling != 0
+    /*
+     * Assign symbolic names to registers for nearest scaling
+     */
+    W           .req        r0
+    DST_W       .req        r1
+    SRC         .req        r2
+    VX          .req        r3
+    UNIT_X      .req        ip
+    MASK        .req        lr
+    TMP1        .req        r4
+    TMP2        .req        r5
+    DST_R       .req        r6
+
+    .macro pixld_src x:vararg
+        pixld_s x
+    .endm
+
+    ldr         UNIT_X, [sp]
+    push        {r4-r6, lr}
+    .if mask_bpp != 0
+    ldr         MASK, [sp, #(16 + 4)]
+    .endif
+.else
+    /*
+     * Assign symbolic names to registers
+     */
     W           .req        r0      /* width (is updated during processing) */
     DST_W       .req        r1      /* destination buffer pointer for writes */
     SRC         .req        r2      /* source buffer pointer */
     DST_R       .req        ip      /* destination buffer pointer for reads */
     MASK        .req        r3      /* mask pointer */
 
+    .macro pixld_src x:vararg
+        pixld x
+    .endm
+.endif
+
 .if (((flags) & FLAG_DST_READWRITE) != 0)
     .set dst_r_bpp, dst_w_bpp
 .else
     .set dst_r_bpp, 0
 .endif
 .if (((flags) & FLAG_DEINTERLEAVE_32BPP) != 0)
     .set DEINTERLEAVE_32BPP_ENABLED, 1
 .else
     .set DEINTERLEAVE_32BPP_ENABLED, 0
 .endif
 
+    .macro fetch_src_pixblock
+        pixld_src   pixblock_size, src_bpp, \
+                    (src_basereg - pixblock_size * src_bpp / 64), SRC
+    .endm
+
     init
     mov         DST_R, DST_W
 
     cmp         W, #pixblock_size
     blt         8f
 
     ensure_destination_ptr_alignment process_pixblock_head, \
                                      process_pixblock_tail, \
                                      process_pixblock_tail_head
 
     subs        W, W, #pixblock_size
     blt         7f
 
     /* Implement "head (tail_head) ... (tail_head) tail" loop pattern */
     pixld_a     pixblock_size, dst_r_bpp, \
                 (dst_r_basereg - pixblock_size * dst_r_bpp / 64), DST_R
-    pixld       pixblock_size, src_bpp, \
-                (src_basereg - pixblock_size * src_bpp / 64), SRC
+    fetch_src_pixblock
     pixld       pixblock_size, mask_bpp, \
                 (mask_basereg - pixblock_size * mask_bpp / 64), MASK
     process_pixblock_head
     subs        W, W, #pixblock_size
     blt         2f
 1:
     process_pixblock_tail_head
     subs        W, W, #pixblock_size
@@ -875,35 +1038,67 @@ 2:
 7:
     /* Process the remaining trailing pixels in the scanline (dst aligned) */
     process_trailing_pixels 0, 1, \
                             process_pixblock_head, \
                             process_pixblock_tail, \
                             process_pixblock_tail_head
 
     cleanup
-    bx         lr  /* exit */
+.if use_nearest_scaling != 0
+    pop         {r4-r6, pc}  /* exit */
+.else
+    bx          lr  /* exit */
+.endif
 8:
     /* Process the remaining trailing pixels in the scanline (dst unaligned) */
     process_trailing_pixels 0, 0, \
                             process_pixblock_head, \
                             process_pixblock_tail, \
                             process_pixblock_tail_head
 
     cleanup
+
+.if use_nearest_scaling != 0
+    pop         {r4-r6, pc}  /* exit */
+
+    .unreq      DST_R
+    .unreq      SRC
+    .unreq      W
+    .unreq      VX
+    .unreq      UNIT_X
+    .unreq      TMP1
+    .unreq      TMP2
+    .unreq      DST_W
+    .unreq      MASK
+
+.else
     bx          lr  /* exit */
 
     .unreq      SRC
     .unreq      MASK
     .unreq      DST_R
     .unreq      DST_W
     .unreq      W
+.endif
+
+    .purgem     fetch_src_pixblock
+    .purgem     pixld_src
+
     .endfunc
 .endm
 
+.macro generate_composite_function_single_scanline x:vararg
+    generate_composite_function_scanline 0, x
+.endm
+
+.macro generate_composite_function_nearest_scanline x:vararg
+    generate_composite_function_scanline 1, x
+.endm
+
 /* Default prologue/epilogue, nothing special needs to be done */
 
 .macro default_init
 .endm
 
 .macro default_cleanup
 .endm
 
@@ -958,8 +1153,25 @@ 8:
  */
 .macro convert_8888_to_0565 in_r, in_g, in_b, out, tmp1, tmp2
     vshll.u8    tmp1, in_g, #8
     vshll.u8    out, in_r, #8
     vshll.u8    tmp2, in_b, #8
     vsri.u16    out, tmp1, #5
     vsri.u16    out, tmp2, #11
 .endm
+
+/*
+ * Conversion of four r5g6b5 pixels (in) to four x8r8g8b8 pixels
+ * returned in (out0, out1) registers pair. Requires one temporary
+ * 64-bit register (tmp). 'out1' and 'in' may overlap, the original
+ * value from 'in' is lost
+ */
+.macro convert_four_0565_to_x888_packed in, out0, out1, tmp
+    vshl.u16    out0, in,   #5  /* G top 6 bits */
+    vshl.u16    tmp,  in,   #11 /* B top 5 bits */
+    vsri.u16    in,   in,   #5  /* R is ready in top bits */
+    vsri.u16    out0, out0, #6  /* G is ready in top bits */
+    vsri.u16    tmp,  tmp,  #5  /* B is ready in top bits */
+    vshr.u16    out1, in,   #8  /* R is in place */
+    vsri.u16    out0, tmp,  #8  /* G & B is in place */
+    vzip.u16    out0, out1      /* everything is in place */
+.endm
--- a/gfx/cairo/libpixman/src/pixman-arm-neon.c
+++ b/gfx/cairo/libpixman/src/pixman-arm-neon.c
@@ -47,61 +47,100 @@ PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (neon,
 PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (neon, src_0565_8888,
                                    uint16_t, 1, uint32_t, 1)
 PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (neon, src_0888_8888_rev,
                                    uint8_t, 3, uint32_t, 1)
 PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (neon, src_0888_0565_rev,
                                    uint8_t, 3, uint16_t, 1)
 PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (neon, src_pixbuf_8888,
                                    uint32_t, 1, uint32_t, 1)
+PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (neon, src_rpixbuf_8888,
+                                   uint32_t, 1, uint32_t, 1)
 PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (neon, add_8_8,
                                    uint8_t, 1, uint8_t, 1)
 PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (neon, add_8888_8888,
                                    uint32_t, 1, uint32_t, 1)
 PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (neon, over_8888_0565,
                                    uint32_t, 1, uint16_t, 1)
 PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (neon, over_8888_8888,
                                    uint32_t, 1, uint32_t, 1)
 PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (neon, out_reverse_8_0565,
                                    uint8_t, 1, uint16_t, 1)
 
-PIXMAN_ARM_BIND_FAST_PATH_N_DST (neon, over_n_0565,
+PIXMAN_ARM_BIND_FAST_PATH_N_DST (SKIP_ZERO_SRC, neon, over_n_0565,
                                  uint16_t, 1)
-PIXMAN_ARM_BIND_FAST_PATH_N_DST (neon, over_n_8888,
+PIXMAN_ARM_BIND_FAST_PATH_N_DST (SKIP_ZERO_SRC, neon, over_n_8888,
                                  uint32_t, 1)
-PIXMAN_ARM_BIND_FAST_PATH_N_DST (neon, over_reverse_n_8888,
+PIXMAN_ARM_BIND_FAST_PATH_N_DST (SKIP_ZERO_SRC, neon, over_reverse_n_8888,
                                  uint32_t, 1)
+PIXMAN_ARM_BIND_FAST_PATH_N_DST (0, neon, in_n_8,
+                                 uint8_t, 1)
 
-PIXMAN_ARM_BIND_FAST_PATH_N_MASK_DST (neon, over_n_8_0565,
+PIXMAN_ARM_BIND_FAST_PATH_N_MASK_DST (SKIP_ZERO_SRC, neon, over_n_8_0565,
                                       uint8_t, 1, uint16_t, 1)
-PIXMAN_ARM_BIND_FAST_PATH_N_MASK_DST (neon, over_n_8_8888,
+PIXMAN_ARM_BIND_FAST_PATH_N_MASK_DST (SKIP_ZERO_SRC, neon, over_n_8_8888,
                                       uint8_t, 1, uint32_t, 1)
-PIXMAN_ARM_BIND_FAST_PATH_N_MASK_DST (neon, over_n_8888_8888_ca,
+PIXMAN_ARM_BIND_FAST_PATH_N_MASK_DST (SKIP_ZERO_SRC, neon, over_n_8888_8888_ca,
                                       uint32_t, 1, uint32_t, 1)
-PIXMAN_ARM_BIND_FAST_PATH_N_MASK_DST (neon, add_n_8_8,
+PIXMAN_ARM_BIND_FAST_PATH_N_MASK_DST (SKIP_ZERO_SRC, neon, over_n_8_8,
+                                      uint8_t, 1, uint8_t, 1)
+PIXMAN_ARM_BIND_FAST_PATH_N_MASK_DST (SKIP_ZERO_SRC, neon, add_n_8_8,
                                       uint8_t, 1, uint8_t, 1)
+PIXMAN_ARM_BIND_FAST_PATH_N_MASK_DST (SKIP_ZERO_SRC, neon, add_n_8_8888,
+                                      uint8_t, 1, uint32_t, 1)
 
-PIXMAN_ARM_BIND_FAST_PATH_SRC_N_DST (neon, over_8888_n_8888,
+PIXMAN_ARM_BIND_FAST_PATH_SRC_N_DST (SKIP_ZERO_MASK, neon, over_8888_n_8888,
+                                     uint32_t, 1, uint32_t, 1)
+PIXMAN_ARM_BIND_FAST_PATH_SRC_N_DST (SKIP_ZERO_MASK, neon, over_8888_n_0565,
+                                     uint32_t, 1, uint16_t, 1)
+PIXMAN_ARM_BIND_FAST_PATH_SRC_N_DST (SKIP_ZERO_MASK, neon, over_0565_n_0565,
+                                     uint16_t, 1, uint16_t, 1)
+PIXMAN_ARM_BIND_FAST_PATH_SRC_N_DST (SKIP_ZERO_MASK, neon, add_8888_n_8888,
                                      uint32_t, 1, uint32_t, 1)
 
 PIXMAN_ARM_BIND_FAST_PATH_SRC_MASK_DST (neon, add_8_8_8,
                                         uint8_t, 1, uint8_t, 1, uint8_t, 1)
 PIXMAN_ARM_BIND_FAST_PATH_SRC_MASK_DST (neon, add_0565_8_0565,
                                         uint16_t, 1, uint8_t, 1, uint16_t, 1)
+PIXMAN_ARM_BIND_FAST_PATH_SRC_MASK_DST (neon, add_8888_8_8888,
+                                        uint32_t, 1, uint8_t, 1, uint32_t, 1)
 PIXMAN_ARM_BIND_FAST_PATH_SRC_MASK_DST (neon, add_8888_8888_8888,
                                         uint32_t, 1, uint32_t, 1, uint32_t, 1)
 PIXMAN_ARM_BIND_FAST_PATH_SRC_MASK_DST (neon, over_8888_8_8888,
                                         uint32_t, 1, uint8_t, 1, uint32_t, 1)
 PIXMAN_ARM_BIND_FAST_PATH_SRC_MASK_DST (neon, over_8888_8888_8888,
                                         uint32_t, 1, uint32_t, 1, uint32_t, 1)
 PIXMAN_ARM_BIND_FAST_PATH_SRC_MASK_DST (neon, over_8888_8_0565,
                                         uint32_t, 1, uint8_t, 1, uint16_t, 1)
 PIXMAN_ARM_BIND_FAST_PATH_SRC_MASK_DST (neon, over_0565_8_0565,
                                         uint16_t, 1, uint8_t, 1, uint16_t, 1)
 
+PIXMAN_ARM_BIND_SCALED_NEAREST_SRC_DST (neon, 8888_8888, OVER,
+                                        uint32_t, uint32_t)
+PIXMAN_ARM_BIND_SCALED_NEAREST_SRC_DST (neon, 8888_0565, OVER,
+                                        uint32_t, uint16_t)
+PIXMAN_ARM_BIND_SCALED_NEAREST_SRC_DST (neon, 8888_0565, SRC,
+                                        uint32_t, uint16_t)
+PIXMAN_ARM_BIND_SCALED_NEAREST_SRC_DST (neon, 0565_8888, SRC,
+                                        uint16_t, uint32_t)
+
+PIXMAN_ARM_BIND_SCALED_NEAREST_SRC_A8_DST (SKIP_ZERO_SRC, neon, 8888_8_0565,
+                                           OVER, uint32_t, uint16_t)
+PIXMAN_ARM_BIND_SCALED_NEAREST_SRC_A8_DST (SKIP_ZERO_SRC, neon, 0565_8_0565,
+                                           OVER, uint16_t, uint16_t)
+
+PIXMAN_ARM_BIND_SCALED_BILINEAR_SRC_DST (0, neon, 8888_8888, SRC,
+                                         uint32_t, uint32_t)
+PIXMAN_ARM_BIND_SCALED_BILINEAR_SRC_DST (0, neon, 8888_0565, SRC,
+                                         uint32_t, uint16_t)
+PIXMAN_ARM_BIND_SCALED_BILINEAR_SRC_DST (0, neon, 0565_x888, SRC,
+                                         uint16_t, uint32_t)
+PIXMAN_ARM_BIND_SCALED_BILINEAR_SRC_DST (0, neon, 0565_0565, SRC,
+                                         uint16_t, uint16_t)
+
 void
 pixman_composite_src_n_8_asm_neon (int32_t   w,
                                    int32_t   h,
                                    uint8_t  *dst,
                                    int32_t   dst_stride,
                                    uint8_t   src);
 
 void
@@ -221,31 +260,39 @@ static const pixman_fast_path_t arm_neon
     PIXMAN_STD_FAST_PATH (SRC,  a8r8g8b8, null,     a8r8g8b8, neon_composite_src_8888_8888),
     PIXMAN_STD_FAST_PATH (SRC,  a8b8g8r8, null,     a8b8g8r8, neon_composite_src_8888_8888),
     PIXMAN_STD_FAST_PATH (SRC,  x8r8g8b8, null,     a8r8g8b8, neon_composite_src_x888_8888),
     PIXMAN_STD_FAST_PATH (SRC,  x8b8g8r8, null,     a8b8g8r8, neon_composite_src_x888_8888),
     PIXMAN_STD_FAST_PATH (SRC,  r8g8b8,   null,     r8g8b8,   neon_composite_src_0888_0888),
     PIXMAN_STD_FAST_PATH (SRC,  b8g8r8,   null,     x8r8g8b8, neon_composite_src_0888_8888_rev),
     PIXMAN_STD_FAST_PATH (SRC,  b8g8r8,   null,     r5g6b5,   neon_composite_src_0888_0565_rev),
     PIXMAN_STD_FAST_PATH (SRC,  pixbuf,   pixbuf,   a8r8g8b8, neon_composite_src_pixbuf_8888),
+    PIXMAN_STD_FAST_PATH (SRC,  pixbuf,   pixbuf,   a8b8g8r8, neon_composite_src_rpixbuf_8888),
+    PIXMAN_STD_FAST_PATH (SRC,  rpixbuf,  rpixbuf,  a8r8g8b8, neon_composite_src_rpixbuf_8888),
+    PIXMAN_STD_FAST_PATH (SRC,  rpixbuf,  rpixbuf,  a8b8g8r8, neon_composite_src_pixbuf_8888),
+    PIXMAN_STD_FAST_PATH (OVER, solid,    a8,       a8,       neon_composite_over_n_8_8),
     PIXMAN_STD_FAST_PATH (OVER, solid,    a8,       r5g6b5,   neon_composite_over_n_8_0565),
     PIXMAN_STD_FAST_PATH (OVER, solid,    a8,       b5g6r5,   neon_composite_over_n_8_0565),
     PIXMAN_STD_FAST_PATH (OVER, solid,    a8,       a8r8g8b8, neon_composite_over_n_8_8888),
     PIXMAN_STD_FAST_PATH (OVER, solid,    a8,       x8r8g8b8, neon_composite_over_n_8_8888),
     PIXMAN_STD_FAST_PATH (OVER, solid,    a8,       a8b8g8r8, neon_composite_over_n_8_8888),
     PIXMAN_STD_FAST_PATH (OVER, solid,    a8,       x8b8g8r8, neon_composite_over_n_8_8888),
     PIXMAN_STD_FAST_PATH (OVER, solid,    null,     r5g6b5,   neon_composite_over_n_0565),
     PIXMAN_STD_FAST_PATH (OVER, solid,    null,     a8r8g8b8, neon_composite_over_n_8888),
     PIXMAN_STD_FAST_PATH (OVER, solid,    null,     x8r8g8b8, neon_composite_over_n_8888),
     PIXMAN_STD_FAST_PATH_CA (OVER, solid, a8r8g8b8, a8r8g8b8, neon_composite_over_n_8888_8888_ca),
     PIXMAN_STD_FAST_PATH_CA (OVER, solid, a8r8g8b8, x8r8g8b8, neon_composite_over_n_8888_8888_ca),
     PIXMAN_STD_FAST_PATH_CA (OVER, solid, a8b8g8r8, a8b8g8r8, neon_composite_over_n_8888_8888_ca),
     PIXMAN_STD_FAST_PATH_CA (OVER, solid, a8b8g8r8, x8b8g8r8, neon_composite_over_n_8888_8888_ca),
     PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, solid,    a8r8g8b8, neon_composite_over_8888_n_8888),
     PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, solid,    x8r8g8b8, neon_composite_over_8888_n_8888),
+    PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, solid,    r5g6b5,   neon_composite_over_8888_n_0565),
+    PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, solid,    b5g6r5,   neon_composite_over_8888_n_0565),
+    PIXMAN_STD_FAST_PATH (OVER, r5g6b5,   solid,    r5g6b5,   neon_composite_over_0565_n_0565),
+    PIXMAN_STD_FAST_PATH (OVER, b5g6r5,   solid,    b5g6r5,   neon_composite_over_0565_n_0565),
     PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, a8,       a8r8g8b8, neon_composite_over_8888_8_8888),
     PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, a8,       x8r8g8b8, neon_composite_over_8888_8_8888),
     PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, a8,       a8b8g8r8, neon_composite_over_8888_8_8888),
     PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, a8,       x8b8g8r8, neon_composite_over_8888_8_8888),
     PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, a8,       r5g6b5,   neon_composite_over_8888_8_0565),
     PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, a8,       b5g6r5,   neon_composite_over_8888_8_0565),
     PIXMAN_STD_FAST_PATH (OVER, r5g6b5,   a8,       r5g6b5,   neon_composite_over_0565_8_0565),
     PIXMAN_STD_FAST_PATH (OVER, b5g6r5,   a8,       b5g6r5,   neon_composite_over_0565_8_0565),
@@ -254,28 +301,72 @@ static const pixman_fast_path_t arm_neon
     PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null,     b5g6r5,   neon_composite_over_8888_0565),
     PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null,     a8r8g8b8, neon_composite_over_8888_8888),
     PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null,     x8r8g8b8, neon_composite_over_8888_8888),
     PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null,     a8b8g8r8, neon_composite_over_8888_8888),
     PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null,     x8b8g8r8, neon_composite_over_8888_8888),
     PIXMAN_STD_FAST_PATH (OVER, x8r8g8b8, null,     a8r8g8b8, neon_composite_src_x888_8888),
     PIXMAN_STD_FAST_PATH (OVER, x8b8g8r8, null,     a8b8g8r8, neon_composite_src_x888_8888),
     PIXMAN_STD_FAST_PATH (ADD,  solid,    a8,       a8,       neon_composite_add_n_8_8),
+    PIXMAN_STD_FAST_PATH (ADD,  solid,    a8,       a8r8g8b8, neon_composite_add_n_8_8888),
+    PIXMAN_STD_FAST_PATH (ADD,  solid,    a8,       a8b8g8r8, neon_composite_add_n_8_8888),
     PIXMAN_STD_FAST_PATH (ADD,  a8,       a8,       a8,       neon_composite_add_8_8_8),
     PIXMAN_STD_FAST_PATH (ADD,  r5g6b5,   a8,       r5g6b5,   neon_composite_add_0565_8_0565),
     PIXMAN_STD_FAST_PATH (ADD,  b5g6r5,   a8,       b5g6r5,   neon_composite_add_0565_8_0565),
+    PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, a8,       a8r8g8b8, neon_composite_add_8888_8_8888),
+    PIXMAN_STD_FAST_PATH (ADD,  a8b8g8r8, a8,       a8b8g8r8, neon_composite_add_8888_8_8888),
     PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, a8r8g8b8, a8r8g8b8, neon_composite_add_8888_8888_8888),
+    PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, solid,    a8r8g8b8, neon_composite_add_8888_n_8888),
+    PIXMAN_STD_FAST_PATH (ADD,  a8b8g8r8, solid,    a8b8g8r8, neon_composite_add_8888_n_8888),
     PIXMAN_STD_FAST_PATH (ADD,  a8,       null,     a8,       neon_composite_add_8_8),
     PIXMAN_STD_FAST_PATH (ADD,  a8r8g8b8, null,     a8r8g8b8, neon_composite_add_8888_8888),
     PIXMAN_STD_FAST_PATH (ADD,  a8b8g8r8, null,     a8b8g8r8, neon_composite_add_8888_8888),
+    PIXMAN_STD_FAST_PATH (IN,   solid,    null,     a8,       neon_composite_in_n_8),
     PIXMAN_STD_FAST_PATH (OVER_REVERSE, solid, null, a8r8g8b8, neon_composite_over_reverse_n_8888),
     PIXMAN_STD_FAST_PATH (OVER_REVERSE, solid, null, a8b8g8r8, neon_composite_over_reverse_n_8888),
     PIXMAN_STD_FAST_PATH (OUT_REVERSE,  a8,    null, r5g6b5,   neon_composite_out_reverse_8_0565),
     PIXMAN_STD_FAST_PATH (OUT_REVERSE,  a8,    null, b5g6r5,   neon_composite_out_reverse_8_0565),
 
+    PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (OVER, a8r8g8b8, a8r8g8b8, neon_8888_8888),
+    PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (OVER, a8b8g8r8, a8b8g8r8, neon_8888_8888),
+    PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (OVER, a8r8g8b8, x8r8g8b8, neon_8888_8888),
+    PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (OVER, a8b8g8r8, x8b8g8r8, neon_8888_8888),
+
+    PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (OVER, a8r8g8b8, r5g6b5, neon_8888_0565),
+    PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (OVER, a8b8g8r8, b5g6r5, neon_8888_0565),
+
+    PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (SRC, a8r8g8b8, r5g6b5, neon_8888_0565),
+    PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (SRC, x8r8g8b8, r5g6b5, neon_8888_0565),
+    PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (SRC, a8b8g8r8, b5g6r5, neon_8888_0565),
+    PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (SRC, x8b8g8r8, b5g6r5, neon_8888_0565),
+
+    PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (SRC, b5g6r5, x8b8g8r8, neon_0565_8888),
+    PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (SRC, r5g6b5, x8r8g8b8, neon_0565_8888),
+    /* Note: NONE repeat is not supported yet */
+    SIMPLE_NEAREST_FAST_PATH_COVER (SRC, r5g6b5, a8r8g8b8, neon_0565_8888),
+    SIMPLE_NEAREST_FAST_PATH_COVER (SRC, b5g6r5, a8b8g8r8, neon_0565_8888),
+    SIMPLE_NEAREST_FAST_PATH_PAD (SRC, r5g6b5, a8r8g8b8, neon_0565_8888),
+    SIMPLE_NEAREST_FAST_PATH_PAD (SRC, b5g6r5, a8b8g8r8, neon_0565_8888),
+
+    PIXMAN_ARM_SIMPLE_NEAREST_A8_MASK_FAST_PATH (OVER, a8r8g8b8, r5g6b5, neon_8888_8_0565),
+    PIXMAN_ARM_SIMPLE_NEAREST_A8_MASK_FAST_PATH (OVER, a8b8g8r8, b5g6r5, neon_8888_8_0565),
+
+    PIXMAN_ARM_SIMPLE_NEAREST_A8_MASK_FAST_PATH (OVER, r5g6b5, r5g6b5, neon_0565_8_0565),
+    PIXMAN_ARM_SIMPLE_NEAREST_A8_MASK_FAST_PATH (OVER, b5g6r5, b5g6r5, neon_0565_8_0565),
+
+    SIMPLE_BILINEAR_FAST_PATH (SRC, a8r8g8b8, a8r8g8b8, neon_8888_8888),
+    SIMPLE_BILINEAR_FAST_PATH (SRC, a8r8g8b8, x8r8g8b8, neon_8888_8888),
+    SIMPLE_BILINEAR_FAST_PATH (SRC, x8r8g8b8, x8r8g8b8, neon_8888_8888),
+
+    SIMPLE_BILINEAR_FAST_PATH (SRC, a8r8g8b8, r5g6b5, neon_8888_0565),
+    SIMPLE_BILINEAR_FAST_PATH (SRC, x8r8g8b8, r5g6b5, neon_8888_0565),
+
+    SIMPLE_BILINEAR_FAST_PATH (SRC, r5g6b5, x8r8g8b8, neon_0565_x888),
+    SIMPLE_BILINEAR_FAST_PATH (SRC, r5g6b5, r5g6b5, neon_0565_0565),
+
     { PIXMAN_OP_NONE },
 };
 
 static pixman_bool_t
 arm_neon_blt (pixman_implementation_t *imp,
               uint32_t *               src_bits,
               uint32_t *               dst_bits,
               int                      src_stride,
@@ -348,23 +439,18 @@ neon_combine_##name##_u (pixman_implemen
 	pixman_composite_scanline_##name##_asm_neon (width, dest, src);  \
 }
 
 BIND_COMBINE_U (over)
 BIND_COMBINE_U (add)
 BIND_COMBINE_U (out_reverse)
 
 pixman_implementation_t *
-_pixman_implementation_create_arm_neon (void)
+_pixman_implementation_create_arm_neon (pixman_implementation_t *fallback)
 {
-#ifdef USE_ARM_SIMD
-    pixman_implementation_t *fallback = _pixman_implementation_create_arm_simd ();
-#else
-    pixman_implementation_t *fallback = _pixman_implementation_create_fast_path ();
-#endif
     pixman_implementation_t *imp =
 	_pixman_implementation_create (fallback, arm_neon_fast_paths);
 
     imp->combine_32[PIXMAN_OP_OVER] = neon_combine_over_u;
     imp->combine_32[PIXMAN_OP_ADD] = neon_combine_add_u;
     imp->combine_32[PIXMAN_OP_OUT_REVERSE] = neon_combine_out_reverse_u;
 
     imp->blt = arm_neon_blt;
--- a/gfx/cairo/libpixman/src/pixman-arm-simd-asm.S
+++ b/gfx/cairo/libpixman/src/pixman-arm-simd-asm.S
@@ -1,10 +1,11 @@
 /*
  * Copyright © 2008 Mozilla Corporation
+ * Copyright © 2010 Nokia Corporation
  *
  * Permission to use, copy, modify, distribute, and sell this software and its
  * documentation for any purpose is hereby granted without fee, provided that
  * the above copyright notice appear in all copies and that both that
  * copyright notice and this permission notice appear in supporting
  * documentation, and that the name of Mozilla Corporation not be used in
  * advertising or publicity pertaining to distribution of the software without
  * specific, written prior permission.  Mozilla Corporation makes no
@@ -323,8 +324,115 @@ 3:	ldr	r4, [sp, #12]
 	cmp	r0, r4
 	ldr	r12, [sp, #8]
 	ldr	r2, [sp]
 	bne	5b
 0:	add	sp, sp, #28
 	pop	{r4, r5, r6, r7, r8, r9, r10, r11}
 	bx	lr
 .endfunc
+
+/*
+ * Note: This code is only using armv5te instructions (not even armv6),
+ *       but is scheduled for ARM Cortex-A8 pipeline. So it might need to
+ *       be split into a few variants, tuned for each microarchitecture.
+ *
+ * TODO: In order to get good performance on ARM9/ARM11 cores (which don't
+ * have efficient write combining), it needs to be changed to use 16-byte
+ * aligned writes using STM instruction.
+ *
+ * Nearest scanline scaler macro template uses the following arguments:
+ *  fname                     - name of the function to generate
+ *  bpp_shift                 - (1 << bpp_shift) is the size of pixel in bytes
+ *  t                         - type suffix for LDR/STR instructions
+ *  prefetch_distance         - prefetch in the source image by that many
+ *                              pixels ahead
+ *  prefetch_braking_distance - stop prefetching when that many pixels are
+ *                              remaining before the end of scanline
+ */
+
+.macro generate_nearest_scanline_func fname, bpp_shift, t,      \
+                                      prefetch_distance,        \
+                                      prefetch_braking_distance
+
+pixman_asm_function fname
+	W	.req	r0
+	DST	.req	r1
+	SRC	.req	r2
+	VX	.req	r3
+	UNIT_X	.req	ip
+	TMP1	.req	r4
+	TMP2	.req	r5
+	VXMASK	.req	r6
+	PF_OFFS	.req	r7
+
+	ldr	UNIT_X, [sp]
+	push	{r4, r5, r6, r7}
+	mvn	VXMASK, #((1 << bpp_shift) - 1)
+
+	/* define helper macro */
+	.macro	scale_2_pixels
+		ldr&t	TMP1, [SRC, TMP1]
+		and	TMP2, VXMASK, VX, lsr #(16 - bpp_shift)
+		add	VX, VX, UNIT_X
+		str&t	TMP1, [DST], #(1 << bpp_shift)
+
+		ldr&t	TMP2, [SRC, TMP2]
+		and	TMP1, VXMASK, VX, lsr #(16 - bpp_shift)
+		add	VX, VX, UNIT_X
+		str&t	TMP2, [DST], #(1 << bpp_shift)
+	.endm
+
+	/* now do the scaling */
+	and	TMP1, VXMASK, VX, lsr #(16 - bpp_shift)
+	add	VX, VX, UNIT_X
+	subs	W, W, #(8 + prefetch_braking_distance)
+	blt	2f
+	/* calculate prefetch offset */
+	mov	PF_OFFS, #prefetch_distance
+	mla	PF_OFFS, UNIT_X, PF_OFFS, VX
+1:	/* main loop, process 8 pixels per iteration with prefetch */
+	subs	W, W, #8
+	add	PF_OFFS, UNIT_X, lsl #3
+	scale_2_pixels
+	scale_2_pixels
+	scale_2_pixels
+	scale_2_pixels
+	pld	[SRC, PF_OFFS, lsr #(16 - bpp_shift)]
+	bge	1b
+2:
+	subs	W, W, #(4 - 8 - prefetch_braking_distance)
+	blt	2f
+1:	/* process the remaining pixels */
+	scale_2_pixels
+	scale_2_pixels
+	subs	W, W, #4
+	bge	1b
+2:
+	tst	W, #2
+	beq	2f
+	scale_2_pixels
+2:
+	tst	W, #1
+	ldrne&t	TMP1, [SRC, TMP1]
+	strne&t	TMP1, [DST]
+	/* cleanup helper macro */
+	.purgem	scale_2_pixels
+	.unreq	DST
+	.unreq	SRC
+	.unreq	W
+	.unreq	VX
+	.unreq	UNIT_X
+	.unreq	TMP1
+	.unreq	TMP2
+	.unreq	VXMASK
+	.unreq	PF_OFFS
+	/* return */
+	pop	{r4, r5, r6, r7}
+	bx	lr
+.endfunc
+.endm
+
+generate_nearest_scanline_func \
+    pixman_scaled_nearest_scanline_0565_0565_SRC_asm_armv6, 1, h, 80, 32
+
+generate_nearest_scanline_func \
+    pixman_scaled_nearest_scanline_8888_8888_SRC_asm_armv6, 2,  , 48, 32
--- a/gfx/cairo/libpixman/src/pixman-arm-simd.c
+++ b/gfx/cairo/libpixman/src/pixman-arm-simd.c
@@ -24,16 +24,17 @@
  *
  */
 #ifdef HAVE_CONFIG_H
 #include <config.h>
 #endif
 
 #include "pixman-private.h"
 #include "pixman-arm-common.h"
+#include "pixman-fast-path.h"
 
 #if 0 /* This code was moved to 'pixman-arm-simd-asm.S' */
 
 void
 pixman_composite_add_8_8_asm_armv6 (int32_t  width,
 				    int32_t  height,
 				    uint8_t *dst_line,
 				    int32_t  dst_stride,
@@ -375,22 +376,27 @@ pixman_composite_over_n_8_8888_asm_armv6
 
 #endif
 
 PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, add_8_8,
                                    uint8_t, 1, uint8_t, 1)
 PIXMAN_ARM_BIND_FAST_PATH_SRC_DST (armv6, over_8888_8888,
                                    uint32_t, 1, uint32_t, 1)
 
-PIXMAN_ARM_BIND_FAST_PATH_SRC_N_DST (armv6, over_8888_n_8888,
+PIXMAN_ARM_BIND_FAST_PATH_SRC_N_DST (SKIP_ZERO_MASK, armv6, over_8888_n_8888,
                                      uint32_t, 1, uint32_t, 1)
 
-PIXMAN_ARM_BIND_FAST_PATH_N_MASK_DST (armv6, over_n_8_8888,
+PIXMAN_ARM_BIND_FAST_PATH_N_MASK_DST (SKIP_ZERO_SRC, armv6, over_n_8_8888,
                                       uint8_t, 1, uint32_t, 1)
 
+PIXMAN_ARM_BIND_SCALED_NEAREST_SRC_DST (armv6, 0565_0565, SRC,
+                                        uint16_t, uint16_t)
+PIXMAN_ARM_BIND_SCALED_NEAREST_SRC_DST (armv6, 8888_8888, SRC,
+                                        uint32_t, uint32_t)
+
 static const pixman_fast_path_t arm_simd_fast_paths[] =
 {
     PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, a8r8g8b8, armv6_composite_over_8888_8888),
     PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, x8r8g8b8, armv6_composite_over_8888_8888),
     PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, a8b8g8r8, armv6_composite_over_8888_8888),
     PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, x8b8g8r8, armv6_composite_over_8888_8888),
     PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, solid, a8r8g8b8, armv6_composite_over_8888_n_8888),
     PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, solid, x8r8g8b8, armv6_composite_over_8888_n_8888),
@@ -399,19 +405,28 @@ static const pixman_fast_path_t arm_simd
 
     PIXMAN_STD_FAST_PATH (ADD, a8, null, a8, armv6_composite_add_8_8),
 
     PIXMAN_STD_FAST_PATH (OVER, solid, a8, a8r8g8b8, armv6_composite_over_n_8_8888),
     PIXMAN_STD_FAST_PATH (OVER, solid, a8, x8r8g8b8, armv6_composite_over_n_8_8888),
     PIXMAN_STD_FAST_PATH (OVER, solid, a8, a8b8g8r8, armv6_composite_over_n_8_8888),
     PIXMAN_STD_FAST_PATH (OVER, solid, a8, x8b8g8r8, armv6_composite_over_n_8_8888),
 
+    PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (SRC, r5g6b5, r5g6b5, armv6_0565_0565),
+    PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (SRC, b5g6r5, b5g6r5, armv6_0565_0565),
+
+    PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (SRC, a8r8g8b8, a8r8g8b8, armv6_8888_8888),
+    PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (SRC, a8r8g8b8, x8r8g8b8, armv6_8888_8888),
+    PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (SRC, x8r8g8b8, x8r8g8b8, armv6_8888_8888),
+    PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (SRC, a8b8g8r8, a8b8g8r8, armv6_8888_8888),
+    PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (SRC, a8b8g8r8, x8b8g8r8, armv6_8888_8888),
+    PIXMAN_ARM_SIMPLE_NEAREST_FAST_PATH (SRC, x8b8g8r8, x8b8g8r8, armv6_8888_8888),
+
     { PIXMAN_OP_NONE },
 };
 
 pixman_implementation_t *
-_pixman_implementation_create_arm_simd (void)
+_pixman_implementation_create_arm_simd (pixman_implementation_t *fallback)
 {
-    pixman_implementation_t *general = _pixman_implementation_create_fast_path ();
-    pixman_implementation_t *imp = _pixman_implementation_create (general, arm_simd_fast_paths);
+    pixman_implementation_t *imp = _pixman_implementation_create (fallback, arm_simd_fast_paths);
 
     return imp;
 }
--- a/gfx/cairo/libpixman/src/pixman-bits-image.c
+++ b/gfx/cairo/libpixman/src/pixman-bits-image.c
@@ -30,53 +30,51 @@
 #include <config.h>
 #endif
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include "pixman-private.h"
 #include "pixman-combine32.h"
 
-/* Store functions */
-void
-_pixman_image_store_scanline_32 (bits_image_t *  image,
-                                 int             x,
-                                 int             y,
-                                 int             width,
-                                 const uint32_t *buffer)
+/*
+ * By default, just evaluate the image at 32bpp and expand.  Individual image
+ * types can plug in a better scanline getter if they want to. For example
+ * we  could produce smoother gradients by evaluating them at higher color
+ * depth, but that's a project for the future.
+ */
+static void
+_pixman_image_get_scanline_generic_64 (pixman_image_t * image,
+                                       int              x,
+                                       int              y,
+                                       int              width,
+                                       uint32_t *       buffer,
+                                       const uint32_t * mask)
 {
-    image->store_scanline_32 (image, x, y, width, buffer);
+    uint32_t *mask8 = NULL;
 
-    if (image->common.alpha_map)
+    /* Contract the mask image, if one exists, so that the 32-bit fetch
+     * function can use it.
+     */
+    if (mask)
     {
-	x -= image->common.alpha_origin_x;
-	y -= image->common.alpha_origin_y;
+	mask8 = pixman_malloc_ab (width, sizeof(uint32_t));
+	if (!mask8)
+	    return;
 
-	image->common.alpha_map->store_scanline_32 (
-	    image->common.alpha_map, x, y, width, buffer);
+	pixman_contract (mask8, (uint64_t *)mask, width);
     }
-}
+
+    /* Fetch the source image into the first half of buffer. */
+    image->bits.get_scanline_32 (image, x, y, width, (uint32_t*)buffer, mask8);
 
-void
-_pixman_image_store_scanline_64 (bits_image_t *  image,
-                                 int             x,
-                                 int             y,
-                                 int             width,
-                                 const uint32_t *buffer)
-{
-    image->store_scanline_64 (image, x, y, width, buffer);
+    /* Expand from 32bpp to 64bpp in place. */
+    pixman_expand ((uint64_t *)buffer, buffer, PIXMAN_a8r8g8b8, width);
 
-    if (image->common.alpha_map)
-    {
-	x -= image->common.alpha_origin_x;
-	y -= image->common.alpha_origin_y;
-
-	image->common.alpha_map->store_scanline_64 (
-	    image->common.alpha_map, x, y, width, buffer);
-    }
+    free (mask8);
 }
 
 /* Fetch functions */
 
 static force_inline uint32_t
 fetch_pixel_no_alpha (bits_image_t *image,
 		      int x, int y, pixman_bool_t check_bounds)
 {
@@ -292,16 +290,17 @@ bits_image_fetch_bilinear_no_repeat_8888
     pixman_fixed_t x_top, x_bottom, x;
     pixman_fixed_t ux_top, ux_bottom, ux;
     pixman_vector_t v;
     uint32_t top_mask, bottom_mask;
     uint32_t *top_row;
     uint32_t *bottom_row;
     uint32_t *end;
     uint32_t zero[2] = { 0, 0 };
+    uint32_t one = 1;
     int y, y1, y2;
     int disty;
     int mask_inc;
     int w;
 
     /* reference point is the center of the pixel */
     v.vector[0] = pixman_int_to_fixed (offset) + pixman_fixed_1 / 2;
     v.vector[1] = pixman_int_to_fixed (line) + pixman_fixed_1 / 2;
@@ -357,20 +356,18 @@ bits_image_fetch_bilinear_no_repeat_8888
     }
 
     /* Instead of checking whether the operation uses the mast in
      * each loop iteration, verify this only once and prepare the
      * variables to make the code smaller inside the loop.
      */
     if (!mask)
     {
-	uint32_t mask_bits = 1;
-
         mask_inc = 0;
-        mask = &mask_bits;
+        mask = &one;
     }
     else
     {
         /* If have a mask, prepare the variables to check it */
         mask_inc = 1;
     }
 
     /* If both are zero, then the whole thing is zero */
@@ -902,16 +899,87 @@ bits_image_fetch_bilinear_affine (pixman
 	    tl, tr, bl, br, distx, disty);
 
     next:
 	x += ux;
 	y += uy;
     }
 }
 
+static force_inline void
+bits_image_fetch_nearest_affine (pixman_image_t * image,
+				 int              offset,
+				 int              line,
+				 int              width,
+				 uint32_t *       buffer,
+				 const uint32_t * mask,
+				 
+				 convert_pixel_t	convert_pixel,
+				 pixman_format_code_t	format,
+				 pixman_repeat_t	repeat_mode)
+{
+    pixman_fixed_t x, y;
+    pixman_fixed_t ux, uy;
+    pixman_vector_t v;
+    bits_image_t *bits = &image->bits;
+    int i;
+
+    /* reference point is the center of the pixel */
+    v.vector[0] = pixman_int_to_fixed (offset) + pixman_fixed_1 / 2;
+    v.vector[1] = pixman_int_to_fixed (line) + pixman_fixed_1 / 2;
+    v.vector[2] = pixman_fixed_1;
+
+    if (!pixman_transform_point_3d (image->common.transform, &v))
+	return;
+
+    ux = image->common.transform->matrix[0][0];
+    uy = image->common.transform->matrix[1][0];
+
+    x = v.vector[0];
+    y = v.vector[1];
+
+    for (i = 0; i < width; ++i)
+    {
+	int width, height, x0, y0;
+	const uint8_t *row;
+
+	if (mask && !mask[i])
+	    goto next;
+	
+	width = image->bits.width;
+	height = image->bits.height;
+	x0 = pixman_fixed_to_int (x - pixman_fixed_e);
+	y0 = pixman_fixed_to_int (y - pixman_fixed_e);
+
+	if (repeat_mode == PIXMAN_REPEAT_NONE &&
+	    (y0 < 0 || y0 >= height || x0 < 0 || x0 >= width))
+	{
+	    buffer[i] = 0;
+	}
+	else
+	{
+	    uint32_t mask = PIXMAN_FORMAT_A (format)? 0 : 0xff000000;
+
+	    if (repeat_mode != PIXMAN_REPEAT_NONE)
+	    {
+		repeat (repeat_mode, width, &x0);
+		repeat (repeat_mode, height, &y0);
+	    }
+
+	    row = (uint8_t *)bits->bits + bits->rowstride * 4 * y0;
+
+	    buffer[i] = convert_pixel (row, x0) | mask;
+	}
+
+    next:
+	x += ux;
+	y += uy;
+    }
+}
+
 static force_inline uint32_t
 convert_a8r8g8b8 (const uint8_t *row, int x)
 {
     return *(((uint32_t *)row) + x);
 }
 
 static force_inline uint32_t
 convert_x8r8g8b8 (const uint8_t *row, int x)
@@ -935,39 +1003,59 @@ convert_r5g6b5 (const uint8_t *row, int 
     static void								\
     bits_image_fetch_bilinear_affine_ ## name (pixman_image_t *image,	\
 					       int              offset,	\
 					       int              line,	\
 					       int              width,	\
 					       uint32_t *       buffer,	\
 					       const uint32_t * mask)	\
     {									\
-	bits_image_fetch_bilinear_affine (image, offset, line, width, buffer, mask, \
+	bits_image_fetch_bilinear_affine (image, offset, line,		\
+					  width, buffer, mask,		\
 					  convert_ ## format,		\
 					  PIXMAN_ ## format,		\
 					  repeat_mode);			\
-    }									\
-    extern int no_such_variable
+    }
+
+#define MAKE_NEAREST_FETCHER(name, format, repeat_mode)			\
+    static void								\
+    bits_image_fetch_nearest_affine_ ## name (pixman_image_t *image,	\
+					      int              offset,	\
+					      int              line,	\
+					      int              width,	\
+					      uint32_t *       buffer,	\
+					      const uint32_t * mask)	\
+    {									\
+	bits_image_fetch_nearest_affine (image, offset, line,		\
+					 width, buffer, mask,		\
+					 convert_ ## format,		\
+					 PIXMAN_ ## format,		\
+					 repeat_mode);			\
+    }
 
-MAKE_BILINEAR_FETCHER (pad_a8r8g8b8,     a8r8g8b8, PIXMAN_REPEAT_PAD);
-MAKE_BILINEAR_FETCHER (none_a8r8g8b8,    a8r8g8b8, PIXMAN_REPEAT_NONE);
-MAKE_BILINEAR_FETCHER (reflect_a8r8g8b8, a8r8g8b8, PIXMAN_REPEAT_REFLECT);
-MAKE_BILINEAR_FETCHER (normal_a8r8g8b8,  a8r8g8b8, PIXMAN_REPEAT_NORMAL);
-MAKE_BILINEAR_FETCHER (pad_x8r8g8b8,     x8r8g8b8, PIXMAN_REPEAT_PAD);
-MAKE_BILINEAR_FETCHER (none_x8r8g8b8,    x8r8g8b8, PIXMAN_REPEAT_NONE);
-MAKE_BILINEAR_FETCHER (reflect_x8r8g8b8, x8r8g8b8, PIXMAN_REPEAT_REFLECT);
-MAKE_BILINEAR_FETCHER (normal_x8r8g8b8,  x8r8g8b8, PIXMAN_REPEAT_NORMAL);
-MAKE_BILINEAR_FETCHER (pad_a8,           a8,       PIXMAN_REPEAT_PAD);
-MAKE_BILINEAR_FETCHER (none_a8,          a8,       PIXMAN_REPEAT_NONE);
-MAKE_BILINEAR_FETCHER (reflect_a8,	 a8,       PIXMAN_REPEAT_REFLECT);
-MAKE_BILINEAR_FETCHER (normal_a8,	 a8,       PIXMAN_REPEAT_NORMAL);
-MAKE_BILINEAR_FETCHER (pad_r5g6b5,       r5g6b5,   PIXMAN_REPEAT_PAD);
-MAKE_BILINEAR_FETCHER (none_r5g6b5,      r5g6b5,   PIXMAN_REPEAT_NONE);
-MAKE_BILINEAR_FETCHER (reflect_r5g6b5,   r5g6b5,   PIXMAN_REPEAT_REFLECT);
-MAKE_BILINEAR_FETCHER (normal_r5g6b5,    r5g6b5,   PIXMAN_REPEAT_NORMAL);
+#define MAKE_FETCHERS(name, format, repeat_mode)			\
+    MAKE_NEAREST_FETCHER (name, format, repeat_mode)			\
+    MAKE_BILINEAR_FETCHER (name, format, repeat_mode)
+
+MAKE_FETCHERS (pad_a8r8g8b8,     a8r8g8b8, PIXMAN_REPEAT_PAD)
+MAKE_FETCHERS (none_a8r8g8b8,    a8r8g8b8, PIXMAN_REPEAT_NONE)
+MAKE_FETCHERS (reflect_a8r8g8b8, a8r8g8b8, PIXMAN_REPEAT_REFLECT)
+MAKE_FETCHERS (normal_a8r8g8b8,  a8r8g8b8, PIXMAN_REPEAT_NORMAL)
+MAKE_FETCHERS (pad_x8r8g8b8,     x8r8g8b8, PIXMAN_REPEAT_PAD)
+MAKE_FETCHERS (none_x8r8g8b8,    x8r8g8b8, PIXMAN_REPEAT_NONE)
+MAKE_FETCHERS (reflect_x8r8g8b8, x8r8g8b8, PIXMAN_REPEAT_REFLECT)
+MAKE_FETCHERS (normal_x8r8g8b8,  x8r8g8b8, PIXMAN_REPEAT_NORMAL)
+MAKE_FETCHERS (pad_a8,           a8,       PIXMAN_REPEAT_PAD)
+MAKE_FETCHERS (none_a8,          a8,       PIXMAN_REPEAT_NONE)
+MAKE_FETCHERS (reflect_a8,	 a8,       PIXMAN_REPEAT_REFLECT)
+MAKE_FETCHERS (normal_a8,	 a8,       PIXMAN_REPEAT_NORMAL)
+MAKE_FETCHERS (pad_r5g6b5,       r5g6b5,   PIXMAN_REPEAT_PAD)
+MAKE_FETCHERS (none_r5g6b5,      r5g6b5,   PIXMAN_REPEAT_NONE)
+MAKE_FETCHERS (reflect_r5g6b5,   r5g6b5,   PIXMAN_REPEAT_REFLECT)
+MAKE_FETCHERS (normal_r5g6b5,    r5g6b5,   PIXMAN_REPEAT_NORMAL)
 
 static void
 bits_image_fetch_solid_32 (pixman_image_t * image,
                            int              x,
                            int              y,
                            int              width,
                            uint32_t *       buffer,
                            const uint32_t * mask)
@@ -1171,39 +1259,57 @@ static const fetcher_info_t fetcher_info
 
 #define GENERAL_BILINEAR_FLAGS						\
     (FAST_PATH_NO_ALPHA_MAP		|				\
      FAST_PATH_NO_ACCESSORS		|				\
      FAST_PATH_HAS_TRANSFORM		|				\
      FAST_PATH_AFFINE_TRANSFORM		|				\
      FAST_PATH_BILINEAR_FILTER)
 
+#define GENERAL_NEAREST_FLAGS						\
+    (FAST_PATH_NO_ALPHA_MAP		|				\
+     FAST_PATH_NO_ACCESSORS		|				\
+     FAST_PATH_HAS_TRANSFORM		|				\
+     FAST_PATH_AFFINE_TRANSFORM		|				\
+     FAST_PATH_NEAREST_FILTER)
+
 #define BILINEAR_AFFINE_FAST_PATH(name, format, repeat)			\
     { PIXMAN_ ## format,						\
       GENERAL_BILINEAR_FLAGS | FAST_PATH_ ## repeat ## _REPEAT,		\
       bits_image_fetch_bilinear_affine_ ## name,			\
       _pixman_image_get_scanline_generic_64				\
     },
 
-    BILINEAR_AFFINE_FAST_PATH (pad_a8r8g8b8, a8r8g8b8, PAD)
-    BILINEAR_AFFINE_FAST_PATH (none_a8r8g8b8, a8r8g8b8, NONE)
-    BILINEAR_AFFINE_FAST_PATH (reflect_a8r8g8b8, a8r8g8b8, REFLECT)
-    BILINEAR_AFFINE_FAST_PATH (normal_a8r8g8b8, a8r8g8b8, NORMAL)
-    BILINEAR_AFFINE_FAST_PATH (pad_x8r8g8b8, x8r8g8b8, PAD)
-    BILINEAR_AFFINE_FAST_PATH (none_x8r8g8b8, x8r8g8b8, NONE)
-    BILINEAR_AFFINE_FAST_PATH (reflect_x8r8g8b8, x8r8g8b8, REFLECT)
-    BILINEAR_AFFINE_FAST_PATH (normal_x8r8g8b8, x8r8g8b8, NORMAL)
-    BILINEAR_AFFINE_FAST_PATH (pad_a8, a8, PAD)
-    BILINEAR_AFFINE_FAST_PATH (none_a8, a8, NONE)
-    BILINEAR_AFFINE_FAST_PATH (reflect_a8, a8, REFLECT)
-    BILINEAR_AFFINE_FAST_PATH (normal_a8, a8, NORMAL)
-    BILINEAR_AFFINE_FAST_PATH (pad_r5g6b5, r5g6b5, PAD)
-    BILINEAR_AFFINE_FAST_PATH (none_r5g6b5, r5g6b5, NONE)
-    BILINEAR_AFFINE_FAST_PATH (reflect_r5g6b5, r5g6b5, REFLECT)
-    BILINEAR_AFFINE_FAST_PATH (normal_r5g6b5, r5g6b5, NORMAL)
+#define NEAREST_AFFINE_FAST_PATH(name, format, repeat)			\
+    { PIXMAN_ ## format,						\
+      GENERAL_NEAREST_FLAGS | FAST_PATH_ ## repeat ## _REPEAT,		\
+      bits_image_fetch_nearest_affine_ ## name,			\
+      _pixman_image_get_scanline_generic_64				\
+    },
+
+#define AFFINE_FAST_PATHS(name, format, repeat)				\
+    BILINEAR_AFFINE_FAST_PATH(name, format, repeat)			\
+    NEAREST_AFFINE_FAST_PATH(name, format, repeat)
+    
+    AFFINE_FAST_PATHS (pad_a8r8g8b8, a8r8g8b8, PAD)
+    AFFINE_FAST_PATHS (none_a8r8g8b8, a8r8g8b8, NONE)
+    AFFINE_FAST_PATHS (reflect_a8r8g8b8, a8r8g8b8, REFLECT)
+    AFFINE_FAST_PATHS (normal_a8r8g8b8, a8r8g8b8, NORMAL)
+    AFFINE_FAST_PATHS (pad_x8r8g8b8, x8r8g8b8, PAD)
+    AFFINE_FAST_PATHS (none_x8r8g8b8, x8r8g8b8, NONE)
+    AFFINE_FAST_PATHS (reflect_x8r8g8b8, x8r8g8b8, REFLECT)
+    AFFINE_FAST_PATHS (normal_x8r8g8b8, x8r8g8b8, NORMAL)
+    AFFINE_FAST_PATHS (pad_a8, a8, PAD)
+    AFFINE_FAST_PATHS (none_a8, a8, NONE)
+    AFFINE_FAST_PATHS (reflect_a8, a8, REFLECT)
+    AFFINE_FAST_PATHS (normal_a8, a8, NORMAL)
+    AFFINE_FAST_PATHS (pad_r5g6b5, r5g6b5, PAD)
+    AFFINE_FAST_PATHS (none_r5g6b5, r5g6b5, NONE)
+    AFFINE_FAST_PATHS (reflect_r5g6b5, r5g6b5, REFLECT)
+    AFFINE_FAST_PATHS (normal_r5g6b5, r5g6b5, NORMAL)
 
     /* Affine, no alpha */
     { PIXMAN_any,
       (FAST_PATH_NO_ALPHA_MAP | FAST_PATH_HAS_TRANSFORM | FAST_PATH_AFFINE_TRANSFORM),
       bits_image_fetch_affine_no_alpha,
       _pixman_image_get_scanline_generic_64
     },
 
@@ -1223,26 +1329,190 @@ bits_image_property_changed (pixman_imag
     _pixman_bits_image_setup_accessors (&image->bits);
 
     info = fetcher_info;
     while (info->format != PIXMAN_null)
     {
 	if ((info->format == format || info->format == PIXMAN_any)	&&
 	    (info->flags & flags) == info->flags)
 	{
-	    image->common.get_scanline_32 = info->fetch_32;
-	    image->common.get_scanline_64 = info->fetch_64;
+	    image->bits.get_scanline_32 = info->fetch_32;
+	    image->bits.get_scanline_64 = info->fetch_64;
 	    break;
 	}
 
 	info++;
     }
 }
 
 static uint32_t *
+src_get_scanline_narrow (pixman_iter_t *iter, const uint32_t *mask)
+{
+    iter->image->bits.get_scanline_32 (
+	iter->image, iter->x, iter->y++, iter->width, iter->buffer, mask);
+
+    return iter->buffer;
+}
+
+static uint32_t *
+src_get_scanline_wide (pixman_iter_t *iter, const uint32_t *mask)
+{
+    iter->image->bits.get_scanline_64 (
+	iter->image, iter->x, iter->y++, iter->width, iter->buffer, mask);
+
+    return iter->buffer;
+}
+
+void
+_pixman_bits_image_src_iter_init (pixman_image_t *image, pixman_iter_t *iter)
+{
+    if (iter->flags & ITER_NARROW)
+	iter->get_scanline = src_get_scanline_narrow;
+    else
+	iter->get_scanline = src_get_scanline_wide;
+}
+
+static uint32_t *
+dest_get_scanline_narrow (pixman_iter_t *iter, const uint32_t *mask)
+{
+    pixman_image_t *image  = iter->image;
+    int             x      = iter->x;
+    int             y      = iter->y;
+    int             width  = iter->width;
+    uint32_t *	    buffer = iter->buffer;
+
+    image->bits.fetch_scanline_32 (image, x, y, width, buffer, mask);
+    if (image->common.alpha_map)
+    {
+	x -= image->common.alpha_origin_x;
+	y -= image->common.alpha_origin_y;
+
+	image->common.alpha_map->fetch_scanline_32 (
+	    (pixman_image_t *)image->common.alpha_map,
+	    x, y, width, buffer, mask);
+    }
+
+    return iter->buffer;
+}
+
+static uint32_t *
+dest_get_scanline_wide (pixman_iter_t *iter, const uint32_t *mask)
+{
+    bits_image_t *  image  = &iter->image->bits;
+    int             x      = iter->x;
+    int             y      = iter->y;
+    int             width  = iter->width;
+    uint32_t *	    buffer = iter->buffer;
+
+    image->fetch_scanline_64 (
+	(pixman_image_t *)image, x, y, width, buffer, mask);
+    if (image->common.alpha_map)
+    {
+	x -= image->common.alpha_origin_x;
+	y -= image->common.alpha_origin_y;
+
+	image->common.alpha_map->fetch_scanline_64 (
+	    (pixman_image_t *)image->common.alpha_map, x, y, width, buffer, mask);
+    }
+
+    return iter->buffer;
+}
+
+static void
+dest_write_back_narrow (pixman_iter_t *iter)
+{
+    bits_image_t *  image  = &iter->image->bits;
+    int             x      = iter->x;
+    int             y      = iter->y;
+    int             width  = iter->width;
+    const uint32_t *buffer = iter->buffer;
+
+    image->store_scanline_32 (image, x, y, width, buffer);
+
+    if (image->common.alpha_map)
+    {
+	x -= image->common.alpha_origin_x;
+	y -= image->common.alpha_origin_y;
+
+	image->common.alpha_map->store_scanline_32 (
+	    image->common.alpha_map, x, y, width, buffer);
+    }
+
+    iter->y++;
+}
+
+static void
+dest_write_back_wide (pixman_iter_t *iter)
+{
+    bits_image_t *  image  = &iter->image->bits;
+    int             x      = iter->x;
+    int             y      = iter->y;
+    int             width  = iter->width;
+    const uint32_t *buffer = iter->buffer;
+
+    image->store_scanline_64 (image, x, y, width, buffer);
+
+    if (image->common.alpha_map)
+    {
+	x -= image->common.alpha_origin_x;
+	y -= image->common.alpha_origin_y;
+
+	image->common.alpha_map->store_scanline_64 (
+	    image->common.alpha_map, x, y, width, buffer);
+    }
+
+    iter->y++;
+}
+
+static void
+dest_write_back_direct (pixman_iter_t *iter)
+{
+    iter->buffer += iter->image->bits.rowstride;
+}
+
+void
+_pixman_bits_image_dest_iter_init (pixman_image_t *image, pixman_iter_t *iter)
+{
+    if (iter->flags & ITER_NARROW)
+    {
+	if (((image->common.flags &
+	      (FAST_PATH_NO_ALPHA_MAP | FAST_PATH_NO_ACCESSORS)) ==
+	     (FAST_PATH_NO_ALPHA_MAP | FAST_PATH_NO_ACCESSORS)) &&
+	    (image->bits.format == PIXMAN_a8r8g8b8	||
+	     (image->bits.format == PIXMAN_x8r8g8b8	&&
+	      (iter->flags & ITER_LOCALIZED_ALPHA))))
+	{
+	    iter->buffer = image->bits.bits + iter->y * image->bits.rowstride + iter->x;
+
+	    iter->get_scanline = _pixman_iter_get_scanline_noop;
+	    iter->write_back = dest_write_back_direct;
+	}
+	else
+	{
+	    if ((iter->flags & (ITER_IGNORE_RGB | ITER_IGNORE_ALPHA)) ==
+		(ITER_IGNORE_RGB | ITER_IGNORE_ALPHA))
+	    {
+		iter->get_scanline = _pixman_iter_get_scanline_noop;
+	    }
+	    else
+	    {
+		iter->get_scanline = dest_get_scanline_narrow;
+	    }
+
+	    iter->write_back = dest_write_back_narrow;
+	}
+    }
+    else
+    {
+	iter->get_scanline = dest_get_scanline_wide;
+	iter->write_back = dest_write_back_wide;
+    }
+}
+
+static uint32_t *
 create_bits (pixman_format_code_t format,
              int                  width,
              int                  height,
              int *                rowstride_bytes)
 {
     int stride;
     int buf_size;
     int bpp;
--- a/gfx/cairo/libpixman/src/pixman-combine.c.template
+++ b/gfx/cairo/libpixman/src/pixman-combine.c.template
@@ -128,16 +128,27 @@ combine_clear (pixman_implementation_t *
                const comp4_t *          src,
                const comp4_t *          mask,
                int                      width)
 {
     memset (dest, 0, width * sizeof(comp4_t));
 }
 
 static void
+combine_dst (pixman_implementation_t *imp,
+	     pixman_op_t	      op,
+	     comp4_t *		      dest,
+	     const comp4_t *	      src,
+	     const comp4_t *          mask,
+	     int		      width)
+{
+    return;
+}
+
+static void
 combine_src_u (pixman_implementation_t *imp,
                pixman_op_t              op,
                comp4_t *                dest,
                const comp4_t *          src,
                const comp4_t *          mask,
                int                      width)
 {
     int i;
@@ -943,25 +954,43 @@ set_lum (comp4_t dest[3], comp4_t src[3]
 
     /* clip_color */
     l = LUM (tmp);
     min = CH_MIN (tmp);
     max = CH_MAX (tmp);
 
     if (min < 0)
     {
-	tmp[0] = l + (tmp[0] - l) * l / (l - min);
-	tmp[1] = l + (tmp[1] - l) * l / (l - min);
-	tmp[2] = l + (tmp[2] - l) * l / (l - min);
+	if (l - min == 0.0)
+	{
+	    tmp[0] = 0;
+	    tmp[1] = 0;
+	    tmp[2] = 0;
+	}
+	else
+	{
+	    tmp[0] = l + (tmp[0] - l) * l / (l - min);
+	    tmp[1] = l + (tmp[1] - l) * l / (l - min);
+	    tmp[2] = l + (tmp[2] - l) * l / (l - min);
+	}
     }
     if (max > a)
     {
-	tmp[0] = l + (tmp[0] - l) * (a - l) / (max - l);
-	tmp[1] = l + (tmp[1] - l) * (a - l) / (max - l);
-	tmp[2] = l + (tmp[2] - l) * (a - l) / (max - l);
+	if (max - l == 0.0)
+	{
+	    tmp[0] = a;
+	    tmp[1] = a;
+	    tmp[2] = a;
+	}
+	else
+	{
+	    tmp[0] = l + (tmp[0] - l) * (a - l) / (max - l);
+	    tmp[1] = l + (tmp[1] - l) * (a - l) / (max - l);
+	    tmp[2] = l + (tmp[2] - l) * (a - l) / (max - l);
+	}
     }
 
     dest[0] = tmp[0] * MASK + 0.5;
     dest[1] = tmp[1] * MASK + 0.5;
     dest[2] = tmp[2] * MASK + 0.5;
 }
 
 static void
@@ -1291,27 +1320,23 @@ combine_disjoint_over_u (pixman_implemen
 {
     int i;
 
     for (i = 0; i < width; ++i)
     {
 	comp4_t s = combine_mask (src, mask, i);
 	comp2_t a = s >> A_SHIFT;
 
-	if (a != 0x00)
+	if (s != 0x00)
 	{
-	    if (a != MASK)
-	    {
-		comp4_t d = *(dest + i);
-		a = combine_disjoint_out_part (d >> A_SHIFT, a);
-		UNcx4_MUL_UNc_ADD_UNcx4 (d, a, s);
-		s = d;
-	    }
+	    comp4_t d = *(dest + i);
+	    a = combine_disjoint_out_part (d >> A_SHIFT, a);
+	    UNcx4_MUL_UNc_ADD_UNcx4 (d, a, s);
 
-	    *(dest + i) = s;
+	    *(dest + i) = d;
 	}
     }
 }
 
 static void
 combine_disjoint_in_u (pixman_implementation_t *imp,
                        pixman_op_t              op,
                        comp4_t *                dest,
@@ -2309,47 +2334,47 @@ combine_conjoint_xor_ca (pixman_implemen
 }
 
 void
 _pixman_setup_combiner_functions_width (pixman_implementation_t *imp)
 {
     /* Unified alpha */
     imp->combine_width[PIXMAN_OP_CLEAR] = combine_clear;
     imp->combine_width[PIXMAN_OP_SRC] = combine_src_u;
-    /* dest */
+    imp->combine_width[PIXMAN_OP_DST] = combine_dst;
     imp->combine_width[PIXMAN_OP_OVER] = combine_over_u;
     imp->combine_width[PIXMAN_OP_OVER_REVERSE] = combine_over_reverse_u;
     imp->combine_width[PIXMAN_OP_IN] = combine_in_u;
     imp->combine_width[PIXMAN_OP_IN_REVERSE] = combine_in_reverse_u;
     imp->combine_width[PIXMAN_OP_OUT] = combine_out_u;
     imp->combine_width[PIXMAN_OP_OUT_REVERSE] = combine_out_reverse_u;
     imp->combine_width[PIXMAN_OP_ATOP] = combine_atop_u;
     imp->combine_width[PIXMAN_OP_ATOP_REVERSE] = combine_atop_reverse_u;
     imp->combine_width[PIXMAN_OP_XOR] = combine_xor_u;
     imp->combine_width[PIXMAN_OP_ADD] = combine_add_u;
     imp->combine_width[PIXMAN_OP_SATURATE] = combine_saturate_u;
 
     /* Disjoint, unified */
     imp->combine_width[PIXMAN_OP_DISJOINT_CLEAR] = combine_clear;
     imp->combine_width[PIXMAN_OP_DISJOINT_SRC] = combine_src_u;
-    /* dest */
+    imp->combine_width[PIXMAN_OP_DISJOINT_DST] = combine_dst;
     imp->combine_width[PIXMAN_OP_DISJOINT_OVER] = combine_disjoint_over_u;
     imp->combine_width[PIXMAN_OP_DISJOINT_OVER_REVERSE] = combine_saturate_u;
     imp->combine_width[PIXMAN_OP_DISJOINT_IN] = combine_disjoint_in_u;
     imp->combine_width[PIXMAN_OP_DISJOINT_IN_REVERSE] = combine_disjoint_in_reverse_u;
     imp->combine_width[PIXMAN_OP_DISJOINT_OUT] = combine_disjoint_out_u;
     imp->combine_width[PIXMAN_OP_DISJOINT_OUT_REVERSE] = combine_disjoint_out_reverse_u;
     imp->combine_width[PIXMAN_OP_DISJOINT_ATOP] = combine_disjoint_atop_u;
     imp->combine_width[PIXMAN_OP_DISJOINT_ATOP_REVERSE] = combine_disjoint_atop_reverse_u;
     imp->combine_width[PIXMAN_OP_DISJOINT_XOR] = combine_disjoint_xor_u;
 
     /* Conjoint, unified */
     imp->combine_width[PIXMAN_OP_CONJOINT_CLEAR] = combine_clear;
     imp->combine_width[PIXMAN_OP_CONJOINT_SRC] = combine_src_u;
-    /* dest */
+    imp->combine_width[PIXMAN_OP_CONJOINT_DST] = combine_dst;
     imp->combine_width[PIXMAN_OP_CONJOINT_OVER] = combine_conjoint_over_u;
     imp->combine_width[PIXMAN_OP_CONJOINT_OVER_REVERSE] = combine_conjoint_over_reverse_u;
     imp->combine_width[PIXMAN_OP_CONJOINT_IN] = combine_conjoint_in_u;
     imp->combine_width[PIXMAN_OP_CONJOINT_IN_REVERSE] = combine_conjoint_in_reverse_u;
     imp->combine_width[PIXMAN_OP_CONJOINT_OUT] = combine_conjoint_out_u;
     imp->combine_width[PIXMAN_OP_CONJOINT_OUT_REVERSE] = combine_conjoint_out_reverse_u;
     imp->combine_width[PIXMAN_OP_CONJOINT_ATOP] = combine_conjoint_atop_u;
     imp->combine_width[PIXMAN_OP_CONJOINT_ATOP_REVERSE] = combine_conjoint_atop_reverse_u;
@@ -2385,31 +2410,31 @@ void
     imp->combine_width_ca[PIXMAN_OP_ATOP_REVERSE] = combine_atop_reverse_ca;
     imp->combine_width_ca[PIXMAN_OP_XOR] = combine_xor_ca;
     imp->combine_width_ca[PIXMAN_OP_ADD] = combine_add_ca;
     imp->combine_width_ca[PIXMAN_OP_SATURATE] = combine_saturate_ca;
 
     /* Disjoint CA */
     imp->combine_width_ca[PIXMAN_OP_DISJOINT_CLEAR] = combine_clear_ca;
     imp->combine_width_ca[PIXMAN_OP_DISJOINT_SRC] = combine_src_ca;
-    /* dest */
+    imp->combine_width_ca[PIXMAN_OP_DISJOINT_DST] = combine_dst;
     imp->combine_width_ca[PIXMAN_OP_DISJOINT_OVER] = combine_disjoint_over_ca;
     imp->combine_width_ca[PIXMAN_OP_DISJOINT_OVER_REVERSE] = combine_saturate_ca;
     imp->combine_width_ca[PIXMAN_OP_DISJOINT_IN] = combine_disjoint_in_ca;
     imp->combine_width_ca[PIXMAN_OP_DISJOINT_IN_REVERSE] = combine_disjoint_in_reverse_ca;
     imp->combine_width_ca[PIXMAN_OP_DISJOINT_OUT] = combine_disjoint_out_ca;
     imp->combine_width_ca[PIXMAN_OP_DISJOINT_OUT_REVERSE] = combine_disjoint_out_reverse_ca;
     imp->combine_width_ca[PIXMAN_OP_DISJOINT_ATOP] = combine_disjoint_atop_ca;
     imp->combine_width_ca[PIXMAN_OP_DISJOINT_ATOP_REVERSE] = combine_disjoint_atop_reverse_ca;
     imp->combine_width_ca[PIXMAN_OP_DISJOINT_XOR] = combine_disjoint_xor_ca;
 
     /* Conjoint CA */
     imp->combine_width_ca[PIXMAN_OP_CONJOINT_CLEAR] = combine_clear_ca;
     imp->combine_width_ca[PIXMAN_OP_CONJOINT_SRC] = combine_src_ca;
-    /* dest */
+    imp->combine_width_ca[PIXMAN_OP_CONJOINT_DST] = combine_dst;
     imp->combine_width_ca[PIXMAN_OP_CONJOINT_OVER] = combine_conjoint_over_ca;
     imp->combine_width_ca[PIXMAN_OP_CONJOINT_OVER_REVERSE] = combine_conjoint_over_reverse_ca;
     imp->combine_width_ca[PIXMAN_OP_CONJOINT_IN] = combine_conjoint_in_ca;
     imp->combine_width_ca[PIXMAN_OP_CONJOINT_IN_REVERSE] = combine_conjoint_in_reverse_ca;
     imp->combine_width_ca[PIXMAN_OP_CONJOINT_OUT] = combine_conjoint_out_ca;
     imp->combine_width_ca[PIXMAN_OP_CONJOINT_OUT_REVERSE] = combine_conjoint_out_reverse_ca;
     imp->combine_width_ca[PIXMAN_OP_CONJOINT_ATOP] = combine_conjoint_atop_ca;
     imp->combine_width_ca[PIXMAN_OP_CONJOINT_ATOP_REVERSE] = combine_conjoint_atop_reverse_ca;
@@ -2422,15 +2447,15 @@ void
     imp->combine_width_ca[PIXMAN_OP_LIGHTEN] = combine_lighten_ca;
     imp->combine_width_ca[PIXMAN_OP_COLOR_DODGE] = combine_color_dodge_ca;
     imp->combine_width_ca[PIXMAN_OP_COLOR_BURN] = combine_color_burn_ca;
     imp->combine_width_ca[PIXMAN_OP_HARD_LIGHT] = combine_hard_light_ca;
     imp->combine_width_ca[PIXMAN_OP_SOFT_LIGHT] = combine_soft_light_ca;
     imp->combine_width_ca[PIXMAN_OP_DIFFERENCE] = combine_difference_ca;
     imp->combine_width_ca[PIXMAN_OP_EXCLUSION] = combine_exclusion_ca;
 
-    /* It is not clear that these make sense, so leave them out for now */
-    imp->combine_width_ca[PIXMAN_OP_HSL_HUE] = NULL;
-    imp->combine_width_ca[PIXMAN_OP_HSL_SATURATION] = NULL;
-    imp->combine_width_ca[PIXMAN_OP_HSL_COLOR] = NULL;
-    imp->combine_width_ca[PIXMAN_OP_HSL_LUMINOSITY] = NULL;
+    /* It is not clear that these make sense, so make them noops for now */
+    imp->combine_width_ca[PIXMAN_OP_HSL_HUE] = combine_dst;
+    imp->combine_width_ca[PIXMAN_OP_HSL_SATURATION] = combine_dst;
+    imp->combine_width_ca[PIXMAN_OP_HSL_COLOR] = combine_dst;
+    imp->combine_width_ca[PIXMAN_OP_HSL_LUMINOSITY] = combine_dst;
 }
 
--- a/gfx/cairo/libpixman/src/pixman-combine.h.template
+++ b/gfx/cairo/libpixman/src/pixman-combine.h.template
@@ -26,17 +26,17 @@
 
 #define MUL_UNc(a, b, t)						\
     ((t) = (a) * (b) + ONE_HALF, ((((t) >> G_SHIFT ) + (t) ) >> G_SHIFT ))
 
 #define DIV_UNc(a, b)							\
     (((comp2_t) (a) * MASK) / (b))
 
 #define ADD_UNc(x, y, t)				     \
-    ((t) = x + y,					     \
+    ((t) = (x) + (y),					     \
      (comp4_t) (comp1_t) ((t) | (0 - ((t) >> G_SHIFT))))
 
 #define DIV_ONE_UNc(x)							\
     (((x) + ONE_HALF + (((x) + ONE_HALF) >> G_SHIFT)) >> G_SHIFT)
 
 /*
  * The methods below use some tricks to be able to do two color
  * components at the same time.
@@ -79,148 +79,148 @@
     } while (0)
 
 /*
  * x_c = (x_c * a) / 255
  */
 #define UNcx4_MUL_UNc(x, a)						\
     do									\
     {									\
-	comp4_t r1, r2, t;						\
+	comp4_t r1__, r2__, t__;					\
 									\
-	r1 = (x);							\
-	UNc_rb_MUL_UNc (r1, a, t);					\
+	r1__ = (x);							\
+	UNc_rb_MUL_UNc (r1__, (a), t__);				\
 									\
-	r2 = (x) >> G_SHIFT;						\
-	UNc_rb_MUL_UNc (r2, a, t);					\
+	r2__ = (x) >> G_SHIFT;						\
+	UNc_rb_MUL_UNc (r2__, (a), t__);				\
 									\
-	x = r1 | (r2 << G_SHIFT);					\
+	(x) = r1__ | (r2__ << G_SHIFT);					\
     } while (0)
 
 /*
  * x_c = (x_c * a) / 255 + y_c
  */
 #define UNcx4_MUL_UNc_ADD_UNcx4(x, a, y)				\
     do									\
     {									\
-	comp4_t r1, r2, r3, t;						\
+	comp4_t r1__, r2__, r3__, t__;					\
 									\
-	r1 = (x);							\
-	r2 = (y) & RB_MASK;						\
-	UNc_rb_MUL_UNc (r1, a, t);					\
-	UNc_rb_ADD_UNc_rb (r1, r2, t);					\
+	r1__ = (x);							\
+	r2__ = (y) & RB_MASK;						\
+	UNc_rb_MUL_UNc (r1__, (a), t__);				\
+	UNc_rb_ADD_UNc_rb (r1__, r2__, t__);				\
 									\
-	r2 = (x) >> G_SHIFT;						\
-	r3 = ((y) >> G_SHIFT) & RB_MASK;				\
-	UNc_rb_MUL_UNc (r2, a, t);					\
-	UNc_rb_ADD_UNc_rb (r2, r3, t);					\
+	r2__ = (x) >> G_SHIFT;						\
+	r3__ = ((y) >> G_SHIFT) & RB_MASK;				\
+	UNc_rb_MUL_UNc (r2__, (a), t__);				\
+	UNc_rb_ADD_UNc_rb (r2__, r3__, t__);				\
 									\
-	x = r1 | (r2 << G_SHIFT);					\
+	(x) = r1__ | (r2__ << G_SHIFT);					\
     } while (0)
 
 /*
  * x_c = (x_c * a + y_c * b) / 255
  */
 #define UNcx4_MUL_UNc_ADD_UNcx4_MUL_UNc(x, a, y, b)			\
     do									\
     {									\
-	comp4_t r1, r2, r3, t;						\
+	comp4_t r1__, r2__, r3__, t__;					\
 									\
-	r1 = x;								\
-	r2 = y;								\
-	UNc_rb_MUL_UNc (r1, a, t);					\
-	UNc_rb_MUL_UNc (r2, b, t);					\
-	UNc_rb_ADD_UNc_rb (r1, r2, t);					\
+	r1__ = (x);							\
+	r2__ = (y);							\
+	UNc_rb_MUL_UNc (r1__, (a), t__);				\
+	UNc_rb_MUL_UNc (r2__, (b), t__);				\
+	UNc_rb_ADD_UNc_rb (r1__, r2__, t__);				\
 									\
-	r2 = (x >> G_SHIFT);						\
-	r3 = (y >> G_SHIFT);						\
-	UNc_rb_MUL_UNc (r2, a, t);					\
-	UNc_rb_MUL_UNc (r3, b, t);					\
-	UNc_rb_ADD_UNc_rb (r2, r3, t);					\
+	r2__ = ((x) >> G_SHIFT);					\
+	r3__ = ((y) >> G_SHIFT);					\
+	UNc_rb_MUL_UNc (r2__, (a), t__);				\
+	UNc_rb_MUL_UNc (r3__, (b), t__);				\
+	UNc_rb_ADD_UNc_rb (r2__, r3__, t__);				\
 									\
-	x = r1 | (r2 << G_SHIFT);					\
+	(x) = r1__ | (r2__ << G_SHIFT);					\
     } while (0)
 
 /*
  * x_c = (x_c * a_c) / 255
  */
 #define UNcx4_MUL_UNcx4(x, a)						\
     do									\
     {									\
-	comp4_t r1, r2, r3, t;						\
+	comp4_t r1__, r2__, r3__, t__;					\
 									\
-	r1 = x;								\
-	r2 = a;								\
-	UNc_rb_MUL_UNc_rb (r1, r2, t);					\
+	r1__ = (x);							\
+	r2__ = (a);							\
+	UNc_rb_MUL_UNc_rb (r1__, r2__, t__);				\
 									\
-	r2 = x >> G_SHIFT;						\
-	r3 = a >> G_SHIFT;						\
-	UNc_rb_MUL_UNc_rb (r2, r3, t);					\
+	r2__ = (x) >> G_SHIFT;						\
+	r3__ = (a) >> G_SHIFT;						\
+	UNc_rb_MUL_UNc_rb (r2__, r3__, t__);				\
 									\
-	x = r1 | (r2 << G_SHIFT);					\
+	(x) = r1__ | (r2__ << G_SHIFT);					\
     } while (0)
 
 /*
  * x_c = (x_c * a_c) / 255 + y_c
  */
 #define UNcx4_MUL_UNcx4_ADD_UNcx4(x, a, y)				\
     do									\
     {									\
-	comp4_t r1, r2, r3, t;						\
+	comp4_t r1__, r2__, r3__, t__;					\
 									\
-	r1 = x;								\
-	r2 = a;								\
-	UNc_rb_MUL_UNc_rb (r1, r2, t);					\
-	r2 = y & RB_MASK;						\
-	UNc_rb_ADD_UNc_rb (r1, r2, t);					\
+	r1__ = (x);							\
+	r2__ = (a);							\
+	UNc_rb_MUL_UNc_rb (r1__, r2__, t__);				\
+	r2__ = (y) & RB_MASK;						\
+	UNc_rb_ADD_UNc_rb (r1__, r2__, t__);				\
 									\
-	r2 = (x >> G_SHIFT);						\
-	r3 = (a >> G_SHIFT);						\
-	UNc_rb_MUL_UNc_rb (r2, r3, t);					\
-	r3 = (y >> G_SHIFT) & RB_MASK;					\
-	UNc_rb_ADD_UNc_rb (r2, r3, t);					\
+	r2__ = ((x) >> G_SHIFT);					\
+	r3__ = ((a) >> G_SHIFT);					\
+	UNc_rb_MUL_UNc_rb (r2__, r3__, t__);				\
+	r3__ = ((y) >> G_SHIFT) & RB_MASK;				\
+	UNc_rb_ADD_UNc_rb (r2__, r3__, t__);				\
 									\
-	x = r1 | (r2 << G_SHIFT);					\
+	(x) = r1__ | (r2__ << G_SHIFT);					\
     } while (0)
 
 /*
  * x_c = (x_c * a_c + y_c * b) / 255
  */
 #define UNcx4_MUL_UNcx4_ADD_UNcx4_MUL_UNc(x, a, y, b)			\
     do									\
     {									\
-	comp4_t r1, r2, r3, t;						\
+	comp4_t r1__, r2__, r3__, t__;					\
 									\
-	r1 = x;								\
-	r2 = a;								\
-	UNc_rb_MUL_UNc_rb (r1, r2, t);					\
-	r2 = y;								\
-	UNc_rb_MUL_UNc (r2, b, t);					\
-	UNc_rb_ADD_UNc_rb (r1, r2, t);					\
+	r1__ = (x);							\
+	r2__ = (a);							\
+	UNc_rb_MUL_UNc_rb (r1__, r2__, t__);				\
+	r2__ = (y);							\
+	UNc_rb_MUL_UNc (r2__, (b), t__);				\
+	UNc_rb_ADD_UNc_rb (r1__, r2__, t__);				\
 									\
-	r2 = x >> G_SHIFT;						\
-	r3 = a >> G_SHIFT;						\
-	UNc_rb_MUL_UNc_rb (r2, r3, t);					\
-	r3 = y >> G_SHIFT;						\
-	UNc_rb_MUL_UNc (r3, b, t);					\
-	UNc_rb_ADD_UNc_rb (r2, r3, t);					\
+	r2__ = (x) >> G_SHIFT;						\
+	r3__ = (a) >> G_SHIFT;						\
+	UNc_rb_MUL_UNc_rb (r2__, r3__, t__);				\
+	r3__ = (y) >> G_SHIFT;						\
+	UNc_rb_MUL_UNc (r3__, (b), t__);				\
+	UNc_rb_ADD_UNc_rb (r2__, r3__, t__);				\
 									\
-	x = r1 | (r2 << G_SHIFT);					\
+	x = r1__ | (r2__ << G_SHIFT);					\
     } while (0)
 
 /*
-   x_c = min(x_c + y_c, 255)
- */
+  x_c = min(x_c + y_c, 255)
+*/
 #define UNcx4_ADD_UNcx4(x, y)						\
     do									\
     {									\
-	comp4_t r1, r2, r3, t;						\
+	comp4_t r1__, r2__, r3__, t__;					\
 									\
-	r1 = x & RB_MASK;						\
-	r2 = y & RB_MASK;						\
-	UNc_rb_ADD_UNc_rb (r1, r2, t);					\
+	r1__ = (x) & RB_MASK;						\
+	r2__ = (y) & RB_MASK;						\
+	UNc_rb_ADD_UNc_rb (r1__, r2__, t__);				\
 									\
-	r2 = (x >> G_SHIFT) & RB_MASK;					\
-	r3 = (y >> G_SHIFT) & RB_MASK;					\
-	UNc_rb_ADD_UNc_rb (r2, r3, t);					\
+	r2__ = ((x) >> G_SHIFT) & RB_MASK;				\
+	r3__ = ((y) >> G_SHIFT) & RB_MASK;				\
+	UNc_rb_ADD_UNc_rb (r2__, r3__, t__);				\
 									\
-	x = r1 | (r2 << G_SHIFT);					\
+	x = r1__ | (r2__ << G_SHIFT);					\
     } while (0)
--- a/gfx/cairo/libpixman/src/pixman-combine32.c
+++ b/gfx/cairo/libpixman/src/pixman-combine32.c
@@ -132,16 +132,27 @@ combine_clear (pixman_implementation_t *
                const uint32_t *          src,
                const uint32_t *          mask,
                int                      width)
 {
     memset (dest, 0, width * sizeof(uint32_t));
 }
 
 static void
+combine_dst (pixman_implementation_t *imp,
+	     pixman_op_t	      op,
+	     uint32_t *		      dest,
+	     const uint32_t *	      src,
+	     const uint32_t *          mask,
+	     int		      width)
+{
+    return;
+}
+
+static void
 combine_src_u (pixman_implementation_t *imp,
                pixman_op_t              op,
                uint32_t *                dest,
                const uint32_t *          src,
                const uint32_t *          mask,
                int                      width)
 {
     int i;
@@ -1295,27 +1306,23 @@ combine_disjoint_over_u (pixman_implemen
 {
     int i;
 
     for (i = 0; i < width; ++i)
     {
 	uint32_t s = combine_mask (src, mask, i);
 	uint16_t a = s >> A_SHIFT;
 
-	if (a != 0x00)
+	if (s != 0x00)
 	{
-	    if (a != MASK)
-	    {
-		uint32_t d = *(dest + i);
-		a = combine_disjoint_out_part (d >> A_SHIFT, a);
-		UN8x4_MUL_UN8_ADD_UN8x4 (d, a, s);
-		s = d;
-	    }
+	    uint32_t d = *(dest + i);
+	    a = combine_disjoint_out_part (d >> A_SHIFT, a);
+	    UN8x4_MUL_UN8_ADD_UN8x4 (d, a, s);
 
-	    *(dest + i) = s;
+	    *(dest + i) = d;
 	}
     }
 }
 
 static void
 combine_disjoint_in_u (pixman_implementation_t *imp,
                        pixman_op_t              op,
                        uint32_t *                dest,
@@ -2313,47 +2320,47 @@ combine_conjoint_xor_ca (pixman_implemen
 }
 
 void
 _pixman_setup_combiner_functions_32 (pixman_implementation_t *imp)
 {
     /* Unified alpha */
     imp->combine_32[PIXMAN_OP_CLEAR] = combine_clear;
     imp->combine_32[PIXMAN_OP_SRC] = combine_src_u;
-    /* dest */
+    imp->combine_32[PIXMAN_OP_DST] = combine_dst;
     imp->combine_32[PIXMAN_OP_OVER] = combine_over_u;
     imp->combine_32[PIXMAN_OP_OVER_REVERSE] = combine_over_reverse_u;
     imp->combine_32[PIXMAN_OP_IN] = combine_in_u;
     imp->combine_32[PIXMAN_OP_IN_REVERSE] = combine_in_reverse_u;
     imp->combine_32[PIXMAN_OP_OUT] = combine_out_u;
     imp->combine_32[PIXMAN_OP_OUT_REVERSE] = combine_out_reverse_u;
     imp->combine_32[PIXMAN_OP_ATOP] = combine_atop_u;
     imp->combine_32[PIXMAN_OP_ATOP_REVERSE] = combine_atop_reverse_u;
     imp->combine_32[PIXMAN_OP_XOR] = combine_xor_u;
     imp->combine_32[PIXMAN_OP_ADD] = combine_add_u;
     imp->combine_32[PIXMAN_OP_SATURATE] = combine_saturate_u;
 
     /* Disjoint, unified */
     imp->combine_32[PIXMAN_OP_DISJOINT_CLEAR] = combine_clear;
     imp->combine_32[PIXMAN_OP_DISJOINT_SRC] = combine_src_u;
-    /* dest */
+    imp->combine_32[PIXMAN_OP_DISJOINT_DST] = combine_dst;
     imp->combine_32[PIXMAN_OP_DISJOINT_OVER] = combine_disjoint_over_u;
     imp->combine_32[PIXMAN_OP_DISJOINT_OVER_REVERSE] = combine_saturate_u;
     imp->combine_32[PIXMAN_OP_DISJOINT_IN] = combine_disjoint_in_u;
     imp->combine_32[PIXMAN_OP_DISJOINT_IN_REVERSE] = combine_disjoint_in_reverse_u;
     imp->combine_32[PIXMAN_OP_DISJOINT_OUT] = combine_disjoint_out_u;
     imp->combine_32[PIXMAN_OP_DISJOINT_OUT_REVERSE] = combine_disjoint_out_reverse_u;
     imp->combine_32[PIXMAN_OP_DISJOINT_ATOP] = combine_disjoint_atop_u;
     imp->combine_32[PIXMAN_OP_DISJOINT_ATOP_REVERSE] = combine_disjoint_atop_reverse_u;
     imp->combine_32[PIXMAN_OP_DISJOINT_XOR] = combine_disjoint_xor_u;
 
     /* Conjoint, unified */
     imp->combine_32[PIXMAN_OP_CONJOINT_CLEAR] = combine_clear;
     imp->combine_32[PIXMAN_OP_CONJOINT_SRC] = combine_src_u;
-    /* dest */
+    imp->combine_32[PIXMAN_OP_CONJOINT_DST] = combine_dst;
     imp->combine_32[PIXMAN_OP_CONJOINT_OVER] = combine_conjoint_over_u;
     imp->combine_32[PIXMAN_OP_CONJOINT_OVER_REVERSE] = combine_conjoint_over_reverse_u;
     imp->combine_32[PIXMAN_OP_CONJOINT_IN] = combine_conjoint_in_u;
     imp->combine_32[PIXMAN_OP_CONJOINT_IN_REVERSE] = combine_conjoint_in_reverse_u;
     imp->combine_32[PIXMAN_OP_CONJOINT_OUT] = combine_conjoint_out_u;
     imp->combine_32[PIXMAN_OP_CONJOINT_OUT_REVERSE] = combine_conjoint_out_reverse_u;
     imp->combine_32[PIXMAN_OP_CONJOINT_ATOP] = combine_conjoint_atop_u;
     imp->combine_32[PIXMAN_OP_CONJOINT_ATOP_REVERSE] = combine_conjoint_atop_reverse_u;
@@ -2389,31 +2396,31 @@ void
     imp->combine_32_ca[PIXMAN_OP_ATOP_REVERSE] = combine_atop_reverse_ca;
     imp->combine_32_ca[PIXMAN_OP_XOR] = combine_xor_ca;
     imp->combine_32_ca[PIXMAN_OP_ADD] = combine_add_ca;
     imp->combine_32_ca[PIXMAN_OP_SATURATE] = combine_saturate_ca;
 
     /* Disjoint CA */
     imp->combine_32_ca[PIXMAN_OP_DISJOINT_CLEAR] = combine_clear_ca;
     imp->combine_32_ca[PIXMAN_OP_DISJOINT_SRC] = combine_src_ca;
-    /* dest */
+    imp->combine_32_ca[PIXMAN_OP_DISJOINT_DST] = combine_dst;
     imp->combine_32_ca[PIXMAN_OP_DISJOINT_OVER] = combine_disjoint_over_ca;
     imp->combine_32_ca[PIXMAN_OP_DISJOINT_OVER_REVERSE] = combine_saturate_ca;
     imp->combine_32_ca[PIXMAN_OP_DISJOINT_IN] = combine_disjoint_in_ca;
     imp->combine_32_ca[PIXMAN_OP_DISJOINT_IN_REVERSE] = combine_disjoint_in_reverse_ca;
     imp->combine_32_ca[PIXMAN_OP_DISJOINT_OUT] = combine_disjoint_out_ca;
     imp->combine_32_ca[PIXMAN_OP_DISJOINT_OUT_REVERSE] = combine_disjoint_out_reverse_ca;
     imp->combine_32_ca[PIXMAN_OP_DISJOINT_ATOP] = combine_disjoint_atop_ca;
     imp->combine_32_ca[PIXMAN_OP_DISJOINT_ATOP_REVERSE] = combine_disjoint_atop_reverse_ca;
     imp->combine_32_ca[PIXMAN_OP_DISJOINT_XOR] = combine_disjoint_xor_ca;
 
     /* Conjoint CA */
     imp->combine_32_ca[PIXMAN_OP_CONJOINT_CLEAR] = combine_clear_ca;
     imp->combine_32_ca[PIXMAN_OP_CONJOINT_SRC] = combine_src_ca;
-    /* dest */
+    imp->combine_32_ca[PIXMAN_OP_CONJOINT_DST] = combine_dst;
     imp->combine_32_ca[PIXMAN_OP_CONJOINT_OVER] = combine_conjoint_over_ca;
     imp->combine_32_ca[PIXMAN_OP_CONJOINT_OVER_REVERSE] = combine_conjoint_over_reverse_ca;
     imp->combine_32_ca[PIXMAN_OP_CONJOINT_IN] = combine_conjoint_in_ca;
     imp->combine_32_ca[PIXMAN_OP_CONJOINT_IN_REVERSE] = combine_conjoint_in_reverse_ca;
     imp->combine_32_ca[PIXMAN_OP_CONJOINT_OUT] = combine_conjoint_out_ca;
     imp->combine_32_ca[PIXMAN_OP_CONJOINT_OUT_REVERSE] = combine_conjoint_out_reverse_ca;
     imp->combine_32_ca[PIXMAN_OP_CONJOINT_ATOP] = combine_conjoint_atop_ca;
     imp->combine_32_ca[PIXMAN_OP_CONJOINT_ATOP_REVERSE] = combine_conjoint_atop_reverse_ca;
@@ -2426,15 +2433,15 @@ void
     imp->combine_32_ca[PIXMAN_OP_LIGHTEN] = combine_lighten_ca;
     imp->combine_32_ca[PIXMAN_OP_COLOR_DODGE] = combine_color_dodge_ca;
     imp->combine_32_ca[PIXMAN_OP_COLOR_BURN] = combine_color_burn_ca;
     imp->combine_32_ca[PIXMAN_OP_HARD_LIGHT] = combine_hard_light_ca;
     imp->combine_32_ca[PIXMAN_OP_SOFT_LIGHT] = combine_soft_light_ca;
     imp->combine_32_ca[PIXMAN_OP_DIFFERENCE] = combine_difference_ca;
     imp->combine_32_ca[PIXMAN_OP_EXCLUSION] = combine_exclusion_ca;
 
-    /* It is not clear that these make sense, so leave them out for now */
-    imp->combine_32_ca[PIXMAN_OP_HSL_HUE] = NULL;
-    imp->combine_32_ca[PIXMAN_OP_HSL_SATURATION] = NULL;
-    imp->combine_32_ca[PIXMAN_OP_HSL_COLOR] = NULL;
-    imp->combine_32_ca[PIXMAN_OP_HSL_LUMINOSITY] = NULL;
+    /* It is not clear that these make sense, so make them noops for now */
+    imp->combine_32_ca[PIXMAN_OP_HSL_HUE] = combine_dst;
+    imp->combine_32_ca[PIXMAN_OP_HSL_SATURATION] = combine_dst;
+    imp->combine_32_ca[PIXMAN_OP_HSL_COLOR] = combine_dst;
+    imp->combine_32_ca[PIXMAN_OP_HSL_LUMINOSITY] = combine_dst;
 }
 
--- a/gfx/cairo/libpixman/src/pixman-combine32.h
+++ b/gfx/cairo/libpixman/src/pixman-combine32.h
@@ -30,17 +30,17 @@
 
 #define MUL_UN8(a, b, t)						\
     ((t) = (a) * (b) + ONE_HALF, ((((t) >> G_SHIFT ) + (t) ) >> G_SHIFT ))
 
 #define DIV_UN8(a, b)							\
     (((uint16_t) (a) * MASK) / (b))
 
 #define ADD_UN8(x, y, t)				     \
-    ((t) = x + y,					     \
+    ((t) = (x) + (y),					     \
      (uint32_t) (uint8_t) ((t) | (0 - ((t) >> G_SHIFT))))
 
 #define DIV_ONE_UN8(x)							\
     (((x) + ONE_HALF + (((x) + ONE_HALF) >> G_SHIFT)) >> G_SHIFT)
 
 /*
  * The methods below use some tricks to be able to do two color
  * components at the same time.
@@ -83,148 +83,148 @@
     } while (0)
 
 /*
  * x_c = (x_c * a) / 255
  */
 #define UN8x4_MUL_UN8(x, a)						\
     do									\
     {									\
-	uint32_t r1, r2, t;						\
+	uint32_t r1__, r2__, t__;					\
 									\
-	r1 = (x);							\
-	UN8_rb_MUL_UN8 (r1, a, t);					\
+	r1__ = (x);							\
+	UN8_rb_MUL_UN8 (r1__, (a), t__);				\
 									\
-	r2 = (x) >> G_SHIFT;						\
-	UN8_rb_MUL_UN8 (r2, a, t);					\
+	r2__ = (x) >> G_SHIFT;						\
+	UN8_rb_MUL_UN8 (r2__, (a), t__);				\
 									\
-	x = r1 | (r2 << G_SHIFT);					\
+	(x) = r1__ | (r2__ << G_SHIFT);					\
     } while (0)
 
 /*
  * x_c = (x_c * a) / 255 + y_c
  */
 #define UN8x4_MUL_UN8_ADD_UN8x4(x, a, y)				\
     do									\
     {									\
-	uint32_t r1, r2, r3, t;						\
+	uint32_t r1__, r2__, r3__, t__;					\
 									\
-	r1 = (x);							\
-	r2 = (y) & RB_MASK;						\
-	UN8_rb_MUL_UN8 (r1, a, t);					\
-	UN8_rb_ADD_UN8_rb (r1, r2, t);					\
+	r1__ = (x);							\
+	r2__ = (y) & RB_MASK;						\
+	UN8_rb_MUL_UN8 (r1__, (a), t__);				\
+	UN8_rb_ADD_UN8_rb (r1__, r2__, t__);				\
 									\
-	r2 = (x) >> G_SHIFT;						\
-	r3 = ((y) >> G_SHIFT) & RB_MASK;				\
-	UN8_rb_MUL_UN8 (r2, a, t);					\
-	UN8_rb_ADD_UN8_rb (r2, r3, t);					\
+	r2__ = (x) >> G_SHIFT;						\
+	r3__ = ((y) >> G_SHIFT) & RB_MASK;				\
+	UN8_rb_MUL_UN8 (r2__, (a), t__);				\
+	UN8_rb_ADD_UN8_rb (r2__, r3__, t__);				\
 									\
-	x = r1 | (r2 << G_SHIFT);					\
+	(x) = r1__ | (r2__ << G_SHIFT);					\
     } while (0)
 
 /*
  * x_c = (x_c * a + y_c * b) / 255
  */
 #define UN8x4_MUL_UN8_ADD_UN8x4_MUL_UN8(x, a, y, b)			\
     do									\
     {									\
-	uint32_t r1, r2, r3, t;						\
+	uint32_t r1__, r2__, r3__, t__;					\
 									\
-	r1 = x;								\
-	r2 = y;								\
-	UN8_rb_MUL_UN8 (r1, a, t);					\
-	UN8_rb_MUL_UN8 (r2, b, t);					\
-	UN8_rb_ADD_UN8_rb (r1, r2, t);					\
+	r1__ = (x);							\
+	r2__ = (y);							\
+	UN8_rb_MUL_UN8 (r1__, (a), t__);				\
+	UN8_rb_MUL_UN8 (r2__, (b), t__);				\
+	UN8_rb_ADD_UN8_rb (r1__, r2__, t__);				\
 									\
-	r2 = (x >> G_SHIFT);						\
-	r3 = (y >> G_SHIFT);						\
-	UN8_rb_MUL_UN8 (r2, a, t);					\
-	UN8_rb_MUL_UN8 (r3, b, t);					\
-	UN8_rb_ADD_UN8_rb (r2, r3, t);					\
+	r2__ = ((x) >> G_SHIFT);					\
+	r3__ = ((y) >> G_SHIFT);					\
+	UN8_rb_MUL_UN8 (r2__, (a), t__);				\
+	UN8_rb_MUL_UN8 (r3__, (b), t__);				\
+	UN8_rb_ADD_UN8_rb (r2__, r3__, t__);				\
 									\
-	x = r1 | (r2 << G_SHIFT);					\
+	(x) = r1__ | (r2__ << G_SHIFT);					\
     } while (0)
 
 /*
  * x_c = (x_c * a_c) / 255
  */
 #define UN8x4_MUL_UN8x4(x, a)						\
     do									\
     {									\
-	uint32_t r1, r2, r3, t;						\
+	uint32_t r1__, r2__, r3__, t__;					\
 									\
-	r1 = x;								\
-	r2 = a;								\
-	UN8_rb_MUL_UN8_rb (r1, r2, t);					\
+	r1__ = (x);							\
+	r2__ = (a);							\
+	UN8_rb_MUL_UN8_rb (r1__, r2__, t__);				\
 									\
-	r2 = x >> G_SHIFT;						\
-	r3 = a >> G_SHIFT;						\
-	UN8_rb_MUL_UN8_rb (r2, r3, t);					\
+	r2__ = (x) >> G_SHIFT;						\
+	r3__ = (a) >> G_SHIFT;						\
+	UN8_rb_MUL_UN8_rb (r2__, r3__, t__);				\
 									\
-	x = r1 | (r2 << G_SHIFT);					\
+	(x) = r1__ | (r2__ << G_SHIFT);					\
     } while (0)
 
 /*
  * x_c = (x_c * a_c) / 255 + y_c
  */
 #define UN8x4_MUL_UN8x4_ADD_UN8x4(x, a, y)				\
     do									\
     {									\
-	uint32_t r1, r2, r3, t;						\
+	uint32_t r1__, r2__, r3__, t__;					\
 									\
-	r1 = x;								\
-	r2 = a;								\
-	UN8_rb_MUL_UN8_rb (r1, r2, t);					\
-	r2 = y & RB_MASK;						\
-	UN8_rb_ADD_UN8_rb (r1, r2, t);					\
+	r1__ = (x);							\
+	r2__ = (a);							\
+	UN8_rb_MUL_UN8_rb (r1__, r2__, t__);				\
+	r2__ = (y) & RB_MASK;						\
+	UN8_rb_ADD_UN8_rb (r1__, r2__, t__);				\
 									\
-	r2 = (x >> G_SHIFT);						\
-	r3 = (a >> G_SHIFT);						\
-	UN8_rb_MUL_UN8_rb (r2, r3, t);					\
-	r3 = (y >> G_SHIFT) & RB_MASK;					\
-	UN8_rb_ADD_UN8_rb (r2, r3, t);					\
+	r2__ = ((x) >> G_SHIFT);					\
+	r3__ = ((a) >> G_SHIFT);					\
+	UN8_rb_MUL_UN8_rb (r2__, r3__, t__);				\
+	r3__ = ((y) >> G_SHIFT) & RB_MASK;				\
+	UN8_rb_ADD_UN8_rb (r2__, r3__, t__);				\
 									\
-	x = r1 | (r2 << G_SHIFT);					\
+	(x) = r1__ | (r2__ << G_SHIFT);					\
     } while (0)
 
 /*
  * x_c = (x_c * a_c + y_c * b) / 255
  */
 #define UN8x4_MUL_UN8x4_ADD_UN8x4_MUL_UN8(x, a, y, b)			\
     do									\
     {									\
-	uint32_t r1, r2, r3, t;						\
+	uint32_t r1__, r2__, r3__, t__;					\
 									\
-	r1 = x;								\
-	r2 = a;								\
-	UN8_rb_MUL_UN8_rb (r1, r2, t);					\
-	r2 = y;								\
-	UN8_rb_MUL_UN8 (r2, b, t);					\
-	UN8_rb_ADD_UN8_rb (r1, r2, t);					\
+	r1__ = (x);							\
+	r2__ = (a);							\
+	UN8_rb_MUL_UN8_rb (r1__, r2__, t__);				\
+	r2__ = (y);							\
+	UN8_rb_MUL_UN8 (r2__, (b), t__);				\
+	UN8_rb_ADD_UN8_rb (r1__, r2__, t__);				\
 									\
-	r2 = x >> G_SHIFT;						\
-	r3 = a >> G_SHIFT;						\
-	UN8_rb_MUL_UN8_rb (r2, r3, t);					\
-	r3 = y >> G_SHIFT;						\
-	UN8_rb_MUL_UN8 (r3, b, t);					\
-	UN8_rb_ADD_UN8_rb (r2, r3, t);					\
+	r2__ = (x) >> G_SHIFT;						\
+	r3__ = (a) >> G_SHIFT;						\
+	UN8_rb_MUL_UN8_rb (r2__, r3__, t__);				\
+	r3__ = (y) >> G_SHIFT;						\
+	UN8_rb_MUL_UN8 (r3__, (b), t__);				\
+	UN8_rb_ADD_UN8_rb (r2__, r3__, t__);				\
 									\
-	x = r1 | (r2 << G_SHIFT);					\
+	x = r1__ | (r2__ << G_SHIFT);					\
     } while (0)
 
 /*
-   x_c = min(x_c + y_c, 255)
- */
+  x_c = min(x_c + y_c, 255)
+*/
 #define UN8x4_ADD_UN8x4(x, y)						\
     do									\
     {									\
-	uint32_t r1, r2, r3, t;						\
+	uint32_t r1__, r2__, r3__, t__;					\
 									\
-	r1 = x & RB_MASK;						\
-	r2 = y & RB_MASK;						\
-	UN8_rb_ADD_UN8_rb (r1, r2, t);					\
+	r1__ = (x) & RB_MASK;						\
+	r2__ = (y) & RB_MASK;						\
+	UN8_rb_ADD_UN8_rb (r1__, r2__, t__);				\
 									\
-	r2 = (x >> G_SHIFT) & RB_MASK;					\
-	r3 = (y >> G_SHIFT) & RB_MASK;					\
-	UN8_rb_ADD_UN8_rb (r2, r3, t);					\
+	r2__ = ((x) >> G_SHIFT) & RB_MASK;				\
+	r3__ = ((y) >> G_SHIFT) & RB_MASK;				\
+	UN8_rb_ADD_UN8_rb (r2__, r3__, t__);				\
 									\
-	x = r1 | (r2 << G_SHIFT);					\
+	x = r1__ | (r2__ << G_SHIFT);					\
     } while (0)
--- a/gfx/cairo/libpixman/src/pixman-combine64.c
+++ b/gfx/cairo/libpixman/src/pixman-combine64.c
@@ -132,16 +132,27 @@ combine_clear (pixman_implementation_t *
                const uint64_t *          src,
                const uint64_t *          mask,
                int                      width)
 {
     memset (dest, 0, width * sizeof(uint64_t));
 }
 
 static void
+combine_dst (pixman_implementation_t *imp,
+	     pixman_op_t	      op,
+	     uint64_t *		      dest,
+	     const uint64_t *	      src,
+	     const uint64_t *          mask,
+	     int		      width)
+{
+    return;
+}
+
+static void
 combine_src_u (pixman_implementation_t *imp,
                pixman_op_t              op,
                uint64_t *                dest,
                const uint64_t *          src,
                const uint64_t *          mask,
                int                      width)
 {
     int i;
@@ -1295,27 +1306,23 @@ combine_disjoint_over_u (pixman_implemen
 {
     int i;
 
     for (i = 0; i < width; ++i)
     {
 	uint64_t s = combine_mask (src, mask, i);
 	uint32_t a = s >> A_SHIFT;
 
-	if (a != 0x00)
+	if (s != 0x00)
 	{
-	    if (a != MASK)
-	    {
-		uint64_t d = *(dest + i);
-		a = combine_disjoint_out_part (d >> A_SHIFT, a);
-		UN16x4_MUL_UN16_ADD_UN16x4 (d, a, s);
-		s = d;
-	    }
+	    uint64_t d = *(dest + i);
+	    a = combine_disjoint_out_part (d >> A_SHIFT, a);
+	    UN16x4_MUL_UN16_ADD_UN16x4 (d, a, s);
 
-	    *(dest + i) = s;
+	    *(dest + i) = d;
 	}
     }
 }
 
 static void
 combine_disjoint_in_u (pixman_implementation_t *imp,
                        pixman_op_t              op,
                        uint64_t *                dest,
@@ -2313,47 +2320,47 @@ combine_conjoint_xor_ca (pixman_implemen
 }
 
 void
 _pixman_setup_combiner_functions_64 (pixman_implementation_t *imp)
 {
     /* Unified alpha */
     imp->combine_64[PIXMAN_OP_CLEAR] = combine_clear;
     imp->combine_64[PIXMAN_OP_SRC] = combine_src_u;
-    /* dest */
+    imp->combine_64[PIXMAN_OP_DST] = combine_dst;
     imp->combine_64[PIXMAN_OP_OVER] = combine_over_u;
     imp->combine_64[PIXMAN_OP_OVER_REVERSE] = combine_over_reverse_u;
     imp->combine_64[PIXMAN_OP_IN] = combine_in_u;
     imp->combine_64[PIXMAN_OP_IN_REVERSE] = combine_in_reverse_u;
     imp->combine_64[PIXMAN_OP_OUT] = combine_out_u;
     imp->combine_64[PIXMAN_OP_OUT_REVERSE] = combine_out_reverse_u;
     imp->combine_64[PIXMAN_OP_ATOP] = combine_atop_u;
     imp->combine_64[PIXMAN_OP_ATOP_REVERSE] = combine_atop_reverse_u;
     imp->combine_64[PIXMAN_OP_XOR] = combine_xor_u;
     imp->combine_64[PIXMAN_OP_ADD] = combine_add_u;
     imp->combine_64[PIXMAN_OP_SATURATE] = combine_saturate_u;
 
     /* Disjoint, unified */
     imp->combine_64[PIXMAN_OP_DISJOINT_CLEAR] = combine_clear;
     imp->combine_64[PIXMAN_OP_DISJOINT_SRC] = combine_src_u;
-    /* dest */
+    imp->combine_64[PIXMAN_OP_DISJOINT_DST] = combine_dst;
     imp->combine_64[PIXMAN_OP_DISJOINT_OVER] = combine_disjoint_over_u;
     imp->combine_64[PIXMAN_OP_DISJOINT_OVER_REVERSE] = combine_saturate_u;
     imp->combine_64[PIXMAN_OP_DISJOINT_IN] = combine_disjoint_in_u;
     imp->combine_64[PIXMAN_OP_DISJOINT_IN_REVERSE] = combine_disjoint_in_reverse_u;
     imp->combine_64[PIXMAN_OP_DISJOINT_OUT] = combine_disjoint_out_u;
     imp->combine_64[PIXMAN_OP_DISJOINT_OUT_REVERSE] = combine_disjoint_out_reverse_u;
     imp->combine_64[PIXMAN_OP_DISJOINT_ATOP] = combine_disjoint_atop_u;
     imp->combine_64[PIXMAN_OP_DISJOINT_ATOP_REVERSE] = combine_disjoint_atop_reverse_u;
     imp->combine_64[PIXMAN_OP_DISJOINT_XOR] = combine_disjoint_xor_u;
 
     /* Conjoint, unified */
     imp->combine_64[PIXMAN_OP_CONJOINT_CLEAR] = combine_clear;
     imp->combine_64[PIXMAN_OP_CONJOINT_SRC] = combine_src_u;
-    /* dest */
+    imp->combine_64[PIXMAN_OP_CONJOINT_DST] = combine_dst;
     imp->combine_64[PIXMAN_OP_CONJOINT_OVER] = combine_conjoint_over_u;
     imp->combine_64[PIXMAN_OP_CONJOINT_OVER_REVERSE] = combine_conjoint_over_reverse_u;
     imp->combine_64[PIXMAN_OP_CONJOINT_IN] = combine_conjoint_in_u;
     imp->combine_64[PIXMAN_OP_CONJOINT_IN_REVERSE] = combine_conjoint_in_reverse_u;
     imp->combine_64[PIXMAN_OP_CONJOINT_OUT] = combine_conjoint_out_u;
     imp->combine_64[PIXMAN_OP_CONJOINT_OUT_REVERSE] = combine_conjoint_out_reverse_u;
     imp->combine_64[PIXMAN_OP_CONJOINT_ATOP] = combine_conjoint_atop_u;
     imp->combine_64[PIXMAN_OP_CONJOINT_ATOP_REVERSE] = combine_conjoint_atop_reverse_u;
@@ -2389,31 +2396,31 @@ void
     imp->combine_64_ca[PIXMAN_OP_ATOP_REVERSE] = combine_atop_reverse_ca;
     imp->combine_64_ca[PIXMAN_OP_XOR] = combine_xor_ca;
     imp->combine_64_ca[PIXMAN_OP_ADD] = combine_add_ca;
     imp->combine_64_ca[PIXMAN_OP_SATURATE] = combine_saturate_ca;
 
     /* Disjoint CA */
     imp->combine_64_ca[PIXMAN_OP_DISJOINT_CLEAR] = combine_clear_ca;
     imp->combine_64_ca[PIXMAN_OP_DISJOINT_SRC] = combine_src_ca;
-    /* dest */
+    imp->combine_64_ca[PIXMAN_OP_DISJOINT_DST] = combine_dst;
     imp->combine_64_ca[PIXMAN_OP_DISJOINT_OVER] = combine_disjoint_over_ca;
     imp->combine_64_ca[PIXMAN_OP_DISJOINT_OVER_REVERSE] = combine_saturate_ca;
     imp->combine_64_ca[PIXMAN_OP_DISJOINT_IN] = combine_disjoint_in_ca;
     imp->combine_64_ca[PIXMAN_OP_DISJOINT_IN_REVERSE] = combine_disjoint_in_reverse_ca;
     imp->combine_64_ca[PIXMAN_OP_DISJOINT_OUT] = combine_disjoint_out_ca;
     imp->combine_64_ca[PIXMAN_OP_DISJOINT_OUT_REVERSE] = combine_disjoint_out_reverse_ca;
     imp->combine_64_ca[PIXMAN_OP_DISJOINT_ATOP] = combine_disjoint_atop_ca;
     imp->combine_64_ca[PIXMAN_OP_DISJOINT_ATOP_REVERSE] = combine_disjoint_atop_reverse_ca;
     imp->combine_64_ca[PIXMAN_OP_DISJOINT_XOR] = combine_disjoint_xor_ca;
 
     /* Conjoint CA */
     imp->combine_64_ca[PIXMAN_OP_CONJOINT_CLEAR] = combine_clear_ca;
     imp->combine_64_ca[PIXMAN_OP_CONJOINT_SRC] = combine_src_ca;
-    /* dest */
+    imp->combine_64_ca[PIXMAN_OP_CONJOINT_DST] = combine_dst;
     imp->combine_64_ca[PIXMAN_OP_CONJOINT_OVER] = combine_conjoint_over_ca;
     imp->combine_64_ca[PIXMAN_OP_CONJOINT_OVER_REVERSE] = combine_conjoint_over_reverse_ca;
     imp->combine_64_ca[PIXMAN_OP_CONJOINT_IN] = combine_conjoint_in_ca;
     imp->combine_64_ca[PIXMAN_OP_CONJOINT_IN_REVERSE] = combine_conjoint_in_reverse_ca;
     imp->combine_64_ca[PIXMAN_OP_CONJOINT_OUT] = combine_conjoint_out_ca;
     imp->combine_64_ca[PIXMAN_OP_CONJOINT_OUT_REVERSE] = combine_conjoint_out_reverse_ca;
     imp->combine_64_ca[PIXMAN_OP_CONJOINT_ATOP] = combine_conjoint_atop_ca;
     imp->combine_64_ca[PIXMAN_OP_CONJOINT_ATOP_REVERSE] = combine_conjoint_atop_reverse_ca;
@@ -2426,15 +2433,15 @@ void
     imp->combine_64_ca[PIXMAN_OP_LIGHTEN] = combine_lighten_ca;
     imp->combine_64_ca[PIXMAN_OP_COLOR_DODGE] = combine_color_dodge_ca;
     imp->combine_64_ca[PIXMAN_OP_COLOR_BURN] = combine_color_burn_ca;
     imp->combine_64_ca[PIXMAN_OP_HARD_LIGHT] = combine_hard_light_ca;
     imp->combine_64_ca[PIXMAN_OP_SOFT_LIGHT] = combine_soft_light_ca;
     imp->combine_64_ca[PIXMAN_OP_DIFFERENCE] = combine_difference_ca;
     imp->combine_64_ca[PIXMAN_OP_EXCLUSION] = combine_exclusion_ca;
 
-    /* It is not clear that these make sense, so leave them out for now */
-    imp->combine_64_ca[PIXMAN_OP_HSL_HUE] = NULL;
-    imp->combine_64_ca[PIXMAN_OP_HSL_SATURATION] = NULL;
-    imp->combine_64_ca[PIXMAN_OP_HSL_COLOR] = NULL;
-    imp->combine_64_ca[PIXMAN_OP_HSL_LUMINOSITY] = NULL;
+    /* It is not clear that these make sense, so make them noops for now */
+    imp->combine_64_ca[PIXMAN_OP_HSL_HUE] = combine_dst;
+    imp->combine_64_ca[PIXMAN_OP_HSL_SATURATION] = combine_dst;
+    imp->combine_64_ca[PIXMAN_OP_HSL_COLOR] = combine_dst;
+    imp->combine_64_ca[PIXMAN_OP_HSL_LUMINOSITY] = combine_dst;
 }
 
--- a/gfx/cairo/libpixman/src/pixman-combine64.h
+++ b/gfx/cairo/libpixman/src/pixman-combine64.h
@@ -30,17 +30,17 @@
 
 #define MUL_UN16(a, b, t)						\
     ((t) = (a) * (b) + ONE_HALF, ((((t) >> G_SHIFT ) + (t) ) >> G_SHIFT ))
 
 #define DIV_UN16(a, b)							\
     (((uint32_t) (a) * MASK) / (b))
 
 #define ADD_UN16(x, y, t)				     \
-    ((t) = x + y,					     \
+    ((t) = (x) + (y),					     \
      (uint64_t) (uint16_t) ((t) | (0 - ((t) >> G_SHIFT))))
 
 #define DIV_ONE_UN16(x)							\
     (((x) + ONE_HALF + (((x) + ONE_HALF) >> G_SHIFT)) >> G_SHIFT)
 
 /*
  * The methods below use some tricks to be able to do two color
  * components at the same time.
@@ -83,148 +83,148 @@
     } while (0)
 
 /*
  * x_c = (x_c * a) / 255
  */
 #define UN16x4_MUL_UN16(x, a)						\
     do									\
     {									\
-	uint64_t r1, r2, t;						\
+	uint64_t r1__, r2__, t__;					\
 									\
-	r1 = (x);							\
-	UN16_rb_MUL_UN16 (r1, a, t);					\
+	r1__ = (x);							\
+	UN16_rb_MUL_UN16 (r1__, (a), t__);				\
 									\
-	r2 = (x) >> G_SHIFT;						\
-	UN16_rb_MUL_UN16 (r2, a, t);					\
+	r2__ = (x) >> G_SHIFT;						\
+	UN16_rb_MUL_UN16 (r2__, (a), t__);				\
 									\
-	x = r1 | (r2 << G_SHIFT);					\
+	(x) = r1__ | (r2__ << G_SHIFT);					\
     } while (0)
 
 /*
  * x_c = (x_c * a) / 255 + y_c
  */
 #define UN16x4_MUL_UN16_ADD_UN16x4(x, a, y)				\
     do									\
     {									\
-	uint64_t r1, r2, r3, t;						\
+	uint64_t r1__, r2__, r3__, t__;					\
 									\
-	r1 = (x);							\
-	r2 = (y) & RB_MASK;						\
-	UN16_rb_MUL_UN16 (r1, a, t);					\
-	UN16_rb_ADD_UN16_rb (r1, r2, t);					\
+	r1__ = (x);							\
+	r2__ = (y) & RB_MASK;						\
+	UN16_rb_MUL_UN16 (r1__, (a), t__);				\
+	UN16_rb_ADD_UN16_rb (r1__, r2__, t__);				\
 									\
-	r2 = (x) >> G_SHIFT;						\
-	r3 = ((y) >> G_SHIFT) & RB_MASK;				\
-	UN16_rb_MUL_UN16 (r2, a, t);					\
-	UN16_rb_ADD_UN16_rb (r2, r3, t);					\
+	r2__ = (x) >> G_SHIFT;						\
+	r3__ = ((y) >> G_SHIFT) & RB_MASK;				\
+	UN16_rb_MUL_UN16 (r2__, (a), t__);				\
+	UN16_rb_ADD_UN16_rb (r2__, r3__, t__);				\
 									\
-	x = r1 | (r2 << G_SHIFT);					\
+	(x) = r1__ | (r2__ << G_SHIFT);					\
     } while (0)
 
 /*
  * x_c = (x_c * a + y_c * b) / 255
  */
 #define UN16x4_MUL_UN16_ADD_UN16x4_MUL_UN16(x, a, y, b)			\
     do									\
     {									\
-	uint64_t r1, r2, r3, t;						\
+	uint64_t r1__, r2__, r3__, t__;					\
 									\
-	r1 = x;								\
-	r2 = y;								\
-	UN16_rb_MUL_UN16 (r1, a, t);					\
-	UN16_rb_MUL_UN16 (r2, b, t);					\
-	UN16_rb_ADD_UN16_rb (r1, r2, t);					\
+	r1__ = (x);							\
+	r2__ = (y);							\
+	UN16_rb_MUL_UN16 (r1__, (a), t__);				\
+	UN16_rb_MUL_UN16 (r2__, (b), t__);				\
+	UN16_rb_ADD_UN16_rb (r1__, r2__, t__);				\
 									\
-	r2 = (x >> G_SHIFT);						\
-	r3 = (y >> G_SHIFT);						\
-	UN16_rb_MUL_UN16 (r2, a, t);					\
-	UN16_rb_MUL_UN16 (r3, b, t);					\
-	UN16_rb_ADD_UN16_rb (r2, r3, t);					\
+	r2__ = ((x) >> G_SHIFT);					\
+	r3__ = ((y) >> G_SHIFT);					\
+	UN16_rb_MUL_UN16 (r2__, (a), t__);				\
+	UN16_rb_MUL_UN16 (r3__, (b), t__);				\
+	UN16_rb_ADD_UN16_rb (r2__, r3__, t__);				\
 									\
-	x = r1 | (r2 << G_SHIFT);					\
+	(x) = r1__ | (r2__ << G_SHIFT);					\
     } while (0)
 
 /*
  * x_c = (x_c * a_c) / 255
  */
 #define UN16x4_MUL_UN16x4(x, a)						\
     do									\
     {									\
-	uint64_t r1, r2, r3, t;						\
+	uint64_t r1__, r2__, r3__, t__;					\
 									\
-	r1 = x;								\
-	r2 = a;								\
-	UN16_rb_MUL_UN16_rb (r1, r2, t);					\
+	r1__ = (x);							\
+	r2__ = (a);							\
+	UN16_rb_MUL_UN16_rb (r1__, r2__, t__);				\
 									\
-	r2 = x >> G_SHIFT;						\
-	r3 = a >> G_SHIFT;						\
-	UN16_rb_MUL_UN16_rb (r2, r3, t);					\
+	r2__ = (x) >> G_SHIFT;						\
+	r3__ = (a) >> G_SHIFT;						\
+	UN16_rb_MUL_UN16_rb (r2__, r3__, t__);				\
 									\
-	x = r1 | (r2 << G_SHIFT);					\
+	(x) = r1__ | (r2__ << G_SHIFT);					\
     } while (0)
 
 /*
  * x_c = (x_c * a_c) / 255 + y_c
  */
 #define UN16x4_MUL_UN16x4_ADD_UN16x4(x, a, y)				\
     do									\
     {									\
-	uint64_t r1, r2, r3, t;						\
+	uint64_t r1__, r2__, r3__, t__;					\
 									\
-	r1 = x;								\
-	r2 = a;								\
-	UN16_rb_MUL_UN16_rb (r1, r2, t);					\
-	r2 = y & RB_MASK;						\
-	UN16_rb_ADD_UN16_rb (r1, r2, t);					\
+	r1__ = (x);							\
+	r2__ = (a);							\
+	UN16_rb_MUL_UN16_rb (r1__, r2__, t__);				\
+	r2__ = (y) & RB_MASK;						\
+	UN16_rb_ADD_UN16_rb (r1__, r2__, t__);				\
 									\
-	r2 = (x >> G_SHIFT);						\
-	r3 = (a >> G_SHIFT);						\
-	UN16_rb_MUL_UN16_rb (r2, r3, t);					\
-	r3 = (y >> G_SHIFT) & RB_MASK;					\
-	UN16_rb_ADD_UN16_rb (r2, r3, t);					\
+	r2__ = ((x) >> G_SHIFT);					\
+	r3__ = ((a) >> G_SHIFT);					\
+	UN16_rb_MUL_UN16_rb (r2__, r3__, t__);				\
+	r3__ = ((y) >> G_SHIFT) & RB_MASK;				\
+	UN16_rb_ADD_UN16_rb (r2__, r3__, t__);				\
 									\
-	x = r1 | (r2 << G_SHIFT);					\
+	(x) = r1__ | (r2__ << G_SHIFT);					\
     } while (0)
 
 /*
  * x_c = (x_c * a_c + y_c * b) / 255
  */
 #define UN16x4_MUL_UN16x4_ADD_UN16x4_MUL_UN16(x, a, y, b)			\
     do									\
     {									\
-	uint64_t r1, r2, r3, t;						\
+	uint64_t r1__, r2__, r3__, t__;					\
 									\
-	r1 = x;								\
-	r2 = a;								\
-	UN16_rb_MUL_UN16_rb (r1, r2, t);					\
-	r2 = y;								\
-	UN16_rb_MUL_UN16 (r2, b, t);					\
-	UN16_rb_ADD_UN16_rb (r1, r2, t);					\
+	r1__ = (x);							\
+	r2__ = (a);							\
+	UN16_rb_MUL_UN16_rb (r1__, r2__, t__);				\
+	r2__ = (y);							\
+	UN16_rb_MUL_UN16 (r2__, (b), t__);				\
+	UN16_rb_ADD_UN16_rb (r1__, r2__, t__);				\
 									\
-	r2 = x >> G_SHIFT;						\
-	r3 = a >> G_SHIFT;						\
-	UN16_rb_MUL_UN16_rb (r2, r3, t);					\
-	r3 = y >> G_SHIFT;						\
-	UN16_rb_MUL_UN16 (r3, b, t);					\
-	UN16_rb_ADD_UN16_rb (r2, r3, t);					\
+	r2__ = (x) >> G_SHIFT;						\
+	r3__ = (a) >> G_SHIFT;						\
+	UN16_rb_MUL_UN16_rb (r2__, r3__, t__);				\
+	r3__ = (y) >> G_SHIFT;						\
+	UN16_rb_MUL_UN16 (r3__, (b), t__);				\
+	UN16_rb_ADD_UN16_rb (r2__, r3__, t__);				\
 									\
-	x = r1 | (r2 << G_SHIFT);					\
+	x = r1__ | (r2__ << G_SHIFT);					\
     } while (0)
 
 /*
-   x_c = min(x_c + y_c, 255)
- */
+  x_c = min(x_c + y_c, 255)
+*/
 #define UN16x4_ADD_UN16x4(x, y)						\
     do									\
     {									\
-	uint64_t r1, r2, r3, t;						\
+	uint64_t r1__, r2__, r3__, t__;					\
 									\
-	r1 = x & RB_MASK;						\
-	r2 = y & RB_MASK;						\
-	UN16_rb_ADD_UN16_rb (r1, r2, t);					\
+	r1__ = (x) & RB_MASK;						\
+	r2__ = (y) & RB_MASK;						\
+	UN16_rb_ADD_UN16_rb (r1__, r2__, t__);				\
 									\
-	r2 = (x >> G_SHIFT) & RB_MASK;					\
-	r3 = (y >> G_SHIFT) & RB_MASK;					\
-	UN16_rb_ADD_UN16_rb (r2, r3, t);					\
+	r2__ = ((x) >> G_SHIFT) & RB_MASK;				\
+	r3__ = ((y) >> G_SHIFT) & RB_MASK;				\
+	UN16_rb_ADD_UN16_rb (r2__, r3__, t__);				\
 									\
-	x = r1 | (r2 << G_SHIFT);					\
+	x = r1__ | (r2__ << G_SHIFT);					\
     } while (0)
--- a/gfx/cairo/libpixman/src/pixman-compiler.h
+++ b/gfx/cairo/libpixman/src/pixman-compiler.h
@@ -193,18 +193,17 @@
 	if (pthread_once (&tls_ ## name ## _once_control,		\
 			  tls_ ## name ## _make_key) == 0)		\
 	{								\
 	    value = pthread_getspecific (tls_ ## name ## _key);		\
 	    if (!value)							\
 		value = tls_ ## name ## _alloc ();			\
 	}								\
 	return value;							\
-    }									\
-    extern int no_such_variable						
+    }
 
 #   define PIXMAN_GET_THREAD_LOCAL(name)				\
     tls_ ## name ## _get ()
 
 #else
 
 #    error "Unknown thread local support for this system. Pixman will not work with multiple threads. Define PIXMAN_NO_TLS to acknowledge and accept this limitation and compile pixman without thread-safety support."
 
--- a/gfx/cairo/libpixman/src/pixman-conical-gradient.c
+++ b/gfx/cairo/libpixman/src/pixman-conical-gradient.c
@@ -45,61 +45,61 @@ coordinates_to_parameter (double x, doub
     while (t >= 2 * M_PI)
 	t -= 2 * M_PI;
 
     return 1 - t * (1 / (2 * M_PI)); /* Scale t to [0, 1] and
 				      * make rotation CCW
 				      */
 }
 
-static void
-conical_gradient_get_scanline_32 (pixman_image_t *image,
-                                  int             x,
-                                  int             y,
-                                  int             width,
-                                  uint32_t *      buffer,
-                                  const uint32_t *mask)
+static uint32_t *
+conical_get_scanline_narrow (pixman_iter_t *iter, const uint32_t *mask)
 {
-    source_image_t *source = (source_image_t *)image;
-    gradient_t *gradient = (gradient_t *)source;
+    pixman_image_t *image = iter->image;
+    int x = iter->x;
+    int y = iter->y;
+    int width = iter->width;
+    uint32_t *buffer = iter->buffer;
+
+    gradient_t *gradient = (gradient_t *)image;
     conical_gradient_t *conical = (conical_gradient_t *)image;
     uint32_t       *end = buffer + width;
     pixman_gradient_walker_t walker;
     pixman_bool_t affine = TRUE;
     double cx = 1.;
     double cy = 0.;
     double cz = 0.;
     double rx = x + 0.5;
     double ry = y + 0.5;
     double rz = 1.;
 
-    _pixman_gradient_walker_init (&walker, gradient, source->common.repeat);
+    _pixman_gradient_walker_init (&walker, gradient, image->common.repeat);
 
-    if (source->common.transform)
+    if (image->common.transform)
     {
 	pixman_vector_t v;
 
 	/* reference point is the center of the pixel */
 	v.vector[0] = pixman_int_to_fixed (x) + pixman_fixed_1 / 2;
 	v.vector[1] = pixman_int_to_fixed (y) + pixman_fixed_1 / 2;
 	v.vector[2] = pixman_fixed_1;
 
-	if (!pixman_transform_point_3d (source->common.transform, &v))
-	    return;
+	if (!pixman_transform_point_3d (image->common.transform, &v))
+	    return iter->buffer;
 
-	cx = source->common.transform->matrix[0][0] / 65536.;
-	cy = source->common.transform->matrix[1][0] / 65536.;
-	cz = source->common.transform->matrix[2][0] / 65536.;
+	cx = image->common.transform->matrix[0][0] / 65536.;
+	cy = image->common.transform->matrix[1][0] / 65536.;
+	cz = image->common.transform->matrix[2][0] / 65536.;
 
 	rx = v.vector[0] / 65536.;
 	ry = v.vector[1] / 65536.;
 	rz = v.vector[2] / 65536.;
 
 	affine =
-	    source->common.transform->matrix[2][0] == 0 &&
+	    image->common.transform->matrix[2][0] == 0 &&
 	    v.vector[2] == pixman_fixed_1;
     }
 
     if (affine)
     {
 	rx -= conical->center.x / 65536.;
 	ry -= conical->center.y / 65536.;
 
@@ -150,23 +150,38 @@ conical_gradient_get_scanline_32 (pixman
 
 	    ++buffer;
 
 	    rx += cx;
 	    ry += cy;
 	    rz += cz;
 	}
     }
+
+    iter->y++;
+    return iter->buffer;
 }
 
-static void
-conical_gradient_property_changed (pixman_image_t *image)
+static uint32_t *
+conical_get_scanline_wide (pixman_iter_t *iter, const uint32_t *mask)
 {
-    image->common.get_scanline_32 = conical_gradient_get_scanline_32;
-    image->common.get_scanline_64 = _pixman_image_get_scanline_generic_64;
+    uint32_t *buffer = conical_get_scanline_narrow (iter, NULL);
+
+    pixman_expand ((uint64_t *)buffer, buffer, PIXMAN_a8r8g8b8, iter->width);
+
+    return buffer;
+}
+
+void
+_pixman_conical_gradient_iter_init (pixman_image_t *image, pixman_iter_t *iter)
+{
+    if (iter->flags & ITER_NARROW)
+	iter->get_scanline = conical_get_scanline_narrow;
+    else
+	iter->get_scanline = conical_get_scanline_wide;
 }
 
 PIXMAN_EXPORT pixman_image_t *
 pixman_image_create_conical_gradient (pixman_point_fixed_t *        center,
                                       pixman_fixed_t                angle,
                                       const pixman_gradient_stop_t *stops,
                                       int                           n_stops)
 {
@@ -186,13 +201,11 @@ pixman_image_create_conical_gradient (pi
 
     angle = MOD (angle, pixman_int_to_fixed (360));
 
     image->type = CONICAL;
 
     conical->center = *center;
     conical->angle = (pixman_fixed_to_double (angle) / 180.0) * M_PI;
 
-    image->common.property_changed = conical_gradient_property_changed;
-
     return image;
 }
 
--- a/gfx/cairo/libpixman/src/pixman-cpu.c
+++ b/gfx/cairo/libpixman/src/pixman-cpu.c
@@ -239,17 +239,17 @@ pixman_have_arm_neon (void)
 	initialized = TRUE;
     }
 
     return have_arm_neon;
 }
 
 #endif /* USE_ARM_NEON */
 
-#else /* linux ELF */
+#elif defined (__linux__) /* linux ELF */
 
 #include <stdlib.h>
 #include <unistd.h>
 #include <sys/types.h>
 #include <sys/stat.h>
 #include <sys/mman.h>
 #include <fcntl.h>
 #include <string.h>
@@ -265,28 +265,28 @@ static pixman_bool_t arm_tests_initializ
 #ifdef ANDROID
 
 /* on Android, we can't reliably access /proc/self/auxv,
  * so instead read the text version in /proc/cpuinfo and
  * parse that instead.
  */
 
 static void
-pixman_arm_read_auxv()
+pixman_arm_detect_cpu_features (void)
 {
     char buf[1024];
     char* pos;
     const char* ver_token = "CPU architecture: ";
     FILE* f = fopen("/proc/cpuinfo", "r");
     if (!f) {
 	arm_tests_initialized = TRUE;
 	return;
     }
 
-    fread(buf, sizeof(char), 1024, f);
+    fread(buf, sizeof(char), sizeof(buf), f);
     fclose(f);
     pos = strstr(buf, ver_token);
     if (pos) {
 	char vchar = *(pos + strlen(ver_token));
 	if (vchar >= '0' && vchar <= '9') {
 	    int ver = vchar - '0';
 	    arm_has_v7 = ver >= 7;
 	    arm_has_v6 = ver >= 6;
@@ -296,17 +296,17 @@ pixman_arm_read_auxv()
     arm_has_vfp = strstr(buf, "vfp") != NULL;
     arm_has_iwmmxt = strstr(buf, "iwmmxt") != NULL;
     arm_tests_initialized = TRUE;
 }
 
 #else
 
 static void
-pixman_arm_read_auxv ()
+pixman_arm_detect_cpu_features (void)
 {
     int fd;
     Elf32_auxv_t aux;
 
     fd = open ("/proc/self/auxv", O_RDONLY);
     if (fd >= 0)
     {
 	while (read (fd, &aux, sizeof(Elf32_auxv_t)) == sizeof(Elf32_auxv_t))
@@ -343,36 +343,41 @@ pixman_arm_read_auxv ()
 }
 #endif
 
 #if defined(USE_ARM_SIMD)
 pixman_bool_t
 pixman_have_arm_simd (void)
 {
     if (!arm_tests_initialized)
-	pixman_arm_read_auxv ();
+	pixman_arm_detect_cpu_features ();
 
     return arm_has_v6;
 }
 
 #endif /* USE_ARM_SIMD */
 
 #if defined(USE_ARM_NEON)
 pixman_bool_t
 pixman_have_arm_neon (void)
 {
     if (!arm_tests_initialized)
-	pixman_arm_read_auxv ();
+	pixman_arm_detect_cpu_features ();
 
     return arm_has_neon;
 }
 
 #endif /* USE_ARM_NEON */
 
-#endif /* linux */
+#else /* linux ELF */
+
+#define pixman_have_arm_simd() FALSE
+#define pixman_have_arm_neon() FALSE
+
+#endif
 
 #endif /* USE_ARM_SIMD || USE_ARM_NEON */
 
 #if defined(USE_MMX) || defined(USE_SSE2)
 /* The CPU detection code needs to be in a file not compiled with
  * "-mmmx -msse", as gcc would generate CMOV instructions otherwise
  * that would lead to SIGILL instructions on old CPUs that don't have
  * it.
@@ -605,33 +610,41 @@ pixman_have_sse2 (void)
 #define pixman_have_sse2() TRUE
 #endif
 #endif /* __amd64__ */
 #endif
 
 pixman_implementation_t *
 _pixman_choose_implementation (void)
 {
+    pixman_implementation_t *imp;
+
+    imp = _pixman_implementation_create_general();
+    imp = _pixman_implementation_create_fast_path (imp);
+    
+#ifdef USE_MMX
+    if (pixman_have_mmx ())
+	imp = _pixman_implementation_create_mmx (imp);
+#endif
+
 #ifdef USE_SSE2
     if (pixman_have_sse2 ())
-	return _pixman_implementation_create_sse2 ();
+	imp = _pixman_implementation_create_sse2 (imp);
 #endif
-#ifdef USE_MMX
-    if (pixman_have_mmx ())
-	return _pixman_implementation_create_mmx ();
+
+#ifdef USE_ARM_SIMD
+    if (pixman_have_arm_simd ())
+	imp = _pixman_implementation_create_arm_simd (imp);
 #endif
 
 #ifdef USE_ARM_NEON
     if (pixman_have_arm_neon ())
-	return _pixman_implementation_create_arm_neon ();
+	imp = _pixman_implementation_create_arm_neon (imp);
 #endif
-#ifdef USE_ARM_SIMD
-    if (pixman_have_arm_simd ())
-	return _pixman_implementation_create_arm_simd ();
-#endif
+    
 #ifdef USE_VMX
     if (pixman_have_vmx ())
-	return _pixman_implementation_create_vmx ();
+	imp = _pixman_implementation_create_vmx (imp);
 #endif
 
-    return _pixman_implementation_create_fast_path ();
+    return imp;
 }
 
--- a/gfx/cairo/libpixman/src/pixman-fast-path.c
+++ b/gfx/cairo/libpixman/src/pixman-fast-path.c
@@ -183,17 +183,17 @@ fast_composite_in_n_8_8 (pixman_implemen
 {
     uint32_t src, srca;
     uint8_t     *dst_line, *dst;
     uint8_t     *mask_line, *mask, m;
     int dst_stride, mask_stride;
     int32_t w;
     uint16_t t;
 
-    src = _pixman_image_get_solid (src_image, dest_image->bits.format);
+    src = _pixman_image_get_solid (imp, src_image, dest_image->bits.format);
 
     srca = src >> 24;
 
     PIXMAN_IMAGE_GET_LINE (dest_image, dest_x, dest_y, uint8_t, dst_stride, dst_line, 1);
     PIXMAN_IMAGE_GET_LINE (mask_image, mask_x, mask_y, uint8_t, mask_stride, mask_line, 1);
 
     if (srca == 0xff)
     {
@@ -307,17 +307,17 @@ fast_composite_over_n_8_8888 (pixman_imp
                               int32_t                  height)
 {
     uint32_t src, srca;
     uint32_t    *dst_line, *dst, d;
     uint8_t     *mask_line, *mask, m;
     int dst_stride, mask_stride;
     int32_t w;
 
-    src = _pixman_image_get_solid (src_image, dst_image->bits.format);
+    src = _pixman_image_get_solid (imp, src_image, dst_image->bits.format);
 
     srca = src >> 24;
     if (src == 0)
 	return;
 
     PIXMAN_IMAGE_GET_LINE (dst_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
     PIXMAN_IMAGE_GET_LINE (mask_image, mask_x, mask_y, uint8_t, mask_stride, mask_line, 1);
 
@@ -359,25 +359,24 @@ fast_composite_add_n_8888_8888_ca (pixma
 				   int32_t                  src_y,
 				   int32_t                  mask_x,
 				   int32_t                  mask_y,
 				   int32_t                  dest_x,
 				   int32_t                  dest_y,
 				   int32_t                  width,
 				   int32_t                  height)
 {
-    uint32_t src, srca, s;
+    uint32_t src, s;
     uint32_t    *dst_line, *dst, d;
     uint32_t    *mask_line, *mask, ma;
     int dst_stride, mask_stride;
     int32_t w;
 
-    src = _pixman_image_get_solid (src_image, dst_image->bits.format);
+    src = _pixman_image_get_solid (imp, src_image, dst_image->bits.format);
 
-    srca = src >> 24;
     if (src == 0)
 	return;
 
     PIXMAN_IMAGE_GET_LINE (dst_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
     PIXMAN_IMAGE_GET_LINE (mask_image, mask_x, mask_y, uint32_t, mask_stride, mask_line, 1);
 
     while (height--)
     {
@@ -422,17 +421,17 @@ fast_composite_over_n_8888_8888_ca (pixm
                                     int32_t                  height)
 {
     uint32_t src, srca, s;
     uint32_t    *dst_line, *dst, d;
     uint32_t    *mask_line, *mask, ma;
     int dst_stride, mask_stride;
     int32_t w;
 
-    src = _pixman_image_get_solid (src_image, dst_image->bits.format);
+    src = _pixman_image_get_solid (imp, src_image, dst_image->bits.format);
 
     srca = src >> 24;
     if (src == 0)
 	return;
 
     PIXMAN_IMAGE_GET_LINE (dst_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
     PIXMAN_IMAGE_GET_LINE (mask_image, mask_x, mask_y, uint32_t, mask_stride, mask_line, 1);
 
@@ -489,17 +488,17 @@ fast_composite_over_n_8_0888 (pixman_imp
 {
     uint32_t src, srca;
     uint8_t     *dst_line, *dst;
     uint32_t d;
     uint8_t     *mask_line, *mask, m;
     int dst_stride, mask_stride;
     int32_t w;
 
-    src = _pixman_image_get_solid (src_image, dst_image->bits.format);
+    src = _pixman_image_get_solid (imp, src_image, dst_image->bits.format);
 
     srca = src >> 24;
     if (src == 0)
 	return;
 
     PIXMAN_IMAGE_GET_LINE (dst_image, dest_x, dest_y, uint8_t, dst_stride, dst_line, 3);
     PIXMAN_IMAGE_GET_LINE (mask_image, mask_x, mask_y, uint8_t, mask_stride, mask_line, 1);
 
@@ -554,17 +553,17 @@ fast_composite_over_n_8_0565 (pixman_imp
 {
     uint32_t src, srca;
     uint16_t    *dst_line, *dst;
     uint32_t d;
     uint8_t     *mask_line, *mask, m;
     int dst_stride, mask_stride;
     int32_t w;
 
-    src = _pixman_image_get_solid (src_image, dst_image->bits.format);
+    src = _pixman_image_get_solid (imp, src_image, dst_image->bits.format);
 
     srca = src >> 24;
     if (src == 0)
 	return;
 
     PIXMAN_IMAGE_GET_LINE (dst_image, dest_x, dest_y, uint16_t, dst_stride, dst_line, 1);
     PIXMAN_IMAGE_GET_LINE (mask_image, mask_x, mask_y, uint8_t, mask_stride, mask_line, 1);
 
@@ -621,17 +620,17 @@ fast_composite_over_n_8888_0565_ca (pixm
     uint32_t  src, srca, s;
     uint16_t  src16;
     uint16_t *dst_line, *dst;
     uint32_t  d;
     uint32_t *mask_line, *mask, ma;
     int dst_stride, mask_stride;
     int32_t w;
 
-    src = _pixman_image_get_solid (src_image, dst_image->bits.format);
+    src = _pixman_image_get_solid (imp, src_image, dst_image->bits.format);
 
     srca = src >> 24;
     if (src == 0)
 	return;
 
     src16 = CONVERT_8888_TO_0565 (src);
 
     PIXMAN_IMAGE_GET_LINE (dst_image, dest_x, dest_y, uint16_t, dst_stride, dst_line, 1);
@@ -1029,17 +1028,17 @@ fast_composite_add_n_8_8 (pixman_impleme
     uint8_t     *mask_line, *mask;
     int dst_stride, mask_stride;
     int32_t w;
     uint32_t src;
     uint8_t sa;
 
     PIXMAN_IMAGE_GET_LINE (dst_image, dest_x, dest_y, uint8_t, dst_stride, dst_line, 1);
     PIXMAN_IMAGE_GET_LINE (mask_image, mask_x, mask_y, uint8_t, mask_stride, mask_line, 1);
-    src = _pixman_image_get_solid (src_image, dst_image->bits.format);
+    src = _pixman_image_get_solid (imp, src_image, dst_image->bits.format);
     sa = (src >> 24);
 
     while (height--)
     {
 	dst = dst_line;
 	dst_line += dst_stride;
 	mask = mask_line;
 	mask_line += mask_stride;
@@ -1141,17 +1140,17 @@ fast_composite_over_n_1_8888 (pixman_imp
     uint32_t    *mask, *mask_line;
     int          mask_stride, dst_stride;
     uint32_t     bitcache, bitmask;
     int32_t      w;
 
     if (width <= 0)
 	return;
 
-    src = _pixman_image_get_solid (src_image, dst_image->bits.format);
+    src = _pixman_image_get_solid (imp, src_image, dst_image->bits.format);
     srca = src >> 24;
     if (src == 0)
 	return;
 
     PIXMAN_IMAGE_GET_LINE (dst_image, dest_x, dest_y, uint32_t,
                            dst_stride, dst_line, 1);
     PIXMAN_IMAGE_GET_LINE (mask_image, 0, mask_y, uint32_t,
                            mask_stride, mask_line, 1);
@@ -1235,17 +1234,17 @@ fast_composite_over_n_1_0565 (pixman_imp
     uint32_t     bitcache, bitmask;
     int32_t      w;
     uint32_t     d;
     uint16_t     src565;
 
     if (width <= 0)
 	return;
 
-    src = _pixman_image_get_solid (src_image, dst_image->bits.format);
+    src = _pixman_image_get_solid (imp, src_image, dst_image->bits.format);
     srca = src >> 24;
     if (src == 0)
 	return;
 
     PIXMAN_IMAGE_GET_LINE (dst_image, dest_x, dest_y, uint16_t,
                            dst_stride, dst_line, 1);
     PIXMAN_IMAGE_GET_LINE (mask_image, 0, mask_y, uint32_t,
                            mask_stride, mask_line, 1);
@@ -1327,19 +1326,23 @@ fast_composite_solid_fill (pixman_implem
                            int32_t                  mask_y,
                            int32_t                  dest_x,
                            int32_t                  dest_y,
                            int32_t                  width,
                            int32_t                  height)
 {
     uint32_t src;
 
-    src = _pixman_image_get_solid (src_image, dst_image->bits.format);
+    src = _pixman_image_get_solid (imp, src_image, dst_image->bits.format);
 
-    if (dst_image->bits.format == PIXMAN_a8)
+    if (dst_image->bits.format == PIXMAN_a1)
+    {
+	src = src >> 31;
+    }
+    else if (dst_image->bits.format == PIXMAN_a8)
     {
 	src = src >> 24;
     }
     else if (dst_image->bits.format == PIXMAN_r5g6b5 ||
              dst_image->bits.format == PIXMAN_b5g6r5)
     {
 	src = CONVERT_8888_TO_0565 (src);
     }
@@ -1382,42 +1385,43 @@ fast_composite_src_memcpy (pixman_implem
     {
 	memcpy (dst, src, n_bytes);
 
 	dst += dst_stride;
 	src += src_stride;
     }
 }
 
-FAST_NEAREST (8888_8888_cover, 8888, 8888, uint32_t, uint32_t, SRC, COVER);
-FAST_NEAREST (8888_8888_none, 8888, 8888, uint32_t, uint32_t, SRC, NONE);
-FAST_NEAREST (8888_8888_pad, 8888, 8888, uint32_t, uint32_t, SRC, PAD);
-FAST_NEAREST (8888_8888_normal, 8888, 8888, uint32_t, uint32_t, SRC, NORMAL);
-FAST_NEAREST (8888_8888_cover, 8888, 8888, uint32_t, uint32_t, OVER, COVER);
-FAST_NEAREST (8888_8888_none, 8888, 8888, uint32_t, uint32_t, OVER, NONE);
-FAST_NEAREST (8888_8888_pad, 8888, 8888, uint32_t, uint32_t, OVER, PAD);
-FAST_NEAREST (8888_8888_normal, 8888, 8888, uint32_t, uint32_t, OVER, NORMAL);
-FAST_NEAREST (8888_565_cover, 8888, 0565, uint32_t, uint16_t, SRC, COVER);
-FAST_NEAREST (8888_565_none, 8888, 0565, uint32_t, uint16_t, SRC, NONE);
-FAST_NEAREST (8888_565_pad, 8888, 0565, uint32_t, uint16_t, SRC, PAD);
-FAST_NEAREST (8888_565_normal, 8888, 0565, uint32_t, uint16_t, SRC, NORMAL);
-FAST_NEAREST (565_565_normal, 0565, 0565, uint16_t, uint16_t, SRC, NORMAL);
-FAST_NEAREST (8888_565_cover, 8888, 0565, uint32_t, uint16_t, OVER, COVER);
-FAST_NEAREST (8888_565_none, 8888, 0565, uint32_t, uint16_t, OVER, NONE);
-FAST_NEAREST (8888_565_pad, 8888, 0565, uint32_t, uint16_t, OVER, PAD);
-FAST_NEAREST (8888_565_normal, 8888, 0565, uint32_t, uint16_t, OVER, NORMAL);
+FAST_NEAREST (8888_8888_cover, 8888, 8888, uint32_t, uint32_t, SRC, COVER)
+FAST_NEAREST (8888_8888_none, 8888, 8888, uint32_t, uint32_t, SRC, NONE)
+FAST_NEAREST (8888_8888_pad, 8888, 8888, uint32_t, uint32_t, SRC, PAD)
+FAST_NEAREST (8888_8888_normal, 8888, 8888, uint32_t, uint32_t, SRC, NORMAL)
+FAST_NEAREST (8888_8888_cover, 8888, 8888, uint32_t, uint32_t, OVER, COVER)
+FAST_NEAREST (8888_8888_none, 8888, 8888, uint32_t, uint32_t, OVER, NONE)
+FAST_NEAREST (8888_8888_pad, 8888, 8888, uint32_t, uint32_t, OVER, PAD)
+FAST_NEAREST (8888_8888_normal, 8888, 8888, uint32_t, uint32_t, OVER, NORMAL)
+FAST_NEAREST (8888_565_cover, 8888, 0565, uint32_t, uint16_t, SRC, COVER)
+FAST_NEAREST (8888_565_none, 8888, 0565, uint32_t, uint16_t, SRC, NONE)
+FAST_NEAREST (8888_565_pad, 8888, 0565, uint32_t, uint16_t, SRC, PAD)
+FAST_NEAREST (8888_565_normal, 8888, 0565, uint32_t, uint16_t, SRC, NORMAL)
+FAST_NEAREST (565_565_normal, 0565, 0565, uint16_t, uint16_t, SRC, NORMAL)
+FAST_NEAREST (8888_565_cover, 8888, 0565, uint32_t, uint16_t, OVER, COVER)
+FAST_NEAREST (8888_565_none, 8888, 0565, uint32_t, uint16_t, OVER, NONE)
+FAST_NEAREST (8888_565_pad, 8888, 0565, uint32_t, uint16_t, OVER, PAD)
+FAST_NEAREST (8888_565_normal, 8888, 0565, uint32_t, uint16_t, OVER, NORMAL)
 
 /* Use more unrolling for src_0565_0565 because it is typically CPU bound */
 static force_inline void
-scaled_nearest_scanline_565_565_SRC (uint16_t *      dst,
-				     uint16_t *      src,
-				     int32_t         w,
-				     pixman_fixed_t  vx,
-				     pixman_fixed_t  unit_x,
-				     pixman_fixed_t  max_vx)
+scaled_nearest_scanline_565_565_SRC (uint16_t *       dst,
+				     const uint16_t * src,
+				     int32_t          w,
+				     pixman_fixed_t   vx,
+				     pixman_fixed_t   unit_x,
+				     pixman_fixed_t   max_vx,
+				     pixman_bool_t    fully_transparent_src)
 {
     uint16_t tmp1, tmp2, tmp3, tmp4;
     while ((w -= 4) >= 0)
     {
 	tmp1 = src[pixman_fixed_to_int (vx)];
 	vx += unit_x;
 	tmp2 = src[pixman_fixed_to_int (vx)];
 	vx += unit_x;
@@ -1440,23 +1444,23 @@ scaled_nearest_scanline_565_565_SRC (uin
 	*dst++ = tmp2;
     }
     if (w & 1)
 	*dst++ = src[pixman_fixed_to_int (vx)];
 }
 
 FAST_NEAREST_MAINLOOP (565_565_cover_SRC,
 		       scaled_nearest_scanline_565_565_SRC,
-		       uint16_t, uint16_t, COVER);
+		       uint16_t, uint16_t, COVER)
 FAST_NEAREST_MAINLOOP (565_565_none_SRC,
 		       scaled_nearest_scanline_565_565_SRC,
-		       uint16_t, uint16_t, NONE);
+		       uint16_t, uint16_t, NONE)
 FAST_NEAREST_MAINLOOP (565_565_pad_SRC,
 		       scaled_nearest_scanline_565_565_SRC,
-		       uint16_t, uint16_t, PAD);
+		       uint16_t, uint16_t, PAD)
 
 static force_inline uint32_t
 fetch_nearest (pixman_repeat_t src_repeat,
 	       pixman_format_code_t format,
 	       uint32_t *src, int x, int src_width)
 {
     if (repeat (src_repeat, &x, src_width))
     {
@@ -1608,16 +1612,282 @@ fast_composite_scaled_nearest (pixman_im
 		    combine_over (s, dst++);
 		else
 		    combine_src (s, dst++);
 	    }
 	}
     }
 }
 
+#define CACHE_LINE_SIZE 64
+
+#define FAST_SIMPLE_ROTATE(suffix, pix_type)                                  \
+                                                                              \
+static void                                                                   \
+blt_rotated_90_trivial_##suffix (pix_type       *dst,                         \
+				 int             dst_stride,                  \
+				 const pix_type *src,                         \
+				 int             src_stride,                  \
+				 int             w,                           \
+				 int             h)                           \
+{                                                                             \
+    int x, y;                                                                 \
+    for (y = 0; y < h; y++)                                                   \
+    {                                                                         \
+	const pix_type *s = src + (h - y - 1);                                \
+	pix_type *d = dst + dst_stride * y;                                   \
+	for (x = 0; x < w; x++)                                               \
+	{                                                                     \
+	    *d++ = *s;                                                        \
+	    s += src_stride;                                                  \
+	}                                                                     \
+    }                                                                         \
+}                                                                             \
+                                                                              \
+static void                                                                   \
+blt_rotated_270_trivial_##suffix (pix_type       *dst,                        \
+				  int             dst_stride,                 \
+				  const pix_type *src,                        \
+				  int             src_stride,                 \
+				  int             w,                          \
+				  int             h)                          \
+{                                                                             \
+    int x, y;                                                                 \
+    for (y = 0; y < h; y++)                                                   \
+    {                                                                         \
+	const pix_type *s = src + src_stride * (w - 1) + y;                   \
+	pix_type *d = dst + dst_stride * y;                                   \
+	for (x = 0; x < w; x++)                                               \
+	{                                                                     \
+	    *d++ = *s;                                                        \
+	    s -= src_stride;                                                  \
+	}                                                                     \
+    }                                                                         \
+}                                                                             \
+                                                                              \
+static void                                                                   \
+blt_rotated_90_##suffix (pix_type       *dst,                                 \
+			 int             dst_stride,                          \
+			 const pix_type *src,                                 \
+			 int             src_stride,                          \
+			 int             W,                                   \
+			 int             H)                                   \
+{                                                                             \
+    int x;                                                                    \
+    int leading_pixels = 0, trailing_pixels = 0;                              \
+    const int TILE_SIZE = CACHE_LINE_SIZE / sizeof(pix_type);                 \
+                                                                              \
+    /*                                                                        \
+     * split processing into handling destination as TILE_SIZExH cache line   \
+     * aligned vertical stripes (optimistically assuming that destination     \
+     * stride is a multiple of cache line, if not - it will be just a bit     \
+     * slower)                                                                \
+     */                                                                       \
+                                                                              \
+    if ((uintptr_t)dst & (CACHE_LINE_SIZE - 1))                               \
+    {                                                                         \
+	leading_pixels = TILE_SIZE - (((uintptr_t)dst &                       \
+			    (CACHE_LINE_SIZE - 1)) / sizeof(pix_type));       \
+	if (leading_pixels > W)                                               \
+	    leading_pixels = W;                                               \
+                                                                              \
+	/* unaligned leading part NxH (where N < TILE_SIZE) */                \
+	blt_rotated_90_trivial_##suffix (                                     \
+	    dst,                                                              \
+	    dst_stride,                                                       \
+	    src,                                                              \
+	    src_stride,                                                       \
+	    leading_pixels,                                                   \
+	    H);                                                               \
+	                                                                      \
+	dst += leading_pixels;                                                \
+	src += leading_pixels * src_stride;                                   \
+	W -= leading_pixels;                                                  \
+    }                                                                         \
+                                                                              \
+    if ((uintptr_t)(dst + W) & (CACHE_LINE_SIZE - 1))                         \
+    {                                                                         \
+	trailing_pixels = (((uintptr_t)(dst + W) &                            \
+			    (CACHE_LINE_SIZE - 1)) / sizeof(pix_type));       \
+	if (trailing_pixels > W)                                              \
+	    trailing_pixels = W;                                              \
+	W -= trailing_pixels;                                                 \
+    }                                                                         \
+                                                                              \
+    for (x = 0; x < W; x += TILE_SIZE)                                        \
+    {                                                                         \
+	/* aligned middle part TILE_SIZExH */                                 \
+	blt_rotated_90_trivial_##suffix (                                     \
+	    dst + x,                                                          \
+	    dst_stride,                                                       \
+	    src + src_stride * x,                                             \
+	    src_stride,                                                       \
+	    TILE_SIZE,                                                        \
+	    H);                                                               \
+    }                                                                         \
+                                                                              \
+    if (trailing_pixels)                                                      \
+    {                                                                         \
+	/* unaligned trailing part NxH (where N < TILE_SIZE) */               \
+	blt_rotated_90_trivial_##suffix (                                     \
+	    dst + W,                                                          \
+	    dst_stride,                                                       \
+	    src + W * src_stride,                                             \
+	    src_stride,                                                       \
+	    trailing_pixels,                                                  \
+	    H);                                                               \
+    }                                                                         \
+}                                                                             \
+                                                                              \
+static void                                                                   \
+blt_rotated_270_##suffix (pix_type       *dst,                                \
+			  int             dst_stride,                         \
+			  const pix_type *src,                                \
+			  int             src_stride,                         \
+			  int             W,                                  \
+			  int             H)                                  \
+{                                                                             \
+    int x;                                                                    \
+    int leading_pixels = 0, trailing_pixels = 0;                              \
+    const int TILE_SIZE = CACHE_LINE_SIZE / sizeof(pix_type);                 \
+                                                                              \
+    /*                                                                        \
+     * split processing into handling destination as TILE_SIZExH cache line   \
+     * aligned vertical stripes (optimistically assuming that destination     \
+     * stride is a multiple of cache line, if not - it will be just a bit     \
+     * slower)                                                                \
+     */                                                                       \
+                                                                              \
+    if ((uintptr_t)dst & (CACHE_LINE_SIZE - 1))                               \
+    {                                                                         \
+	leading_pixels = TILE_SIZE - (((uintptr_t)dst &                       \
+			    (CACHE_LINE_SIZE - 1)) / sizeof(pix_type));       \
+	if (leading_pixels > W)                                               \
+	    leading_pixels = W;                                               \
+                                                                              \
+	/* unaligned leading part NxH (where N < TILE_SIZE) */                \
+	blt_rotated_270_trivial_##suffix (                                    \
+	    dst,                                                              \
+	    dst_stride,                                                       \
+	    src + src_stride * (W - leading_pixels),                          \
+	    src_stride,                                                       \
+	    leading_pixels,                                                   \
+	    H);                                                               \
+	                                                                      \
+	dst += leading_pixels;                                                \
+	W -= leading_pixels;                                                  \
+    }                                                                         \
+                                                                              \
+    if ((uintptr_t)(dst + W) & (CACHE_LINE_SIZE - 1))                         \
+    {                                                                         \
+	trailing_pixels = (((uintptr_t)(dst + W) &                            \
+			    (CACHE_LINE_SIZE - 1)) / sizeof(pix_type));       \
+	if (trailing_pixels > W)                                              \
+	    trailing_pixels = W;                                              \
+	W -= trailing_pixels;                                                 \
+	src += trailing_pixels * src_stride;                                  \
+    }                                                                         \
+                                                                              \
+    for (x = 0; x < W; x += TILE_SIZE)                                        \
+    {                                                                         \
+	/* aligned middle part TILE_SIZExH */                                 \
+	blt_rotated_270_trivial_##suffix (                                    \
+	    dst + x,                                                          \
+	    dst_stride,                                                       \
+	    src + src_stride * (W - x - TILE_SIZE),                           \
+	    src_stride,                                                       \
+	    TILE_SIZE,                                                        \
+	    H);                                                               \
+    }                                                                         \
+                                                                              \
+    if (trailing_pixels)                                                      \
+    {                                                                         \
+	/* unaligned trailing part NxH (where N < TILE_SIZE) */               \
+	blt_rotated_270_trivial_##suffix (                                    \
+	    dst + W,                                                          \
+	    dst_stride,                                                       \
+	    src - trailing_pixels * src_stride,                               \
+	    src_stride,                                                       \
+	    trailing_pixels,                                                  \
+	    H);                                                               \
+    }                                                                         \
+}                                                                             \
+                                                                              \
+static void                                                                   \
+fast_composite_rotate_90_##suffix (pixman_implementation_t *imp,              \
+				   pixman_op_t              op,               \
+				   pixman_image_t *         src_image,        \
+				   pixman_image_t *         mask_image,       \
+				   pixman_image_t *         dst_image,        \
+				   int32_t                  src_x,            \
+				   int32_t                  src_y,            \
+				   int32_t                  mask_x,           \
+				   int32_t                  mask_y,           \
+				   int32_t                  dest_x,           \
+				   int32_t                  dest_y,           \
+				   int32_t                  width,            \
+				   int32_t                  height)           \
+{                                                                             \
+    pix_type       *dst_line;                                                 \
+    pix_type       *src_line;                                                 \
+    int             dst_stride, src_stride;                                   \
+    int             src_x_t, src_y_t;                                         \
+                                                                              \
+    PIXMAN_IMAGE_GET_LINE (dst_image, dest_x, dest_y, pix_type,               \
+			   dst_stride, dst_line, 1);                          \
+    src_x_t = -src_y + pixman_fixed_to_int (                                  \
+				src_image->common.transform->matrix[0][2] +   \
+				pixman_fixed_1 / 2 - pixman_fixed_e) - height;\
+    src_y_t = src_x + pixman_fixed_to_int (                                   \
+				src_image->common.transform->matrix[1][2] +   \
+				pixman_fixed_1 / 2 - pixman_fixed_e);         \
+    PIXMAN_IMAGE_GET_LINE (src_image, src_x_t, src_y_t, pix_type,             \
+			   src_stride, src_line, 1);                          \
+    blt_rotated_90_##suffix (dst_line, dst_stride, src_line, src_stride,      \
+			     width, height);                                  \
+}                                                                             \
+                                                                              \
+static void                                                                   \
+fast_composite_rotate_270_##suffix (pixman_implementation_t *imp,             \
+				    pixman_op_t              op,              \
+				    pixman_image_t *         src_image,       \
+				    pixman_image_t *         mask_image,      \
+				    pixman_image_t *         dst_image,       \
+				    int32_t                  src_x,           \
+				    int32_t                  src_y,           \
+				    int32_t                  mask_x,          \
+				    int32_t                  mask_y,          \
+				    int32_t                  dest_x,          \
+				    int32_t                  dest_y,          \
+				    int32_t                  width,           \
+				    int32_t                  height)          \
+{                                                                             \
+    pix_type       *dst_line;                                                 \
+    pix_type       *src_line;                                                 \
+    int             dst_stride, src_stride;                                   \
+    int             src_x_t, src_y_t;                                         \
+                                                                              \
+    PIXMAN_IMAGE_GET_LINE (dst_image, dest_x, dest_y, pix_type,               \
+			   dst_stride, dst_line, 1);                          \
+    src_x_t = src_y + pixman_fixed_to_int (                                   \
+				src_image->common.transform->matrix[0][2] +   \
+				pixman_fixed_1 / 2 - pixman_fixed_e);         \
+    src_y_t = -src_x + pixman_fixed_to_int (                                  \
+				src_image->common.transform->matrix[1][2] +   \
+				pixman_fixed_1 / 2 - pixman_fixed_e) - width; \
+    PIXMAN_IMAGE_GET_LINE (src_image, src_x_t, src_y_t, pix_type,             \
+			   src_stride, src_line, 1);                          \
+    blt_rotated_270_##suffix (dst_line, dst_stride, src_line, src_stride,     \
+			      width, height);                                 \
+}
+
+FAST_SIMPLE_ROTATE (8, uint8_t)
+FAST_SIMPLE_ROTATE (565, uint16_t)
+FAST_SIMPLE_ROTATE (8888, uint32_t)
+
 static const pixman_fast_path_t c_fast_paths[] =
 {
     PIXMAN_STD_FAST_PATH (OVER, solid, a8, r5g6b5, fast_composite_over_n_8_0565),
     PIXMAN_STD_FAST_PATH (OVER, solid, a8, b5g6r5, fast_composite_over_n_8_0565),
     PIXMAN_STD_FAST_PATH (OVER, solid, a8, r8g8b8, fast_composite_over_n_8_0888),
     PIXMAN_STD_FAST_PATH (OVER, solid, a8, b8g8r8, fast_composite_over_n_8_0888),
     PIXMAN_STD_FAST_PATH (OVER, solid, a8, a8r8g8b8, fast_composite_over_n_8_8888),
     PIXMAN_STD_FAST_PATH (OVER, solid, a8, x8r8g8b8, fast_composite_over_n_8_8888),
@@ -1650,16 +1920,17 @@ static const pixman_fast_path_t c_fast_p
     PIXMAN_STD_FAST_PATH (ADD, a8, null, a8, fast_composite_add_8_8),
     PIXMAN_STD_FAST_PATH (ADD, a1, null, a1, fast_composite_add_1000_1000),
     PIXMAN_STD_FAST_PATH_CA (ADD, solid, a8r8g8b8, a8r8g8b8, fast_composite_add_n_8888_8888_ca),
     PIXMAN_STD_FAST_PATH (ADD, solid, a8, a8, fast_composite_add_n_8_8),
     PIXMAN_STD_FAST_PATH (SRC, solid, null, a8r8g8b8, fast_composite_solid_fill),
     PIXMAN_STD_FAST_PATH (SRC, solid, null, x8r8g8b8, fast_composite_solid_fill),
     PIXMAN_STD_FAST_PATH (SRC, solid, null, a8b8g8r8, fast_composite_solid_fill),
     PIXMAN_STD_FAST_PATH (SRC, solid, null, x8b8g8r8, fast_composite_solid_fill),
+    PIXMAN_STD_FAST_PATH (SRC, solid, null, a1, fast_composite_solid_fill),
     PIXMAN_STD_FAST_PATH (SRC, solid, null, a8, fast_composite_solid_fill),
     PIXMAN_STD_FAST_PATH (SRC, solid, null, r5g6b5, fast_composite_solid_fill),
     PIXMAN_STD_FAST_PATH (SRC, x8r8g8b8, null, a8r8g8b8, fast_composite_src_x888_8888),
     PIXMAN_STD_FAST_PATH (SRC, x8b8g8r8, null, a8b8g8r8, fast_composite_src_x888_8888),
     PIXMAN_STD_FAST_PATH (SRC, a8r8g8b8, null, x8r8g8b8, fast_composite_src_memcpy),
     PIXMAN_STD_FAST_PATH (SRC, a8r8g8b8, null, a8r8g8b8, fast_composite_src_memcpy),
     PIXMAN_STD_FAST_PATH (SRC, x8r8g8b8, null, x8r8g8b8, fast_composite_src_memcpy),
     PIXMAN_STD_FAST_PATH (SRC, a8b8g8r8, null, x8b8g8r8, fast_composite_src_memcpy),
@@ -1725,19 +1996,121 @@ static const pixman_fast_path_t c_fast_p
     NEAREST_FAST_PATH (OVER, x8b8g8r8, x8b8g8r8),
     NEAREST_FAST_PATH (OVER, a8b8g8r8, x8b8g8r8),
 
     NEAREST_FAST_PATH (OVER, x8r8g8b8, a8r8g8b8),
     NEAREST_FAST_PATH (OVER, a8r8g8b8, a8r8g8b8),
     NEAREST_FAST_PATH (OVER, x8b8g8r8, a8b8g8r8),
     NEAREST_FAST_PATH (OVER, a8b8g8r8, a8b8g8r8),
 
+#define SIMPLE_ROTATE_FLAGS(angle)					  \
+    (FAST_PATH_ROTATE_ ## angle ## _TRANSFORM	|			  \
+     FAST_PATH_NEAREST_FILTER			|			  \
+     FAST_PATH_SAMPLES_COVER_CLIP		|			  \
+     FAST_PATH_STANDARD_FLAGS)
+
+#define SIMPLE_ROTATE_FAST_PATH(op,s,d,suffix)				  \
+    {   PIXMAN_OP_ ## op,						  \
+	PIXMAN_ ## s, SIMPLE_ROTATE_FLAGS (90),				  \
+	PIXMAN_null, 0,							  \
+	PIXMAN_ ## d, FAST_PATH_STD_DEST_FLAGS,				  \
+	fast_composite_rotate_90_##suffix,				  \
+    },									  \
+    {   PIXMAN_OP_ ## op,						  \
+	PIXMAN_ ## s, SIMPLE_ROTATE_FLAGS (270),			  \
+	PIXMAN_null, 0,							  \
+	PIXMAN_ ## d, FAST_PATH_STD_DEST_FLAGS,				  \
+	fast_composite_rotate_270_##suffix,				  \
+    }
+
+    SIMPLE_ROTATE_FAST_PATH (SRC, a8r8g8b8, a8r8g8b8, 8888),
+    SIMPLE_ROTATE_FAST_PATH (SRC, a8r8g8b8, x8r8g8b8, 8888),
+    SIMPLE_ROTATE_FAST_PATH (SRC, x8r8g8b8, x8r8g8b8, 8888),
+    SIMPLE_ROTATE_FAST_PATH (SRC, r5g6b5, r5g6b5, 565),
+    SIMPLE_ROTATE_FAST_PATH (SRC, a8, a8, 8),
+
     {   PIXMAN_OP_NONE	},
 };
 
+#ifdef WORDS_BIGENDIAN
+#define A1_FILL_MASK(n, offs) (((1 << (n)) - 1) << (32 - (offs) - (n)))
+#else
+#define A1_FILL_MASK(n, offs) (((1 << (n)) - 1) << (offs))
+#endif
+
+static force_inline void
+pixman_fill1_line (uint32_t *dst, int offs, int width, int v)
+{
+    if (offs)
+    {
+	int leading_pixels = 32 - offs;
+	if (leading_pixels >= width)
+	{
+	    if (v)
+		*dst |= A1_FILL_MASK (width, offs);
+	    else
+		*dst &= ~A1_FILL_MASK (width, offs);
+	    return;
+	}
+	else
+	{
+	    if (v)
+		*dst++ |= A1_FILL_MASK (leading_pixels, offs);
+	    else
+		*dst++ &= ~A1_FILL_MASK (leading_pixels, offs);
+	    width -= leading_pixels;
+	}
+    }
+    while (width >= 32)
+    {
+	if (v)
+	    *dst++ = 0xFFFFFFFF;
+	else
+	    *dst++ = 0;
+	width -= 32;
+    }
+    if (width > 0)
+    {
+	if (v)
+	    *dst |= A1_FILL_MASK (width, 0);
+	else
+	    *dst &= ~A1_FILL_MASK (width, 0);
+    }
+}
+
+static void
+pixman_fill1 (uint32_t *bits,
+              int       stride,
+              int       x,
+              int       y,
+              int       width,
+              int       height,
+              uint32_t  xor)
+{
+    uint32_t *dst = bits + y * stride + (x >> 5);
+    int offs = x & 31;
+
+    if (xor & 1)
+    {
+	while (height--)
+	{
+	    pixman_fill1_line (dst, offs, width, 1);
+	    dst += stride;
+	}
+    }
+    else
+    {
+	while (height--)
+	{
+	    pixman_fill1_line (dst, offs, width, 0);
+	    dst += stride;
+	}
+    }
+}
+
 static void
 pixman_fill8 (uint32_t *bits,
               int       stride,
               int       x,
               int       y,
               int       width,
               int       height,
               uint32_t xor)
@@ -1814,16 +2187,20 @@ fast_path_fill (pixman_implementation_t 
                 int                      x,
                 int                      y,
                 int                      width,
                 int                      height,
                 uint32_t		 xor)
 {
     switch (bpp)
     {
+    case 1:
+	pixman_fill1 (bits, stride, x, y, width, height, xor);
+	break;
+
     case 8:
 	pixman_fill8 (bits, stride, x, y, width, height, xor);
 	break;
 
     case 16:
 	pixman_fill16 (bits, stride, x, y, width, height, xor);
 	break;
 
@@ -1836,17 +2213,16 @@ fast_path_fill (pixman_implementation_t 
 	    imp->delegate, bits, stride, bpp, x, y, width, height, xor);
 	break;
     }
 
     return TRUE;
 }
 
 pixman_implementation_t *
-_pixman_implementation_create_fast_path (void)
+_pixman_implementation_create_fast_path (pixman_implementation_t *fallback)
 {
-    pixman_implementation_t *general = _pixman_implementation_create_general ();
-    pixman_implementation_t *imp = _pixman_implementation_create (general, c_fast_paths);
+    pixman_implementation_t *imp = _pixman_implementation_create (fallback, c_fast_paths);
 
     imp->fill = fast_path_fill;
 
     return imp;
 }
--- a/gfx/cairo/libpixman/src/pixman-fast-path.h
+++ b/gfx/cairo/libpixman/src/pixman-fast-path.h
@@ -133,28 +133,32 @@ pad_repeat_get_scanline_bounds (int32_t 
 #define GET_8888_ALPHA(s) ((s) >> 24)
  /* This is not actually used since we don't have an OVER with
     565 source, but it is needed to build. */
 #define GET_0565_ALPHA(s) 0xff
 
 #define FAST_NEAREST_SCANLINE(scanline_func_name, SRC_FORMAT, DST_FORMAT,			\
 			      src_type_t, dst_type_t, OP, repeat_mode)				\
 static force_inline void									\
-scanline_func_name (dst_type_t     *dst,							\
-		    src_type_t     *src,							\
-		    int32_t         w,								\
-		    pixman_fixed_t  vx,								\
-		    pixman_fixed_t  unit_x,							\
-		    pixman_fixed_t  max_vx)							\
+scanline_func_name (dst_type_t       *dst,							\
+		    const src_type_t *src,							\
+		    int32_t           w,							\
+		    pixman_fixed_t    vx,							\
+		    pixman_fixed_t    unit_x,							\
+		    pixman_fixed_t    max_vx,							\
+		    pixman_bool_t     fully_transparent_src)					\
 {												\
 	uint32_t   d;										\
 	src_type_t s1, s2;									\
 	uint8_t    a1, a2;									\
 	int        x1, x2;									\
 												\
+	if (PIXMAN_OP_ ## OP == PIXMAN_OP_OVER && fully_transparent_src)			\
+	    return;										\
+												\
 	if (PIXMAN_OP_ ## OP != PIXMAN_OP_SRC && PIXMAN_OP_ ## OP != PIXMAN_OP_OVER)		\
 	    abort();										\
 												\
 	while ((w -= 2) >= 0)									\
 	{											\
 	    x1 = vx >> 16;									\
 	    vx += unit_x;									\
 	    if (PIXMAN_REPEAT_ ## repeat_mode == PIXMAN_REPEAT_NORMAL)				\
@@ -240,48 +244,59 @@ scanline_func_name (dst_type_t     *dst,
 	    }											\
 	    else /* PIXMAN_OP_SRC */								\
 	    {											\
 		*dst++ = CONVERT_ ## SRC_FORMAT ## _TO_ ## DST_FORMAT (s1);			\
 	    }											\
 	}											\
 }
 
-#define FAST_NEAREST_MAINLOOP(scale_func_name, scanline_func, src_type_t, dst_type_t,		\
-			      repeat_mode)							\
+#define FAST_NEAREST_MAINLOOP_INT(scale_func_name, scanline_func, src_type_t, mask_type_t,	\
+				  dst_type_t, repeat_mode, have_mask, mask_is_solid)		\
 static void											\
-fast_composite_scaled_nearest_ ## scale_func_name (pixman_implementation_t *imp,		\
+fast_composite_scaled_nearest  ## scale_func_name (pixman_implementation_t *imp,		\
 						   pixman_op_t              op,			\
 						   pixman_image_t *         src_image,		\
 						   pixman_image_t *         mask_image,		\
 						   pixman_image_t *         dst_image,		\
 						   int32_t                  src_x,		\
 						   int32_t                  src_y,		\
 						   int32_t                  mask_x,		\
 						   int32_t                  mask_y,		\
 						   int32_t                  dst_x,		\
 						   int32_t                  dst_y,		\
 						   int32_t                  width,		\
 						   int32_t                  height)		\
 {												\
     dst_type_t *dst_line;									\
+    mask_type_t *mask_line;									\
     src_type_t *src_first_line;									\
     int       y;										\
-    pixman_fixed_t max_vx = max_vx; /* suppress uninitialized variable warning */		\
+    pixman_fixed_t max_vx = INT32_MAX; /* suppress uninitialized variable warning */		\
     pixman_fixed_t max_vy;									\
     pixman_vector_t v;										\
     pixman_fixed_t vx, vy;									\
     pixman_fixed_t unit_x, unit_y;								\
     int32_t left_pad, right_pad;								\
 												\
     src_type_t *src;										\
     dst_type_t *dst;										\
-    int       src_stride, dst_stride;								\
+    mask_type_t solid_mask;									\
+    const mask_type_t *mask = &solid_mask;							\
+    int src_stride, mask_stride, dst_stride;							\
 												\
     PIXMAN_IMAGE_GET_LINE (dst_image, dst_x, dst_y, dst_type_t, dst_stride, dst_line, 1);	\
+    if (have_mask)										\
+    {												\
+	if (mask_is_solid)									\
+	    solid_mask = _pixman_image_get_solid (imp, mask_image, dst_image->bits.format);	\
+	else											\
+	    PIXMAN_IMAGE_GET_LINE (mask_image, mask_x, mask_y, mask_type_t,			\
+				   mask_stride, mask_line, 1);					\
+    }												\
     /* pass in 0 instead of src_x and src_y because src_x and src_y need to be			\
      * transformed from destination space to source space */					\
     PIXMAN_IMAGE_GET_LINE (src_image, 0, 0, src_type_t, src_stride, src_first_line, 1);		\
 												\
     /* reference point is the center of the pixel */						\
     v.vector[0] = pixman_int_to_fixed (src_x) + pixman_fixed_1 / 2;				\
     v.vector[1] = pixman_int_to_fixed (src_y) + pixman_fixed_1 / 2;				\
     v.vector[2] = pixman_fixed_1;								\
@@ -316,79 +331,115 @@ fast_composite_scaled_nearest_ ## scale_
 					&width, &left_pad, &right_pad);				\
 	vx += left_pad * unit_x;								\
     }												\
 												\
     while (--height >= 0)									\
     {												\
 	dst = dst_line;										\
 	dst_line += dst_stride;									\
+	if (have_mask && !mask_is_solid)							\
+	{											\
+	    mask = mask_line;									\
+	    mask_line += mask_stride;								\
+	}											\
 												\
 	y = vy >> 16;										\
 	vy += unit_y;										\
 	if (PIXMAN_REPEAT_ ## repeat_mode == PIXMAN_REPEAT_NORMAL)				\
 	    repeat (PIXMAN_REPEAT_NORMAL, &vy, max_vy);						\
 	if (PIXMAN_REPEAT_ ## repeat_mode == PIXMAN_REPEAT_PAD)					\
 	{											\
 	    repeat (PIXMAN_REPEAT_PAD, &y, src_image->bits.height);				\
 	    src = src_first_line + src_stride * y;						\
 	    if (left_pad > 0)									\
 	    {											\
-		scanline_func (dst, src, left_pad, 0, 0, 0);					\
+		scanline_func (mask, dst, src, left_pad, 0, 0, 0, FALSE);			\
 	    }											\
 	    if (width > 0)									\
 	    {											\
-		scanline_func (dst + left_pad, src, width, vx, unit_x, 0);			\
+		scanline_func (mask + (mask_is_solid ? 0 : left_pad),				\
+			       dst + left_pad, src, width, vx, unit_x, 0, FALSE);		\
 	    }											\
 	    if (right_pad > 0)									\
 	    {											\
-		scanline_func (dst + left_pad + width, src + src_image->bits.width - 1,		\
-			        right_pad, 0, 0, 0);						\
+		scanline_func (mask + (mask_is_solid ? 0 : left_pad + width),			\
+			       dst + left_pad + width, src + src_image->bits.width - 1,		\
+			       right_pad, 0, 0, 0, FALSE);					\
 	    }											\
 	}											\
 	else if (PIXMAN_REPEAT_ ## repeat_mode == PIXMAN_REPEAT_NONE)				\
 	{											\
-	    static src_type_t zero = 0;								\
+	    static const src_type_t zero[1] = { 0 };						\
 	    if (y < 0 || y >= src_image->bits.height)						\
 	    {											\
-		scanline_func (dst, &zero, left_pad + width + right_pad, 0, 0, 0);		\
+		scanline_func (mask, dst, zero, left_pad + width + right_pad, 0, 0, 0, TRUE);	\
 		continue;									\
 	    }											\
 	    src = src_first_line + src_stride * y;						\
 	    if (left_pad > 0)									\
 	    {											\
-		scanline_func (dst, &zero, left_pad, 0, 0, 0);					\
+		scanline_func (mask, dst, zero, left_pad, 0, 0, 0, TRUE);			\
 	    }											\
 	    if (width > 0)									\
 	    {											\
-		scanline_func (dst + left_pad, src, width, vx, unit_x, 0);			\
+		scanline_func (mask + (mask_is_solid ? 0 : left_pad),				\
+			       dst + left_pad, src, width, vx, unit_x, 0, FALSE);		\
 	    }											\
 	    if (right_pad > 0)									\
 	    {											\
-		scanline_func (dst + left_pad + width, &zero, right_pad, 0, 0, 0);		\
+		scanline_func (mask + (mask_is_solid ? 0 : left_pad + width),			\
+			       dst + left_pad + width, zero, right_pad, 0, 0, 0, TRUE);		\
 	    }											\
 	}											\
 	else											\
 	{											\
 	    src = src_first_line + src_stride * y;						\
-	    scanline_func (dst, src, width, vx, unit_x, max_vx);				\
+	    scanline_func (mask, dst, src, width, vx, unit_x, max_vx, FALSE);			\
 	}											\
     }												\
 }
 
+/* A workaround for old sun studio, see: https://bugs.freedesktop.org/show_bug.cgi?id=32764 */
+#define FAST_NEAREST_MAINLOOP_COMMON(scale_func_name, scanline_func, src_type_t, mask_type_t,	\
+				  dst_type_t, repeat_mode, have_mask, mask_is_solid)		\
+	FAST_NEAREST_MAINLOOP_INT(_ ## scale_func_name, scanline_func, src_type_t, mask_type_t,	\
+				  dst_type_t, repeat_mode, have_mask, mask_is_solid)
+
+#define FAST_NEAREST_MAINLOOP_NOMASK(scale_func_name, scanline_func, src_type_t, dst_type_t,	\
+			      repeat_mode)							\
+    static force_inline void									\
+    scanline_func##scale_func_name##_wrapper (							\
+		    const uint8_t    *mask,							\
+		    dst_type_t       *dst,							\
+		    const src_type_t *src,							\
+		    int32_t          w,								\
+		    pixman_fixed_t   vx,							\
+		    pixman_fixed_t   unit_x,							\
+		    pixman_fixed_t   max_vx,							\
+		    pixman_bool_t    fully_transparent_src)					\
+    {												\
+	scanline_func (dst, src, w, vx, unit_x, max_vx, fully_transparent_src);			\
+    }												\
+    FAST_NEAREST_MAINLOOP_INT (scale_func_name, scanline_func##scale_func_name##_wrapper,	\
+			       src_type_t, uint8_t, dst_type_t, repeat_mode, FALSE, FALSE)
+
+#define FAST_NEAREST_MAINLOOP(scale_func_name, scanline_func, src_type_t, dst_type_t,		\
+			      repeat_mode)							\
+	FAST_NEAREST_MAINLOOP_NOMASK(_ ## scale_func_name, scanline_func, src_type_t,		\
+			      dst_type_t, repeat_mode)
+
 #define FAST_NEAREST(scale_func_name, SRC_FORMAT, DST_FORMAT,				\
 		     src_type_t, dst_type_t, OP, repeat_mode)				\
     FAST_NEAREST_SCANLINE(scaled_nearest_scanline_ ## scale_func_name ## _ ## OP,	\
 			  SRC_FORMAT, DST_FORMAT, src_type_t, dst_type_t,		\
 			  OP, repeat_mode)						\
-    FAST_NEAREST_MAINLOOP(scale_func_name##_##OP,					\
+    FAST_NEAREST_MAINLOOP_NOMASK(_ ## scale_func_name ## _ ## OP,			\
 			  scaled_nearest_scanline_ ## scale_func_name ## _ ## OP,	\
-			  src_type_t, dst_type_t, repeat_mode)				\
-											\
-    extern int no_such_variable
+			  src_type_t, dst_type_t, repeat_mode)
 
 
 #define SCALED_NEAREST_FLAGS						\
     (FAST_PATH_SCALE_TRANSFORM	|					\
      FAST_PATH_NO_ALPHA_MAP	|					\
      FAST_PATH_NEAREST_FILTER	|					\
      FAST_PATH_NO_ACCESSORS	|					\
      FAST_PATH_NARROW_FORMAT)
@@ -430,16 +481,542 @@ fast_composite_scaled_nearest_ ## scale_
     {   PIXMAN_OP_ ## op,						\
 	PIXMAN_ ## s,							\
 	SCALED_NEAREST_FLAGS | FAST_PATH_SAMPLES_COVER_CLIP,		\
 	PIXMAN_null, 0,							\
 	PIXMAN_ ## d, FAST_PATH_STD_DEST_FLAGS,				\
 	fast_composite_scaled_nearest_ ## func ## _cover ## _ ## op,	\
     }
 
+#define SIMPLE_NEAREST_A8_MASK_FAST_PATH_NORMAL(op,s,d,func)		\
+    {   PIXMAN_OP_ ## op,						\
+	PIXMAN_ ## s,							\
+	(SCALED_NEAREST_FLAGS		|				\
+	 FAST_PATH_NORMAL_REPEAT	|				\
+	 FAST_PATH_X_UNIT_POSITIVE),					\
+	PIXMAN_a8, MASK_FLAGS (a8, FAST_PATH_UNIFIED_ALPHA),		\
+	PIXMAN_ ## d, FAST_PATH_STD_DEST_FLAGS,				\
+	fast_composite_scaled_nearest_ ## func ## _normal ## _ ## op,	\
+    }
+
+#define SIMPLE_NEAREST_A8_MASK_FAST_PATH_PAD(op,s,d,func)		\
+    {   PIXMAN_OP_ ## op,						\
+	PIXMAN_ ## s,							\
+	(SCALED_NEAREST_FLAGS		|				\
+	 FAST_PATH_PAD_REPEAT		|				\
+	 FAST_PATH_X_UNIT_POSITIVE),					\
+	PIXMAN_a8, MASK_FLAGS (a8, FAST_PATH_UNIFIED_ALPHA),		\
+	PIXMAN_ ## d, FAST_PATH_STD_DEST_FLAGS,				\
+	fast_composite_scaled_nearest_ ## func ## _pad ## _ ## op,	\
+    }
+
+#define SIMPLE_NEAREST_A8_MASK_FAST_PATH_NONE(op,s,d,func)		\
+    {   PIXMAN_OP_ ## op,						\
+	PIXMAN_ ## s,							\
+	(SCALED_NEAREST_FLAGS		|				\
+	 FAST_PATH_NONE_REPEAT		|				\
+	 FAST_PATH_X_UNIT_POSITIVE),					\
+	PIXMAN_a8, MASK_FLAGS (a8, FAST_PATH_UNIFIED_ALPHA),		\
+	PIXMAN_ ## d, FAST_PATH_STD_DEST_FLAGS,				\
+	fast_composite_scaled_nearest_ ## func ## _none ## _ ## op,	\
+    }
+
+#define SIMPLE_NEAREST_A8_MASK_FAST_PATH_COVER(op,s,d,func)		\
+    {   PIXMAN_OP_ ## op,						\
+	PIXMAN_ ## s,							\
+	SCALED_NEAREST_FLAGS | FAST_PATH_SAMPLES_COVER_CLIP,		\
+	PIXMAN_a8, MASK_FLAGS (a8, FAST_PATH_UNIFIED_ALPHA),		\
+	PIXMAN_ ## d, FAST_PATH_STD_DEST_FLAGS,				\
+	fast_composite_scaled_nearest_ ## func ## _cover ## _ ## op,	\
+    }
+
+#define SIMPLE_NEAREST_SOLID_MASK_FAST_PATH_NORMAL(op,s,d,func)		\
+    {   PIXMAN_OP_ ## op,						\
+	PIXMAN_ ## s,							\
+	(SCALED_NEAREST_FLAGS		|				\
+	 FAST_PATH_NORMAL_REPEAT	|				\
+	 FAST_PATH_X_UNIT_POSITIVE),					\
+	PIXMAN_solid, MASK_FLAGS (solid, FAST_PATH_UNIFIED_ALPHA),	\
+	PIXMAN_ ## d, FAST_PATH_STD_DEST_FLAGS,				\
+	fast_composite_scaled_nearest_ ## func ## _normal ## _ ## op,	\
+    }
+
+#define SIMPLE_NEAREST_SOLID_MASK_FAST_PATH_PAD(op,s,d,func)		\
+    {   PIXMAN_OP_ ## op,						\
+	PIXMAN_ ## s,							\
+	(SCALED_NEAREST_FLAGS		|				\
+	 FAST_PATH_PAD_REPEAT		|				\
+	 FAST_PATH_X_UNIT_POSITIVE),					\
+	PIXMAN_solid, MASK_FLAGS (solid, FAST_PATH_UNIFIED_ALPHA),	\
+	PIXMAN_ ## d, FAST_PATH_STD_DEST_FLAGS,				\
+	fast_composite_scaled_nearest_ ## func ## _pad ## _ ## op,	\
+    }
+
+#define SIMPLE_NEAREST_SOLID_MASK_FAST_PATH_NONE(op,s,d,func)		\
+    {   PIXMAN_OP_ ## op,						\
+	PIXMAN_ ## s,							\
+	(SCALED_NEAREST_FLAGS		|				\
+	 FAST_PATH_NONE_REPEAT		|				\
+	 FAST_PATH_X_UNIT_POSITIVE),					\
+	PIXMAN_solid, MASK_FLAGS (solid, FAST_PATH_UNIFIED_ALPHA),	\
+	PIXMAN_ ## d, FAST_PATH_STD_DEST_FLAGS,				\
+	fast_composite_scaled_nearest_ ## func ## _none ## _ ## op,	\
+    }
+
+#define SIMPLE_NEAREST_SOLID_MASK_FAST_PATH_COVER(op,s,d,func)		\
+    {   PIXMAN_OP_ ## op,						\
+	PIXMAN_ ## s,							\
+	SCALED_NEAREST_FLAGS | FAST_PATH_SAMPLES_COVER_CLIP,		\
+	PIXMAN_solid, MASK_FLAGS (solid, FAST_PATH_UNIFIED_ALPHA),	\
+	PIXMAN_ ## d, FAST_PATH_STD_DEST_FLAGS,				\
+	fast_composite_scaled_nearest_ ## func ## _cover ## _ ## op,	\
+    }
+
 /* Prefer the use of 'cover' variant, because it is faster */
 #define SIMPLE_NEAREST_FAST_PATH(op,s,d,func)				\
     SIMPLE_NEAREST_FAST_PATH_COVER (op,s,d,func),			\
     SIMPLE_NEAREST_FAST_PATH_NONE (op,s,d,func),			\
     SIMPLE_NEAREST_FAST_PATH_PAD (op,s,d,func),				\
     SIMPLE_NEAREST_FAST_PATH_NORMAL (op,s,d,func)
 
+#define SIMPLE_NEAREST_A8_MASK_FAST_PATH(op,s,d,func)			\
+    SIMPLE_NEAREST_A8_MASK_FAST_PATH_COVER (op,s,d,func),		\
+    SIMPLE_NEAREST_A8_MASK_FAST_PATH_NONE (op,s,d,func),		\
+    SIMPLE_NEAREST_A8_MASK_FAST_PATH_PAD (op,s,d,func)
+
+#define SIMPLE_NEAREST_SOLID_MASK_FAST_PATH(op,s,d,func)		\
+    SIMPLE_NEAREST_SOLID_MASK_FAST_PATH_COVER (op,s,d,func),		\
+    SIMPLE_NEAREST_SOLID_MASK_FAST_PATH_NONE (op,s,d,func),		\
+    SIMPLE_NEAREST_SOLID_MASK_FAST_PATH_PAD (op,s,d,func)
+
+/*****************************************************************************/
+
+/*
+ * Identify 5 zones in each scanline for bilinear scaling. Depending on
+ * whether 2 pixels to be interpolated are fetched from the image itself,
+ * from the padding area around it or from both image and padding area.
+ */
+static force_inline void
+bilinear_pad_repeat_get_scanline_bounds (int32_t         source_image_width,
+					 pixman_fixed_t  vx,
+					 pixman_fixed_t  unit_x,
+					 int32_t *       left_pad,
+					 int32_t *       left_tz,
+					 int32_t *       width,
+					 int32_t *       right_tz,
+					 int32_t *       right_pad)
+{
+	int width1 = *width, left_pad1, right_pad1;
+	int width2 = *width, left_pad2, right_pad2;
+
+	pad_repeat_get_scanline_bounds (source_image_width, vx, unit_x,
+					&width1, &left_pad1, &right_pad1);
+	pad_repeat_get_scanline_bounds (source_image_width, vx + pixman_fixed_1,
+					unit_x, &width2, &left_pad2, &right_pad2);
+
+	*left_pad = left_pad2;
+	*left_tz = left_pad1 - left_pad2;
+	*right_tz = right_pad2 - right_pad1;
+	*right_pad = right_pad1;
+	*width -= *left_pad + *left_tz + *right_tz + *right_pad;
+}
+
+/*
+ * Main loop template for single pass bilinear scaling. It needs to be
+ * provided with 'scanline_func' which should do the compositing operation.
+ * The needed function has the following prototype:
+ *
+ *	scanline_func (dst_type_t *       dst,
+ *		       const mask_type_ * mask,
+ *		       const src_type_t * src_top,
+ *		       const src_type_t * src_bottom,
+ *		       int32_t            width,
+ *		       int                weight_top,
+ *		       int                weight_bottom,
+ *		       pixman_fixed_t     vx,
+ *		       pixman_fixed_t     unit_x,
+ *		       pixman_fixed_t     max_vx,
+ *		       pixman_bool_t      zero_src)
+ *
+ * Where:
+ *  dst                 - destination scanline buffer for storing results
+ *  mask                - mask buffer (or single value for solid mask)
+ *  src_top, src_bottom - two source scanlines
+ *  width               - number of pixels to process
+ *  weight_top          - weight of the top row for interpolation
+ *  weight_bottom       - weight of the bottom row for interpolation
+ *  vx                  - initial position for fetching the first pair of
+ *                        pixels from the source buffer
+ *  unit_x              - position increment needed to move to the next pair
+ *                        of pixels
+ *  max_vx              - image size as a fixed point value, can be used for
+ *                        implementing NORMAL repeat (when it is supported)
+ *  zero_src            - boolean hint variable, which is set to TRUE when
+ *                        all source pixels are fetched from zero padding
+ *                        zone for NONE repeat
+ *
+ * Note: normally the sum of 'weight_top' and 'weight_bottom' is equal to 256,
+ *       but sometimes it may be less than that for NONE repeat when handling
+ *       fuzzy antialiased top or bottom image edges. Also both top and
+ *       bottom weight variables are guaranteed to have value in 0-255
+ *       range and can fit into unsigned byte or be used with 8-bit SIMD
+ *       multiplication instructions.
+ */
+#define FAST_BILINEAR_MAINLOOP_INT(scale_func_name, scanline_func, src_type_t, mask_type_t,	\
+				  dst_type_t, repeat_mode, have_mask, mask_is_solid)		\
+static void											\
+fast_composite_scaled_bilinear ## scale_func_name (pixman_implementation_t *imp,		\
+						   pixman_op_t              op,			\
+						   pixman_image_t *         src_image,		\
+						   pixman_image_t *         mask_image,		\
+						   pixman_image_t *         dst_image,		\
+						   int32_t                  src_x,		\
+						   int32_t                  src_y,		\
+						   int32_t                  mask_x,		\
+						   int32_t                  mask_y,		\
+						   int32_t                  dst_x,		\
+						   int32_t                  dst_y,		\
+						   int32_t                  width,		\
+						   int32_t                  height)		\
+{												\
+    dst_type_t *dst_line;									\
+    mask_type_t *mask_line;									\
+    src_type_t *src_first_line;									\
+    int       y1, y2;										\
+    pixman_fixed_t max_vx = INT32_MAX; /* suppress uninitialized variable warning */		\
+    pixman_vector_t v;										\
+    pixman_fixed_t vx, vy;									\
+    pixman_fixed_t unit_x, unit_y;								\
+    int32_t left_pad, left_tz, right_tz, right_pad;						\
+												\
+    dst_type_t *dst;										\
+    mask_type_t solid_mask;									\
+    const mask_type_t *mask = &solid_mask;							\
+    int src_stride, mask_stride, dst_stride;							\
+												\
+    PIXMAN_IMAGE_GET_LINE (dst_image, dst_x, dst_y, dst_type_t, dst_stride, dst_line, 1);	\
+    if (have_mask)										\
+    {												\
+	if (mask_is_solid)									\
+	{											\
+	    solid_mask = _pixman_image_get_solid (imp, mask_image, dst_image->bits.format);	\
+	    mask_stride = 0;									\
+	}											\
+	else											\
+	{											\
+	    PIXMAN_IMAGE_GET_LINE (mask_image, mask_x, mask_y, mask_type_t,			\
+				   mask_stride, mask_line, 1);					\
+	}											\
+    }												\
+    /* pass in 0 instead of src_x and src_y because src_x and src_y need to be			\
+     * transformed from destination space to source space */					\
+    PIXMAN_IMAGE_GET_LINE (src_image, 0, 0, src_type_t, src_stride, src_first_line, 1);		\
+												\
+    /* reference point is the center of the pixel */						\
+    v.vector[0] = pixman_int_to_fixed (src_x) + pixman_fixed_1 / 2;				\
+    v.vector[1] = pixman_int_to_fixed (src_y) + pixman_fixed_1 / 2;				\
+    v.vector[2] = pixman_fixed_1;								\
+												\
+    if (!pixman_transform_point_3d (src_image->common.transform, &v))				\
+	return;											\
+												\
+    unit_x = src_image->common.transform->matrix[0][0];						\
+    unit_y = src_image->common.transform->matrix[1][1];						\
+												\
+    v.vector[0] -= pixman_fixed_1 / 2;								\
+    v.vector[1] -= pixman_fixed_1 / 2;								\
+												\
+    vy = v.vector[1];										\
+												\
+    if (PIXMAN_REPEAT_ ## repeat_mode == PIXMAN_REPEAT_PAD ||					\
+	PIXMAN_REPEAT_ ## repeat_mode == PIXMAN_REPEAT_NONE)					\
+    {												\
+	bilinear_pad_repeat_get_scanline_bounds (src_image->bits.width, v.vector[0], unit_x,	\
+					&left_pad, &left_tz, &width, &right_tz, &right_pad);	\
+	if (PIXMAN_REPEAT_ ## repeat_mode == PIXMAN_REPEAT_PAD)					\
+	{											\
+	    /* PAD repeat does not need special handling for 'transition zones' and */		\
+	    /* they can be combined with 'padding zones' safely */				\
+	    left_pad += left_tz;								\
+	    right_pad += right_tz;								\
+	    left_tz = right_tz = 0;								\
+	}											\
+	v.vector[0] += left_pad * unit_x;							\
+    }												\
+												\
+    while (--height >= 0)									\
+    {												\
+	int weight1, weight2;									\
+	dst = dst_line;										\
+	dst_line += dst_stride;									\
+	vx = v.vector[0];									\
+	if (have_mask && !mask_is_solid)							\
+	{											\
+	    mask = mask_line;									\
+	    mask_line += mask_stride;								\
+	}											\
+												\
+	y1 = pixman_fixed_to_int (vy);								\
+	weight2 = (vy >> 8) & 0xff;								\
+	if (weight2)										\
+	{											\
+	    /* normal case, both row weights are in 0-255 range and fit unsigned byte */	\
+	    y2 = y1 + 1;									\
+	    weight1 = 256 - weight2;								\
+	}											\
+	else											\
+	{											\
+	    /* set both top and bottom row to the same scanline, and weights to 128+128 */	\
+	    y2 = y1;										\
+	    weight1 = weight2 = 128;								\
+	}											\
+	vy += unit_y;										\
+	if (PIXMAN_REPEAT_ ## repeat_mode == PIXMAN_REPEAT_PAD)					\
+	{											\
+	    src_type_t *src1, *src2;								\
+	    src_type_t buf1[2];									\
+	    src_type_t buf2[2];									\
+	    repeat (PIXMAN_REPEAT_PAD, &y1, src_image->bits.height);				\
+	    repeat (PIXMAN_REPEAT_PAD, &y2, src_image->bits.height);				\
+	    src1 = src_first_line + src_stride * y1;						\
+	    src2 = src_first_line + src_stride * y2;						\
+												\
+	    if (left_pad > 0)									\
+	    {											\
+		buf1[0] = buf1[1] = src1[0];							\
+		buf2[0] = buf2[1] = src2[0];							\
+		scanline_func (dst, mask,							\
+			       buf1, buf2, left_pad, weight1, weight2, 0, 0, 0, FALSE);		\
+		dst += left_pad;								\
+		if (have_mask && !mask_is_solid)						\
+		    mask += left_pad;								\
+	    }											\
+	    if (width > 0)									\
+	    {											\
+		scanline_func (dst, mask,							\
+			       src1, src2, width, weight1, weight2, vx, unit_x, 0, FALSE);	\
+		dst += width;									\
+		if (have_mask && !mask_is_solid)						\
+		    mask += width;								\
+	    }											\
+	    if (right_pad > 0)									\
+	    {											\
+		buf1[0] = buf1[1] = src1[src_image->bits.width - 1];				\
+		buf2[0] = buf2[1] = src2[src_image->bits.width - 1];				\
+		scanline_func (dst, mask,							\
+			       buf1, buf2, right_pad, weight1, weight2, 0, 0, 0, FALSE);	\
+	    }											\
+	}											\
+	else if (PIXMAN_REPEAT_ ## repeat_mode == PIXMAN_REPEAT_NONE)				\
+	{											\
+	    src_type_t *src1, *src2;								\
+	    src_type_t buf1[2];									\
+	    src_type_t buf2[2];									\
+	    /* handle top/bottom zero padding by just setting weights to 0 if needed */		\
+	    if (y1 < 0)										\
+	    {											\
+		weight1 = 0;									\
+		y1 = 0;										\
+	    }											\
+	    if (y1 >= src_image->bits.height)							\
+	    {											\
+		weight1 = 0;									\
+		y1 = src_image->bits.height - 1;						\
+	    }											\
+	    if (y2 < 0)										\
+	    {											\
+		weight2 = 0;									\
+		y2 = 0;										\
+	    }											\
+	    if (y2 >= src_image->bits.height)							\
+	    {											\
+		weight2 = 0;									\
+		y2 = src_image->bits.height - 1;						\
+	    }											\
+	    src1 = src_first_line + src_stride * y1;						\
+	    src2 = src_first_line + src_stride * y2;						\
+												\
+	    if (left_pad > 0)									\
+	    {											\
+		buf1[0] = buf1[1] = 0;								\
+		buf2[0] = buf2[1] = 0;								\
+		scanline_func (dst, mask,							\
+			       buf1, buf2, left_pad, weight1, weight2, 0, 0, 0, TRUE);		\
+		dst += left_pad;								\
+		if (have_mask && !mask_is_solid)						\
+		    mask += left_pad;								\
+	    }											\
+	    if (left_tz > 0)									\
+	    {											\
+		buf1[0] = 0;									\
+		buf1[1] = src1[0];								\
+		buf2[0] = 0;									\
+		buf2[1] = src2[0];								\
+		scanline_func (dst, mask,							\
+			       buf1, buf2, left_tz, weight1, weight2,				\
+			       pixman_fixed_frac (vx), unit_x, 0, FALSE);			\
+		dst += left_tz;									\
+		if (have_mask && !mask_is_solid)						\
+		    mask += left_tz;								\
+		vx += left_tz * unit_x;								\
+	    }											\
+	    if (width > 0)									\
+	    {											\
+		scanline_func (dst, mask,							\
+			       src1, src2, width, weight1, weight2, vx, unit_x, 0, FALSE);	\
+		dst += width;									\
+		if (have_mask && !mask_is_solid)						\
+		    mask += width;								\
+		vx += width * unit_x;								\
+	    }											\
+	    if (right_tz > 0)									\
+	    {											\
+		buf1[0] = src1[src_image->bits.width - 1];					\
+		buf1[1] = 0;									\
+		buf2[0] = src2[src_image->bits.width - 1];					\
+		buf2[1] = 0;									\
+		scanline_func (dst, mask,							\
+			       buf1, buf2, right_tz, weight1, weight2,				\
+			       pixman_fixed_frac (vx), unit_x, 0, FALSE);			\
+		dst += right_tz;								\
+		if (have_mask && !mask_is_solid)						\
+		    mask += right_tz;								\
+	    }											\
+	    if (right_pad > 0)									\
+	    {											\
+		buf1[0] = buf1[1] = 0;								\
+		buf2[0] = buf2[1] = 0;								\
+		scanline_func (dst, mask,							\
+			       buf1, buf2, right_pad, weight1, weight2, 0, 0, 0, TRUE);		\
+	    }											\
+	}											\
+	else											\
+	{											\
+	    scanline_func (dst, mask, src_first_line + src_stride * y1,				\
+			   src_first_line + src_stride * y2, width,				\
+			   weight1, weight2, vx, unit_x, max_vx, FALSE);			\
+	}											\
+    }												\
+}
+
+/* A workaround for old sun studio, see: https://bugs.freedesktop.org/show_bug.cgi?id=32764 */
+#define FAST_BILINEAR_MAINLOOP_COMMON(scale_func_name, scanline_func, src_type_t, mask_type_t,	\
+				  dst_type_t, repeat_mode, have_mask, mask_is_solid)		\
+	FAST_BILINEAR_MAINLOOP_INT(_ ## scale_func_name, scanline_func, src_type_t, mask_type_t,\
+				  dst_type_t, repeat_mode, have_mask, mask_is_solid)
+
+#define SCALED_BILINEAR_FLAGS						\
+    (FAST_PATH_SCALE_TRANSFORM	|					\
+     FAST_PATH_NO_ALPHA_MAP	|					\
+     FAST_PATH_BILINEAR_FILTER	|					\
+     FAST_PATH_NO_ACCESSORS	|					\
+     FAST_PATH_NARROW_FORMAT)
+
+#define SIMPLE_BILINEAR_FAST_PATH_PAD(op,s,d,func)			\
+    {   PIXMAN_OP_ ## op,						\
+	PIXMAN_ ## s,							\
+	(SCALED_BILINEAR_FLAGS		|				\
+	 FAST_PATH_PAD_REPEAT		|				\
+	 FAST_PATH_X_UNIT_POSITIVE),					\
+	PIXMAN_null, 0,							\
+	PIXMAN_ ## d, FAST_PATH_STD_DEST_FLAGS,				\
+	fast_composite_scaled_bilinear_ ## func ## _pad ## _ ## op,	\
+    }
+
+#define SIMPLE_BILINEAR_FAST_PATH_NONE(op,s,d,func)			\
+    {   PIXMAN_OP_ ## op,						\
+	PIXMAN_ ## s,							\
+	(SCALED_BILINEAR_FLAGS		|				\
+	 FAST_PATH_NONE_REPEAT		|				\
+	 FAST_PATH_X_UNIT_POSITIVE),					\
+	PIXMAN_null, 0,							\
+	PIXMAN_ ## d, FAST_PATH_STD_DEST_FLAGS,				\
+	fast_composite_scaled_bilinear_ ## func ## _none ## _ ## op,	\
+    }
+
+#define SIMPLE_BILINEAR_FAST_PATH_COVER(op,s,d,func)			\
+    {   PIXMAN_OP_ ## op,						\
+	PIXMAN_ ## s,							\
+	SCALED_BILINEAR_FLAGS | FAST_PATH_SAMPLES_COVER_CLIP,		\
+	PIXMAN_null, 0,							\
+	PIXMAN_ ## d, FAST_PATH_STD_DEST_FLAGS,				\
+	fast_composite_scaled_bilinear_ ## func ## _cover ## _ ## op,	\
+    }
+
+#define SIMPLE_BILINEAR_A8_MASK_FAST_PATH_PAD(op,s,d,func)		\
+    {   PIXMAN_OP_ ## op,						\
+	PIXMAN_ ## s,							\
+	(SCALED_BILINEAR_FLAGS		|				\
+	 FAST_PATH_PAD_REPEAT		|				\
+	 FAST_PATH_X_UNIT_POSITIVE),					\
+	PIXMAN_a8, MASK_FLAGS (a8, FAST_PATH_UNIFIED_ALPHA),		\
+	PIXMAN_ ## d, FAST_PATH_STD_DEST_FLAGS,				\
+	fast_composite_scaled_bilinear_ ## func ## _pad ## _ ## op,	\
+    }
+
+#define SIMPLE_BILINEAR_A8_MASK_FAST_PATH_NONE(op,s,d,func)		\
+    {   PIXMAN_OP_ ## op,						\
+	PIXMAN_ ## s,							\
+	(SCALED_BILINEAR_FLAGS		|				\
+	 FAST_PATH_NONE_REPEAT		|				\
+	 FAST_PATH_X_UNIT_POSITIVE),					\
+	PIXMAN_a8, MASK_FLAGS (a8, FAST_PATH_UNIFIED_ALPHA),		\
+	PIXMAN_ ## d, FAST_PATH_STD_DEST_FLAGS,				\
+	fast_composite_scaled_bilinear_ ## func ## _none ## _ ## op,	\
+    }
+
+#define SIMPLE_BILINEAR_A8_MASK_FAST_PATH_COVER(op,s,d,func)		\
+    {   PIXMAN_OP_ ## op,						\
+	PIXMAN_ ## s,							\
+	SCALED_BILINEAR_FLAGS | FAST_PATH_SAMPLES_COVER_CLIP,		\
+	PIXMAN_a8, MASK_FLAGS (a8, FAST_PATH_UNIFIED_ALPHA),		\
+	PIXMAN_ ## d, FAST_PATH_STD_DEST_FLAGS,				\
+	fast_composite_scaled_bilinear_ ## func ## _cover ## _ ## op,	\
+    }
+
+#define SIMPLE_BILINEAR_SOLID_MASK_FAST_PATH_PAD(op,s,d,func)		\
+    {   PIXMAN_OP_ ## op,						\
+	PIXMAN_ ## s,							\
+	(SCALED_BILINEAR_FLAGS		|				\
+	 FAST_PATH_PAD_REPEAT		|				\
+	 FAST_PATH_X_UNIT_POSITIVE),					\
+	PIXMAN_solid, MASK_FLAGS (solid, FAST_PATH_UNIFIED_ALPHA),	\
+	PIXMAN_ ## d, FAST_PATH_STD_DEST_FLAGS,				\
+	fast_composite_scaled_bilinear_ ## func ## _pad ## _ ## op,	\
+    }
+
+#define SIMPLE_BILINEAR_SOLID_MASK_FAST_PATH_NONE(op,s,d,func)		\
+    {   PIXMAN_OP_ ## op,						\
+	PIXMAN_ ## s,							\
+	(SCALED_BILINEAR_FLAGS		|				\
+	 FAST_PATH_NONE_REPEAT		|				\
+	 FAST_PATH_X_UNIT_POSITIVE),					\
+	PIXMAN_solid, MASK_FLAGS (solid, FAST_PATH_UNIFIED_ALPHA),	\
+	PIXMAN_ ## d, FAST_PATH_STD_DEST_FLAGS,				\
+	fast_composite_scaled_bilinear_ ## func ## _none ## _ ## op,	\
+    }
+
+#define SIMPLE_BILINEAR_SOLID_MASK_FAST_PATH_COVER(op,s,d,func)		\
+    {   PIXMAN_OP_ ## op,						\
+	PIXMAN_ ## s,							\
+	SCALED_BILINEAR_FLAGS | FAST_PATH_SAMPLES_COVER_CLIP,		\
+	PIXMAN_solid, MASK_FLAGS (solid, FAST_PATH_UNIFIED_ALPHA),	\
+	PIXMAN_ ## d, FAST_PATH_STD_DEST_FLAGS,				\
+	fast_composite_scaled_bilinear_ ## func ## _cover ## _ ## op,	\
+    }
+
+/* Prefer the use of 'cover' variant, because it is faster */
+#define SIMPLE_BILINEAR_FAST_PATH(op,s,d,func)				\
+    SIMPLE_BILINEAR_FAST_PATH_COVER (op,s,d,func),			\
+    SIMPLE_BILINEAR_FAST_PATH_NONE (op,s,d,func),			\
+    SIMPLE_BILINEAR_FAST_PATH_PAD (op,s,d,func)
+
+#define SIMPLE_BILINEAR_A8_MASK_FAST_PATH(op,s,d,func)			\
+    SIMPLE_BILINEAR_A8_MASK_FAST_PATH_COVER (op,s,d,func),		\
+    SIMPLE_BILINEAR_A8_MASK_FAST_PATH_NONE (op,s,d,func),		\
+    SIMPLE_BILINEAR_A8_MASK_FAST_PATH_PAD (op,s,d,func)
+
+#define SIMPLE_BILINEAR_SOLID_MASK_FAST_PATH(op,s,d,func)		\
+    SIMPLE_BILINEAR_SOLID_MASK_FAST_PATH_COVER (op,s,d,func),		\
+    SIMPLE_BILINEAR_SOLID_MASK_FAST_PATH_NONE (op,s,d,func),		\
+    SIMPLE_BILINEAR_SOLID_MASK_FAST_PATH_PAD (op,s,d,func)
+
 #endif
--- a/gfx/cairo/libpixman/src/pixman-general.c
+++ b/gfx/cairo/libpixman/src/pixman-general.c
@@ -31,18 +31,76 @@
 #include <stdlib.h>
 #include <string.h>
 #include <math.h>
 #include <limits.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include "pixman-private.h"
-#include "pixman-combine32.h"
-#include "pixman-private.h"
+
+static void
+general_src_iter_init (pixman_implementation_t *imp, pixman_iter_t *iter)
+{
+    pixman_image_t *image = iter->image;
+
+    if (image->type == SOLID)
+	_pixman_solid_fill_iter_init (image, iter);
+    else if (image->type == LINEAR)
+	_pixman_linear_gradient_iter_init (image, iter);
+    else if (image->type == RADIAL)
+	_pixman_radial_gradient_iter_init (image, iter);
+    else if (image->type == CONICAL)
+	_pixman_conical_gradient_iter_init (image, iter);
+    else if (image->type == BITS)
+	_pixman_bits_image_src_iter_init (image, iter);
+    else
+	_pixman_log_error (FUNC, "Pixman bug: unknown image type\n");
+}
+
+static void
+general_dest_iter_init (pixman_implementation_t *imp, pixman_iter_t *iter)
+{
+    if (iter->image->type == BITS)
+    {
+	_pixman_bits_image_dest_iter_init (iter->image, iter);
+    }
+    else
+    {
+	_pixman_log_error (FUNC, "Trying to write to a non-writable image");
+    }
+}
+
+typedef struct op_info_t op_info_t;
+struct op_info_t
+{
+    uint8_t src, dst;
+};
+
+#define ITER_IGNORE_BOTH						\
+    (ITER_IGNORE_ALPHA | ITER_IGNORE_RGB | ITER_LOCALIZED_ALPHA)
+
+static const op_info_t op_flags[PIXMAN_N_OPERATORS] =
+{
+    /* Src                   Dst                   */
+    { ITER_IGNORE_BOTH,      ITER_IGNORE_BOTH      }, /* CLEAR */
+    { ITER_LOCALIZED_ALPHA,  ITER_IGNORE_BOTH      }, /* SRC */
+    { ITER_IGNORE_BOTH,      ITER_LOCALIZED_ALPHA  }, /* DST */
+    { 0,                     ITER_LOCALIZED_ALPHA  }, /* OVER */
+    { ITER_LOCALIZED_ALPHA,  0                     }, /* OVER_REVERSE */
+    { ITER_LOCALIZED_ALPHA,  ITER_IGNORE_RGB       }, /* IN */
+    { ITER_IGNORE_RGB,       ITER_LOCALIZED_ALPHA  }, /* IN_REVERSE */
+    { ITER_LOCALIZED_ALPHA,  ITER_IGNORE_RGB       }, /* OUT */
+    { ITER_IGNORE_RGB,       ITER_LOCALIZED_ALPHA  }, /* OUT_REVERSE */
+    { 0,                     0                     }, /* ATOP */
+    { 0,                     0                     }, /* ATOP_REVERSE */
+    { 0,                     0                     }, /* XOR */
+    { ITER_LOCALIZED_ALPHA,  ITER_LOCALIZED_ALPHA  }, /* ADD */
+    { 0,                     0                     }, /* SATURATE */
+};
 
 #define SCANLINE_BUFFER_LENGTH 8192
 
 static void
 general_composite_rect  (pixman_implementation_t *imp,
                          pixman_op_t              op,
                          pixman_image_t *         src,
                          pixman_image_t *         mask,
@@ -51,131 +109,84 @@ general_composite_rect  (pixman_implemen
                          int32_t                  src_y,
                          int32_t                  mask_x,
                          int32_t                  mask_y,
                          int32_t                  dest_x,
                          int32_t                  dest_y,
                          int32_t                  width,
                          int32_t                  height)
 {
-    uint8_t stack_scanline_buffer[SCANLINE_BUFFER_LENGTH * 3];
-    uint8_t *scanline_buffer = stack_scanline_buffer;
+    uint64_t stack_scanline_buffer[(SCANLINE_BUFFER_LENGTH * 3 + 7) / 8];
+    uint8_t *scanline_buffer = (uint8_t *) stack_scanline_buffer;
     uint8_t *src_buffer, *mask_buffer, *dest_buffer;
-    fetch_scanline_t fetch_src = NULL, fetch_mask = NULL, fetch_dest = NULL;
+    pixman_iter_t src_iter, mask_iter, dest_iter;
     pixman_combine_32_func_t compose;
-    store_scanline_t store;
-    source_image_class_t src_class, mask_class;
     pixman_bool_t component_alpha;
-    uint32_t *bits;
-    int32_t stride;
-    int narrow, Bpp;
+    iter_flags_t narrow, src_flags;
+    int Bpp;
     int i;
 
-    narrow =
-	(src->common.flags & FAST_PATH_NARROW_FORMAT)		&&
+    if ((src->common.flags & FAST_PATH_NARROW_FORMAT)		&&
 	(!mask || mask->common.flags & FAST_PATH_NARROW_FORMAT)	&&
-	(dest->common.flags & FAST_PATH_NARROW_FORMAT);
-    Bpp = narrow ? 4 : 8;
+	(dest->common.flags & FAST_PATH_NARROW_FORMAT))
+    {
+	narrow = ITER_NARROW;
+	Bpp = 4;
+    }
+    else
+    {
+	narrow = 0;
+	Bpp = 8;
+    }
 
     if (width * Bpp > SCANLINE_BUFFER_LENGTH)
     {
 	scanline_buffer = pixman_malloc_abc (width, 3, Bpp);
 
 	if (!scanline_buffer)
 	    return;
     }
 
     src_buffer = scanline_buffer;
     mask_buffer = src_buffer + width * Bpp;
     dest_buffer = mask_buffer + width * Bpp;
 
-    src_class = _pixman_image_classify (src,
-                                        src_x, src_y,
-                                        width, height);
-
-    mask_class = SOURCE_IMAGE_CLASS_UNKNOWN;
-
-    if (mask)
-    {
-	mask_class = _pixman_image_classify (mask,
-	                                     src_x, src_y,
-	                                     width, height);
-    }
+    /* src iter */
+    src_flags = narrow | op_flags[op].src;
 
-    if (op == PIXMAN_OP_CLEAR)
-	fetch_src = NULL;
-    else if (narrow)
-	fetch_src = _pixman_image_get_scanline_32;
-    else
-	fetch_src = _pixman_image_get_scanline_64;
-
-    if (!mask || op == PIXMAN_OP_CLEAR)
-	fetch_mask = NULL;
-    else if (narrow)
-	fetch_mask = _pixman_image_get_scanline_32;
-    else
-	fetch_mask = _pixman_image_get_scanline_64;
-
-    if (op == PIXMAN_OP_CLEAR || op == PIXMAN_OP_SRC)
-	fetch_dest = NULL;
-    else if (narrow)
-	fetch_dest = _pixman_image_get_scanline_32;
-    else
-	fetch_dest = _pixman_image_get_scanline_64;
+    _pixman_implementation_src_iter_init (imp->toplevel, &src_iter, src,
+					  src_x, src_y, width, height,
+					  src_buffer, src_flags);
 
-    if (narrow)
-	store = _pixman_image_store_scanline_32;
-    else
-	store = _pixman_image_store_scanline_64;
-
-    /* Skip the store step and composite directly into the
-     * destination if the output format of the compose func matches
-     * the destination format.
-     *
-     * If the destination format is a8r8g8b8 then we can always do
-     * this. If it is x8r8g8b8, then we can only do it if the
-     * operator doesn't make use of destination alpha.
-     */
-    if ((dest->bits.format == PIXMAN_a8r8g8b8)	||
-	(dest->bits.format == PIXMAN_x8r8g8b8	&&
-	 (op == PIXMAN_OP_OVER		||
-	  op == PIXMAN_OP_ADD		||
-	  op == PIXMAN_OP_SRC		||
-	  op == PIXMAN_OP_CLEAR		||
-	  op == PIXMAN_OP_IN_REVERSE	||
-	  op == PIXMAN_OP_OUT_REVERSE	||
-	  op == PIXMAN_OP_DST)))
+    /* mask iter */
+    if ((src_flags & (ITER_IGNORE_ALPHA | ITER_IGNORE_RGB)) ==
+	(ITER_IGNORE_ALPHA | ITER_IGNORE_RGB))
     {
-	if (narrow &&
-	    !dest->common.alpha_map &&
-	    !dest->bits.write_func)
-	{
-	    store = NULL;
-	}
-    }
-
-    if (!store)
-    {
-	bits = dest->bits.bits;
-	stride = dest->bits.rowstride;
-    }
-    else
-    {
-	bits = NULL;
-	stride = 0;
+	/* If it doesn't matter what the source is, then it doesn't matter
+	 * what the mask is
+	 */
+	mask = NULL;
     }
 
     component_alpha =
-        fetch_src                       &&
-        fetch_mask                      &&
         mask                            &&
         mask->common.type == BITS       &&
         mask->common.component_alpha    &&
         PIXMAN_FORMAT_RGB (mask->bits.format);
 
+    _pixman_implementation_src_iter_init (
+	imp->toplevel, &mask_iter, mask, mask_x, mask_y, width, height,
+	mask_buffer, narrow | (component_alpha? 0 : ITER_IGNORE_RGB));
+
+    /* dest iter */
+    _pixman_implementation_dest_iter_init (imp->toplevel, &dest_iter, dest,
+					   dest_x, dest_y, width, height,
+					   dest_buffer,
+					   narrow | op_flags[op].dst);
+
     if (narrow)
     {
 	if (component_alpha)
 	    compose = _pixman_implementation_combine_32_ca;
 	else
 	    compose = _pixman_implementation_combine_32;
     }
     else
@@ -184,83 +195,30 @@ general_composite_rect  (pixman_implemen
 	    compose = (pixman_combine_32_func_t)_pixman_implementation_combine_64_ca;
 	else
 	    compose = (pixman_combine_32_func_t)_pixman_implementation_combine_64;
     }
 
     if (!compose)
 	return;
 
-    if (!fetch_mask)
-	mask_buffer = NULL;
-
     for (i = 0; i < height; ++i)
     {
-	/* fill first half of scanline with source */
-	if (fetch_src)
-	{
-	    if (fetch_mask)
-	    {
-		/* fetch mask before source so that fetching of
-		   source can be optimized */
-		fetch_mask (mask, mask_x, mask_y + i,
-		            width, (void *)mask_buffer, 0);
-
-		if (mask_class == SOURCE_IMAGE_CLASS_HORIZONTAL)
-		    fetch_mask = NULL;
-	    }
+	uint32_t *s, *m, *d;
 
-	    if (src_class == SOURCE_IMAGE_CLASS_HORIZONTAL)
-	    {
-		fetch_src (src, src_x, src_y + i,
-		           width, (void *)src_buffer, 0);
-		fetch_src = NULL;
-	    }
-	    else
-	    {
-		fetch_src (src, src_x, src_y + i,
-		           width, (void *)src_buffer, (void *)mask_buffer);
-	    }
-	}
-	else if (fetch_mask)
-	{
-	    fetch_mask (mask, mask_x, mask_y + i,
-	                width, (void *)mask_buffer, 0);
-	}
+	m = mask_iter.get_scanline (&mask_iter, NULL);
+	s = src_iter.get_scanline (&src_iter, m);
+	d = dest_iter.get_scanline (&dest_iter, NULL);
 
-	if (store)
-	{
-	    /* fill dest into second half of scanline */
-	    if (fetch_dest)
-	    {
-		fetch_dest (dest, dest_x, dest_y + i,
-		            width, (void *)dest_buffer, 0);
-	    }
+	compose (imp->toplevel, op, d, s, m, width);
 
-	    /* blend */
-	    compose (imp->toplevel, op,
-		     (void *)dest_buffer,
-		     (void *)src_buffer,
-		     (void *)mask_buffer,
-		     width);
-
-	    /* write back */
-	    store (&(dest->bits), dest_x, dest_y + i, width,
-	           (void *)dest_buffer);
-	}
-	else
-	{
-	    /* blend */
-	    compose (imp->toplevel, op,
-		     bits + (dest_y + i) * stride + dest_x,
-	             (void *)src_buffer, (void *)mask_buffer, width);
-	}
+	dest_iter.write_back (&dest_iter);
     }
 
-    if (scanline_buffer != stack_scanline_buffer)
+    if (scanline_buffer != (uint8_t *) stack_scanline_buffer)
 	free (scanline_buffer);
 }
 
 static const pixman_fast_path_t general_fast_path[] =
 {
     { PIXMAN_OP_any, PIXMAN_any, 0, PIXMAN_any,	0, PIXMAN_any, 0, general_composite_rect },
     { PIXMAN_OP_NONE }
 };
@@ -304,12 +262,14 @@ pixman_implementation_t *
 {
     pixman_implementation_t *imp = _pixman_implementation_create (NULL, general_fast_path);
 
     _pixman_setup_combiner_functions_32 (imp);
     _pixman_setup_combiner_functions_64 (imp);
 
     imp->blt = general_blt;
     imp->fill = general_fill;
+    imp->src_iter_init = general_src_iter_init;
+    imp->dest_iter_init = general_dest_iter_init;
 
     return imp;
 }
 
--- a/gfx/cairo/libpixman/src/pixman-image.c
+++ b/gfx/cairo/libpixman/src/pixman-image.c
@@ -25,76 +25,35 @@
 #endif
 
 #include <stdlib.h>
 #include <stdio.h>
 #include <string.h>
 #include <assert.h>
 
 #include "pixman-private.h"
-#include "pixman-combine32.h"
 
 pixman_bool_t
 _pixman_init_gradient (gradient_t *                  gradient,
                        const pixman_gradient_stop_t *stops,
                        int                           n_stops)
 {
     return_val_if_fail (n_stops > 0, FALSE);
 
     gradient->stops = pixman_malloc_ab (n_stops, sizeof (pixman_gradient_stop_t));
     if (!gradient->stops)
 	return FALSE;
 
     memcpy (gradient->stops, stops, n_stops * sizeof (pixman_gradient_stop_t));
 
     gradient->n_stops = n_stops;
 
-    gradient->stop_range = 0xffff;
-    gradient->common.class = SOURCE_IMAGE_CLASS_UNKNOWN;
-
     return TRUE;
 }
 
-/*
- * By default, just evaluate the image at 32bpp and expand.  Individual image
- * types can plug in a better scanline getter if they want to. For example
- * we  could produce smoother gradients by evaluating them at higher color
- * depth, but that's a project for the future.
- */
-void
-_pixman_image_get_scanline_generic_64 (pixman_image_t * image,
-                                       int              x,
-                                       int              y,
-                                       int              width,
-                                       uint32_t *       buffer,
-                                       const uint32_t * mask)
-{
-    uint32_t *mask8 = NULL;
-
-    /* Contract the mask image, if one exists, so that the 32-bit fetch
-     * function can use it.
-     */
-    if (mask)
-    {
-	mask8 = pixman_malloc_ab (width, sizeof(uint32_t));
-	if (!mask8)
-	    return;
-
-	pixman_contract (mask8, (uint64_t *)mask, width);
-    }
-
-    /* Fetch the source image into the first half of buffer. */
-    _pixman_image_get_scanline_32 (image, x, y, width, (uint32_t*)buffer, mask8);
-
-    /* Expand from 32bpp to 64bpp in place. */
-    pixman_expand ((uint64_t *)buffer, buffer, PIXMAN_a8r8g8b8, width);
-
-    free (mask8);
-}
-
 pixman_image_t *
 _pixman_image_allocate (void)
 {
     pixman_image_t *image = malloc (sizeof (pixman_image_t));
 
     if (image)
     {
 	image_common_t *common = &image->common;
@@ -107,64 +66,26 @@ pixman_image_t *
 	common->transform = NULL;
 	common->repeat = PIXMAN_REPEAT_NONE;
 	common->filter = PIXMAN_FILTER_NEAREST;
 	common->filter_params = NULL;
 	common->n_filter_params = 0;
 	common->alpha_map = NULL;
 	common->component_alpha = FALSE;
 	common->ref_count = 1;
-	common->classify = NULL;
+	common->property_changed = NULL;
 	common->client_clip = FALSE;
 	common->destroy_func = NULL;
 	common->destroy_data = NULL;
 	common->dirty = TRUE;
     }
 
     return image;
 }
 
-source_image_class_t
-_pixman_image_classify (pixman_image_t *image,
-                        int             x,
-                        int             y,
-                        int             width,
-                        int             height)
-{
-    if (image->common.classify)
-	return image->common.classify (image, x, y, width, height);
-    else
-	return SOURCE_IMAGE_CLASS_UNKNOWN;
-}
-
-void
-_pixman_image_get_scanline_32 (pixman_image_t *image,
-                               int             x,
-                               int             y,
-                               int             width,
-                               uint32_t *      buffer,
-                               const uint32_t *mask)
-{
-    image->common.get_scanline_32 (image, x, y, width, buffer, mask);
-}
-
-/* Even thought the type of buffer is uint32_t *, the function actually expects
- * a uint64_t *buffer.
- */
-void
-_pixman_image_get_scanline_64 (pixman_image_t *image,
-                               int             x,
-                               int             y,
-                               int             width,
-                               uint32_t *      buffer,
-                               const uint32_t *unused)
-{
-    image->common.get_scanline_64 (image, x, y, width, buffer, unused);
-}
-
 static void
 image_property_changed (pixman_image_t *image)
 {
     image->common.dirty = TRUE;
 }
 
 /* Ref Counting */
 PIXMAN_EXPORT pixman_image_t *
@@ -234,64 +155,37 @@ pixman_image_get_destroy_data (pixman_im
 }
 
 void
 _pixman_image_reset_clip_region (pixman_image_t *image)
 {
     image->common.have_clip_region = FALSE;
 }
 
-static pixman_bool_t out_of_bounds_workaround = TRUE;
-
-/* Old X servers rely on out-of-bounds accesses when they are asked
- * to composite with a window as the source. They create a pixman image
- * pointing to some bogus position in memory, but then they set a clip
- * region to the position where the actual bits are.
+/* Executive Summary: This function is a no-op that only exists
+ * for historical reasons.
+ *
+ * There used to be a bug in the X server where it would rely on
+ * out-of-bounds accesses when it was asked to composite with a
+ * window as the source. It would create a pixman image pointing
+ * to some bogus position in memory, but then set a clip region
+ * to the position where the actual bits were.
  *
  * Due to a bug in old versions of pixman, where it would not clip
  * against the image bounds when a clip region was set, this would
- * actually work. So by default we allow certain out-of-bound access
- * to happen unless explicitly disabled.
+ * actually work. So when the pixman bug was fixed, a workaround was
+ * added to allow certain out-of-bound accesses. This function disabled
+ * those workarounds.
  *
- * Fixed X servers should call this function to disable the workaround.
+ * Since 0.21.2, pixman doesn't do these workarounds anymore, so now
+ * this function is a no-op.
  */
 PIXMAN_EXPORT void
 pixman_disable_out_of_bounds_workaround (void)
 {
-    out_of_bounds_workaround = FALSE;
-}
-
-static pixman_bool_t
-source_image_needs_out_of_bounds_workaround (bits_image_t *image)
-{
-    if (image->common.clip_sources                      &&
-        image->common.repeat == PIXMAN_REPEAT_NONE      &&
-	image->common.have_clip_region			&&
-        out_of_bounds_workaround)
-    {
-	if (!image->common.client_clip)
-	{
-	    /* There is no client clip, so if the clip region extends beyond the
-	     * drawable geometry, it must be because the X server generated the
-	     * bogus clip region.
-	     */
-	    const pixman_box32_t *extents =
-		pixman_region32_extents (&image->common.clip_region);
-
-	    if (extents->x1 >= 0 && extents->x2 <= image->width &&
-		extents->y1 >= 0 && extents->y2 <= image->height)
-	    {
-		return FALSE;
-	    }
-	}
-
-	return TRUE;
-    }
-
-    return FALSE;
 }
 
 static void
 compute_image_info (pixman_image_t *image)
 {
     pixman_format_code_t code;
     uint32_t flags = 0;
 
@@ -311,18 +205,35 @@ compute_image_info (pixman_image_t *imag
 	    image->common.transform->matrix[2][1] == 0			&&
 	    image->common.transform->matrix[2][2] == pixman_fixed_1)
 	{
 	    flags |= FAST_PATH_AFFINE_TRANSFORM;
 
 	    if (image->common.transform->matrix[0][1] == 0 &&
 		image->common.transform->matrix[1][0] == 0)
 	    {
+		if (image->common.transform->matrix[0][0] == -pixman_fixed_1 &&
+		    image->common.transform->matrix[1][1] == -pixman_fixed_1)
+		{
+		    flags |= FAST_PATH_ROTATE_180_TRANSFORM;
+		}
 		flags |= FAST_PATH_SCALE_TRANSFORM;
 	    }
+	    else if (image->common.transform->matrix[0][0] == 0 &&
+	             image->common.transform->matrix[1][1] == 0)
+	    {
+		pixman_fixed_t m01 = image->common.transform->matrix[0][1];
+		if (m01 == -image->common.transform->matrix[1][0])
+		{
+			if (m01 == -pixman_fixed_1)
+			    flags |= FAST_PATH_ROTATE_90_TRANSFORM;
+			else if (m01 == pixman_fixed_1)
+			    flags |= FAST_PATH_ROTATE_270_TRANSFORM;
+		}
+	    }
 	}
 
 	if (image->common.transform->matrix[0][0] > 0)
 	    flags |= FAST_PATH_X_UNIT_POSITIVE;
 
 	if (image->common.transform->matrix[1][0] == 0)
 	    flags |= FAST_PATH_Y_UNIT_ZERO;
     }
@@ -404,40 +315,58 @@ compute_image_info (pixman_image_t *imag
 	    image->bits.height == 1	&&
 	    image->common.repeat != PIXMAN_REPEAT_NONE)
 	{
 	    code = PIXMAN_solid;
 	}
 	else
 	{
 	    code = image->bits.format;
+
+	    if (!image->common.transform &&
+		image->common.repeat == PIXMAN_REPEAT_NORMAL)
+	    {
+		flags |= FAST_PATH_SIMPLE_REPEAT | FAST_PATH_SAMPLES_COVER_CLIP;
+	    }
 	}
 
 	if (!PIXMAN_FORMAT_A (image->bits.format)				&&
 	    PIXMAN_FORMAT_TYPE (image->bits.format) != PIXMAN_TYPE_GRAY		&&
 	    PIXMAN_FORMAT_TYPE (image->bits.format) != PIXMAN_TYPE_COLOR)
 	{
 	    flags |= FAST_PATH_SAMPLES_OPAQUE;
 
 	    if (image->common.repeat != PIXMAN_REPEAT_NONE)
 		flags |= FAST_PATH_IS_OPAQUE;
 	}
 
-	if (source_image_needs_out_of_bounds_workaround (&image->bits))
-	    flags |= FAST_PATH_NEEDS_WORKAROUND;
-
 	if (image->bits.read_func || image->bits.write_func)
 	    flags &= ~FAST_PATH_NO_ACCESSORS;
 
 	if (PIXMAN_FORMAT_IS_WIDE (image->bits.format))
 	    flags &= ~FAST_PATH_NARROW_FORMAT;
 	break;
 
+    case RADIAL:
+	code = PIXMAN_unknown;
+
+	/*
+	 * As explained in pixman-radial-gradient.c, every point of
+	 * the plane has a valid associated radius (and thus will be
+	 * colored) if and only if a is negative (i.e. one of the two
+	 * circles contains the other one).
+	 */
+
+        if (image->radial.a >= 0)
+	    break;
+
+	/* Fall through */
+
+    case CONICAL:
     case LINEAR:
-    case RADIAL:
 	code = PIXMAN_unknown;
 
 	if (image->common.repeat != PIXMAN_REPEAT_NONE)
 	{
 	    int i;
 
 	    flags |= FAST_PATH_IS_OPAQUE;
 	    for (i = 0; i < image->gradient.n_stops; ++i)
@@ -491,17 +420,18 @@ void
     {
 	compute_image_info (image);
 
 	/* It is important that property_changed is
 	 * called *after* compute_image_info() because
 	 * property_changed() can make use of the flags
 	 * to set up accessors etc.
 	 */
-	image->common.property_changed (image);
+	if (image->common.property_changed)
+	    image->common.property_changed (image);
 
 	image->common.dirty = FALSE;
     }
 
     if (image->common.alpha_map)
 	_pixman_image_validate ((pixman_image_t *)image->common.alpha_map);
 }
 
@@ -572,25 +502,31 @@ pixman_image_set_transform (pixman_image
     };
 
     image_common_t *common = (image_common_t *)image;
     pixman_bool_t result;
 
     if (common->transform == transform)
 	return TRUE;
 
-    if (memcmp (&id, transform, sizeof (pixman_transform_t)) == 0)
+    if (!transform || memcmp (&id, transform, sizeof (pixman_transform_t)) == 0)
     {
 	free (common->transform);
 	common->transform = NULL;
 	result = TRUE;
 
 	goto out;
     }
 
+    if (common->transform &&
+	memcmp (common->transform, transform, sizeof (pixman_transform_t) == 0))
+    {
+	return TRUE;
+    }
+
     if (common->transform == NULL)
 	common->transform = malloc (sizeof (pixman_transform_t));
 
     if (common->transform == NULL)
     {
 	result = FALSE;
 
 	goto out;
@@ -605,16 +541,19 @@ out:
 
     return result;
 }
 
 PIXMAN_EXPORT void
 pixman_image_set_repeat (pixman_image_t *image,
                          pixman_repeat_t repeat)
 {
+    if (image->common.repeat == repeat)
+	return;
+
     image->common.repeat = repeat;
 
     image_property_changed (image);
 }
 
 PIXMAN_EXPORT pixman_bool_t
 pixman_image_set_filter (pixman_image_t *      image,
                          pixman_filter_t       filter,
@@ -649,31 +588,37 @@ pixman_image_set_filter (pixman_image_t 
     image_property_changed (image);
     return TRUE;
 }
 
 PIXMAN_EXPORT void
 pixman_image_set_source_clipping (pixman_image_t *image,
                                   pixman_bool_t   clip_sources)
 {
+    if (image->common.clip_sources == clip_sources)
+	return;
+
     image->common.clip_sources = clip_sources;
 
     image_property_changed (image);
 }
 
 /* Unlike all the other property setters, this function does not
  * copy the content of indexed. Doing this copying is simply
  * way, way too expensive.
  */
 PIXMAN_EXPORT void
 pixman_image_set_indexed (pixman_image_t *        image,
                           const pixman_indexed_t *indexed)
 {
     bits_image_t *bits = (bits_image_t *)image;
 
+    if (bits->indexed == indexed)
+	return;
+
     bits->indexed = indexed;
 
     image_property_changed (image);
 }
 
 PIXMAN_EXPORT void
 pixman_image_set_alpha_map (pixman_image_t *image,
                             pixman_image_t *alpha_map,
@@ -726,16 +671,19 @@ pixman_image_set_alpha_map (pixman_image
 
     image_property_changed (image);
 }
 
 PIXMAN_EXPORT void
 pixman_image_set_component_alpha   (pixman_image_t *image,
                                     pixman_bool_t   component_alpha)
 {
+    if (image->common.component_alpha == component_alpha)
+	return;
+
     image->common.component_alpha = component_alpha;
 
     image_property_changed (image);
 }
 
 PIXMAN_EXPORT pixman_bool_t
 pixman_image_get_component_alpha   (pixman_image_t       *image)
 {
@@ -808,22 +756,28 @@ pixman_image_get_format (pixman_image_t 
 {
     if (image->type == BITS)
 	return image->bits.format;
 
     return 0;
 }
 
 uint32_t
-_pixman_image_get_solid (pixman_image_t *     image,
-                         pixman_format_code_t format)
+_pixman_image_get_solid (pixman_implementation_t *imp,
+			 pixman_image_t *         image,
+                         pixman_format_code_t     format)
 {
     uint32_t result;
+    pixman_iter_t iter;
 
-    _pixman_image_get_scanline_32 (image, 0, 0, 1, &result, NULL);
+    _pixman_implementation_src_iter_init (
+	imp, &iter, image, 0, 0, 1, 1,
+	(uint8_t *)&result, ITER_NARROW);
+
+    result = *iter.get_scanline (&iter, NULL);
 
     /* If necessary, convert RGB <--> BGR. */
     if (PIXMAN_FORMAT_TYPE (format) != PIXMAN_TYPE_ARGB)
     {
 	result = (((result & 0xff000000) >>  0) |
 	          ((result & 0x00ff0000) >> 16) |
 	          ((result & 0x0000ff00) >>  0) |
 	          ((result & 0x000000ff) << 16));
--- a/gfx/cairo/libpixman/src/pixman-implementation.c
+++ b/gfx/cairo/libpixman/src/pixman-implementation.c
@@ -106,16 +106,30 @@ delegate_fill (pixman_implementation_t *
                int                      width,
                int                      height,
                uint32_t                 xor)
 {
     return _pixman_implementation_fill (
 	imp->delegate, bits, stride, bpp, x, y, width, height, xor);
 }
 
+static void
+delegate_src_iter_init (pixman_implementation_t *imp,
+			pixman_iter_t *	         iter)
+{
+    imp->delegate->src_iter_init (imp->delegate, iter);
+}
+
+static void
+delegate_dest_iter_init (pixman_implementation_t *imp,
+			 pixman_iter_t *	  iter)
+{
+    imp->delegate->dest_iter_init (imp->delegate, iter);
+}
+
 pixman_implementation_t *
 _pixman_implementation_create (pixman_implementation_t *delegate,
 			       const pixman_fast_path_t *fast_paths)
 {
     pixman_implementation_t *imp = malloc (sizeof (pixman_implementation_t));
     pixman_implementation_t *d;
     int i;
 
@@ -128,27 +142,29 @@ pixman_implementation_t *
     imp->delegate = delegate;
     for (d = imp; d != NULL; d = d->delegate)
 	d->toplevel = imp;
 
     /* Fill out function pointers with ones that just delegate
      */
     imp->blt = delegate_blt;
     imp->fill = delegate_fill;
+    imp->src_iter_init = delegate_src_iter_init;
+    imp->dest_iter_init = delegate_dest_iter_init;
 
     for (i = 0; i < PIXMAN_N_OPERATORS; ++i)
     {
 	imp->combine_32[i] = delegate_combine_32;
 	imp->combine_64[i] = delegate_combine_64;
 	imp->combine_32_ca[i] = delegate_combine_32_ca;
 	imp->combine_64_ca[i] = delegate_combine_64_ca;
     }
 
     imp->fast_paths = fast_paths;
-    
+
     return imp;
 }
 
 void
 _pixman_implementation_combine_32 (pixman_implementation_t * imp,
                                    pixman_op_t               op,
                                    uint32_t *                dest,
                                    const uint32_t *          src,
@@ -220,8 +236,69 @@ pixman_bool_t
                              int                      y,
                              int                      width,
                              int                      height,
                              uint32_t                 xor)
 {
     return (*imp->fill) (imp, bits, stride, bpp, x, y, width, height, xor);
 }
 
+static uint32_t *
+get_scanline_null (pixman_iter_t *iter, const uint32_t *mask)
+{
+    return NULL;
+}
+
+void
+_pixman_implementation_src_iter_init (pixman_implementation_t	*imp,
+				      pixman_iter_t             *iter,
+				      pixman_image_t		*image,
+				      int			 x,
+				      int			 y,
+				      int			 width,
+				      int			 height,
+				      uint8_t			*buffer,
+				      iter_flags_t		 flags)
+{
+    iter->image = image;
+    iter->buffer = (uint32_t *)buffer;
+    iter->x = x;
+    iter->y = y;
+    iter->width = width;
+    iter->height = height;
+    iter->flags = flags;
+
+    if (!image)
+    {
+	iter->get_scanline = get_scanline_null;
+    }
+    else if ((flags & (ITER_IGNORE_ALPHA | ITER_IGNORE_RGB)) ==
+	     (ITER_IGNORE_ALPHA | ITER_IGNORE_RGB))
+    {
+	iter->get_scanline = _pixman_iter_get_scanline_noop;
+    }
+    else
+    {
+	(*imp->src_iter_init) (imp, iter);
+    }
+}
+
+void
+_pixman_implementation_dest_iter_init (pixman_implementation_t	*imp,
+				       pixman_iter_t            *iter,
+				       pixman_image_t		*image,
+				       int			 x,
+				       int			 y,
+				       int			 width,
+				       int			 height,
+				       uint8_t			*buffer,
+				       iter_flags_t		 flags)
+{
+    iter->image = image;
+    iter->buffer = (uint32_t *)buffer;
+    iter->x = x;
+    iter->y = y;
+    iter->width = width;
+    iter->height = height;
+    iter->flags = flags;
+
+    (*imp->dest_iter_init) (imp, iter);
+}
--- a/gfx/cairo/libpixman/src/pixman-linear-gradient.c
+++ b/gfx/cairo/libpixman/src/pixman-linear-gradient.c
@@ -1,8 +1,9 @@
+/* -*- Mode: c; c-basic-offset: 4; tab-width: 8; indent-tabs-mode: t; -*- */
 /*
  * Copyright © 2000 SuSE, Inc.
  * Copyright © 2007 Red Hat, Inc.
  * Copyright © 2000 Keith Packard, member of The XFree86 Project, Inc.
  *             2005 Lars Knoll & Zack Rusin, Trolltech
  *
  * Permission to use, copy, modify, distribute, and sell this software and its
  * documentation for any purpose is hereby granted without fee, provided that
@@ -25,240 +26,237 @@
  */
 
 #ifdef HAVE_CONFIG_H
 #include <config.h>
 #endif
 #include <stdlib.h>
 #include "pixman-private.h"
 
-static source_image_class_t
-linear_gradient_classify (pixman_image_t *image,
-                          int             x,
-                          int             y,
-                          int             width,
-                          int             height)
+static pixman_bool_t
+linear_gradient_is_horizontal (pixman_image_t *image,
+			       int             x,
+			       int             y,
+			       int             width,
+			       int             height)
 {
     linear_gradient_t *linear = (linear_gradient_t *)image;
     pixman_vector_t v;
     pixman_fixed_32_32_t l;
-    pixman_fixed_48_16_t dx, dy, a, b, off;
-    pixman_fixed_48_16_t factors[4];
-    int i;
+    pixman_fixed_48_16_t dx, dy;
+    double inc;
 
-    image->source.class = SOURCE_IMAGE_CLASS_UNKNOWN;
+    if (image->common.transform)
+    {
+	/* projective transformation */
+	if (image->common.transform->matrix[2][0] != 0 ||
+	    image->common.transform->matrix[2][1] != 0 ||
+	    image->common.transform->matrix[2][2] == 0)
+	{
+	    return FALSE;
+	}
+
+	v.vector[0] = image->common.transform->matrix[0][1];
+	v.vector[1] = image->common.transform->matrix[1][1];
+	v.vector[2] = image->common.transform->matrix[2][2];
+    }
+    else
+    {
+	v.vector[0] = 0;
+	v.vector[1] = pixman_fixed_1;
+	v.vector[2] = pixman_fixed_1;
+    }
 
     dx = linear->p2.x - linear->p1.x;
     dy = linear->p2.y - linear->p1.y;
 
     l = dx * dx + dy * dy;
 
-    if (l)
-    {
-	a = (dx << 32) / l;
-	b = (dy << 32) / l;
-    }
-    else
-    {
-	a = b = 0;
-    }
-
-    off = (-a * linear->p1.x
-           -b * linear->p1.y) >> 16;
-
-    for (i = 0; i < 3; i++)
-    {
-	v.vector[0] = pixman_int_to_fixed ((i % 2) * (width  - 1) + x);
-	v.vector[1] = pixman_int_to_fixed ((i / 2) * (height - 1) + y);
-	v.vector[2] = pixman_fixed_1;
+    if (l == 0)
+	return FALSE;
 
-	if (image->common.transform)
-	{
-	    if (!pixman_transform_point_3d (image->common.transform, &v))
-	    {
-		image->source.class = SOURCE_IMAGE_CLASS_UNKNOWN;
-
-		return image->source.class;
-	    }
-	}
+    /*
+     * compute how much the input of the gradient walked changes
+     * when moving vertically through the whole image
+     */
+    inc = height * (double) pixman_fixed_1 * pixman_fixed_1 *
+	(dx * v.vector[0] + dy * v.vector[1]) /
+	(v.vector[2] * (double) l);
 
-	factors[i] = ((a * v.vector[0] + b * v.vector[1]) >> 16) + off;
-    }
+    /* check that casting to integer would result in 0 */
+    if (-1 < inc && inc < 1)
+	return TRUE;
 
-    if (factors[2] == factors[0])
-	image->source.class = SOURCE_IMAGE_CLASS_HORIZONTAL;
-    else if (factors[1] == factors[0])
-	image->source.class = SOURCE_IMAGE_CLASS_VERTICAL;
-
-    return image->source.class;
+    return FALSE;
 }
 
-static void
-linear_gradient_get_scanline_32 (pixman_image_t *image,
-                                 int             x,
-                                 int             y,
-                                 int             width,
-                                 uint32_t *      buffer,
-                                 const uint32_t *mask)
+static uint32_t *
+linear_get_scanline_narrow (pixman_iter_t  *iter,
+			    const uint32_t *mask)
 {
+    pixman_image_t *image  = iter->image;
+    int             x      = iter->x;
+    int             y      = iter->y;
+    int             width  = iter->width;
+    uint32_t *      buffer = iter->buffer;
+
     pixman_vector_t v, unit;
     pixman_fixed_32_32_t l;
-    pixman_fixed_48_16_t dx, dy, a, b, off;
+    pixman_fixed_48_16_t dx, dy;
     gradient_t *gradient = (gradient_t *)image;
-    source_image_t *source = (source_image_t *)image;
     linear_gradient_t *linear = (linear_gradient_t *)image;
     uint32_t *end = buffer + width;
     pixman_gradient_walker_t walker;
 
-    _pixman_gradient_walker_init (&walker, gradient, source->common.repeat);
+    _pixman_gradient_walker_init (&walker, gradient, image->common.repeat);
 
     /* reference point is the center of the pixel */
     v.vector[0] = pixman_int_to_fixed (x) + pixman_fixed_1 / 2;
     v.vector[1] = pixman_int_to_fixed (y) + pixman_fixed_1 / 2;
     v.vector[2] = pixman_fixed_1;
 
-    if (source->common.transform)
+    if (image->common.transform)
     {
-	if (!pixman_transform_point_3d (source->common.transform, &v))
-	    return;
-	
-	unit.vector[0] = source->common.transform->matrix[0][0];
-	unit.vector[1] = source->common.transform->matrix[1][0];
-	unit.vector[2] = source->common.transform->matrix[2][0];
+	if (!pixman_transform_point_3d (image->common.transform, &v))
+	    return iter->buffer;
+
+	unit.vector[0] = image->common.transform->matrix[0][0];
+	unit.vector[1] = image->common.transform->matrix[1][0];
+	unit.vector[2] = image->common.transform->matrix[2][0];
     }
     else
     {
 	unit.vector[0] = pixman_fixed_1;
 	unit.vector[1] = 0;
 	unit.vector[2] = 0;
     }
 
     dx = linear->p2.x - linear->p1.x;
     dy = linear->p2.y - linear->p1.y;
 
     l = dx * dx + dy * dy;
 
-    if (l != 0)
+    if (l == 0 || unit.vector[2] == 0)
     {
-	a = (dx << 32) / l;
-	b = (dy << 32) / l;
-	off = (-a * linear->p1.x
-	       -b * linear->p1.y) >> 16;
-    }
+	/* affine transformation only */
+        pixman_fixed_32_32_t t, next_inc;
+	double inc;
 
-    if (l == 0 || (unit.vector[2] == 0 && v.vector[2] == pixman_fixed_1))
-    {
-	pixman_fixed_48_16_t inc, t;
-
-	/* affine transformation only */
-	if (l == 0)
+	if (l == 0 || v.vector[2] == 0)
 	{
 	    t = 0;
 	    inc = 0;
 	}
 	else
 	{
-	    t = ((a * v.vector[0] + b * v.vector[1]) >> 16) + off;
-	    inc = (a * unit.vector[0] + b * unit.vector[1]) >> 16;
+	    double invden, v2;
+
+	    invden = pixman_fixed_1 * (double) pixman_fixed_1 /
+		(l * (double) v.vector[2]);
+	    v2 = v.vector[2] * (1. / pixman_fixed_1);
+	    t = ((dx * v.vector[0] + dy * v.vector[1]) - 
+		 (dx * linear->p1.x + dy * linear->p1.y) * v2) * invden;
+	    inc = (dx * unit.vector[0] + dy * unit.vector[1]) * invden;
 	}
+	next_inc = 0;
 
-	if (source->class == SOURCE_IMAGE_CLASS_VERTICAL)
+	if (((pixman_fixed_32_32_t )(inc * width)) == 0)
 	{
 	    register uint32_t color;
 
 	    color = _pixman_gradient_walker_pixel (&walker, t);
 	    while (buffer < end)
 		*buffer++ = color;
 	}
 	else
 	{
-	    if (!mask)
+	    int i;
+
+	    i = 0;
+	    while (buffer < end)
 	    {
-		while (buffer < end)
+		if (!mask || *mask++)
 		{
-		    *buffer++ = _pixman_gradient_walker_pixel (&walker, t);
-		    
-		    t += inc;
+		    *buffer = _pixman_gradient_walker_pixel (&walker,
+							     t + next_inc);
 		}
-	    }
-	    else
-	    {
-		while (buffer < end)
-		{
-		    if (*mask++)
-			*buffer = _pixman_gradient_walker_pixel (&walker, t);
-
-		    buffer++;
-		    t += inc;
-		}
+		i++;
+		next_inc = inc * i;
+		buffer++;
 	    }
 	}
     }
     else
     {
 	/* projective transformation */
-	pixman_fixed_48_16_t t;
+        double t;
 
-	if (source->class == SOURCE_IMAGE_CLASS_VERTICAL)
+	t = 0;
+
+	while (buffer < end)
 	{
-	    register uint32_t color;
-
-	    if (v.vector[2] == 0)
+	    if (!mask || *mask++)
 	    {
-		t = 0;
-	    }
-	    else
-	    {
-		pixman_fixed_48_16_t x, y;
+	        if (v.vector[2] != 0)
+		{
+		    double invden, v2;
 
-		x = ((pixman_fixed_48_16_t) v.vector[0] << 16) / v.vector[2];
-		y = ((pixman_fixed_48_16_t) v.vector[1] << 16) / v.vector[2];
-		t = ((a * x + b * y) >> 16) + off;
+		    invden = pixman_fixed_1 * (double) pixman_fixed_1 /
+			(l * (double) v.vector[2]);
+		    v2 = v.vector[2] * (1. / pixman_fixed_1);
+		    t = ((dx * v.vector[0] + dy * v.vector[1]) - 
+			 (dx * linear->p1.x + dy * linear->p1.y) * v2) * invden;
+		}
+
+		*buffer = _pixman_gradient_walker_pixel (&walker, t);
 	    }
 
-	    color = _pixman_gradient_walker_pixel (&walker, t);
-	    while (buffer < end)
-		*buffer++ = color;
-	}
-	else
-	{
-	    while (buffer < end)
-	    {
-		if (!mask || *mask++)
-		{
-		    if (v.vector[2] == 0)
-		    {
-			t = 0;
-		    }
-		    else
-		    {
-			pixman_fixed_48_16_t x, y;
-			x = ((pixman_fixed_48_16_t)v.vector[0] << 16) / v.vector[2];
-			y = ((pixman_fixed_48_16_t)v.vector[1] << 16) / v.vector[2];
-			t = ((a * x + b * y) >> 16) + off;
-		    }
+	    ++buffer;
 
-		    *buffer = _pixman_gradient_walker_pixel (&walker, t);
-		}
-
-		++buffer;
-
-		v.vector[0] += unit.vector[0];
-		v.vector[1] += unit.vector[1];
-		v.vector[2] += unit.vector[2];
-	    }
+	    v.vector[0] += unit.vector[0];
+	    v.vector[1] += unit.vector[1];
+	    v.vector[2] += unit.vector[2];
 	}
     }
+
+    iter->y++;
+
+    return iter->buffer;
+}
+
+static uint32_t *
+linear_get_scanline_wide (pixman_iter_t *iter, const uint32_t *mask)
+{
+    uint32_t *buffer = linear_get_scanline_narrow (iter, NULL);
+
+    pixman_expand ((uint64_t *)buffer, buffer, PIXMAN_a8r8g8b8, iter->width);
+
+    return buffer;
 }
 
-static void
-linear_gradient_property_changed (pixman_image_t *image)
+void
+_pixman_linear_gradient_iter_init (pixman_image_t *image, pixman_iter_t  *iter)
 {
-    image->common.get_scanline_32 = linear_gradient_get_scanline_32;
-    image->common.get_scanline_64 = _pixman_image_get_scanline_generic_64;
+    if (linear_gradient_is_horizontal (
+	    iter->image, iter->x, iter->y, iter->width, iter->height))
+    {
+	if (iter->flags & ITER_NARROW)
+	    linear_get_scanline_narrow (iter, NULL);
+	else
+	    linear_get_scanline_wide (iter, NULL);
+
+	iter->get_scanline = _pixman_iter_get_scanline_noop;
+    }
+    else
+    {
+	if (iter->flags & ITER_NARROW)
+	    iter->get_scanline = linear_get_scanline_narrow;
+	else
+	    iter->get_scanline = linear_get_scanline_wide;
+    }
 }
 
 PIXMAN_EXPORT pixman_image_t *
 pixman_image_create_linear_gradient (pixman_point_fixed_t *        p1,
                                      pixman_point_fixed_t *        p2,
                                      const pixman_gradient_stop_t *stops,
                                      int                           n_stops)
 {
@@ -277,15 +275,12 @@ pixman_image_create_linear_gradient (pix
 	free (image);
 	return NULL;
     }
 
     linear->p1 = *p1;
     linear->p2 = *p2;
 
     image->type = LINEAR;
-    image->source.class = SOURCE_IMAGE_CLASS_UNKNOWN;
-    image->common.classify = linear_gradient_classify;
-    image->common.property_changed = linear_gradient_property_changed;
 
     return image;
 }
 
--- a/gfx/cairo/libpixman/src/pixman-matrix.c
+++ b/gfx/cairo/libpixman/src/pixman-matrix.c
@@ -420,17 +420,18 @@ pixman_transform_is_int_translate (const
 }
 
 PIXMAN_EXPORT pixman_bool_t
 pixman_transform_is_inverse (const struct pixman_transform *a,
                              const struct pixman_transform *b)
 {
     struct pixman_transform t;
 
-    pixman_transform_multiply (&t, a, b);
+    if (!pixman_transform_multiply (&t, a, b))
+	return FALSE;
 
     return pixman_transform_is_identity (&t);
 }
 
 PIXMAN_EXPORT void
 pixman_f_transform_from_pixman_transform (struct pixman_f_transform *    ft,
                                           const struct pixman_transform *t)
 {
@@ -459,19 +460,16 @@ pixman_transform_from_pixman_f_transform
 	    d = d * 65536.0 + 0.5;
 	    t->matrix[j][i] = (pixman_fixed_t) floor (d);
 	}
     }
     
     return TRUE;
 }
 
-static const int a[3] = { 3, 3, 2 };
-static const int b[3] = { 2, 1, 1 };
-
 PIXMAN_EXPORT pixman_bool_t
 pixman_f_transform_invert (struct pixman_f_transform *      dst,
                            const struct pixman_f_transform *src)
 {
     double det;
     int i, j;
     static int a[3] = { 2, 2, 1 };
     static int b[3] = { 1, 0, 0 };
--- a/gfx/cairo/libpixman/src/pixman-mmx.c
+++ b/gfx/cairo/libpixman/src/pixman-mmx.c
@@ -1103,17 +1103,17 @@ mmx_composite_over_n_8888 (pixman_implem
     uint32_t src;
     uint32_t    *dst_line, *dst;
     int32_t w;
     int dst_stride;
     __m64 vsrc, vsrca;
 
     CHECKPOINT ();
 
-    src = _pixman_image_get_solid (src_image, dst_image->bits.format);
+    src = _pixman_image_get_solid (imp, src_image, dst_image->bits.format);
 
     if (src == 0)
 	return;
 
     PIXMAN_IMAGE_GET_LINE (dst_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
 
     vsrc = load8888 (src);
     vsrca = expand_alpha (vsrc);
@@ -1182,17 +1182,17 @@ mmx_composite_over_n_0565 (pixman_implem
     uint32_t src;
     uint16_t    *dst_line, *dst;
     int32_t w;
     int dst_stride;
     __m64 vsrc, vsrca;
 
     CHECKPOINT ();
 
-    src = _pixman_image_get_solid (src_image, dst_image->bits.format);
+    src = _pixman_image_get_solid (imp, src_image, dst_image->bits.format);
 
     if (src == 0)
 	return;
 
     PIXMAN_IMAGE_GET_LINE (dst_image, dest_x, dest_y, uint16_t, dst_stride, dst_line, 1);
 
     vsrc = load8888 (src);
     vsrca = expand_alpha (vsrc);
@@ -1270,17 +1270,17 @@ mmx_composite_over_n_8888_8888_ca (pixma
     uint32_t src, srca;
     uint32_t    *dst_line;
     uint32_t    *mask_line;
     int dst_stride, mask_stride;
     __m64 vsrc, vsrca;
 
     CHECKPOINT ();
 
-    src = _pixman_image_get_solid (src_image, dst_image->bits.format);
+    src = _pixman_image_get_solid (imp, src_image, dst_image->bits.format);
 
     srca = src >> 24;
     if (src == 0)
 	return;
 
     PIXMAN_IMAGE_GET_LINE (dst_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
     PIXMAN_IMAGE_GET_LINE (mask_image, mask_x, mask_y, uint32_t, mask_stride, mask_line, 1);
 
@@ -1379,17 +1379,17 @@ mmx_composite_over_8888_n_8888 (pixman_i
     int32_t w;
     __m64 srca;
 
     CHECKPOINT ();
 
     PIXMAN_IMAGE_GET_LINE (dst_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
     PIXMAN_IMAGE_GET_LINE (src_image, src_x, src_y, uint32_t, src_stride, src_line, 1);
 
-    mask = _pixman_image_get_solid (mask_image, dst_image->bits.format);
+    mask = _pixman_image_get_solid (imp, mask_image, dst_image->bits.format);
     mask &= 0xff000000;
     mask = mask | mask >> 8 | mask >> 16 | mask >> 24;
     vmask = load8888 (mask);
     srca = MC (4x00ff);
 
     while (height--)
     {
 	dst = dst_line;
@@ -1464,17 +1464,17 @@ mmx_composite_over_x888_n_8888 (pixman_i
     int dst_stride, src_stride;
     int32_t w;
     __m64 srca;
 
     CHECKPOINT ();
 
     PIXMAN_IMAGE_GET_LINE (dst_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
     PIXMAN_IMAGE_GET_LINE (src_image, src_x, src_y, uint32_t, src_stride, src_line, 1);
-    mask = _pixman_image_get_solid (mask_image, dst_image->bits.format);
+    mask = _pixman_image_get_solid (imp, mask_image, dst_image->bits.format);
 
     mask &= 0xff000000;
     mask = mask | mask >> 8 | mask >> 16 | mask >> 24;
     vmask = load8888 (mask);
     srca = MC (4x00ff);
 
     while (height--)
     {
@@ -1759,17 +1759,17 @@ mmx_composite_over_n_8_8888 (pixman_impl
     uint8_t *mask_line, *mask;
     int dst_stride, mask_stride;
     int32_t w;
     __m64 vsrc, vsrca;
     uint64_t srcsrc;
 
     CHECKPOINT ();
 
-    src = _pixman_image_get_solid (src_image, dst_image->bits.format);
+    src = _pixman_image_get_solid (imp, src_image, dst_image->bits.format);
 
     srca = src >> 24;
     if (src == 0)
 	return;
 
     srcsrc = (uint64_t)src << 32 | src;
 
     PIXMAN_IMAGE_GET_LINE (dst_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
@@ -1916,18 +1916,18 @@ pixman_fill_mmx (uint32_t *bits,
     __asm__ (
         "movq		%7,	%0\n"
         "movq		%7,	%1\n"
         "movq		%7,	%2\n"
         "movq		%7,	%3\n"
         "movq		%7,	%4\n"
         "movq		%7,	%5\n"
         "movq		%7,	%6\n"
-	: "=y" (v1), "=y" (v2), "=y" (v3),
-	  "=y" (v4), "=y" (v5), "=y" (v6), "=y" (v7)
+	: "=&y" (v1), "=&y" (v2), "=&y" (v3),
+	  "=&y" (v4), "=&y" (v5), "=&y" (v6), "=y" (v7)
 	: "y" (vfill));
 #endif
 
     while (height--)
     {
 	int w;
 	uint8_t *d = byte_line;
 
@@ -2033,17 +2033,17 @@ mmx_composite_src_n_8_8888 (pixman_imple
     uint8_t     *mask_line, *mask;
     int dst_stride, mask_stride;
     int32_t w;
     __m64 vsrc, vsrca;
     uint64_t srcsrc;
 
     CHECKPOINT ();
 
-    src = _pixman_image_get_solid (src_image, dst_image->bits.format);
+    src = _pixman_image_get_solid (imp, src_image, dst_image->bits.format);
 
     srca = src >> 24;
     if (src == 0)
     {
 	pixman_fill_mmx (dst_image->bits.bits, dst_image->bits.rowstride,
 			 PIXMAN_FORMAT_BPP (dst_image->bits.format),
 	                 dest_x, dest_y, width, height, 0);
 	return;
@@ -2168,17 +2168,17 @@ mmx_composite_over_n_8_0565 (pixman_impl
     uint8_t *mask_line, *mask;
     int dst_stride, mask_stride;
     int32_t w;
     __m64 vsrc, vsrca, tmp;
     uint64_t srcsrcsrcsrc, src16;
 
     CHECKPOINT ();
 
-    src = _pixman_image_get_solid (src_image, dst_image->bits.format);
+    src = _pixman_image_get_solid (imp, src_image, dst_image->bits.format);
 
     srca = src >> 24;
     if (src == 0)
 	return;
 
     PIXMAN_IMAGE_GET_LINE (dst_image, dest_x, dest_y, uint16_t, dst_stride, dst_line, 1);
     PIXMAN_IMAGE_GET_LINE (mask_image, mask_x, mask_y, uint8_t, mask_stride, mask_line, 1);
 
@@ -2527,17 +2527,17 @@ mmx_composite_over_n_8888_0565_ca (pixma
     uint32_t src, srca;
     uint16_t    *dst_line;
     uint32_t    *mask_line;
     int dst_stride, mask_stride;
     __m64 vsrc, vsrca;
 
     CHECKPOINT ();
 
-    src = _pixman_image_get_solid (src_image, dst_image->bits.format);
+    src = _pixman_image_get_solid (imp, src_image, dst_image->bits.format);
 
     srca = src >> 24;
     if (src == 0)
 	return;
 
     PIXMAN_IMAGE_GET_LINE (dst_image, dest_x, dest_y, uint16_t, dst_stride, dst_line, 1);
     PIXMAN_IMAGE_GET_LINE (mask_image, mask_x, mask_y, uint32_t, mask_stride, mask_line, 1);
 
@@ -2638,17 +2638,17 @@ mmx_composite_in_n_8_8 (pixman_implement
     int32_t w;
     uint32_t src;
     uint8_t sa;
     __m64 vsrc, vsrca;
 
     PIXMAN_IMAGE_GET_LINE (dst_image, dest_x, dest_y, uint8_t, dst_stride, dst_line, 1);
     PIXMAN_IMAGE_GET_LINE (mask_image, mask_x, mask_y, uint8_t, mask_stride, mask_line, 1);
 
-    src = _pixman_image_get_solid (src_image, dst_image->bits.format);
+    src = _pixman_image_get_solid (imp, src_image, dst_image->bits.format);
 
     sa = src >> 24;
 
     vsrc = load8888 (src);
     vsrca = expand_alpha (vsrc);
 
     while (height--)
     {
@@ -2785,17 +2785,17 @@ mmx_composite_add_n_8_8 (pixman_implemen
     int32_t w;
     uint32_t src;
     uint8_t sa;
     __m64 vsrc, vsrca;
 
     PIXMAN_IMAGE_GET_LINE (dst_image, dest_x, dest_y, uint8_t, dst_stride, dst_line, 1);
     PIXMAN_IMAGE_GET_LINE (mask_image, mask_x, mask_y, uint8_t, mask_stride, mask_line, 1);
 
-    src = _pixman_image_get_solid (src_image, dst_image->bits.format);
+    src = _pixman_image_get_solid (imp, src_image, dst_image->bits.format);
 
     sa = src >> 24;
 
     if (src == 0)
 	return;
 
     vsrc = load8888 (src);
     vsrca = expand_alpha (vsrc);
@@ -3335,20 +3335,19 @@ mmx_fill (pixman_implementation_t *imp,
 	return _pixman_implementation_fill (
 	    imp->delegate, bits, stride, bpp, x, y, width, height, xor);
     }
 
     return TRUE;
 }
 
 pixman_implementation_t *
-_pixman_implementation_create_mmx (void)
+_pixman_implementation_create_mmx (pixman_implementation_t *fallback)
 {
-    pixman_implementation_t *general = _pixman_implementation_create_fast_path ();
-    pixman_implementation_t *imp = _pixman_implementation_create (general, mmx_fast_paths);
+    pixman_implementation_t *imp = _pixman_implementation_create (fallback, mmx_fast_paths);
 
     imp->combine_32[PIXMAN_OP_OVER] = mmx_combine_over_u;
     imp->combine_32[PIXMAN_OP_OVER_REVERSE] = mmx_combine_over_reverse_u;
     imp->combine_32[PIXMAN_OP_IN] = mmx_combine_in_u;
     imp->combine_32[PIXMAN_OP_IN_REVERSE] = mmx_combine_in_reverse_u;
     imp->combine_32[PIXMAN_OP_OUT] = mmx_combine_out_u;
     imp->combine_32[PIXMAN_OP_OUT_REVERSE] = mmx_combine_out_reverse_u;
     imp->combine_32[PIXMAN_OP_ATOP] = mmx_combine_atop_u;
--- a/gfx/cairo/libpixman/src/pixman-private.h
+++ b/gfx/cairo/libpixman/src/pixman-private.h
@@ -15,17 +15,16 @@
 #include <string.h>
 
 #include "pixman-compiler.h"
 
 /*
  * Images
  */
 typedef struct image_common image_common_t;
-typedef struct source_image source_image_t;
 typedef struct solid_fill solid_fill_t;
 typedef struct gradient gradient_t;
 typedef struct linear_gradient linear_gradient_t;
 typedef struct horizontal_gradient horizontal_gradient_t;
 typedef struct vertical_gradient vertical_gradient_t;
 typedef struct conical_gradient conical_gradient_t;
 typedef struct radial_gradient radial_gradient_t;
 typedef struct bits_image bits_image_t;
@@ -56,28 +55,16 @@ typedef enum
 {
     BITS,
     LINEAR,
     CONICAL,
     RADIAL,
     SOLID
 } image_type_t;
 
-typedef enum
-{
-    SOURCE_IMAGE_CLASS_UNKNOWN,
-    SOURCE_IMAGE_CLASS_HORIZONTAL,
-    SOURCE_IMAGE_CLASS_VERTICAL,
-} source_image_class_t;
-
-typedef source_image_class_t (*classify_func_t) (pixman_image_t *image,
-						int             x,
-						int             y,
-						int             width,
-						int             height);
 typedef void (*property_changed_func_t) (pixman_image_t *image);
 
 struct image_common
 {
     image_type_t                type;
     int32_t                     ref_count;
     pixman_region32_t           clip_region;
     int32_t			alpha_count;	    /* How many times this image is being used as an alpha map */
@@ -92,49 +79,39 @@ struct image_common
     pixman_repeat_t             repeat;
     pixman_filter_t             filter;
     pixman_fixed_t *            filter_params;
     int                         n_filter_params;
     bits_image_t *              alpha_map;
     int                         alpha_origin_x;
     int                         alpha_origin_y;
     pixman_bool_t               component_alpha;
-    classify_func_t             classify;
     property_changed_func_t     property_changed;
-    fetch_scanline_t            get_scanline_32;
-    fetch_scanline_t            get_scanline_64;
 
     pixman_image_destroy_func_t destroy_func;
     void *                      destroy_data;
 
     uint32_t			flags;
     pixman_format_code_t	extended_format_code;
 };
 
-struct source_image
+struct solid_fill
 {
     image_common_t common;
-    source_image_class_t class;
-};
-
-struct solid_fill
-{
-    source_image_t common;
     pixman_color_t color;
     
     uint32_t	   color_32;
     uint64_t	   color_64;
 };
 
 struct gradient
 {
-    source_image_t          common;
+    image_common_t	    common;
     int                     n_stops;
     pixman_gradient_stop_t *stops;
-    int                     stop_range;
 };
 
 struct linear_gradient
 {
     gradient_t           common;
     pixman_point_fixed_t p1;
     pixman_point_fixed_t p2;
 };
@@ -172,16 +149,19 @@ struct bits_image
     pixman_format_code_t       format;
     const pixman_indexed_t *   indexed;
     int                        width;
     int                        height;
     uint32_t *                 bits;
     uint32_t *                 free_me;
     int                        rowstride;  /* in number of uint32_t's */
 
+    fetch_scanline_t           get_scanline_32;
+    fetch_scanline_t           get_scanline_64;
+
     fetch_scanline_t           fetch_scanline_32;
     fetch_pixel_32_t	       fetch_pixel_32;
     store_scanline_t           store_scanline_32;
 
     fetch_scanline_t           fetch_scanline_64;
     fetch_pixel_64_t	       fetch_pixel_64;
     store_scanline_t           store_scanline_64;
 
@@ -190,95 +170,104 @@ struct bits_image
     pixman_write_memory_func_t write_func;
 };
 
 union pixman_image
 {
     image_type_t       type;
     image_common_t     common;
     bits_image_t       bits;
-    source_image_t     source;
     gradient_t         gradient;
     linear_gradient_t  linear;
     conical_gradient_t conical;
     radial_gradient_t  radial;
     solid_fill_t       solid;
 };
 
+typedef struct pixman_iter_t pixman_iter_t;
+typedef uint32_t *(* pixman_iter_get_scanline_t) (pixman_iter_t *iter, const uint32_t *mask);
+typedef void      (* pixman_iter_write_back_t)   (pixman_iter_t *iter);
+
+typedef enum
+{
+    ITER_NARROW =		(1 << 0),
+
+    /* "Localized alpha" is when the alpha channel is used only to compute
+     * the alpha value of the destination. This means that the computation
+     * of the RGB values of the result is independent of the alpha value.
+     *
+     * For example, the OVER operator has localized alpha for the
+     * destination, because the RGB values of the result can be computed
+     * without knowing the destination alpha. Similarly, ADD has localized
+     * alpha for both source and destination because the RGB values of the
+     * result can be computed without knowing the alpha value of source or
+     * destination.
+     *
+     * When he destination is xRGB, this is useful knowledge, because then
+     * we can treat it as if it were ARGB, which means in some cases we can
+     * avoid copying it to a temporary buffer.
+     */
+    ITER_LOCALIZED_ALPHA =	(1 << 1),
+    ITER_IGNORE_ALPHA =		(1 << 2),
+    ITER_IGNORE_RGB =		(1 << 3)
+} iter_flags_t;
+
+struct pixman_iter_t
+{
+    /* These are initialized by _pixman_implementation_{src,dest}_init */
+    pixman_image_t *		image;
+    uint32_t *			buffer;
+    int				x, y;
+    int				width;
+    int				height;
+    iter_flags_t		flags;
+
+    /* These function pointers are initialized by the implementation */
+    pixman_iter_get_scanline_t	get_scanline;
+    pixman_iter_write_back_t	write_back;
+
+    /* These fields are scratch data that implementations can use */
+    uint8_t *			bits;
+    int				stride;
+};
+
 void
 _pixman_bits_image_setup_accessors (bits_image_t *image);
 
 void
-_pixman_image_get_scanline_generic_64  (pixman_image_t *image,
-                                        int             x,
-                                        int             y,
-                                        int             width,
-                                        uint32_t *      buffer,
-                                        const uint32_t *mask);
+_pixman_bits_image_src_iter_init (pixman_image_t *image, pixman_iter_t *iter);
 
-source_image_class_t
-_pixman_image_classify (pixman_image_t *image,
-                        int             x,
-                        int             y,
-                        int             width,
-                        int             height);
+void
+_pixman_bits_image_dest_iter_init (pixman_image_t *image, pixman_iter_t *iter);
 
 void
-_pixman_image_get_scanline_32 (pixman_image_t *image,
-                               int             x,
-                               int             y,
-                               int             width,
-                               uint32_t *      buffer,
-                               const uint32_t *mask);
-
-/* Even thought the type of buffer is uint32_t *, the function actually expects
- * a uint64_t *buffer.
- */
-void
-_pixman_image_get_scanline_64 (pixman_image_t *image,
-                               int             x,
-                               int             y,
-                               int             width,
-                               uint32_t *      buffer,
-                               const uint32_t *unused);
+_pixman_solid_fill_iter_init (pixman_image_t *image, pixman_iter_t  *iter);
 
 void
-_pixman_image_store_scanline_32 (bits_image_t *  image,
-                                 int             x,
-                                 int             y,
-                                 int             width,
-                                 const uint32_t *buffer);
+_pixman_linear_gradient_iter_init (pixman_image_t *image, pixman_iter_t  *iter);
 
-/* Even though the type of buffer is uint32_t *, the function
- * actually expects a uint64_t *buffer.
- */
 void
-_pixman_image_store_scanline_64 (bits_image_t *  image,
-                                 int             x,
-                                 int             y,
-                                 int             width,
-                                 const uint32_t *buffer);
+_pixman_radial_gradient_iter_init (pixman_image_t *image, pixman_iter_t *iter);
+
+void
+_pixman_conical_gradient_iter_init (pixman_image_t *image, pixman_iter_t *iter);
 
 pixman_image_t *
 _pixman_image_allocate (void);
 
 pixman_bool_t
 _pixman_init_gradient (gradient_t *                  gradient,
                        const pixman_gradient_stop_t *stops,
                        int                           n_stops);
 void
 _pixman_image_reset_clip_region (pixman_image_t *image);
 
 void
 _pixman_image_validate (pixman_image_t *image);
 
-uint32_t
-_pixman_image_get_solid (pixman_image_t *     image,
-                         pixman_format_code_t format);
-
 #define PIXMAN_IMAGE_GET_LINE(image, x, y, type, out_stride, line, mul)	\
     do									\
     {									\
 	uint32_t *__bits__;						\
 	int       __stride__;						\
         								\
 	__bits__ = image->bits.bits;					\
 	__stride__ = image->bits.rowstride;				\
@@ -401,16 +390,18 @@ typedef pixman_bool_t (*pixman_fill_func
 					     uint32_t *               bits,
 					     int                      stride,
 					     int                      bpp,
 					     int                      x,
 					     int                      y,
 					     int                      width,
 					     int                      height,
 					     uint32_t                 xor);
+typedef void (*pixman_iter_init_func_t) (pixman_implementation_t *imp,
+                                         pixman_iter_t           *iter);
 
 void _pixman_setup_combiner_functions_32 (pixman_implementation_t *imp);
 void _pixman_setup_combiner_functions_64 (pixman_implementation_t *imp);
 
 typedef struct
 {
     pixman_op_t             op;
     pixman_format_code_t    src_format;
@@ -422,26 +413,33 @@ typedef struct
     pixman_composite_func_t func;
 } pixman_fast_path_t;
 
 struct pixman_implementation_t
 {
     pixman_implementation_t *	toplevel;
     pixman_implementation_t *	delegate;
     const pixman_fast_path_t *	fast_paths;
-    
+
     pixman_blt_func_t		blt;
     pixman_fill_func_t		fill;
+    pixman_iter_init_func_t     src_iter_init;
+    pixman_iter_init_func_t     dest_iter_init;
 
     pixman_combine_32_func_t	combine_32[PIXMAN_N_OPERATORS];
     pixman_combine_32_func_t	combine_32_ca[PIXMAN_N_OPERATORS];
     pixman_combine_64_func_t	combine_64[PIXMAN_N_OPERATORS];
     pixman_combine_64_func_t	combine_64_ca[PIXMAN_N_OPERATORS];
 };
 
+uint32_t
+_pixman_image_get_solid (pixman_implementation_t *imp,
+			 pixman_image_t *         image,
+                         pixman_format_code_t     format);
+
 pixman_implementation_t *
 _pixman_implementation_create (pixman_implementation_t *delegate,
 			       const pixman_fast_path_t *fast_paths);
 
 void
 _pixman_implementation_combine_32 (pixman_implementation_t *imp,
                                    pixman_op_t              op,
                                    uint32_t *               dest,
@@ -491,56 +489,80 @@ pixman_bool_t
                              int                      stride,
                              int                      bpp,
                              int                      x,
                              int                      y,
                              int                      width,
                              int                      height,
                              uint32_t                 xor);
 
+void
+_pixman_implementation_src_iter_init (pixman_implementation_t       *imp,
+				      pixman_iter_t                 *iter,
+				      pixman_image_t                *image,
+				      int                            x,
+				      int                            y,
+				      int                            width,
+				      int                            height,
+				      uint8_t                       *buffer,
+				      iter_flags_t                   flags);
+
+void
+_pixman_implementation_dest_iter_init (pixman_implementation_t       *imp,
+				       pixman_iter_t                 *iter,
+				       pixman_image_t                *image,
+				       int                            x,
+				       int                            y,
+				       int                            width,
+				       int                            height,
+				       uint8_t                       *buffer,
+				       iter_flags_t                   flags);
+
 /* Specific implementations */
 pixman_implementation_t *
 _pixman_implementation_create_general (void);
 
 pixman_implementation_t *
-_pixman_implementation_create_fast_path (void);
+_pixman_implementation_create_fast_path (pixman_implementation_t *fallback);
 
 #ifdef USE_MMX
 pixman_implementation_t *
-_pixman_implementation_create_mmx (void);
+_pixman_implementation_create_mmx (pixman_implementation_t *fallback);
 #endif
 
 #ifdef USE_SSE2
 pixman_implementation_t *
-_pixman_implementation_create_sse2 (void);
+_pixman_implementation_create_sse2 (pixman_implementation_t *fallback);
 #endif
 
 #ifdef USE_ARM_SIMD
 pixman_implementation_t *
-_pixman_implementation_create_arm_simd (void);
+_pixman_implementation_create_arm_simd (pixman_implementation_t *fallback);
 #endif
 
 #ifdef USE_ARM_NEON
 pixman_implementation_t *
-_pixman_implementation_create_arm_neon (void);
+_pixman_implementation_create_arm_neon (pixman_implementation_t *fallback);
 #endif
 
 #ifdef USE_VMX
 pixman_implementation_t *
-_pixman_implementation_create_vmx (void);
+_pixman_implementation_create_vmx (pixman_implementation_t *fallback);
 #endif
 
 pixman_implementation_t *
 _pixman_choose_implementation (void);
 
 
 
 /*
  * Utilities
  */
+uint32_t *
+_pixman_iter_get_scanline_noop (pixman_iter_t *iter, const uint32_t *mask);
 
 /* These "formats" all have depth 0, so they
  * will never clash with any real ones
  */
 #define PIXMAN_null             PIXMAN_FORMAT (0, 0, 0, 0, 0, 0)
 #define PIXMAN_solid            PIXMAN_FORMAT (0, 1, 0, 0, 0, 0)
 #define PIXMAN_pixbuf		PIXMAN_FORMAT (0, 2, 0, 0, 0, 0)
 #define PIXMAN_rpixbuf		PIXMAN_FORMAT (0, 3, 0, 0, 0, 0)
@@ -558,24 +580,27 @@ pixman_implementation_t *
 #define FAST_PATH_NARROW_FORMAT			(1 <<  6)
 #define FAST_PATH_COMPONENT_ALPHA		(1 <<  8)
 #define FAST_PATH_SAMPLES_OPAQUE		(1 <<  7)
 #define FAST_PATH_UNIFIED_ALPHA			(1 <<  9)
 #define FAST_PATH_SCALE_TRANSFORM		(1 << 10)
 #define FAST_PATH_NEAREST_FILTER		(1 << 11)
 #define FAST_PATH_HAS_TRANSFORM			(1 << 12)
 #define FAST_PATH_IS_OPAQUE			(1 << 13)
-#define FAST_PATH_NEEDS_WORKAROUND		(1 << 14)
+#define FAST_PATH_NO_NORMAL_REPEAT		(1 << 14)
 #define FAST_PATH_NO_NONE_REPEAT		(1 << 15)
 #define FAST_PATH_SAMPLES_COVER_CLIP		(1 << 16)
 #define FAST_PATH_X_UNIT_POSITIVE		(1 << 17)
 #define FAST_PATH_AFFINE_TRANSFORM		(1 << 18)
 #define FAST_PATH_Y_UNIT_ZERO			(1 << 19)
 #define FAST_PATH_BILINEAR_FILTER		(1 << 20)
-#define FAST_PATH_NO_NORMAL_REPEAT		(1 << 21)
+#define FAST_PATH_ROTATE_90_TRANSFORM		(1 << 21)
+#define FAST_PATH_ROTATE_180_TRANSFORM		(1 << 22)
+#define FAST_PATH_ROTATE_270_TRANSFORM		(1 << 23)
+#define FAST_PATH_SIMPLE_REPEAT			(1 << 24)
 
 #define FAST_PATH_PAD_REPEAT						\
     (FAST_PATH_NO_NONE_REPEAT		|				\
      FAST_PATH_NO_NORMAL_REPEAT		|				\
      FAST_PATH_NO_REFLECT_REPEAT)
 
 #define FAST_PATH_NORMAL_REPEAT						\
     (FAST_PATH_NO_NONE_REPEAT		|				\
--- a/gfx/cairo/libpixman/src/pixman-radial-gradient.c
+++ b/gfx/cairo/libpixman/src/pixman-radial-gradient.c
@@ -91,18 +91,34 @@ radial_compute_color (double            
      *    this can be a problem if a*c is much smaller than b*b
      *
      *  - the above problems are worse if a is small (as inva becomes bigger)
      */
     double det;
 
     if (a == 0)
     {
-	return _pixman_gradient_walker_pixel (walker,
-					      pixman_fixed_1 / 2 * c / b);
+	double t;
+
+	if (b == 0)
+	    return 0;
+
+	t = pixman_fixed_1 / 2 * c / b;
+	if (repeat == PIXMAN_REPEAT_NONE)
+	{
+	    if (0 <= t && t <= pixman_fixed_1)
+		return _pixman_gradient_walker_pixel (walker, t);
+	}
+	else
+	{
+	    if (t * dr > mindr)
+		return _pixman_gradient_walker_pixel (walker, t);
+	}
+
+	return 0;
     }
 
     det = fdot (b, a, 0, b, -c, 0);
     if (det >= 0)
     {
 	double sqrtdet, t0, t1;
 
 	sqrtdet = sqrt (det);
@@ -123,23 +139,18 @@ radial_compute_color (double            
 	    else if (t1 * dr > mindr)
 		return _pixman_gradient_walker_pixel (walker, t1);
 	}
     }
 
     return 0;
 }
 
-static void
-radial_gradient_get_scanline_32 (pixman_image_t *image,
-                                 int             x,
-                                 int             y,
-                                 int             width,
-                                 uint32_t *      buffer,
-                                 const uint32_t *mask)
+static uint32_t *
+radial_get_scanline_narrow (pixman_iter_t *iter, const uint32_t *mask)
 {
     /*
      * Implementation of radial gradients following the PDF specification.
      * See section 8.7.4.5.4 Type 3 (Radial) Shadings of the PDF Reference
      * Manual (PDF 32000-1:2008 at the time of this writing).
      * 
      * In the radial gradient problem we are given two circles (c₁,r₁) and
      * (c₂,r₂) that define the gradient itself.
@@ -212,39 +223,43 @@ radial_gradient_get_scanline_32 (pixman_
      *
      * Additional observations (useful for optimizations):
      * A does not depend on p
      *
      * A < 0 <=> one of the two circles completely contains the other one
      *   <=> for every p, the radiuses associated with the two t solutions
      *       have opposite sign
      */
+    pixman_image_t *image = iter->image;
+    int x = iter->x;
+    int y = iter->y;
+    int width = iter->width;
+    uint32_t *buffer = iter->buffer;
 
     gradient_t *gradient = (gradient_t *)image;
-    source_image_t *source = (source_image_t *)image;
     radial_gradient_t *radial = (radial_gradient_t *)image;
     uint32_t *end = buffer + width;
     pixman_gradient_walker_t walker;
     pixman_vector_t v, unit;
 
     /* reference point is the center of the pixel */
     v.vector[0] = pixman_int_to_fixed (x) + pixman_fixed_1 / 2;
     v.vector[1] = pixman_int_to_fixed (y) + pixman_fixed_1 / 2;
     v.vector[2] = pixman_fixed_1;
 
-    _pixman_gradient_walker_init (&walker, gradient, source->common.repeat);
+    _pixman_gradient_walker_init (&walker, gradient, image->common.repeat);
 
-    if (source->common.transform)
+    if (image->common.transform)
     {
-	if (!pixman_transform_point_3d (source->common.transform, &v))
-	    return;
+	if (!pixman_transform_point_3d (image->common.transform, &v))
+	    return iter->buffer;
 	
-	unit.vector[0] = source->common.transform->matrix[0][0];
-	unit.vector[1] = source->common.transform->matrix[1][0];
-	unit.vector[2] = source->common.transform->matrix[2][0];
+	unit.vector[0] = image->common.transform->matrix[0][0];
+	unit.vector[1] = image->common.transform->matrix[1][0];
+	unit.vector[2] = image->common.transform->matrix[2][0];
     }
     else
     {
 	unit.vector[0] = pixman_fixed_1;
 	unit.vector[1] = 0;
 	unit.vector[2] = 0;
     }
 
@@ -285,35 +300,36 @@ radial_gradient_get_scanline_32 (pixman_
 	 * This would mean a worst case of unbound relative error or
 	 * about 2^10 absolute error
 	 */
 	b = dot (v.vector[0], v.vector[1], radial->c1.radius,
 		 radial->delta.x, radial->delta.y, radial->delta.radius);
 	db = dot (unit.vector[0], unit.vector[1], 0,
 		  radial->delta.x, radial->delta.y, 0);
 
-	c = dot (v.vector[0], v.vector[1], -radial->c1.radius,
+	c = dot (v.vector[0], v.vector[1],
+		 -((pixman_fixed_48_16_t) radial->c1.radius),
 		 v.vector[0], v.vector[1], radial->c1.radius);
-	dc = dot (2 * v.vector[0] + unit.vector[0],
-		  2 * v.vector[1] + unit.vector[1],
+	dc = dot (2 * (pixman_fixed_48_16_t) v.vector[0] + unit.vector[0],
+		  2 * (pixman_fixed_48_16_t) v.vector[1] + unit.vector[1],
 		  0,
 		  unit.vector[0], unit.vector[1], 0);
 	ddc = 2 * dot (unit.vector[0], unit.vector[1], 0,
 		       unit.vector[0], unit.vector[1], 0);
 
 	while (buffer < end)
 	{
 	    if (!mask || *mask++)
 	    {
 		*buffer = radial_compute_color (radial->a, b, c,
 						radial->inva,
 						radial->delta.radius,
 						radial->mindr,
 						&walker,
-						source->common.repeat);
+						image->common.repeat);
 	    }
 
 	    b += db;
 	    c += dc;
 	    dc += ddc;
 	    ++buffer;
 	}
     }
@@ -348,38 +364,53 @@ radial_gradient_get_scanline_32 (pixman_
 			      pdx, pdy, radial->c1.radius);
 		    /*  / pixman_fixed_1 / pixman_fixed_1 */
 
 		    *buffer = radial_compute_color (radial->a, b, c,
 						    radial->inva,
 						    radial->delta.radius,
 						    radial->mindr,
 						    &walker,
-						    source->common.repeat);
+						    image->common.repeat);
 		}
 		else
 		{
 		    *buffer = 0;
 		}
 	    }
-	    
+
 	    ++buffer;
 
 	    v.vector[0] += unit.vector[0];
 	    v.vector[1] += unit.vector[1];
 	    v.vector[2] += unit.vector[2];
 	}
     }
+
+    iter->y++;
+    return iter->buffer;
 }
 
-static void
-radial_gradient_property_changed (pixman_image_t *image)
+static uint32_t *
+radial_get_scanline_wide (pixman_iter_t *iter, const uint32_t *mask)
 {
-    image->common.get_scanline_32 = radial_gradient_get_scanline_32;
-    image->common.get_scanline_64 = _pixman_image_get_scanline_generic_64;
+    uint32_t *buffer = radial_get_scanline_narrow (iter, NULL);
+
+    pixman_expand ((uint64_t *)buffer, buffer, PIXMAN_a8r8g8b8, iter->width);
+
+    return buffer;
+}
+
+void
+_pixman_radial_gradient_iter_init (pixman_image_t *image, pixman_iter_t *iter)
+{
+    if (iter->flags & ITER_NARROW)
+	iter->get_scanline = radial_get_scanline_narrow;
+    else
+	iter->get_scanline = radial_get_scanline_wide;
 }
 
 PIXMAN_EXPORT pixman_image_t *
 pixman_image_create_radial_gradient (pixman_point_fixed_t *        inner,
                                      pixman_point_fixed_t *        outer,
                                      pixman_fixed_t                inner_radius,
                                      pixman_fixed_t                outer_radius,
                                      const pixman_gradient_stop_t *stops,
@@ -419,13 +450,11 @@ pixman_image_create_radial_gradient (pix
        representation is correct (53 bits) */
     radial->a = dot (radial->delta.x, radial->delta.y, -radial->delta.radius,
 		     radial->delta.x, radial->delta.y, radial->delta.radius);
     if (radial->a != 0)
 	radial->inva = 1. * pixman_fixed_1 / radial->a;
 
     radial->mindr = -1. * pixman_fixed_1 * radial->c1.radius;
 
-    image->common.property_changed = radial_gradient_property_changed;
-
     return image;
 }
 
--- a/gfx/cairo/libpixman/src/pixman-region.c
+++ b/gfx/cairo/libpixman/src/pixman-region.c
@@ -2540,18 +2540,17 @@ bitmap_addrect (region_type_t *reg,
                 int rx1, int ry1,
                 int rx2, int ry2)
 {
     if ((rx1 < rx2) && (ry1 < ry2) &&
 	(!(reg->data->numRects &&
 	   ((r-1)->y1 == ry1) && ((r-1)->y2 == ry2) &&
 	   ((r-1)->x1 <= rx1) && ((r-1)->x2 >= rx2))))
     {
-	if (!reg->data ||
-	    reg->data->numRects == reg->data->size)
+	if (reg->data->numRects == reg->data->size)
 	{
 	    if (!pixman_rect_alloc (reg, 1))
 		return NULL;
 	    *first_rect = PIXREGION_BOXPTR(reg);
 	    r = *first_rect + reg->data->numRects;
 	}
 	r->x1 = rx1;
 	r->y1 = ry1;
@@ -2585,16 +2584,18 @@ PREFIX (_init_from_image) (region_type_t
     int	irect_prev_start, irect_line_start;
     int	h, base, rx1 = 0, crects;
     int	ib;
     pixman_bool_t in_box, same;
     int width, height, stride;
 
     PREFIX(_init) (region);
 
+    critical_if_fail (region->data);
+
     return_if_fail (image->type == BITS);
     return_if_fail (image->bits.format == PIXMAN_a1);
 
     pw_line = pixman_image_get_data (image);
     width = pixman_image_get_width (image);
     height = pixman_image_get_height (image);
     stride = pixman_image_get_stride (image) / 4;
 
--- a/gfx/cairo/libpixman/src/pixman-solid-fill.c
+++ b/gfx/cairo/libpixman/src/pixman-solid-fill.c
@@ -21,64 +21,39 @@
  * CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
  */
 
 #ifdef HAVE_CONFIG_H
 #include <config.h>
 #endif
 #include "pixman-private.h"
 
-static void
-solid_fill_get_scanline_32 (pixman_image_t *image,
-                            int             x,
-                            int             y,
-                            int             width,
-                            uint32_t *      buffer,
-                            const uint32_t *mask)
+void
+_pixman_solid_fill_iter_init (pixman_image_t *image, pixman_iter_t  *iter)
 {
-    uint32_t *end = buffer + width;
-    uint32_t color = image->solid.color_32;
-
-    while (buffer < end)
-	*(buffer++) = color;
-
-    return;
-}
+    if (iter->flags & ITER_NARROW)
+    {
+	uint32_t *b = (uint32_t *)iter->buffer;
+	uint32_t *e = b + iter->width;
+	uint32_t color = iter->image->solid.color_32;
 
-static void
-solid_fill_get_scanline_64 (pixman_image_t *image,
-			    int             x,
-			    int             y,
-			    int             width,
-			    uint32_t *      buffer,
-			    const uint32_t *mask)
-{
-    uint64_t *b = (uint64_t *)buffer;
-    uint64_t *e = b + width;
-    uint64_t color = image->solid.color_64;
-
-    while (b < e)
-	*(b++) = color;
-}
+	while (b < e)
+	    *(b++) = color;
+    }
+    else
+    {
+	uint64_t *b = (uint64_t *)iter->buffer;
+	uint64_t *e = b + iter->width;
+	uint64_t color = image->solid.color_64;
 
-static source_image_class_t
-solid_fill_classify (pixman_image_t *image,
-                     int             x,
-                     int             y,
-                     int             width,
-                     int             height)
-{
-    return (image->source.class = SOURCE_IMAGE_CLASS_HORIZONTAL);
-}
+	while (b < e)
+	    *(b++) = color;
+    }
 
-static void
-solid_fill_property_changed (pixman_image_t *image)
-{
-    image->common.get_scanline_32 = solid_fill_get_scanline_32;
-    image->common.get_scanline_64 = solid_fill_get_scanline_64;
+    iter->get_scanline = _pixman_iter_get_scanline_noop;
 }
 
 static uint32_t
 color_to_uint32 (const pixman_color_t *color)
 {
     return
         (color->alpha >> 8 << 24) |
         (color->red >> 8 << 16) |
@@ -104,15 +79,11 @@ pixman_image_create_solid_fill (pixman_c
     if (!img)
 	return NULL;
 
     img->type = SOLID;
     img->solid.color = *color;
     img->solid.color_32 = color_to_uint32 (color);
     img->solid.color_64 = color_to_uint64 (color);
 
-    img->source.class = SOURCE_IMAGE_CLASS_UNKNOWN;
-    img->common.classify = solid_fill_classify;
-    img->common.property_changed = solid_fill_property_changed;
-
     return img;
 }
 
--- a/gfx/cairo/libpixman/src/pixman-sse2.c
+++ b/gfx/cairo/libpixman/src/pixman-sse2.c
@@ -25,46 +25,22 @@
  *          André Tupinambá (andrelrt@gmail.com)
  *
  * Based on work by Owen Taylor and Søren Sandmann
  */
 #ifdef HAVE_CONFIG_H
 #include <config.h>
 #endif
 
-#include <mmintrin.h>
 #include <xmmintrin.h> /* for _mm_shuffle_pi16 and _MM_SHUFFLE */
 #include <emmintrin.h> /* for SSE2 intrinsics */
 #include "pixman-private.h"
 #include "pixman-combine32.h"
 #include "pixman-fast-path.h"
 
-#if defined(_MSC_VER) && defined(_M_AMD64)
-/* Windows 64 doesn't allow MMX to be used, so
- * the pixman-x64-mmx-emulation.h file contains
- * implementations of those MMX intrinsics that
- * are used in the SSE2 implementation.
- */
-#   include "pixman-x64-mmx-emulation.h"
-#endif
-
-#ifdef USE_SSE2
-
-/* --------------------------------------------------------------------
- * Locals
- */
-
-static __m64 mask_x0080;
-static __m64 mask_x00ff;
-static __m64 mask_x0101;
-static __m64 mask_x_alpha;
-
-static __m64 mask_x565_rgb;
-static __m64 mask_x565_unpack;
-
 static __m128i mask_0080;
 static __m128i mask_00ff;
 static __m128i mask_0101;
 static __m128i mask_ffff;
 static __m128i mask_ff000000;
 static __m128i mask_alpha;
 
 static __m128i mask_565_r;
@@ -72,19 +48,16 @@ static __m128i mask_565_g1, mask_565_g2;
 static __m128i mask_565_b;
 static __m128i mask_red;
 static __m128i mask_green;
 static __m128i mask_blue;
 
 static __m128i mask_565_fix_rb;
 static __m128i mask_565_fix_g;
 
-/* ----------------------------------------------------------------------
- * SSE2 Inlines
- */
 static force_inline __m128i
 unpack_32_1x128 (uint32_t data)
 {
     return _mm_unpacklo_epi8 (_mm_cvtsi32_si128 (data), _mm_setzero_si128 ());
 }
 
 static force_inline void
 unpack_128_2x128 (__m128i data, __m128i* data_lo, __m128i* data_hi)
@@ -392,189 +365,148 @@ save_128_aligned (__m128i* dst,
 /* save 4 pixels on a unaligned address */
 static force_inline void
 save_128_unaligned (__m128i* dst,
                     __m128i  data)
 {
     _mm_storeu_si128 (dst, data);
 }
 
-/* ------------------------------------------------------------------
- * MMX inlines
- */
-
-static force_inline __m64
-load_32_1x64 (uint32_t data)
+static force_inline __m128i
+load_32_1x128 (uint32_t data)
 {
-    return _mm_cvtsi32_si64 (data);
+    return _mm_cvtsi32_si128 (data);
 }
 
-static force_inline __m64
-unpack_32_1x64 (uint32_t data)
+static force_inline __m128i
+expand_alpha_rev_1x128 (__m128i data)
 {
-    return _mm_unpacklo_pi8 (load_32_1x64 (data), _mm_setzero_si64 ());
-}
-
-static force_inline __m64
-expand_alpha_1x64 (__m64 data)
-{
-    return _mm_shuffle_pi16 (data, _MM_SHUFFLE (3, 3, 3, 3));
+    return _mm_shufflelo_epi16 (data, _MM_SHUFFLE (0, 0, 0, 0));
 }
 
-static force_inline __m64
-expand_alpha_rev_1x64 (__m64 data)
-{
-    return _mm_shuffle_pi16 (data, _MM_SHUFFLE (0, 0, 0, 0));
-}
-
-static force_inline __m64
-expand_pixel_8_1x64 (uint8_t data)
+static force_inline __m128i
+expand_pixel_8_1x128 (uint8_t data)
 {
-    return _mm_shuffle_pi16 (
-	unpack_32_1x64 ((uint32_t)data), _MM_SHUFFLE (0, 0, 0, 0));
+    return _mm_shufflelo_epi16 (
+	unpack_32_1x128 ((uint32_t)data), _MM_SHUFFLE (0, 0, 0, 0));
 }
 
-static force_inline __m64
-pix_multiply_1x64 (__m64 data,
-                   __m64 alpha)
+static force_inline __m128i
+pix_multiply_1x128 (__m128i data,
+		    __m128i alpha)
 {
-    return _mm_mulhi_pu16 (_mm_adds_pu16 (_mm_mullo_pi16 (data, alpha),
-                                          mask_x0080),
-                           mask_x0101);
+    return _mm_mulhi_epu16 (_mm_adds_epu16 (_mm_mullo_epi16 (data, alpha),
+					    mask_0080),
+			    mask_0101);
 }
 
-static force_inline __m64
-pix_add_multiply_1x64 (__m64* src,
-                       __m64* alpha_dst,
-                       __m64* dst,
-                       __m64* alpha_src)
+static force_inline __m128i
+pix_add_multiply_1x128 (__m128i* src,
+			__m128i* alpha_dst,
+			__m128i* dst,
+			__m128i* alpha_src)
 {
-    __m64 t1 = pix_multiply_1x64 (*src, *alpha_dst);
-    __m64 t2 = pix_multiply_1x64 (*dst, *alpha_src);
-
-    return _mm_adds_pu8 (t1, t2);
+    __m128i t1 = pix_multiply_1x128 (*src, *alpha_dst);
+    __m128i t2 = pix_multiply_1x128 (*dst, *alpha_src);
+
+    return _mm_adds_epu8 (t1, t2);
 }
 
-static force_inline __m64
-negate_1x64 (__m64 data)
+static force_inline __m128i
+negate_1x128 (__m128i data)
 {
-    return _mm_xor_si64 (data, mask_x00ff);
+    return _mm_xor_si128 (data, mask_00ff);
 }
 
-static force_inline __m64
-invert_colors_1x64 (__m64 data)
+static force_inline __m128i
+invert_colors_1x128 (__m128i data)
 {
-    return _mm_shuffle_pi16 (data, _MM_SHUFFLE (3, 0, 1, 2));
+    return _mm_shufflelo_epi16 (data, _MM_SHUFFLE (3, 0, 1, 2));
 }
 
-static force_inline __m64
-over_1x64 (__m64 src, __m64 alpha, __m64 dst)
+static force_inline __m128i
+over_1x128 (__m128i src, __m128i alpha, __m128i dst)
 {
-    return _mm_adds_pu8 (src, pix_multiply_1x64 (dst, negate_1x64 (alpha)));
+    return _mm_adds_epu8 (src, pix_multiply_1x128 (dst, negate_1x128 (alpha)));
 }
 
-static force_inline __m64
-in_over_1x64 (__m64* src, __m64* alpha, __m64* mask, __m64* dst)
+static force_inline __m128i
+in_over_1x128 (__m128i* src, __m128i* alpha, __m128i* mask, __m128i* dst)
 {
-    return over_1x64 (pix_multiply_1x64 (*src, *mask),
-                      pix_multiply_1x64 (*alpha, *mask),
-                      *dst);
+    return over_1x128 (pix_multiply_1x128 (*src, *mask),
+		       pix_multiply_1x128 (*alpha, *mask),
+		       *dst);
 }
 
-static force_inline __m64
-over_rev_non_pre_1x64 (__m64 src, __m64 dst)
+static force_inline __m128i
+over_rev_non_pre_1x128 (__m128i src, __m128i dst)
 {
-    __m64 alpha = expand_alpha_1x64 (src);
-
-    return over_1x64 (pix_multiply_1x64 (invert_colors_1x64 (src),
-                                         _mm_or_si64 (alpha, mask_x_alpha)),
-                      alpha,
-                      dst);
+    __m128i alpha = expand_alpha_1x128 (src);
+
+    return over_1x128 (pix_multiply_1x128 (invert_colors_1x128 (src),
+					   _mm_or_si128 (alpha, mask_alpha)),
+		       alpha,
+		       dst);
 }
 
 static force_inline uint32_t
-pack_1x64_32 (__m64 data)
+pack_1x128_32 (__m128i data)
 {
-    return _mm_cvtsi64_si32 (_mm_packs_pu16 (data, _mm_setzero_si64 ()));
+    return _mm_cvtsi128_si32 (_mm_packus_epi16 (data, _mm_setzero_si128 ()));
 }
 
-/* Expand 16 bits positioned at @pos (0-3) of a mmx register into
- *
- *    00RR00GG00BB
- *
- * --- Expanding 565 in the low word ---
- *
- * m = (m << (32 - 3)) | (m << (16 - 5)) | m;
- * m = m & (01f0003f001f);
- * m = m * (008404100840);
- * m = m >> 8;
- *
- * Note the trick here - the top word is shifted by another nibble to
- * avoid it bumping into the middle word
- */
-static force_inline __m64
-expand565_16_1x64 (uint16_t pixel)
+static force_inline __m128i
+expand565_16_1x128 (uint16_t pixel)
 {
-    __m64 p;
-    __m64 t1, t2;
-
-    p = _mm_cvtsi32_si64 ((uint32_t) pixel);
-
-    t1 = _mm_slli_si64 (p, 36 - 11);
-    t2 = _mm_slli_si64 (p, 16 - 5);
-
-    p = _mm_or_si64 (t1, p);
-    p = _mm_or_si64 (t2, p);
-    p = _mm_and_si64 (p, mask_x565_rgb);
-    p = _mm_mullo_pi16 (p, mask_x565_unpack);
-
-    return _mm_srli_pi16 (p, 8);
+    __m128i m = _mm_cvtsi32_si128 (pixel);
+
+    m = unpack_565_to_8888 (m);
+
+    return _mm_unpacklo_epi8 (m, _mm_setzero_si128 ());
 }
 
-/* ----------------------------------------------------------------------------
- * Compose Core transformations
- */
 static force_inline uint32_t
 core_combine_over_u_pixel_sse2 (uint32_t src, uint32_t dst)
 {
     uint8_t a;
-    __m64 ms;
+    __m128i xmms;
 
     a = src >> 24;
 
     if (a == 0xff)
     {
 	return src;
     }
     else if (src)
     {
-	ms = unpack_32_1x64 (src);
-	return pack_1x64_32 (
-	    over_1x64 (ms, expand_alpha_1x64 (ms), unpack_32_1x64 (dst)));
+	xmms = unpack_32_1x128 (src);
+	return pack_1x128_32 (
+	    over_1x128 (xmms, expand_alpha_1x128 (xmms),
+			unpack_32_1x128 (dst)));
     }
 
     return dst;
 }
 
 static force_inline uint32_t
 combine1 (const uint32_t *ps, const uint32_t *pm)
 {
     uint32_t s = *ps;
 
     if (pm)
     {
-	__m64 ms, mm;
-
-	mm = unpack_32_1x64 (*pm);
-	mm = expand_alpha_1x64 (mm);
-
-	ms = unpack_32_1x64 (s);
-	ms = pix_multiply_1x64 (ms, mm);
-
-	s = pack_1x64_32 (ms);
+	__m128i ms, mm;
+
+	mm = unpack_32_1x128 (*pm);
+	mm = expand_alpha_1x128 (mm);
+
+	ms = unpack_32_1x128 (s);
+	ms = pix_multiply_1x128 (ms, mm);
+
+	s = pack_1x128_32 (ms);
     }
 
     return s;
 }
 
 static force_inline __m128i
 combine4 (const __m128i *ps, const __m128i *pm)
 {
@@ -605,96 +537,192 @@ combine4 (const __m128i *ps, const __m12
 
 	s = pack_2x128_128 (xmm_src_lo, xmm_src_hi);
     }
 
     return s;
 }
 
 static force_inline void
-core_combine_over_u_sse2 (uint32_t*       pd,
-                          const uint32_t* ps,
-                          const uint32_t* pm,
-                          int             w)
+core_combine_over_u_sse2_mask (uint32_t *	  pd,
+			       const uint32_t*    ps,
+			       const uint32_t*    pm,
+			       int                w)
 {
     uint32_t s, d;
 
-    __m128i xmm_dst_lo, xmm_dst_hi;
-    __m128i xmm_src_lo, xmm_src_hi;
-    __m128i xmm_alpha_lo, xmm_alpha_hi;
-
     /* Align dst on a 16-byte boundary */
     while (w && ((unsigned long)pd & 15))
     {
 	d = *pd;
 	s = combine1 (ps, pm);
 
-	*pd++ = core_combine_over_u_pixel_sse2 (s, d);
+	if (s)
+	    *pd = core_combine_over_u_pixel_sse2 (s, d);
+	pd++;
 	ps++;
-	if (pm)
-	    pm++;
+	pm++;
 	w--;
     }
 
     while (w >= 4)
     {
-	/* I'm loading unaligned because I'm not sure about
-	 * the address alignment.
-	 */
-	xmm_src_hi = combine4 ((__m128i*)ps, (__m128i*)pm);
-
-	if (is_opaque (xmm_src_hi))
-	{
-	    save_128_aligned ((__m128i*)pd, xmm_src_hi);
-	}
-	else if (!is_zero (xmm_src_hi))
+	__m128i mask = load_128_unaligned ((__m128i *)pm);
+
+	if (!is_zero (mask))
 	{
-	    xmm_dst_hi = load_128_aligned ((__m128i*) pd);
-
-	    unpack_128_2x128 (xmm_src_hi, &xmm_src_lo, &xmm_src_hi);
-	    unpack_128_2x128 (xmm_dst_hi, &xmm_dst_lo, &xmm_dst_hi);
-
-	    expand_alpha_2x128 (
-		xmm_src_lo, xmm_src_hi, &xmm_alpha_lo, &xmm_alpha_hi);
-
-	    over_2x128 (&xmm_src_lo, &xmm_src_hi,
-			&xmm_alpha_lo, &xmm_alpha_hi,
-			&xmm_dst_lo, &xmm_dst_hi);
-
-	    /* rebuid the 4 pixel data and save*/
-	    save_128_aligned ((__m128i*)pd,
-			      pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
+	    __m128i src;
+	    __m128i src_hi, src_lo;
+	    __m128i mask_hi, mask_lo;
+	    __m128i alpha_hi, alpha_lo;
+
+	    src = load_128_unaligned ((__m128i *)ps);
+
+	    if (is_opaque (_mm_and_si128 (src, mask)))
+	    {
+		save_128_aligned ((__m128i *)pd, src);
+	    }
+	    else
+	    {
+		__m128i dst = load_128_aligned ((__m128i *)pd);
+		__m128i dst_hi, dst_lo;
+
+		unpack_128_2x128 (mask, &mask_lo, &mask_hi);
+		unpack_128_2x128 (src, &src_lo, &src_hi);
+
+		expand_alpha_2x128 (mask_lo, mask_hi, &mask_lo, &mask_hi);
+		pix_multiply_2x128 (&src_lo, &src_hi,
+				    &mask_lo, &mask_hi,
+				    &src_lo, &src_hi);
+
+		unpack_128_2x128 (dst, &dst_lo, &dst_hi);
+
+		expand_alpha_2x128 (src_lo, src_hi,
+				    &alpha_lo, &alpha_hi);
+
+		over_2x128 (&src_lo, &src_hi, &alpha_lo, &alpha_hi,
+			    &dst_lo, &dst_hi);
+
+		save_128_aligned (
+		    (__m128i *)pd,
+		    pack_2x128_128 (dst_lo, dst_hi));
+	    }
 	}
 
-	w -= 4;
+	pm += 4;
 	ps += 4;
 	pd += 4;
-	if (pm)
-	    pm += 4;
+	w -= 4;
     }
-
     while (w)
     {
 	d = *pd;
 	s = combine1 (ps, pm);
 
-	*pd++ = core_combine_over_u_pixel_sse2 (s, d);
+	if (s)
+	    *pd = core_combine_over_u_pixel_sse2 (s, d);
+	pd++;
 	ps++;
-	if (pm)
-	    pm++;
+	pm++;
 
 	w--;
     }
 }
 
 static force_inline void
-core_combine_over_reverse_u_sse2 (uint32_t*       pd,
-                                  const uint32_t* ps,
-                                  const uint32_t* pm,
-                                  int             w)
+core_combine_over_u_sse2_no_mask (uint32_t *	  pd,
+				  const uint32_t*    ps,
+				  int                w)
+{
+    uint32_t s, d;
+
+    /* Align dst on a 16-byte boundary */
+    while (w && ((unsigned long)pd & 15))
+    {
+	d = *pd;
+	s = *ps;
+
+	if (s)
+	    *pd = core_combine_over_u_pixel_sse2 (s, d);
+	pd++;
+	ps++;
+	w--;
+    }
+
+    while (w >= 4)
+    {
+	__m128i src;
+	__m128i src_hi, src_lo, dst_hi, dst_lo;
+	__m128i alpha_hi, alpha_lo;
+
+	src = load_128_unaligned ((__m128i *)ps);
+
+	if (!is_zero (src))
+	{
+	    if (is_opaque (src))
+	    {
+		save_128_aligned ((__m128i *)pd, src);
+	    }
+	    else
+	    {
+		__m128i dst = load_128_aligned ((__m128i *)pd);
+
+		unpack_128_2x128 (src, &src_lo, &src_hi);
+		unpack_128_2x128 (dst, &dst_lo, &dst_hi);
+
+		expand_alpha_2x128 (src_lo, src_hi,
+				    &alpha_lo, &alpha_hi);
+		over_2x128 (&src_lo, &src_hi, &alpha_lo, &alpha_hi,
+			    &dst_lo, &dst_hi);
+
+		save_128_aligned (
+		    (__m128i *)pd,
+		    pack_2x128_128 (dst_lo, dst_hi));
+	    }
+	}
+
+	ps += 4;
+	pd += 4;
+	w -= 4;
+    }
+    while (w)
+    {
+	d = *pd;
+	s = *ps;
+
+	if (s)
+	    *pd = core_combine_over_u_pixel_sse2 (s, d);
+	pd++;
+	ps++;
+
+	w--;
+    }
+}
+
+static force_inline void
+sse2_combine_over_u (pixman_implementation_t *imp,
+                     pixman_op_t              op,
+                     uint32_t *               pd,
+                     const uint32_t *         ps,
+                     const uint32_t *         pm,
+                     int                      w)
+{
+    if (pm)
+	core_combine_over_u_sse2_mask (pd, ps, pm, w);
+    else
+	core_combine_over_u_sse2_no_mask (pd, ps, w);
+}
+
+static void
+sse2_combine_over_reverse_u (pixman_implementation_t *imp,
+                             pixman_op_t              op,
+                             uint32_t *               pd,
+                             const uint32_t *         ps,
+                             const uint32_t *         pm,
+                             int                      w)
 {
     uint32_t s, d;
 
     __m128i xmm_dst_lo, xmm_dst_hi;
     __m128i xmm_src_lo, xmm_src_hi;
     __m128i xmm_alpha_lo, xmm_alpha_hi;
 
     /* Align dst on a 16-byte boundary */
@@ -750,51 +778,53 @@ core_combine_over_reverse_u_sse2 (uint32
 	ps++;
 	w--;
 	if (pm)
 	    pm++;
     }
 }
 
 static force_inline uint32_t
-core_combine_in_u_pixelsse2 (uint32_t src, uint32_t dst)
+core_combine_in_u_pixel_sse2 (uint32_t src, uint32_t dst)
 {
     uint32_t maska = src >> 24;
 
     if (maska == 0)
     {
 	return 0;
     }
     else if (maska != 0xff)
     {
-	return pack_1x64_32 (
-	    pix_multiply_1x64 (unpack_32_1x64 (dst),
-			       expand_alpha_1x64 (unpack_32_1x64 (src))));
+	return pack_1x128_32 (
+	    pix_multiply_1x128 (unpack_32_1x128 (dst),
+				expand_alpha_1x128 (unpack_32_1x128 (src))));
     }
 
     return dst;
 }
 
-static force_inline void
-core_combine_in_u_sse2 (uint32_t*       pd,
-                        const uint32_t* ps,
-                        const uint32_t* pm,
-                        int             w)
+static void
+sse2_combine_in_u (pixman_implementation_t *imp,
+                   pixman_op_t              op,
+                   uint32_t *               pd,
+                   const uint32_t *         ps,
+                   const uint32_t *         pm,
+                   int                      w)
 {
     uint32_t s, d;
 
     __m128i xmm_src_lo, xmm_src_hi;
     __m128i xmm_dst_lo, xmm_dst_hi;
 
     while (w && ((unsigned long) pd & 15))
     {
 	s = combine1 (ps, pm);
 	d = *pd;
 
-	*pd++ = core_combine_in_u_pixelsse2 (d, s);
+	*pd++ = core_combine_in_u_pixel_sse2 (d, s);
 	w--;
 	ps++;
 	if (pm)
 	    pm++;
     }
 
     while (w >= 4)
     {
@@ -819,41 +849,43 @@ core_combine_in_u_sse2 (uint32_t*       
 	    pm += 4;
     }
 
     while (w)
     {
 	s = combine1 (ps, pm);
 	d = *pd;
 
-	*pd++ = core_combine_in_u_pixelsse2 (d, s);
+	*pd++ = core_combine_in_u_pixel_sse2 (d, s);
 	w--;
 	ps++;
 	if (pm)
 	    pm++;
     }
 }
 
-static force_inline void
-core_combine_reverse_in_u_sse2 (uint32_t*       pd,
-                                const uint32_t* ps,
-                                const uint32_t *pm,
-                                int             w)
+static void
+sse2_combine_in_reverse_u (pixman_implementation_t *imp,
+                           pixman_op_t              op,
+                           uint32_t *               pd,
+                           const uint32_t *         ps,
+                           const uint32_t *         pm,
+                           int                      w)
 {
     uint32_t s, d;
 
     __m128i xmm_src_lo, xmm_src_hi;
     __m128i xmm_dst_lo, xmm_dst_hi;
 
     while (w && ((unsigned long) pd & 15))
     {
 	s = combine1 (ps, pm);
 	d = *pd;
 
-	*pd++ = core_combine_in_u_pixelsse2 (s, d);
+	*pd++ = core_combine_in_u_pixel_sse2 (s, d);
 	ps++;
 	w--;
 	if (pm)
 	    pm++;
     }
 
     while (w >= 4)
     {
@@ -878,39 +910,41 @@ core_combine_reverse_in_u_sse2 (uint32_t
 	    pm += 4;
     }
 
     while (w)
     {
 	s = combine1 (ps, pm);
 	d = *pd;
 
-	*pd++ = core_combine_in_u_pixelsse2 (s, d);
+	*pd++ = core_combine_in_u_pixel_sse2 (s, d);
 	w--;
 	ps++;
 	if (pm)
 	    pm++;
     }
 }
 
-static force_inline void
-core_combine_reverse_out_u_sse2 (uint32_t*       pd,
-                                 const uint32_t* ps,
-                                 const uint32_t* pm,
-                                 int             w)
+static void
+sse2_combine_out_reverse_u (pixman_implementation_t *imp,
+                            pixman_op_t              op,
+                            uint32_t *               pd,
+                            const uint32_t *         ps,
+                            const uint32_t *         pm,
+                            int                      w)
 {
     while (w && ((unsigned long) pd & 15))
     {
 	uint32_t s = combine1 (ps, pm);
 	uint32_t d = *pd;
 
-	*pd++ = pack_1x64_32 (
-	    pix_multiply_1x64 (
-		unpack_32_1x64 (d), negate_1x64 (
-		    expand_alpha_1x64 (unpack_32_1x64 (s)))));
+	*pd++ = pack_1x128_32 (
+	    pix_multiply_1x128 (
+		unpack_32_1x128 (d), negate_1x128 (
+		    expand_alpha_1x128 (unpack_32_1x128 (s)))));
 
 	if (pm)
 	    pm++;
 	ps++;
 	w--;
     }
 
     while (w >= 4)
@@ -942,42 +976,44 @@ core_combine_reverse_out_u_sse2 (uint32_
 	w -= 4;
     }
 
     while (w)
     {
 	uint32_t s = combine1 (ps, pm);
 	uint32_t d = *pd;
 
-	*pd++ = pack_1x64_32 (
-	    pix_multiply_1x64 (
-		unpack_32_1x64 (d), negate_1x64 (
-		    expand_alpha_1x64 (unpack_32_1x64 (s)))));
+	*pd++ = pack_1x128_32 (
+	    pix_multiply_1x128 (
+		unpack_32_1x128 (d), negate_1x128 (
+		    expand_alpha_1x128 (unpack_32_1x128 (s)))));
 	ps++;
 	if (pm)
 	    pm++;
 	w--;
     }
 }
 
-static force_inline void
-core_combine_out_u_sse2 (uint32_t*       pd,
-                         const uint32_t* ps,
-                         const uint32_t* pm,
-                         int             w)
+static void
+sse2_combine_out_u (pixman_implementation_t *imp,
+                    pixman_op_t              op,
+                    uint32_t *               pd,
+                    const uint32_t *         ps,
+                    const uint32_t *         pm,
+                    int                      w)
 {
     while (w && ((unsigned long) pd & 15))
     {
 	uint32_t s = combine1 (ps, pm);
 	uint32_t d = *pd;
 
-	*pd++ = pack_1x64_32 (
-	    pix_multiply_1x64 (
-		unpack_32_1x64 (s), negate_1x64 (
-		    expand_alpha_1x64 (unpack_32_1x64 (d)))));
+	*pd++ = pack_1x128_32 (
+	    pix_multiply_1x128 (
+		unpack_32_1x128 (s), negate_1x128 (
+		    expand_alpha_1x128 (unpack_32_1x128 (d)))));
 	w--;
 	ps++;
 	if (pm)
 	    pm++;
     }
 
     while (w >= 4)
     {
@@ -1007,45 +1043,47 @@ core_combine_out_u_sse2 (uint32_t*      
 	    pm += 4;
     }
 
     while (w)
     {
 	uint32_t s = combine1 (ps, pm);
 	uint32_t d = *pd;
 
-	*pd++ = pack_1x64_32 (
-	    pix_multiply_1x64 (
-		unpack_32_1x64 (s), negate_1x64 (
-		    expand_alpha_1x64 (unpack_32_1x64 (d)))));
+	*pd++ = pack_1x128_32 (
+	    pix_multiply_1x128 (
+		unpack_32_1x128 (s), negate_1x128 (
+		    expand_alpha_1x128 (unpack_32_1x128 (d)))));
 	w--;
 	ps++;
 	if (pm)
 	    pm++;
     }
 }
 
 static force_inline uint32_t
 core_combine_atop_u_pixel_sse2 (uint32_t src,
                                 uint32_t dst)
 {
-    __m64 s = unpack_32_1x64 (src);
-    __m64 d = unpack_32_1x64 (dst);
-
-    __m64 sa = negate_1x64 (expand_alpha_1x64 (s));
-    __m64 da = expand_alpha_1x64 (d);
-
-    return pack_1x64_32 (pix_add_multiply_1x64 (&s, &da, &d, &sa));
+    __m128i s = unpack_32_1x128 (src);
+    __m128i d = unpack_32_1x128 (dst);
+
+    __m128i sa = negate_1x128 (expand_alpha_1x128 (s));
+    __m128i da = expand_alpha_1x128 (d);
+
+    return pack_1x128_32 (pix_add_multiply_1x128 (&s, &da, &d, &sa));
 }
 
-static force_inline void
-core_combine_atop_u_sse2 (uint32_t*       pd,
-                          const uint32_t* ps,
-                          const uint32_t* pm,
-                          int             w)
+static void
+sse2_combine_atop_u (pixman_implementation_t *imp,
+                     pixman_op_t              op,
+                     uint32_t *               pd,
+                     const uint32_t *         ps,
+                     const uint32_t *         pm,
+                     int                      w)
 {
     uint32_t s, d;
 
     __m128i xmm_src_lo, xmm_src_hi;
     __m128i xmm_dst_lo, xmm_dst_hi;
     __m128i xmm_alpha_src_lo, xmm_alpha_src_hi;
     __m128i xmm_alpha_dst_lo, xmm_alpha_dst_hi;
 
@@ -1104,30 +1142,32 @@ core_combine_atop_u_sse2 (uint32_t*     
 	    pm++;
     }
 }
 
 static force_inline uint32_t
 core_combine_reverse_atop_u_pixel_sse2 (uint32_t src,
                                         uint32_t dst)
 {
-    __m64 s = unpack_32_1x64 (src);
-    __m64 d = unpack_32_1x64 (dst);
-
-    __m64 sa = expand_alpha_1x64 (s);
-    __m64 da = negate_1x64 (expand_alpha_1x64 (d));
-
-    return pack_1x64_32 (pix_add_multiply_1x64 (&s, &da, &d, &sa));
+    __m128i s = unpack_32_1x128 (src);
+    __m128i d = unpack_32_1x128 (dst);
+
+    __m128i sa = expand_alpha_1x128 (s);
+    __m128i da = negate_1x128 (expand_alpha_1x128 (d));
+
+    return pack_1x128_32 (pix_add_multiply_1x128 (&s, &da, &d, &sa));
 }
 
-static force_inline void
-core_combine_reverse_atop_u_sse2 (uint32_t*       pd,
-                                  const uint32_t* ps,
-                                  const uint32_t* pm,
-                                  int             w)
+static void
+sse2_combine_atop_reverse_u (pixman_implementation_t *imp,
+                             pixman_op_t              op,
+                             uint32_t *               pd,
+                             const uint32_t *         ps,
+                             const uint32_t *         pm,
+                             int                      w)
 {
     uint32_t s, d;
 
     __m128i xmm_src_lo, xmm_src_hi;
     __m128i xmm_dst_lo, xmm_dst_hi;
     __m128i xmm_alpha_src_lo, xmm_alpha_src_hi;
     __m128i xmm_alpha_dst_lo, xmm_alpha_dst_hi;
 
@@ -1186,30 +1226,32 @@ core_combine_reverse_atop_u_sse2 (uint32
 	    pm++;
     }
 }
 
 static force_inline uint32_t
 core_combine_xor_u_pixel_sse2 (uint32_t src,
                                uint32_t dst)
 {
-    __m64 s = unpack_32_1x64 (src);
-    __m64 d = unpack_32_1x64 (dst);
-
-    __m64 neg_d = negate_1x64 (expand_alpha_1x64 (d));
-    __m64 neg_s = negate_1x64 (expand_alpha_1x64 (s));
-
-    return pack_1x64_32 (pix_add_multiply_1x64 (&s, &neg_d, &d, &neg_s));
+    __m128i s = unpack_32_1x128 (src);
+    __m128i d = unpack_32_1x128 (dst);
+
+    __m128i neg_d = negate_1x128 (expand_alpha_1x128 (d));
+    __m128i neg_s = negate_1x128 (expand_alpha_1x128 (s));
+
+    return pack_1x128_32 (pix_add_multiply_1x128 (&s, &neg_d, &d, &neg_s));
 }
 
-static force_inline void
-core_combine_xor_u_sse2 (uint32_t*       dst,
-                         const uint32_t* src,
-                         const uint32_t *mask,
-                         int             width)
+static void
+sse2_combine_xor_u (pixman_implementation_t *imp,
+                    pixman_op_t              op,
+                    uint32_t *               dst,
+                    const uint32_t *         src,
+                    const uint32_t *         mask,
+                    int                      width)
 {
     int w = width;
     uint32_t s, d;
     uint32_t* pd = dst;
     const uint32_t* ps = src;
     const uint32_t* pm = mask;
 
     __m128i xmm_src, xmm_src_lo, xmm_src_hi;
@@ -1271,37 +1313,39 @@ core_combine_xor_u_sse2 (uint32_t*      
 	w--;
 	ps++;
 	if (pm)
 	    pm++;
     }
 }
 
 static force_inline void
-core_combine_add_u_sse2 (uint32_t*       dst,
-                         const uint32_t* src,
-                         const uint32_t* mask,
-                         int             width)
+sse2_combine_add_u (pixman_implementation_t *imp,
+                    pixman_op_t              op,
+                    uint32_t *               dst,
+                    const uint32_t *         src,
+                    const uint32_t *         mask,
+                    int                      width)
 {
     int w = width;
     uint32_t s, d;
     uint32_t* pd = dst;
     const uint32_t* ps = src;
     const uint32_t* pm = mask;
 
     while (w && (unsigned long)pd & 15)
     {
 	s = combine1 (ps, pm);
 	d = *pd;
 
 	ps++;
 	if (pm)
 	    pm++;
-	*pd++ = _mm_cvtsi64_si32 (
-	    _mm_adds_pu8 (_mm_cvtsi32_si64 (s), _mm_cvtsi32_si64 (d)));