Monday, May 23, 2011

Efficiently splitting the CbCr plane with ARM NEON intrinsics

We've been doing some work with the live video feed on iOS devices, and noticed that some of the frame processing was not very efficient. This post describes how, with a little ARM NEON intrinsics code, we were able to significantly speed up that processing.

The live video feed provided by AVFoundation is available in two pixel formats -- BGRA and YCbCr. (If you want help getting started with video feeds, check out Erica Sadun's useful sample project.)

Suppose, however, you wanted to work on both a grayscale and a color version of an inbound frame.

One option is request BRGA and convert the color representation to grayscale as needed; Computer Vision Talks has already written an efficient implementation of that.

If you primarily need the grayscale image, and only sometimes need the color image, it is better to use the YCbCr format and to work directly with the Y plane from YCbCr. When the time comes to also use the color delta planes Cb and Cr, you need to deinterlace them.

The YCbCr pixel layout looks roughly like this:


However, to operate effectively (or with a library such as OpenCV) on each color plane independently, you frequently need the color planes to be laid out more like this:


You could write some simple C code that will loop over each pixel and put it into its new home. However, it is dramatically more efficient to do this using ARM NEON intrinsics.

Before using ARM NEON intrinsics, you need to make slight modifications to your Xcode build and project settings. In short: Set Architecture to include both armv6 and armv7, and add -mfloat-abi=softfp -mfpu=neon to Other C Flags in Xcode. With these in hand, we can now write a very simple, very efficient deinterlacer, using NEON's load and store instructions.

With no further ado, here's the code:

#ifdef _ARM_ARCH_7
#include <arm_neon.h>


// imageBuffer is the CVImageBufferRef we're working with
size_t cbcrPlane = 1; // the Y plane is plane 0

size_t width = CVPixelBufferGetWidthOfPlane(imageBuffer, cbcrPlane);
size_t height = CVPixelBufferGetHeightOfPlane(imageBuffer, cbcrPlane);
size_t bytesPerRow = CVPixelBufferGetBytesPerRowOfPlane(imageBuffer, plane);
assert(width == bytesPerRow); // keep things simple for illustration purposes; this also seems always to be true, in practice

uint8_t *planeBaseAddress = (uint8_t *)CVPixelBufferGetBaseAddressOfPlane(imageBuffer, cbcrPlane);

size_t planeSize = width * height / 2;
uint8_t *bluePlane = (uint8_t *)malloc(planeSize);
uint8_t *redPlane = (uint8_t *)malloc(planeSize);

#ifdef _ARM_ARCH_7
for(uint32_t i = 0; i < width * height / 4; i++) {
uint8_t *cbcrSrc = &cbcrPlane[i * 8];
uint8_t *blueDest = &bluePlane[i * 4];
uint8_t *redDest = &redPlane[i * 4];
uint8x8x2_t loaded = vld2_u8(cbcrSrc); // load 8 source bytes into two registers, deinterlacing along the way
vst1_u8(blueDest, loaded.val[0]); // write the first 4 bytes into the blue destination...
vst1_u8(redDest, loaded.val[1]); // and the second 4 bytes into the red destination
// use a non-vectorized C implementation

Note that, in order to keep it short, this code is missing error checking and graceful recovery, and doesn't handle the case in which the width * height doesn't divide evenly by 8. (In that case, just use vectorized code up to the fragment at the end, and then use scalar code for the remainder.)

Each of the intrinsics commands vld2_u8 and vst1_u8 is backed by a vectorized instruction, making this very fast to execute. You could get more performance yet by writing the entire thing in assembly, but with the above changes, we were able to reduce deinterlacing from 25% to 2% of total frame processing time -- a significant improvement.


  1. Hi there, thanks for posting neon code, it's really difficult to find examples or code in this subject. Do you know how can I divide a 4 floats neon register by another 4 floats register? I can't find any documentation at all about divisions with neon... i hardly believe that operation is not supported. I need to divide! xD

    Thank you in advance!!

  2. @PaulaCL Believe it. :) There is no division. (See e.g. if you are still in disbelief.) In some circumstances, you can get away with multiplication instead, or find other tricks...

  3. This comment has been removed by the author.

  4. I tried to use this code to replace a plain-C implementation of the same de-interleaving task. Unfortunately, it does not work. I believe that there is an error in the loop guard condition, which as is copies twice as much memory as it should. Your terminating condition is the (original) buffer size divided by four, but it walks 8x the loop variable. I get a crash if I don't divide the buffer size by 8.

    Does vld2_u8() really swizzle the bytes as they are being copied from memory into the registers? Because it isn't working for me. I get images composed of fine vertical stripes.

    This is my plain C code to do roughly the same thing, except it works in 32-bit chunks, not 64. If I can figure out the neon functions, I may try to find a half-way solution, although cutting the copy time in half again won't make a big impact on my code, since the encoder I'm using is now the biggest time sink.

    _rawImage is a vpx_image_t from the vpx library.

    unsigned char * bAddress = CVPixelBufferGetBaseAddressOfPlane(imageBuffer, 1);
    size_t bRowbytes = CVPixelBufferGetBytesPerRowOfPlane(imageBuffer, 1);
    size_t bHeight = CVPixelBufferGetHeightOfPlane(imageBuffer, 1);

    NSUInteger length = bRowbytes*bHeight*0.5f;

    static float ratio = (float)sizeof(char)/sizeof(uint32_t); // show be 0.25

    const uint32_t *const p = (uint32_t *)bAddress;
    uint32_t *p1 = (uint32_t *)_rawImage->planes[1];
    uint32_t *p2 = (uint32_t *)_rawImage->planes[2];
    NSUInteger count = (NSUInteger)(ratio*length);
    uint32_t d1, d2;

    for (NSUInteger i=0; i>8|(d1 & 0x000000ff)|(d2 & 0x00ff0000)<<8|(d2 & 0x000000ff)<<16;

    // Take the second of each pair of bytes in each 4-byte block
    // Pack the 1st and 3rd bytes from p1 into the 3rd and 4th bytes
    // Pack the 1st and 3rd bytes from p2 into the 1st and 2nd bytes
    p2[i] = (d1 & 0xff000000)>>16|(d1 & 0x0000ff00)>>8|(d2 & 0xff000000)|(d2 & 0x0000ff00)<<8;

  5. I got your code to work. Embarrassing, I changed the dimensions coming from my capture source without changing the dimensions being fed to the encoder. The loop counter error had to be fixed, though. And it didn't improve my frame rate on my 3GS, but let's try my iPad 2! Thank you.


Note: Only a member of this blog may post a comment.