Efficient decoding of binary data in numpy

In many experimental sciences, we acquire data from one or more instruments (camera, photodetectors, other sensors, etc…). Oftentimes data is encoded in binary numeric types, structures of numeric types and bit fields.

Here I assume we are dealing with “medium-data”, which is, using Wes McKinney words:

Medium Data (n): Not too big for a single machine, but too big to be dumb about.

We can easily and efficiently read this kind of data using Numpy, the foundational library for any numeric computing in Python. In this post I show the powerful tools numpy offers to “interpret” (or decode) binary data.

An example of binary-data

Recently, I had to read data from a new instrument that sends to the PC “words” of 48 bits.

The bit layout of the word is:

[placeholder]

Within each word, the byte order goes from byte1 (the first) to byte6 (the last). The word is comprised of three numeric fields MT, mT, CH (unsigned integers of bit-width 29, 14 and 3 bits respectively), and a few flags (#B, OV).

For the numeric fields, the bit order in parenthesis shows that the most-significant-bit (MSB) is encountered first when reading the word orderly from the first to the last bit. This byte order is called big-endian (often indicated by ‘>’).

This example can seem fairly convoluted, but is quite common in data from scientific instruments.

Let’s see how we can decode this data, easily and efficiently, using numpy.

Numpy magics: dtype system

This section is a brief digression to introduce numpy’s dtype system and arrays “views” of binary buffers.

One of the most powerful features of numpy is providing a flexible memory model to access numeric data in arbitrary layouts.

At the core of the numpy library there is the ndarray object, i.e. a multi-dimentional array of homogeneous data type.

The crucial point is: what does numpy accept as homogeneous data type?

We can certaintly use all the standard numeric type like integers (signed or unsigned with 8, 16, 32 or 64 bits) or floats (16, 32, 64) or even complex numbers (internally a pair of floats).

But what if we want to define a more complex structure?

Numpy allows defining structures to be used as array elements: we just need to define a dtype describing such a structure. For example, to define a structure of a unit16 followed by a float16 we write:

In [2]:
import numpy as np
In [5]:
custom_dtype = np.dtype([('time', np.int16), ('humidity', np.float16)])
custom_dtype
Out[5]:
dtype([('time', '<i2'), ('humidity', '<f2')])

Note that we assigned a name to each field in the structure, in this case “time” (a 16-bit integer) and “humidity” (a 16-bit float). The ‘<’ sign in the representation indicated little-endian byte order. For alternative syntax to define a new dtype see the comprehensive dtype documentation.

We can build an array of such a structure specifying the dtype argument (accepted by the majority of numpy functions returning arrays) like this:

In [6]:
data = np.ones(6, dtype=custom_dtype)
data
Out[6]:
array([(1, 1.0), (1, 1.0), (1, 1.0), (1, 1.0), (1, 1.0), (1, 1.0)], 
      dtype=[('time', '<i2'), ('humidity', '<f2')])

The previous command defined a 6-element array. Each element of the array is the previously defined structure. Arrays with custom dtypeas are called in numpy structured arrays.

Let’s see a few example of what we can do with such an array.

1. Scalar to array broadcast

Add 10 to all the time fields:

In [63]:
data['time'] += 10
data
Out[63]:
array([(11, 1.0), (11, 1.0), (11, 1.0), (11, 1.0), (11, 1.0), (11, 1.0)], 
      dtype=[('time', '<i2'), ('humidity', '<f2')])

2. Assign data

Assign data to time and humidity fields:

In [64]:
data['time'] = np.arange(6)*10
data['humidity'] = (np.random.randn(6) + 5)*10
data
Out[64]:
array([(0, 50.75), (10, 46.40625), (20, 62.1875), (30, 48.78125),
       (40, 35.53125), (50, 48.78125)], 
      dtype=[('time', '<i2'), ('humidity', '<f2')])

3. Indexing and slicing

Take element number 3 (the 4th element) of the array:

In [72]:
data[3]
Out[72]:
(30, 48.78125)

A single element contains two values of the two fields.

Next, let’s take a slice starting at 1 and taking every 2 elements:

In [65]:
data[1::2]
Out[65]:
array([(10, 46.40625), (30, 48.78125), (50, 48.78125)], 
      dtype=[('time', '<i2'), ('humidity', '<f2')])

This is a new structure array with only the selected elements.

4. View with different dtype

In [70]:
custom_dtype_b = np.dtype([('byte1', 'u1'), ('byte2', 'u1'), ('humidity', '<f2')])
In [71]:
data_b = data.view(custom_dtype_b)
data_b
Out[71]:
array([(0, 0, 50.75), (10, 0, 46.40625), (20, 0, 62.1875),
       (30, 0, 48.78125), (40, 0, 35.53125), (50, 0, 48.78125)], 
      dtype=[('byte1', 'u1'), ('byte2', 'u1'), ('humidity', '<f2')])

In this latter data_b points to the same memory as data, but it reinterprets the array element with a different structure (the integer is split in 2 bytes).

Let’s decode!

Before starting let’s define a dummy binay buffer:

In [103]:
buff = b'\xFF' * (6*4)
In [104]:
len(buff)
Out[104]:
24

In a real world situation this buffer would be read from disc using something like:

with open('datafile.dat', 'br') as f:
    buff = f.read()

Coming back to the previous example, we recognize that data has MT and mT aligned at 32 bit and 16 bit boundaries with big-endian representation.

To access those fields we can simply interpret the binary data with the following dtype:

In [105]:
dtype_fields = np.dtype([('timestamps', '>u4'), ('nanotimes', '>u2')])
dtype_fields
Out[105]:
dtype([('timestamps', '>u4'), ('nanotimes', '>u2')])

We only need to copy the bits CH, OV, #B first, and then set them to 0 in the original buffer. This operation is easier if we had direct access to the byte contining the 3 bits to save. Easy enough, this byte can be accessed defining a second dtype:

In [106]:
dtype_bytes = np.dtype([('byte%d' % b, 'u1') for b in range(6)])
dtype_bytes
Out[106]:
dtype([('byte0', 'u1'), ('byte1', 'u1'), ('byte2', 'u1'), ('byte3', 'u1'), ('byte4', 'u1'), ('byte5', 'u1')])

Now we can “view” this binary buffer according to the two different dtypes we previously defined:

In [107]:
data = np.frombuffer(buff, dtype=dtype_fields)
data
Out[107]:
array([(4294967295, 65535), (4294967295, 65535), (4294967295, 65535),
       (4294967295, 65535)], 
      dtype=[('timestamps', '>u4'), ('nanotimes', '>u2')])
In [108]:
datab = np.frombuffer(buff, dtype=dtype_bytes)
datab
Out[108]:
array([(255, 255, 255, 255, 255, 255), (255, 255, 255, 255, 255, 255),
       (255, 255, 255, 255, 255, 255), (255, 255, 255, 255, 255, 255)], 
      dtype=[('byte0', 'u1'), ('byte1', 'u1'), ('byte2', 'u1'), ('byte3', 'u1'), ('byte4', 'u1'), ('byte5', 'u1')])

By creating an array from an already existing buffer the data is not copied, is just interpreted according to the passed dtype. In this example data and datab point to the same binary content in memory, they simply “view” the data differently, or they provide different interfaces to the same data.

We can use datab['byte1'] as a normal array containing uint8 elements:

In [109]:
datab['byte0']
Out[109]:
array([255, 255, 255, 255], dtype=uint8)

Let start copying the 3 bits in another array:

In [110]:
bitfields = np.zeros_like(data, dtype='u1')
bitfields
Out[110]:
array([0, 0, 0, 0], dtype=uint8)
In [111]:
bitfields = datab['byte0'].copy()
np.right_shift(bitfields, 5, out=bitfields)
Out[111]:
array([7, 7, 7, 7], dtype=uint8)

Then we can set the 3 bit to 0 in the orginal buffer:

In [112]:
datab.flags.writeable = True
datab['byte1'] &= 0x00111111

Note that we need to set the array flag writable to True in order to be alble to modify the buffer. Having modified the buffer also data sees the modified values:

In [113]:
data
Out[113]:
array([(4279369727, 65535), (4279369727, 65535), (4279369727, 65535),
       (4279369727, 65535)], 
      dtype=[('timestamps', '>u4'), ('nanotimes', '>u2')])
In [114]:
data['timestamps']
Out[114]:
array([4279369727, 4279369727, 4279369727, 4279369727], dtype=uint32)
In [ ]:
 

blogroll

social