Run Length Encoding

Our first assignment in our Learning Machines class is to implement a run length encoder and decoder. This is a simple data compression algorithm that benefits from repeated patterns.

It happens that I previously had an idea for an Arduino project that requires a light-weight data decompression algorithm to decode audio data. I was going to use run length encoding because it is simple to implement and the code itself won't take up much of the Arduino's precious memory. I'll also need to encode the audio files in Python, and I'll use the below code to do it.

The example given in class is for a string, such as:

"WWWWBB111WWB22WWWW22BBW111"

being converted to somethign like this:

[(4, 'W'), (2, 'B'), (3, '1'), (2, 'W'), (1, 'B'), (2, '2'), (4, 'W'), (2, '2'), (2, 'B'), (1, 'W'), (3, '1')]

Each Python tuple stores a character and a number indicating how many times that character appears at the given location in the sequence. The original sequence can be reconstructed from the encoded sequence. Ideally this new representation of the data will take up less memory than the original.

The simplest and most straightforward way to implement this encoding is with regular expressions.

In [1]:
import re

def encode1(sequence):
    return [(len(a), b) for a, b in re.findall(r'((\w)\2*)', sequence)]

test1 = "WWWWBB111WWB22WWWW22BBW111"

encoded1 = encode1(test1)

print(encoded1)
[(4, 'W'), (2, 'B'), (3, '1'), (2, 'W'), (1, 'B'), (2, '2'), (4, 'W'), (2, '2'), (2, 'B'), (1, 'W'), (3, '1')]

This can be easily decoded, giving the same sequence as the original.

In [2]:
def decode1(compressed):
    return ''.join([''.join([c] * n) for n, c in compressed])

print(decode1(encoded1))
WWWWBB111WWB22WWWW22BBW111

Regular expressions only work if the input is a string. That won't work for my Arduino project. What if instead the input is a sequence of numbers? That's easy to do with Python's itertools package.

The intent of the groupby function is to operate on sorted data, but when the data isn't sorted, its behavior is exactly what I need it to be for this problem.

In [3]:
from itertools import groupby

test2 = [3, 3, 4, 5, 6, 6, 6, 6, 3, 3, 2, 2, 2, 2, 2]

def encode2(sequence):
    return [(len(list(b)), a) for a, b in groupby(sequence)]

encoded2 = encode2(test2)

print(encoded2)
[(2, 3), (1, 4), (1, 5), (4, 6), (2, 3), (5, 2)]

Decoding this is similar to decode1:

In [4]:
def decode2(compressed):
    return [x for y in [[c] * n for n, c in compressed] for x in y]

print(decode2(encoded2))
[3, 3, 4, 5, 6, 6, 6, 6, 3, 3, 2, 2, 2, 2, 2]

It happens this new encode2 function can encode strings just as well as a sequence of numbers.

In [5]:
print(encode2(test1))
[(4, 'W'), (2, 'B'), (3, '1'), (2, 'W'), (1, 'B'), (2, '2'), (4, 'W'), (2, '2'), (2, 'B'), (1, 'W'), (3, '1')]

Since the second solution is more versatile, it is the better solution.

Comments