Recursive Linear Prefix (RLP) Encoding
I was recently tasked with developing a new encoding library for Hyperledger Burrow to further interoperability with the broader Ethereum ecosystem. Recursive Linear Prefix (RLP) is a data format used to store state in Ethereum, more precisely it is an algorithm for representing arbitrary data structures in binary form. Unlike other serialization techniques however, the output is position dependant in that recovering the original object requires knowledge of the input structure.
History
The first mention of RLP encoding was in the Ethereum yellow paper which logically defines the encoding procedure adopted by Vitalik. It’s not entirely clear why this method was chosen, but it is likely to have been devised for space efficiency. To quote the design rationale:
RLP is intended to be a highly minimalistic serialization format; its sole purpose is to store nested arrays of bytes. Unlike protobuf, BSON and other existing solutions, RLP does not attempt to define any specific data types such as booleans, floats, doubles or even integers; instead, it simply exists to store structure, in the form of nested arrays, and leaves it up to the protocol to determine the meaning of the arrays.
Simply put, this means that we can not always expect the input of our decode function to conform to our target representation.
Internals
As described on the wiki, the input to the encoding function is an ‘item’ which may be a string (byte array) or a list of items. This is evaluated against the following set of conditions in which a list is recursively solved for nested strings.
Prefix | Condition |
---|---|
- | [0x00, 0x7f] |
0x80 | string - 0-55 bytes |
0xb7 | string - > 55 bytes |
0xc0 | list - 0-55 bytes |
0xf7 | list - > 55 bytes |
The length of each item is also added to its prefix.
Examples
Guided by the above rule set, we can begin to encode / decode some example items. Let’s assume we pass the string
foo bar
to our encoding function. After evaluating the length (including whitespace) as 7
, which is distinctly
less that 55
bytes, we should add the byte with value 0x80
to the length to give us our prefix 0x87
. We can then
append the raw bytes of our input string.
f o o b a r
87 66 6f 6f 20 62 61 72
To decode this, our function should first check the prefix. As the value is between [0x80,0xb7)
we can subtract 0x80
to get the length and return the string at offset 1:8
.
Given the list [foo, bar]
, we’ll start the encoding by taking the item’s length plus the byte value 0xc0
. We can
then linearly append each string following the process above; each word is of length 3
, so the prefix is the value 0x83
.
[] {f o o} {b a r}
c8 83 66 6f 6f 83 62 61 72
Let’s follow this procedure in reverse to decode the output. The first prefix is between [0xc0,0xf7)
so we know to construct
a list object, but at this point it is still unclear what the sub-items may be - we could have arbitrary lists or strings.
Therefore the simplest approach is to enumerate over the remaining bytes, taking the next prefix (0x83
) we can extract the
first string, leaving us with 4 bytes (including prefix) as specified by the outer prefix. So our recursive function should
know to continue decoding our second string.