Hashing shows up everywhere in computer science, from quick lookups in hash tables to verifying data with checksums and fingerprints. A hash table works by turning each key into an integer so it can be retrieved fast—ideally with each distinct key producing its own unique value. When two different inputs produce the same output, you get a collision.
While hash tables usually handle collisions through strategies like chaining or open addressing, it is still best to avoid them wherever possible, especially for use cases such as calculating checksums or fingerprinting.
Hashing raw byte arrays is simple enough, but real‑world data structures often contain fields of different lengths, and that’s where collisions become surprisingly easy to introduce.
The Real-Time Analytics lab of Dynatrace Research has developed an open-source library called Hash4j that implements some best practices for hashing, including our preferred approach to variable-length fields.
This article walks through why variable‑length fields make hashing tricky and how to handle them cleanly.
A common pitfall: Concatenating variable-length fields
Let’s consider a simple example of a data structure with two variable-length string fields:
firstName and lastName.
record Person(String firstName, String lastName) {}
A naive approach to hashing this structure might involve concatenating the two strings and then applying a hash function to the resulting byte sequence. However, this method can lead to collisions. For instance, the records ("ANNA", "BELL") and ("ANN", "ABELL") would both produce the same concatenated string "ANNABELL". This occurs because the boundaries between the variable-length fields are lost, making it impossible to distinguish between different combinations of field values.
When serializing data structures for hashing, good practice is to ensure that the serialized byte sequence allows for unambiguous reconstruction of the original structure, which you can’t do in the example above unless additional information is provided.
Let’s explore different strategies to handle variable-length fields properly by adding extra information when mapping the data structure to a byte sequence.
Strategy 1: Delimiters
One effective strategy to avoid this issue is to use delimiters between variable-length fields. For example, we could serialize the data structure as:
firstName ‖ ";" ‖ lastName
where ‖ denotes concatenation and ";" is a chosen delimiter. This way, the records ("ANNA", "BELL") and ("ANN", "ABELL") would serialize to "ANNA;BELL" and "ANN;ABELL", respectively, thus producing distinct byte sequences and avoiding collisions.
The limitation of this approach is the need to ensure that the chosen delimiter does not appear in the actual data. For strings like names, this might be manageable as certain byte patterns are not used. For example, the null character is often used as a delimiter in C-style strings as it does not appear in regular text.
Escape sequences are needed to tell delimiters apart from real data, but that means scanning everything and replacing conflicts, which adds overhead. Nested structures like List<List<String>> make this even harder, since each level needs its own delimiter rules, quickly becoming complex and error‑prone.
Strategy 2: Length prefixes
A better approach is to use length prefixes for each variable-length field. This method involves prefixing each field with its length before concatenation. For example, we could serialize the data structure as:
len(firstName) ‖ firstName ‖ len(lastName) ‖ lastName
where len() represents the length of the string. This way, the records ("ANNA", "BELL") and ("ANN", "ABELL") would serialize to 4 ‖ "ANNA" ‖ 4 ‖ "BELL" and 3 ‖ "ANN" ‖ 5 ‖ "ABELL", respectively, resulting in distinct byte sequences. This approach eliminates the need for delimiters and escaping. Additionally, it scales better for nested data structures, as each level can independently use length prefixes without concern for delimiter conflicts.
This method is the standard practice in many serialization protocols, such as Protocol Buffers and Thrift. Length prefixes make serialized data easy to parse because the parser can read the length first and then extract exactly that many bytes. But for hashing, this doesn’t matter because you only need a one‑way conversion to a byte sequence, not full reconstruction.
Strategy 3: Length postfixes
Another approach is to use length postfixes, where the length of each variable-length field is appended after the field value itself. For example, we could serialize the data structure as:
firstName ‖ len(firstName) ‖ lastName ‖ len(lastName)
This method also ensures that different combinations of variable-length fields produce distinct byte sequences. It may be less intuitive than using length prefixes, as the length information appears after the data, which is also a less common practice in serialization, because it would require parsing starting from the end of the byte sequence to reconstruct the original data structure. However, for hashing purposes, this method is equally effective in preventing collisions, because, for a high-quality hash function, the hash-quality does not change when reordering the input bytes.
The method is even better suited for streaming scenarios or data structures, where the length of the data may not be known in advance. For example, consider null-terminated strings, where the end of the string is indicated by a special null character or linked lists, where the end of the list is indicated by a null pointer or null reference. This would require double scanning the data: once to compute the length, once to write it. In contrast, the postfix approach allows writing the data first and then appending the length, which can be determined during the same pass.
What about mixing strategies?
Though not recommended, strategies can also be mixed, provided that this is done with careful consideration. For example, one could use length prefixes for some fields and delimiters for others. However, this approach can introduce complexity and potential pitfalls if not managed carefully. It is generally advisable to stick to a single consistent strategy throughout the data structure to maintain clarity and reduce the risk of errors resulting in a higher collision probability.
To demonstrate this, consider a record with two variable-length integer arrays:
record Record(int[] v1, int[] v2) {}
Assume that we serialize the first array using length prefixes and the second array using a postfix length. The serialization would look like this:
len(v1) ‖ v1 ‖ v2 ‖ len(v2)
Obviously, this serialization is unambiguous, because the record can be reconstructed without any issue if we assume that the lengths are encoded with a fixed amount of bytes following these steps:
- Take the length of
v1from the beginning of the byte sequence - Read the corresponding number of integer values for the first array
- Take the length of
v2from the end of the byte sequence - Read the corresponding number of integer values for the second array from the remaining bytes in the middle
Now, let’s see if we serialize the first array with a length postfix and the second array using a length prefix. The serialization would look like this:
v1 ‖ len(v1) ‖ len(v2) ‖ v2
In this case, the serialization can be ambiguous. Consider the
records ([5,1,4], [6,7]) and ([5], [3,2,6,7]). Both records would produce the same serialized byte sequence:
5 ‖ 1 ‖ 4 ‖ 3 ‖ 2 ‖ 6 ‖ 7
This example shows that mixing strategies can lead to ambiguities and potential hash collisions if not handled carefully. Applying a single strategy consistently across all variable-length fields in a data structure is a safer and more reliable approach.
Is the length information needed for all variable-length fields?
You might have noticed that we could skip the length of one field in the example above. When using length-postfixes, the length of the first variable-length field can be omitted because it can be inferred from the total length of the byte sequence minus the lengths of the other fields.
v1 ‖ v2 ‖ len(v2)
Similarly, when using length-prefixes, the length of the last variable-length field can be omitted for the same reason.
It’s safe to omit certain length values if you can still determine the size of every variable‑length field without ambiguity, whether by reading from the beginning, the end, or by deriving it from other fields.
For example, a data structure consisting only of a single variable-length field does not require any length information, as the length can be derived from the total length of the byte sequence minus the lengths of all fixed-length fields. Also, in a data structure with multiple variable-length fields, whose lengths are known to be equal or can be derived from other fields, some length information can be theoretically omitted.
However, these performance optimizations are not recommended as a general practice as they introduce complexity and risk for errors, in particular, when dealing with nested and more complex data structures.
How to serialize the length information
When serializing the length information for variable-length fields, we recommend using fixed-size integers that are appropriate for the expected maximum length of the fields. For example, if you expect that the length of a string will not exceed 65,535 characters, you can use a 2-byte unsigned integer to represent the length. If the lengths can be larger, consider using 4-byte or 8-byte integers accordingly. In Java, the size of Arrays and Collections cannot exceed Integer.MAX_VALUE, so a 4-byte integer is usually sufficient there.
Variable‑length integer encodings like VarInt are great for saving space during serialization, but they’re not recommended for hashing. They require branching logic to encode, which is slower than handling fixed‑size integers. Since modern hash functions can process each byte in just a few CPU cycles, the extra overhead from variable‑length encodings simply isn’t worth it.
Moreover, using variable-length encodings for the length information also introduces additional complexity and potential for errors. For example, VarInt is not compatible with the length-postfix strategy without reversing the byte order, because the byte sequence cannot be unambiguously deserialized when reading from the end to the beginning, which is the underlying idea behind length-postfixes.
Hash4j’s approach
Hash4j, the Dynatrace hash library for Java, provides a comprehensive set of hashing utilities, including support for hashing data structures with variable-length fields. Due to the advantages discussed above, Hash4j implements the length-postfix strategy for serializing variable-length fields. This allows hashing of collections, arrays, optionals, and other composed data structures with variable-length fields effectively while minimizing the risk of collisions. Let’s consider the following data structure to store information about a personal contact:
record Contact(
String firstName,
String lastName,
OptionalInt age,
byte[] utf8Email,
List<String> sortedSocialMediaHandles
) {}
In this example, Contact has five variable-length fields:
firstName– a variable-length string representing the person’s first name.lastName– a variable-length string representing the person’s last name.age– an optional integer representing the person’s age if known. Optionals can be considered and treated like lists with lengths restricted to zero and one. Thus, a single byte will be sufficient to represent the length information.utf8Email– a variable-length byte array containing the person’s email address encoded in UTF-8. (Arrays in Java records are discouraged, but this is just to demonstrate Hash4j’s capability of handling arrays.)sortedSocialMediaHandles– a variable-length list of strings representing the person’s social media handles. This field nests variable-length strings inside a variable-length list. The list is sorted to impose a deterministic order before hashing so that the same set of handles yields the same hash regardless of insertion order. Unordered collections like sets would require special handling to yield order-independent hash values, which is beyond the scope of this blog post.
With Hash4j Contact instances can be simply hashed by adding a member method to the record that describes how the individual fields contribute to the hash computation:
void contribute(HashSink hashSink) {
hashSink
.putString(firstName)
.putString(lastName)
.putOptionalInt(age)
.putByteArray(utf8Email)
.putOrderedIterable(sortedSocialMediaHandles,
(handle, hs) -> hs.putString(handle));
}
The putString, putOptionalInt, putByteArray, and putOrderedIterable methods all handle the serialization of variable-length fields using the length-postfix strategy internally, ensuring that the hash computation minimizes the risk of collisions. All methods append some fixed-size information about the length to the data to the hash sink. For Strings, byte arrays, and Iterables, the length is represented as a 4-byte integer, while for Optionals, the length is represented as a 1-byte value.
After defining the contribute method, the hash of a Contact instance c can then finally be computed as follows:
long hash = Hashing.komihash5_0().hashToLong(c, Contact::contribute);
Here komihash5_0() is only one of the high-quality hash functions provided by Hash4j. The library also offers other hash functions like XXH3 and MurmurHash3, all of which can be used interchangeably in the example above.
Effective serialization strategies for hashing variable-length fields
When hashing data structures with variable-length fields, it is crucial to use serialization strategies that preserve the boundaries between fields to minimize the risk of hash collisions. Applying the length‑postfix strategy the way Hash4j does gives you a robust approach to hashing variable‑length fields. It avoids the ambiguity problem with naive concatenation and is simpler and better suited for nested data structures than delimiter-based methods. Unlike prefixes, it does not require a second pass over fields whose length is not known in advance.
Looking for answers?
Start a new discussion or ask for help in our Q&A forum.
Go to forum