Header background

index4j: Open-source FM-Index for fast queries on compressed logs

If you’ve ever tried to track down a weird bug or security issue in a mountain of log files, you know how important it is to be able to search for any string no matter how random or unexpected. Log search engines need to be flexible, letting you look for anything from error codes to user IDs, or even fragments of stack traces. But with modern systems generating massive amounts of logs every day, storing all that data quickly becomes a problem.

That’s where compression comes in. By squeezing down the size of your logs, you save on storage costs and make it easier to keep more history around. The trick is to compress your data without making searches painfully slow. Enter the FM-Index: a clever data structure that combines powerful compression with fast search.

With the FM-Index, you can have a practical balance between compact storage and the freedom to search for anything you need. And if you are working in Java, we’ve got you covered with the open source index4j, created by the Real-Time Analytics lab of Dynatrace Research, an internal research group at Dynatrace where we work on researching methods for real-time processing and analysis of massive data streams.

index4j blog inline image

When should you use the FM-Index?

The FM-Index is a relatively complex data structure that shines in specific scenarios:

  • Your data is immutable or changes infrequently. The FM-Index is built for static datasets, so if your logs are constantly changing, you might want to look at other options.
  • You need to perform a lot of queries. If you only need to search your logs occasionally, the overhead of building the FM-Index might not be worth it.
  • Your queries are likely to be arbitrary substrings or they don’t fit a predefined pattern. The FM-Index does not care about the structure of your queries, making it ideal for searching through unstructured log data.
  • You care about storage efficiency: It might not be your top priority, but you still want to save space without sacrificing search speed.
  • You like the idea of combining compression and indexing into a single data structure, simplifying your architecture.

If any of the above sounds about right for your use case, then the FM-Index could be a strong fit.

In the next sections, we look at how to get started with the Dynatrace Open Source FM-Index library and how to apply it to your own log data!

What is the FM-Index?

So, what exactly is this FM-Index thing? In simple terms, the FM-Index is a supercharged search tool that lets you look for any substring inside your data, even if you never split your logs into words or tokens. It’s built on some clever computer science tricks: suffix arrays, bit vectors, wavelet trees, and the Burrows–Wheeler transform (BWT). Don’t worry if you are not familiar with them, the important part is that they work together to make searching and compressing possible at the same time.

With the FM-Index, you can:

  • Count how many times a pattern appears in your logs, even if it’s just a random string of characters.
  • Jump straight to the position of each match, so you can see exactly where something happened.
  • Retrieve (or decompress) the matching lines or the context around them, all without having to decompress the entire log file upfront.

What’s really cool is that the FM-Index does all this while keeping storage and search times relatively low in practice, even as your logs grow huge. The magic is in its sublinear complexity: as your data gets bigger, the time and space needed for searches don’t grow nearly as fast. That’s why the FM-Index is a favorite in fields like bioinformatics, where people need to search through massive DNA sequences. But it’s just as handy for anyone dealing with mountains of log data and looking for fast, flexible search without the storage bloat.

And the coolest part? You can trade off between how much space you use and how fast your searches are by adjusting only a single parameter when building the index. A higher sample rate means better compression but slower queries, while a lower sample rate gives you faster searches at the cost of more storage space. You can choose what works best for your system!

Let’s get into it

Imagine you have collected ~1.5 million logs from an Android framework. Here’s a sample data set you can use to follow along (around 180 MB of raw text).

Let’s say you suspect one app is doing some questionable activity. Notice that:

  1. The logs are static
  2. You want to keep them around for a long time (maybe you want to do some forensics later)
  3. You need to run many queries to find the root cause

The above constraints make it a a strong candidate for using an FM-Index. By using index4j, you can build an FM-Index over these logs with just a few lines of Java code. The index4j library is available in the Maven Central repository, so you can add it to your Java project as follows.

For maven projects, use:

<dependency> 
    <groupId>com.dynatrace.index4j</groupId> 
    <artifactId>indices</artifactId> 
    <version>0.3.1</version> 
</dependency>

For gradle projects, use:

implementation group: 'com.dynatrace.index4j', name: 'indices', version: '0.3.1'

Then, building your index is as simple as:

char[] text = Files.readString(Path.of("Android.log"), StandardCharsets.UTF_8) 
        .toCharArray(); // catch the exception as needed 
FmIndex fmi = new FmIndexBuilder() 
        .setSampleRate(64) 
        .setEnableExtraction(true) 
        .build(text); 

This will build an index in about one minute in our test environment while at the same time compressing the data down to roughly ~70 MB. From ~180 MB; that’s a reduction of more than 60%! And recall that this also includes the actual index. No additional storage is needed for searching.

Now that you have your index, searching is a breeze. Let’s say you want to find if your logs contain the string Permission Denial: which shows that an app is attempting to access resources for which it does not have the required permissions and may indicate suspicious behavior. Since we only care about whether this string exists and how many times it appears, we can use the count method:

char[] pattern = "Permission Denial:".toCharArray();
int counts = fmi.count(pattern);

This query will return the number of occurrences of the pattern in your logs, which should be 283 for the Android.log. Running it on a desktop processor (Intel Xeon W-10885M with the clock fixed at 2.40GHz) takes:

  • If cold (no caching, no optimization done by the compiler yet), around 200 microseconds (or 0.2 milliseconds and ~5,000 queries per second)
  • If warm (with caching and optimizations), around 30 microseconds (or 0.03 milliseconds and ~33,000 queries per second)

These results are based on a single benchmark setup; actual performance may vary depending on hardware, dataset, and workload. Note that if you run the same query naively, this would require a brute-force scan of the entire log file, which takes around ~100 milliseconds (~10 queries per second). With index4j, that represents a substantial performance improvement compared to a naïve scan in this benchmark.

You are probably wondering now: “What if I want to see the actual log lines that contain this pattern?” Counting is useful, but most likely you want to see the context around the matches. For that you can use the extract methods that index4j provides. Note that for the index there is no notion of “lines” anymore. The FM-Index just sees a big array of characters that is shuffled in some obscure way (that’s also where the compression comes from). However, since we enabled extraction when building the index (recall the line .setEnableExtraction(true)), index4j can still retrieve the original log lines. There are two steps to do this:

First, we need to find the positions of all occurrences of the pattern. We can do this using the locate method:

int[] locations = new int[counts]; 
int found = fmi.locate(pattern, 0, pattern.length, locations, counts); // find all matches 

Where counts is the number of matches we obtained in the previous count query. Note that we can also restrict how many positions we want to find by changing the last parameter of locate.

Second, we can extract the actual log lines, either in its entirety (using extractUntilBoundary) or partially using one of its variants. In this case, we want to see what is to the right side of each match (i.e., the context after the pattern), so we can use extractUntilBoundaryRight:

char[] destination = new char[1024]; // maximum length per extraction 
for (int i = 0; i < found; i++) { 
    // extract previous locations until the end of line 
    int length = fmi.extractUntilBoundaryRight(locations[i] - 1, destination, 0, '\n');  
    System.out.println(new String(destination, 0, length)); // print the extracted string 
} 

This will print each match and the remaining of the log line for all of the 283 lines containing Permission Denial:. For example:

Permission Denial: reading com.android.providers.media.MediaProvider uri content://media/external/audio/media from pid=12127, uid=10146 requires android.permission.READ_EXTERNAL_STORAGE, or grantUriPermission() 
 
Permission Denial: receiving Intent { act=android.intent.action.PHONE_STATE flg=0x10 (has extras) } to ProcessRecord{9ef8336 4492:com.tencent.mm/u0a105} (pid=4492, uid=10105) requires android.permission.READ_PRIVILEGED_PHONE_STATE due to sender android (uid 1000) 
 
Permission Denial: broadcasting Intent { act=android.media.SCO_AUDIO_STATE_CHANGED flg=0x10000010 (has extras) } from null (pid=-1, uid=-1) requires com.huawei.permission due to registered receiver BroadcastFilter{14b6ff9 u0 ReceiverList{c9f82e3 633 com.huawei.espacev2/10102/u0 remote:497ab12}} 
 
... 

Note that we can also stop extracting at a certain character (for example, a space) if we know there is some structure to the log line itself. The runtime to extract depends only on the number of characters and is roughly ~10 microseconds per character, regardless of how arbitrary the pattern may be or its frequency. That means that the extraction of 100 log lines of average length 100 characters will take around 100 * 100 * 10 microseconds = 100 milliseconds. And all of this while your data and index take only ~30% to ~40% of the original log size.

How do I store my index?

Imagine you have already run a few queries and found some interesting patterns in your logs, but you wish to save the index to disk so you can load it later without having to rebuild it from scratch. That is easy because the whole index is serializable. You can simply write it to a byte array like this:

// Save 'serialized' to disk as needed 
byte[] serialized = Serialization.writeToByteArray(FmIndex::write, fmi); 

And read it back later like this:

// read from disk into 'serialized' byte array as needed 
FmIndex deserialized = Serialization.readFromByteArray(FmIndex::read, serialized); 

And that’s it! You can now save and load your FM-Index as needed, making it easy to persist your work and pick up where you left off.

Conclusion

As we saw, the FM-Index is a powerful tool for anyone dealing with large amounts of log data. It combines competitive compression with fast and arbitrary search capabilities (usually in microseconds-time), making it ideal for scenarios where logs are static and queries are changing. While still a relatively niche data structure, state-of-the-art research articles on log search engines have already demonstrated the capabilities of the FM-Index.

By using the index4j library, you can easily build an FM-Index over your logs in your customized log processing pipeline. Whether you’re hunting for specific error messages or performing recurrent queries, the FM-Index provides a flexible and efficient solution that can be tailored to your use case. If you are interested in other cool projects that we work at Dynatrace Research, make sure to check out our research homepage and our open source GitHub repository!