Implementing an LSM-tree in C

March 15, 2024

Introduction

I first encountered LSM-trees (Log-Structured Merge-trees) while reading through Designing Data-Intensive Applications. The book described a rather simple approach to a key-value storage engine that utilized SSTables for disk storage and an in-memory tree structure to accompany it. This approach seemed like a fun project to attempt, but I wanted to do it in a language that I have been avoiding since graduating... C.

Please note that this is not a guide on how to implement an LSM-Tree. Rather, it's simply an overview of my implementation of one.

What is an LSM-Tree?

For a better overview of LSM-trees, I strongly recommend chapter 3 of Designing Data-Intensive Applications. However, I'll provide a brief overview here. LSM-trees are a storage system made of up various data-structures. There is an in-memory component, called a memtable, that stores a limited amount of key-value pairs, typically in a tree structure. When the memtable reaches a certain size (~4MB), its contents are written to the disk in the form of SSTables. SSTables (Sorted Strings Tables) are immutable files that store key value pairs in sorted order.

When reading from an LSM-tree, the search begins in the memtable. If the key is not found there, the search progresses to the SSTables, starting with the most recent SSTable and moving backwards chronologically.

Full LSM-trees implementations usually incorporate write-ahead logging to prevent losing memtable data on crashes, and use bloom filters to expedite SSTable read times. To keep my project manageable within a limited timeframe, I did not implement the write-ahead log or the bloom filter.

The Memtable

As described above, the memtable is used to store key value pairs in memory until a certain capacity threshold is met. Since it is the initial point of interaction with the storage system, it was the natural starting point for my implementation.

I've often seen these memtables described as self-balancing trees (red-black trees and AVL trees), but I have also come across examples of memtables being HashSkipLists. In my project, I opted for a tree-based approach, specifically a Binary Search Tree (BST). While a self-balancing tree would be a more ideal data-structure for this project, a simple BST allows for a far simpler and quicker memtree implementation.

BST Implementation

1// memtable.h
2typedef struct Node {
3  char *key;
4  char *value;
5  struct Node *left;
6  struct Node *right;
7} Node;

My Binary Search Tree implementation, while straightforward, required several specific adaptations to suit the LSM-tree.

Firstly, an LSM-tree typically utilizes a singular memtable, which, in this case, is a BST. By defining extern Node *memtableRoot; as a global pointer and ensuring that all related functions interact with the memtable through this pointer, we have fulfilled this requirement.

Another crucial requirement is a way to track memtable memory usage. This is necessary for determining when to dump the memtable to an SSTable, but it also aids in detecting memory leaks within the BST. We need another global variable extern int globalMemoryUsage; that will track memory usage in bytes.

1// memtable.h
2#define NODE_MEMORY_USAGE(key, value) (sizeof(Node) + strlen(key) + 1 + strlen(value) + 1)
3
4#define INCREASE_MEMORY_USAGE(key, value) (globalMemoryUsage += NODE_MEMORY_USAGE(key, value))
5
6#define DECREASE_MEMORY_USAGE(key, value) (globalMemoryUsage -= NODE_MEMORY_USAGE(key, value))

Memory tracking can be managed by using two macros. By using INCREASE_MEMORY_USAGE when creating a node, and DECREASE_MEMORY_USAGE when removing a node, an estimate of the memtable's memory is generated.

Writing and Reading SSTables

With our memtable in place, we now need to write it to an SSTable when exceeds the memory threshold. The SSTable entries in this project will be key value pairs separated by a space, like entry_key entry_value. Each entry will be separated by a line break.

Writing SSTables

Writing our SSTable starts with creating and naming the SSTable file. We will need a way to read our SSTable files in order of newest to oldest, and including a timestamp in the filename is a simple way to handle this. When reading, we can sort the filenames and look through them in sorted order.

1// Example filenames
2// Note that time(), which returns in seconds, is not precise enough
3sstable_1705789865408671000.dat
4sstable_1705789865433449000.dat
5sstable_1705789865450905000.dat
6sstable_1705789865465489000.dat

With our file named and created, we can begin writing to it. The process of converting our memtable into an SSTable is relatively simple. We need to perform an in-order transversal of the tree, writing each node as an entry in our SStable file.

1// sstable.c
2static void serializeMemtableToFile(Node *root, FILE *file) {
3  if (root == NULL) {
4    return;
5  }
6
7  serializeMemtableToFile(root->left, file);
8  fprintf(file, "%s %s\n", root->key, root->value);
9  serializeMemtableToFile(root->right, file);
10}

Reading SSTables

When reading, we first need to load all SSTable filenames in memory and sort them in descending chronological order. We then go through the files individually, reading each line until we find our key.

Deleting LSM-Tree Keys

Having implemented the memtable and SSTable, the LSM-tree now supports basic read and write operations. With timestamped SSTable files, it is also possible to update entries. The next step is to add a way to delete entries that have been written to an SSTable.

The deletion process first begins with an attempt to delete the key from the memtable. If it is not found there, we will then assume that the key is in the SSTable.

1// sstable.c
2void delete(char *key) {
3  // Try to delete from the memtable
4  if (!deleteMemtableKey(key)) {
5    // Key was not in the memtable, create a tombstone
6    writeTombstone(key);
7  } else {
8    printf("Key deleted from memtable: %s\n", key);
9  }
10}

To delete keys from the SSTables, I utilized tombstones, which are used to flag keys for deletion during the compaction process. Tombstones can live within the SSTable entries themselves, but I decided to use a dedicated tombstone file primarily for simplicity. A dedicated tombstone file allows the SSTables entries to maintain a key value format, and also provides a dedicated tombstone lookup location, which is helpful in reads and compaction. (Note: I address this implementation choice in the Shortcomings of my Implementation section)

During the read process, the system checks the tombstone file first and checks if the key being read is found there. If it is, the key is treated as deleted and the read operation can be cancelled. The actual removal of the tombstoned entries is done during the compaction process.

Compacting the LSM-Tree

When a large number of keys have been deleted, and the tombstone file grows, the read performance will begin to suffer as the tombstone file is searched on every read. The compaction process aims to apply the tombstones to all SSTable files, and also merge those files if they become too small.

The initial phase of the compaction process is to apply tombstones to all SSTables. For this, the tombstones are loaded into memory as an array. Each SSTable entry is then checked against this array to determine if it matches a tombstone. This approach is discussed more in Shortcomings and Potential Improvements.

1// sstable.c - compactSSTables()
2// Loop through all files
3while ((entry = readdir(dir)) != NULL) {
4  if (entry->d_type == DT_REG) {
5    snprintf(filepath, sizeof(filepath), "%s/%s", DIR_NAME, entry->d_name);
6
7    // Apply tombstones to the SSTable file
8    applyTombstonesToFile(filepath, &tombstones);
9      
10    // If the file is below the threshold after applying tombstones, add it to smallFilesList
11    if (isFileBelowThreshold(filepath)) {
12      addToList(&smallFilesList, filepath);
13    }
14  }
15}

While going through all files and applying tombstones, the size of these files can also be checked. If these files are below a certain size threshold SMALL_FILE_THRESHOLD after tombstones have been applied, we can flag them for merging. These small SSTables will be merged until the new file reaches a size limit UPPER_MERGE_THRESHOLD.

Shortcomings of my Implementation

Having discussed the major components of my implementation, I want to mention some things that I would approach differently if I were to implement this again.

Design Choices

In my opinion, the most obvious shortcoming of this implementation is how I handle tombstones. While I don't entirely oppose the idea of keeping them separate from the SSTables, there are still problems with the current approach. Instead of storing the separated tombstone list on the disk, it would likely be worth it to store them in memory as a tree or hash table. This approach would at least eliminate the need for a full scan of the tombstone file during each SSTable read. The tombstone tree could then have its own memory tracker that would trigger the compaction process when it reaches a certain size.

There are also some inefficiencies caused by my choice of data structures in certain places. During compaction, loading the tombstone file's contents into a hash table instead of an array would change lookup time from an average case of O(n) to an average case of O(1). In addition to that, making the memtable a self-balancing tree would also have a notable performance improvement, particularly in maintaining an O(log n) time complexity for both insertions and lookups. These design decisions were conscious trade-offs, prioritizing development time.

Missing Features and Functionalities

Finally, my limited LSM-Tree implementation, while functional, does lack some advanced features commonly found in more complete implementations. These features, if implemented, could significantly enhance the stability and efficiency of the system.

Key missing features include:

Duplicate Handling - Duplicates are not identified during compaction, meaning old entries can exist in limbo.
Concurrency - In advanced LSM-Tree systems, compaction will often run concurrently with other operations like reads, writes, and flushes, enabling the system to maintain high throughput and efficiency.
Write-Ahead Log - In order to save memtable data in case of a crash, a write-ahead log is used to track memtable insertion, updates, and deletions. The memtable can then be reconstructed from these logs on restart.

Source

You can view the source code here. Note that this implementation will not work on Windows.

Kellen

Parker