Building Database in Rust — Part I
I’ve always been curious about the inner workings of a database. Recently, I stumbled upon the CMU Database course, which uses Bustub — a database system written in C++ — for hands-on learning. Inspired by this, I’ve started to reimplement Bustub in Rust, filling in the gaps by completing the course’s assignments, in Rust. Through this blog series, I hope to share my journey and insights. Post will switch between database internal, Rust code and explanation. WIP Code can be found at abustus-rs.
The Anatomy of a Database System
A disk based relational database system can be divided into the following layers(bottom up):
- Disk Manger
- Buffer Pool Manager
- Access Method
- Operator Execution
- Query Planning
We will be building database in bottom up manner. We will outline role of each layer, look at how Rust feature can be used. This post covers the concepts of Disk Scheduler, LRU-K cache eviction and the Buffer Pool Manager (BPM).
Disk Scheduler
Modern databases uses diverse storage strategies. Some, like Redis, are entirely in-memory; others use disk storage; and others like ClickHouse uses cloud-based storage. For this series, we’ll stick to a disk baed storage.
The lowest layer of database is responsible for reading and writing data from persistent storage. Access to this layer is page aware (more on page later). It is not aware of what is stored in these pages.
We can start with following definition for read and write
type PageId = usize;
trait PageAccessor {
// Wite data_buf to storage loaction of page_id
fn write_page(&mut self, page_id: PageId, data_buf: Box<[u8; PAGE_SIZE]) -> io::Result<()>;
// Readfrom storage location of page_id to data buffer.
fn read_page(&mut self, page_id: PageId, data_buf: Box<[u8; PAGE_SIZE]) -> io::Result<()>;
}
Functionality of these methods are given in their names. However, let’s talk about Rust side of this. We use trait which is akin to interface in other language. This provides opportunity to have multiple implementation for set of functionality, say a db where things are in memory, or in disk or cloud storage.
We also see Result, which represent given method can succeed or fail. Rust uses algebraic types aka enum to represent different possible state for a value. Result has two state, Ok and Err. Result will be Ok if operation is successful and Err if failure occurs.
There is usually a think layer on top, called DiskScheduler. This is responsible to call above calls for read and write. This layer is what BufferPool Manager interacts with. In Bustub this has been implemented as a background thread which use channel to communicate task completion. One can use them in blocking or non blocking manner.
The Role of the Operating System and Database Memory Management
A database system runs on top of a general-purpose operating system (OS). However, generality of OS often clashes with the specific needs of databases, especially in performance-critical areas. For example, operating systems provide abstractions like virtual memory, which create the illusion of unlimited memory.
While useful, relying on OS-level virtual memory management would mean relinquishing control over critical aspects like eviction policies, prioritizing pages, and minimizing cache misses. To address these challenges, databases implement their own memory management through a component called the Buffer Pool Manager (BPM).
The BPM handles the task of managing memory within the database system, ensuring data from disk is efficiently cached in memory. As with any limited resource, caching introduces the need for eviction. Among various cache eviction policies, the CMU course emphasizes LRU-K. This post will delve into how this policy is implemented within a database system.
Frames and Pages: Understanding Eviction
Before diving into eviction policies, let’s clarify two key terms: Frames and Pages.
A Page is a fixed-size region of disk storage where the database stores its data, for long term persistence. Choosing the right page size is part science, part art; for this project, we’ll pick size of 4 KB. The Buffer Pool Manager (BPM), which handles data in memory, also divides memory into fixed-size regions called Frames. When a page is loaded into memory, the BPM assigns it to an available frame.
If there are no free frames available in the free pool, the BPM needs to evict a page which is mapped to a frame to make room for a new one. This is where eviction algorithms come into play. Common algorithms include:
- First-In-First-Out (FIFO): Evicts the oldest frame first.
- Most-Recently-Used (MRU): Evicts the most recently accessed frame.
- Least-Recently-Used (LRU): Evicts the least recently accessed frame.
While LRU is generally a good choice, it has a notable drawback: it focuses solely on recent access patterns without considering the broader history. For instance, during a sequential scan, the database might touch many pages that it won’t revisit anytime soon. In contrast, an index page, which hasn’t been accessed recently, is much more likely to be reused.
To address this limitation, LRU-K improves upon LRU by keeping track of the last K accesses for each entry. If an entry doesn’t yet have K recorded accesses, the missing entries are treated as infinitely old. This makes such entries more likely to be evicted, ensuring that more critical pages remain in memory.
Main operations can be defined by
pub type FrameId = usize;
///records frame is accessed.
pub fn record_access(&self, frame_id: FrameId);
///If holds frame, change the evictable flag as requested.
pub fn set_evictable(&self, frame_id: FrameId, is_evictable: bool);
///Returns a frame that is evicted from cache.
///If none exists, None is returned.
pub fn evict(&self) -> Option<FrameId>;
How not be do billion dollar blunder
Let’s look a evict()
method in the LRU-K cache which returns an Option
type. Option is Rust’s way of indicating that a method might or might not return a value. In this case, it signifies that the method might not succeed to find a frame_id
to evict, returning None
in such scenarios. In many languages Null
value is returned, which results in runtime error and reffered as billion dollar blunder.
Internal Mutability in Rust
All the methods in the LRU-K cache operate on a shared reference (&self
), yet they modify the internal state of the cache. This behavior is a hallmark of internal mutability, a design pattern in Rust that allows changes to be made through a shared reference.
Internal mutability is typically achieved in concurrent Rust using types like Arc
(Atomic Reference Counting) combined with synchronization primitives such as Mutex
. This ensures that the shared state remains thread-safe while allowing modifications.
In the context of the LRU-K cache, the underlying data structure might look something like this
type History = Vec<Timestamp>;
// for my implementation i used elapsed time since cache initialization.
cache: Arc<Mutex<HashMap<FrameId, History>>>
Buffer Pool Manager
The task of the Buffer Pool Manager (BPM) is straightforward: maintain a pool of memory frames ready to store pages from disk as frame. When a database operation (eg a query or transaction) brings in page as frame, it either reads and/or writes the frame. For read operations, since they don’t modify the data, multiple operations can safely access the frame concurrently. However, for write operations, the BPM must ensure exclusivity to avoid conflicts or data inconsistencies.
There are many ways to achieve shared vs exclusive access in Rust. I though of many but tried following 4.
Try 1
Return a ref type aka & — issue is once you have this, lifetime comes into play. Which leads to some tough battle when concurrency is involved.
Battle become more challenging for mutable ref type ask &mut.
const PAGE_SIZE : usize = 4096;
struct FrameData {
buffer: Box<[u8; PAGE_SIZE]>
}
frame_holder : Arc<Mutex<Vec<FrameData>>>;
fn read_page(&self, page_id: PageId) -> &[u8; PAGE_SIZE];
fn write_page(&self, page_id: PageId) -> &mut[u8; PAGE_SIZE];
// for both methods, what is lifetime of returned buffer.
// In concurrent env, value need to corss thread boundary and thus must have
// 'static lifetime. 'static is Rust way of saying, a memory must exits till
// Rust needs it to. This way Rust overcomes issue of referring to dropped
// memory region.
Try 2
In original CMU db, with C++, shared_mutex is used. Following best practices of resource allocation, RAII guard is used, which relinquishes the lock once the variable goes out of scope.
In Rust it is not quite possible for concurrent scenario. For a value to move around a thread, it must be of Send type. But the guard returned by read() and write() method of RwLock are marked !Send aka not send. The reason being thread locking a RwLock must be thread relinquishing the lock.
frame_holder: Arc<Mutex<Vec<RwLock<FrameData>>>>;
fn read_page(&self, page_id: PageId) -> RwLockReadGurad<'_, FrameData>;
fn write_page(&self, page_id: PageId) -> RwLockWriteGuard<'_, FrameData>;
// This would have worked but both RwLockReadGurad and RwLockWriteGuard
// as !Send and thus can not be passed around thread.
// '_ is anonymous lifetime in Rust.
Try 3
Why not return RwLock it self. This would work, but for me it is semantically wrong. When I get response from write_page, I expect to have exclusive access till the time I want. Return of RwLock will need caller to invoke read() and write(), which will lock the frame. This may or may not be terribly bad, but let’s try something different, which does not requires caller to lock for access.
Try 4
Would’t it be nice if a frame_holder looses it content once a db activity has acquired a exclusive access. For a frame to be exclusive, even frame_holder should not hold it. It suggest a index in frame_holder may or may not have FrameData. This presence(Some) and absence(None) of value is represented by Option in Rust. So we can think to store frame in frame_holder like
frame_holder: Arc<Mutex<Vec<Option<Arc<FrameData>>>>>
/*
Let's break down the layers in the type.
Arc - so that frame_holder can be passed around thread, and shared if needed.
It makes frame_holder Send. But it need to be Sync to be truly shared.
Mutex - Adds synchronization, enabling safe concurrent access.
Vec - Stores a contiguous block of memory for managing frames.
Option - Indicates that a frame can be either Some(FrameData) or None,
useful for exclusive access scenarios.
Arc - Again for multiple reading DB activity, we need type to be Send.
As shared data are non mutating, Mutex is not needed.
FrameData - Actual data holding Frame buffer and other helper values.
Some useful ones are
frame_id: FrameId - what frame it represents.
page_id: PageId - what page it is assigned to. Helps during eviction.
pin_count: AtomicU32 - how many db activity are looking at this frame.
To be evicted, this count must be zero.
is_dirty: AtomicBool - if frame has be updated. If so it must be persisted
to DB.
*/
After a successful return of write operation, frame_holder at selected frame_id, will have None value. Thus marking exclusive access.
And then we can define our read/write operation as below. Note we still use RAII patten and return Guard. This allows us to restore frame_holder with Some(Arc<FrameData>) in drop of WriteGuard, relinquishing exclusive access. Similarly during drop of ReadGuard, we can decrement page reference counter. Though in our case, Arc::strong_count
can be used for the same.
struct ReadGurad(Arc<Box<[u8; PAGE_SIZE]>>);
fn read_page(&self, page_id: PageId) -> Option<ReadGurad>;
// return with Arc type suggest, multiple activity in db for same page_id
// can have hold to shred copy to data.
// Box - is rust way to of representing data on heap.
struct WriteGuard(Box<[u8; PAGE_SIZE]>>);
fn write_page(&self, page_id: PageId) -> Option<WriteGuard>;
// return with simple Box<[u8; PAGE_SIZE]> signifies write has exclusive access.
// Option in both represents possiblity of non freeded up frame.
// Time to increase size of memory - I guess. Or just wait it out.
//Both ReadGurad and WriteGuard will implement Drop trait for RAII.
//Drop trait method drop(&mut self) is executed when a value goes out of scope.
//This can be thought as Destructor in C++.
I have mentioned quite a few types in this post. Let me talk about two more type.
Conditional Variables
When a thread holds a mutex, waiting for some condition to become true, it can either block CPU or ask OS to inform it when waiting condition is met and thus relinquishes CPU.
In the context of the Buffer Pool Manager (BPM), conditional variables are useful in scenarios such as:
- A writer thread waiting for all active reader threads to finish before obtaining exclusive access to a frame.
- A reader thread waiting for a frame being updated by a writer thread to become available again.
while frame_lock.as_ref().is_none()
|| (is_write && Arc::strong_count(frame_lock.as_ref().unwrap()) != 1)
{
frame_lock = cond
.wait_timeout(frame_lock, Duration::from_micros(100))
.unwrap()
.0;
// no cpu is held, and above line wait for notification for same condvar.
// In this example, a drop() of WriteGuard thread can notify that
// FrameData is returned back to frame_holder.
}
//Taken directly from implementation. You can see I have used wait with timeout.
//This helped me to break out of situation when thread missed out notification
//from condvar at some other place.
Deref: Simplifying Access to Nested Types in Rust
In many programming languages, accessing deeply nested types often involves a chain of access operators (.
). For example, if you want to reach the innermost [i8; PAGE_SIZE]
value wrapped in multiple layers (e.g., ReadGuard
), you might write something like:
readGuard.as_ref().as_ptr(); //dummy code. Does not work.
This approach works, but as the nesting increases, the chain becomes unwieldy. Rust solves this problem elegantly using the Deref
trait.
The Deref
Trait
The Deref
trait allows a type to define how it should be dereferenced. When implemented, it simplifies access by automatically "unwrapping" nested layers. With Deref
, you can access the innermost value seamlessly, as if the layers in between don’t exist.
impl Deref for ReadGuard {
type Target = [u8; PAGE_SIZE];
fn deref(&self) -> &Self::Target {
self.frame.get_readable()
}
}
/*
let guard: ReadGuard = ...;
guard. -> here . will directly refer to [u8;PAGE_SIZE]
*/
//Most wrapper type Box, Arc, Mutex implement this trait along with its
//mutable counter part DerefMut (when it makes sense). Thus accessing nested
//type in Rust is not that taxing, if done right.
And that is all for Post 1. We’ve explored the foundational concepts of frames, pages, and the inner workings of the Buffer Pool Manager, including the LRU-K cache eviction policy. These building blocks form the backbone of a database’s memory management system. In the next post, I’ll dive into Indexes with a focus on the B+ Tree data structure — an essential component for efficiently querying data in relational databases.
I’d love to hear your thoughts, suggestions, or feedback — feel free to share them publicly or privately. As always, keep learning and stay curious!