Berkeley DB notes
Internals
These notes mostly describe the btree back end for our special db-1.85/1.86 variant, as used for the "db2" KDB module. Some of this material should go into the official documentation to help operators understand possible database corruption characteristics.
Berkeley DB files are made up of fixed-size pages. Each back end (hash, btree) has its own way of organizing data into pages. Page zero of a database file is a metadata page that includes persistent parameters of the whole database.
mpool
Berkley DB has a user-space page cache called mpool. Pages in the cache can be in-use/pinned (MPOOL_PINNED
) and/or dirty (MPOOL_DIRTY
). There is a hash table for pages in the cache, and an LRU queue to help release memory used by unreferenced pages.
mpool_open
opens a memory pool from an already-open file descriptor. It takes a pagesize
and a maxcache
parameter. pagesize
is typically a persistent parameter of an upper layer (e.g., stored in the btree metadata page), while maxcache
can be tuned for performance without changing persistent database parameters.
mpool_get
gets an existing page. It checks to see if the requested page is not pinned, and retrieves it from the cache if it's already in the cache. (If built with -DDEBUG
, it will abort if the page is already pinned, unless the caller passes the MPOOL_IGNOREPIN
flag). If the page is not in cache, mpool_get
reads in the page from disk. Regardless of how it obtained the page, mpool_get
sets MPOOL_PINNED
unless the caller passes the MPOOL_IGNOREPIN
flag. If the current cache size is at least maxcache
, mpool_get
tries to reuse unreferenced cache entries when reading in a previously uncached page (through static helper mpool_bkt
). It does this by getting the first unpinned page from the LRU queue (which might be dirty, in which case it will flush it to disk). If this fails to free up a cache entry, mpool_bkt
will allocate a new entry anyway, growing the cache beyond the maxcache
.
mpool_put
releases a pinned page, clearing MPOOL_PINNED
. The caller can pass the MPOOL_DIRTY
flag to indicate that it modified the page and the page therefore needs to be written back to disk. Notably, mpool_put
does not change the order of the LRU queue; only mpool_get
does that, which means that the ordering of mpool_put
calls doesn't determine the order in which mpool flushes dirty pages to disk. If a subsequent mpool_get
fetches a dirty page before it is flushed to disk, the page moves to the tail of the LRU queue, possibly further delaying its being written to disk.
mpool_sync
writes dirty pages out to disk and calls fsync
. The only part of the btree back end that calls it is __bt_sync
(usually by way of __bt_close
).
btree
The btree back end is actually a B+tree. Pages in the tree are either internal pages or leaf pages. There is also a free list (with its head pointer in the metadata page). Leaf pages contain key-value pairs in key-sorted order; internal pages contain keys (also sorted) paired with pointers to child pages. Pages have sibling links to pages to their left and right on the same level of the tree. (The free list also uses sibling links.) These sibling links help speed up some tree update operations and sequential traversal. Updates that split a page can touch a potentially multiple pages, especially if parent pages must split (possibly up to the root), leading to opportunities for corruption if some of the writes fail (due to crashes or power failures). Often, corruption will result in some items being inaccessible to sequential access, random access, or both. Observed instances of corruption appear to be inconsistencies between parent-child links and sibling links. Rarely, loops will form in a sibling chain, causing infinite loops (at least until disk space exhaustion) during dumps. Sibling link corruption can lead to worse corruption if a page with a corrupted sibling link splits or is deleted.
Page splits of non-root pages always preserve the page being split as the left-hand page. Splitting the root creates two new pages and converts the root to an internal page if it started out as a leaf page.
Even though every page in a tree should have at least two children or items, some pages can end up with only one due to deletions. Only when a deletion would completely empty a page is that page deleted.
Compatibility
Our bundled "db2" is actually probably db-1.86 with patches. db-1.85 was the version included by most BSD derivatives. db-1.86 made backward-incompatible changes in the hash back end (but not probably not the btree back end, though we haven't confirmed that).
There's some issue that prevents krb5_db2_promote_db (and thus kdb5_util load) from working properly with db-5.3, and possibly earlier releases.
Bugs
Byteswapping in bt_conv.c
might not work correctly for leaf records with small keys but big (overflow) data. This probably can't be caught by the current test suite, which can only check that the byteswapping code is internally consistent. (A proper test for this would need a database file created on an opposite-endian platform.)
Overflow page pointers for big keys (and sometimes big data) are unaligned; this can cause problems with byte swapping on platforms that enforce strict alignment.
BSD "upstreams"
The major open-source BSD derivatives NetBSD, FreeBSD, and OpenBSD have db-1.85/1.86 in their libc. They have shared various patches back and forth, but there are still some divergences among them. These can become relevant when trying to minimize diffs.
NetBSD seems to have converted from BSD fixed-width types (e.g., u_int32_t
) to C99/POSIX fixed-width type (e.g., uint32_t
), but FreeBSD and OpenBSD have not. FreeBSD and NetBSD seem to have converted function declarations to prototypes, but OpenBSD has not. NetBSD and OpenBSD have eliminated BSD type names such as u_long
, but FreeBSD has not.