[Introduction]

Unix Incompatibility Notes:
DBM Hash Libraries

Jan Wolter

The are several different DBM libraries, including two open source packages: Gnu's GDBM package and Sleepycat's Berkeley DB package). GDBM and DB each can emulate the interfaces of the traditional packages, Old DBM and NDBM. It's important to keep straight which interface to which version of which library you are talking about.

Virtually all Unix systems have some form of Old DBM and NDBM library installed. GDBM and DB can be installed on nearly all of them.

Generally the disk files produced by any two libraries are going to be incompatible with each other. The formats of many packages are not architecture independent either. The formats of these files also change pretty regularly from version to version. In some cases, using different interfaces to the same version of the same package will create slightly incompatible database on disk (e.g., the files will be named differently depending on which interface to GDBM you use.) Ideally, all programs that access the same database file should be running on the same architecture using the same interface to the same version of the same library.

Note that full installs of Redhat 6.1 include three different DBM libraries: GDBM, Berkeley DB Version 1, and Berkeley DB Version 2. Each of these three has different native interfaces and incompatible file format, all three of them can emulate Old DBM and NDBM, and Berkeley DB Version 2 can emulate Version 1. With the introduction of Berkeley DB Version 3, this is likely to get more complex. Figuring out which header file goes with which library is enough to stun an ox.

Using hash files for data can be just a bit perilous. I've had hashed databases suddenly become unreadable when an ISP upgraded the operating system on a server, inadvertantly bringing in a new, non-compatible version of a dbm package. This kind of thing can make maintaining a dbm file something of a headache in the long term.

Because of this you should be very careful about using DBM files for storing important data. They just aren't very maintainable. One workaround for this is to use them only for indexing, and provide a tool that can regenerate the indexes if they are lost. Another workaround for databases that are read often and written rarely is to do something like what is done with the Unix password file, where the master file is a plain text file and the hashed file is generated from that. If all else fails, you should at least provide a tool to dump the hash file into an easily readable plain text file, and a matching tool to read the plain text file to recreate the hash file. A pair of such tools allows you to manually edit the database, or port it to a new system, by dumping it on the old system and loading it on the new.

Old DBM Library

Old DBM is often called just``DBM.'' We call it Old DBM to distinguish it from NDBM (which is also sometimes just called ``DBM'' as in Apache's mod_auth_dbm). It was developed at UC Berkeley.

Old DBM should probably be considered obsolete, and you should probably avoid using it.

Old DBM allows only one DBM database to be open at a time. Each DBM database consists of two files, one file has a .dir suffix and contains an index, the other holds all the data and has a .pag suffix. Data files are sparse and should not be copied carelessly (there are also problems putting such databases on filesystems that don't support sparse files).

There are restrictions on the total size of a key/content pair (the maximum size is typically around 1018 bytes, but ranges from 512 to 4096 in different installations). The total size of all keys that hash to the same value must also be within that limit.

There is no automatic locking, so concurrently updating and reading is risky.

The header file dbm.h should be included.

The dbminit() call is used to open a database. It will not create a new database if none exists. You must manually create empty .dir and .pag files.

Data is saved with store() and retrieved with fetch(). Fetch() returns a pointer to static storage which is overwritten with each successive call. The dptr fields in the returned datum types do not necessarily point to word-aligned addresses, so it is unsafe to cast them to arbitrary data types and dereference them.

Some of the manual pages for Old DBM omit to mention the existance of a dbmclose() call. However, it always seems to exist.

Under IRIX, there is an alternate version with function names like dbminit64() that allows you to reference databases bigger than 2 gigabytes (only possible on machines with more than 32 bits).

To compile, you usually need to link with -ldbm. I don't know if this is always needed.

Old DBM Library (GDBM Emulation)

The Gnu GDBM library provides a historic interface emulating Old DBM.

Old DBM's key and content size limitations disappear in GDBM's emulation.

The on-disk file format is incompatible with genuine Old DBM files. To initialize a database, you still need to manually create the .pag file. When an empty .pag file is opened it is initialized, any .dir file is removed, and a new .dir file is created which is just a link to the .pag. The .dir file is thereafter ignored, and the .pag file is a normal GDBM file. This trickery make the files look similar on disk, and ensures that the time-stamp on the .dir file is kept current. If you want progams using the Old DBM interface to the GDBM library to share files with progarms using the native interface to GDBM, then the programs using the native interface should pass gdbm_open() the filename including the .pag suffix, while the programs using the Old DBM interface should omit the .pag suffix.

The Old DBM interface to GDBM will always try to open files as a writer, meaning that they will be locked so no two processes can simultaneously access the same file. The exception is that if some process already has the file open as a reader (through a different interface) then processes using the Old DBM interface will open it as readers too (and be unable to write!).

Data returned by fetch() will actually be in dynamically allocated storage, but later fetch() calls will take care of freeing it for you, so you should mostly just pretend it is in static storage. The last fetched value will not be freed.

To use GDBM's Old DBM interface, you need GDBM's version of the dbm.h header file. Gnu's install scripts don't install this header file by default, so it is missing from many sites. By default the header is installed as dbm.h, but many glibc-2.1 systems, including Redhat 6.1, install it as gdbm/dbm.h.

To compile, you need to link with -lgdbm and sometimes -lgdbm_compat.

Old DBM Library (DB Emulation)

All versions of Berkeley DB include an interface emulating Old DBM.

Old DBM's key and content size limitations disappear in DB's emulation.

The on-disk file format is completely different from genuine Old DBM files and from GDBM format. The dbminit() call will create the database if it doesn't exist. It will be a single file with a .db suffix added to the name and it be permitted with mode 600 (modified by umask).

The dbmclose() call seems to deallocate the ``static'' memory that values returned by fetch() are saved in, so you need to be sure to save those elsewhere before closing the database.

The DB emulation of Old DBM buffers writes to the database. You must call dbmclose() to ensure that all data is written out.

In Berkeley DB Version 2 and 3, the dbm.h file is eliminated. Instead you do

    #define DB_DBM_HSEARCH 1
    #include <db.h>
When compiling, you need to link with the DB library, usually -ldb.

In Berkeley DB Version 1, there is a dbm.h header file, just like the real thing, but in glibc-2.1 systems, including Redhat 6.1, it is installed as db1/dbm.h. It used to be normal to install Berkeley DB so you'd link with -ldb, but in glib-2.1 systems, where both version 1 and version 2 are installed, you need to link with -ldb1 to get the version 1 library.

NDBM Library

To maximize confusion with Old DBM, NDBM is sometimes refered to as ``DBM''. Apache's mod_auth_dbm module is an NDBM interface, not an Old DBM interface. I think NDBM first appeared in 2.10 BSD.

NDBM allows multiple databases to be open simultaneously, and handles very large database files respectably. However, there is still a limit on the total size of the key/content pairs that can be stored (this ranges from 1018 bytes to 4096 bytes). Like Old DBM, each database consists of two files, with .dir and .pag suffixes. Data files are sparse and should not be copied carelessly.

Under IRIX, there is an alternate version with function names like dbm_open64() that allows you to reference databases bigger than 2 gigabytes.

There is no automatic locking, so concurrently updating and reading is risky.

The header file ndbm.h should be included.

The dbm_open() call is used to open a database. Unlike Old DBM, NDBM can be told to automatically create the files if they don't already exist. On Solaris (at least) opening the database write-only is not supported. If the O_WRONLY flag is given, it actually opens it read-write. I recommend never using O_WRONLY (see below for other wierdnesses with this).

Data is saved with dbm_store() and retrieved with dbm_fetch(). DBM_fetch() returns a pointer to static storage which is overwritten with each successive call. The dptr fields in the returned datum types do not necessarily point to word-aligned addresses, so it is unsafe to cast them to arbitrary data types and dereference them.

To compile, you usually need to link with -lndbm.

NDBM Library (GDBM Emulation)

The Gnu GDBM library provides a historic interface emulating NDBM. NDBM's size limitations disappear in GDBM's emulation.

The on-disk file format is incompatible with genuine NDBM files, but the files have the same suffixes. When an empty .pag file is open it is initialized, any .dir file is removed, and a new .dir file is created which is just a link to the .pag. The .dir file is thereafter ignored, and the .pag file is a normal GDBM file.

When opening a database, the mode flags passed to dbm_open() are mapped to gdbm_open() flags as follows:

dbm_open()gdbm_open()
O_RDONLYGDBM_READER
O_RDWR | O_CREATGDBM_WRCREAT
anything | O_TRUNCGDBM_NEWDB
anything elseGDBM_WRITER
Warning: calling GDBM's emulation of dbm_open() with O_WRONLY|O_CREAT falls into the "anything else" catagory and thus will not create the database. You need to do O_RDWR|O_CREAT to create a database.

Data returned by dbm_fetch() will actually be in dynamically allocated storage, but later dbm_fetch() calls will take care of freeing it for you, so you should mostly just pretend it is in static storage. The last fetched value will not be freed.

To use GDBM's NDBM interface, you need GDBM's version of the ndbm.h header file. Gnu's install scripts don't install this header file by default, so it is missing from many sites. By default the header is installed as ndbm.h, but many glibc-2.1 systems, including Redhat 6.1, install it as gdbm/ndbm.h.

To compile, you need to link with -lgdbm and sometimes -lgdbm_compat.

NDBM Library (DB Emulation)

All versions of Berkeley DB include an interface emulating NDBM. NDBM's size limitations disappear in DB's emulation.

The on-disk file format is completely different from genuine NDBM files. It will a single file with a .db suffix added to the name, instead of two files with .dir and .pag suffixes.

The dbm_close() call seems to deallocate the ``static'' memory that values returned by fetch() are saved in, so you need to be sure to save those elsewhere before closing the database.

The DB emulation of NDBM buffers writes to the database. You must call dbm_close() to ensure that all data is written out.

In Berkeley DB Version 2 and later, the ndbm.h header file is eliminated. Instead you do

    #define DB_DBM_HSEARCH 1
    #include <db.h>
When compiling, you need to link with the DB library, usually -ldb.

In Berkeley DB Version 1, there is a ndbm.h header file, but in glibc-2.1 systems, including Redhat 6.1, it is installed as db1/ndbm.h. It used to be normal to install Berkeley DB so you'd link with -ldb, but in glib-2.1 systems, where both version 1 and version 2 are installed, you need to link with -ldb1 to get the version 1 library.

GDBM Library

Gnu's GDBM can emulate Old DBM or NDBM, but it is best used through it's own native interface.

GDBM allows multiple databases to be open simultaneously, and handles very large database files respectably. There are no size limits on keys or content. By default GDBM does locking, so that either one writer has the database open, or any number of readers have it open. It also buffers writes by default.

The header file gdbm.h should be included.

Database are opened or created with the gdbm_open() call. No suffixes are automatically appended to the filename.

Data is saved with gdbm_store() and retrieved with gdbm_fetch(). Gdbm_fetch() returns a pointer to dynamically allocated storage. It is the caller's responsibility to free() this when it is no longer needed.

To compile, you need to link with -lgdbm.

Berkeley DB Library

Sleepycat's ``Berkeley DB'' has been steadily evolving into an extremely sophisticated package that supports many kinds of database files beyond mere hash files, and lots of fancy locking and transaction control. It is probably wonderful, but I've used it only for simple hash files and that is all that is discussed here.

It has gone through four major versions, the first three with substantially different API's and file formats. Small changes to the C API appear in most releases. It is normal to have more than one version installed on your system. It's good to make sure you get the right header files aligned with the right link libraries. All versions can emulate the Old DBM and NDBM interfaces and Version 2 and later can emulate the Version 1.85 interface, but the file formats are all incompatible.

Apache's mod_auth_db uses any version of Berkeley DB.

The header file db.h should be included and you should link with -ldb. If that header defines DB_VERSION_MAJOR, DB_VERSION_MINOR and DB_VERSION_PATCH then they give the version number. If these are not defined by db.h then you can assume that the major version number is one.

If you have a glibc-2.1 system (like Redhat 6.1), which installs both Version 1 and Version 2, then the header file and library described above are for Version 2. The header file for Version one is db1/db.h and the library is -ldb1.

Databases are always single files, with a .db suffix on their names. How to open them is different for each major release. Version 1 uses dbopen(), Version 2 uses db_open(), and Version 3 uses first db_create() to make a database handle, and then handle->open() to open the database.

Version 1 is known to have some bugs, and certain kinds of operations should be avoided (if you can't avoid version 1 entirely). See http://www.sleepycat.com/historic.html.

Data is fetched and stored with db->get() and db->put() functions. It is important (and non-intuitive) that before doing any of these calls, you must zero out the key and content data structures that you will be passing in, like:

   memset(&key, 0, sizeof(key));
   memset(&content, 0, sizeof(content));
You even have to zero out the content structure before you fetch into it.

The syntax of the cursor() command, which creates a cursor for a sequential walk through the database, changed with version 2.6. Version 2.6 added a fourth flag argument (which can be set to integer zero to get the same behavior as older versions).

The syntax of the db->open() command changed in version 4.1, adding a new argument before the filename argument.


Jan Wolter (E-Mail)
Sun Apr 2 17:34:00 EDT 2000 - Original Release.
Fri Mar 7 10:45:21 EST 2003 - Incomplete Update.
Wed Jun 30 13:31:08 EDT 2004 - Minimal update to Berkeley DB 4 info.