simplehashis a hash datastructure library in C. Because, you know, there can never be enough of those! In my defense it's simple, fast, and easy to use, which is more than I can say for many other hashing libraries.
So why reinvent the wheel? The
READMEfile explains my motivation in bullet-point form but the real reason is that coming from a dynamic language background I often structure algorithms around hash (aka "map" or "dictionary") data structures. But this isn't available in the C standard library, and the external libraries that provide them aren't minimalistic enough. For example, the X/Open h*() functions that you find in POSIX systems require you to initialize the hash data structure, aren't reentrant, and the documentation has non-standard terminology.
In my mind I imagined something as simple as:
struct hash_entry * treasure_map = NULL; void * treasure_ptr, my_treasure; hash_set(treasure_map, "X_marks_the_spot", treasure_ptr); // sometime later... my_treasure = hash_get(treasure_map, "X_marks_the_spot");
Declaring pointers to your data and the hash itself is regrettable codenoise, but necessary in C. Everything else is as simple as it gets: "Take this string and this pointer and associate them in this data structure so that I can quickly and easily retrieve it later." I have a few more things in the todo list for this little library to make it even easier to use, tighter, and more flexible. But it's ready for use right now.
You can browse the code here, or download a tarball here. Or you're using git you can clone it with
git clone http://www.bartgrantham.com/_projects/simplehash/.git. It's license is MIT, so feel free to use it however you'd like. The README file is below. Enjoy!
simplehash ========== Description ----------- Super simple hashing library. Maps from strings to void pointers: hash_set(treasure_map, "X_marks_the_spot", treasure_ptr); ... my_treasure = hash_get(treasure_map, "X_marks_the_spot"); The itch that this scratches is: I just want a very simple way to record memory locations based on strings. Details ------- Why not just use the X/Open functions hcreate(), hsearch(), and hdestroy()?: - The API is needlessly gross. - The library shouldn't require you to create the hash - The naming scheme/terminology is confusing and non-standard - POSIX version non-reentrant, only one hash at a time - I prefered the system to be smaller/tighter - It's easier to use. Important things to note: - simplehash will allocate or free all the memory it uses, _including keeping track of its keys_. That is, it will malloc space and copy your string key so that there's no confusion about who's responsible for that memory. Managing memory for the data your value pointers refer to is up to you, though. - hash_get() will return NULL if it fails to find a key. This is only a problem if you expect to be storing NULLs as the values to keys. If it's an issue, I suggest storing a pointer to a sentinel as the value in this situation. - hash_set(myhash, "foo", NULL) doesn't clear the key/value pair from the hash, you should use hash_clear(myhash, "foo") for that instead. The difference is that hash_clear() will cascade delete empty hash tables whereas hash_set() does not. This is because setting a key/value pair where the value is NULL is semantially different than clearing a key/value pair. - The hash algorithm is pretty good, but not cryptographically sound. This means that it is vulnerable to memory exhaustion where an attacker crafts large keys that conflict, causing nearly empty tables (ie. a "vertical" table arrangement). This is not unique to this library of course, and the solution would defeat the goal of simplicity and speed in this library. The likelihood of this kind of attack is very, very low, but full disclosure... - The algorithm doesn't rebalance the hash tree on hash_clear(), so a heavily churned hash will not be optimally organized. This is almost never a problem, but if you require that your datastructures be perfectly formed in memory at all times at the expense of the occasional rebalancing, this library will not suffice. - Unless you set HASH_KEYS_PER_TABLE very low, the hash tends to be very sparse, usually 90%+. That's usually ok, the tables don't take up THAT much space. How to Use ---------- Declare your hash pointer, _making sure to initialize it to NULL_: struct hash_entry * myhash = NULL; Then you can add key/value pairs to the hash...: hash_set(myhash, "my_key_foo", value_bar_ptr); // don't free the memory that your values point to! we // only make a copy of the key string, not the value data ...retrieve the values from the keys...: tmpval = hash_get(myhash, "my_key_baz"); ...and clear the values: tmpval = hash_clear(myhash, "my_key_quux"); // "tmp_val" points to the value the key "my_key_quux" used to // refer to, and "my_key_quux" is no longer a key in the hash If necessary, you can also dump the hash to stdout: hash_dump(myhash); There's also a set of functions that will provide stats on a hash: i = j = k = 0; max_ptr = NULL; hash_stats(myhash, &i, &j, &k, &max_ptr); printf("tablest: %dnentriest: %dnnullst: %dnmax pointert:%pn", i, j, k, max_ptr); printf("depth:t: %dn", hash_depth(myhash)); printf("sparsenesst:%fn", hash_sparseness(myhash)); Finally, if you need to, there are some constants in hash.h that can be altered to your needs: HASH_KEYS_PER_TABLE - This is a trade-off between sparseness of the hashtable and depth for a given entry. HASH_MAX_KEYSIZE - This is to prevent runaway strcmp()'s, increase as necessary if you find that your keys aren't unique within this many characters. Installation ------------ All you really need is hash.h and hash.c. I've included my test program ("hash_test.c") and a quick program that shows how to use these functions ("hash_example.c"). Also included is a quick Makefile. If you want to see it come to life just get all the files and: # make # ./hash_test # ./hash_example Todo ---- Audit: - is my code cache-friendly? (for example, good: hash_stats always compares against hash_next_magic, which keeps it in L1) - rigorously test mem performance - rigorously test hash_clear and hash_get... make big test cases... - graphs showing: - statistical randomness of hashing algo with random keys/random rounds - effect of hash entries on sparseness, mem utilization, and speed - test suite should automatically test millions of variable-length random strings, churning millions of times - time of inserts/lookups/deletes New features: - hash_map(): takes a function and runs it against all the elements in the hash (depth first, breadth first too?) - see l_mapt from http://www.pasteit4me.com/46003,http://www.pasteit4me.com/46002,http://www.pasteit4me.com/45003 - this could provide hash_copy() - hash_value(): given a void *, find the first string that is the key for it (requires a full search) - hash_map_key(), hash_map_value(): combination of the two above ideas: depth first search for a value that causes the map fcn to return true - hash_copy(): this would be a simple solution to the "sparseness" problem. - Advanced feature: vararg hash_set and hash_get: hash_set_varg(myhash, &user, "New York", "New York City", "Grantham", "Bart", ...) Outstanding questions: - Would an sdbm-derived hash be better? Bugs, etc. ---------- Please let me know if you find any, or if you have license-friendly enhancements to add. License ------- MIT License. See ./LICENSE or http://www.opensource.org/licenses/mit-license.php Few things are more enjoyable than the knowledge that you've helped another person. If you do use these functions for anything, I'd love to hear about it: email@example.com Enjoy! -Bart