Hash Table in Pintos

Apart from the linked list, there is another handful structure offered by Pintos kernel library - Hash Table.

The hash table in Pintos is not much different from any ordinary hash table. The technique it uses to solve hash collision is chaining which is the most classic solution.

chaining

struct hash_elem *
hash_insert (struct hash *h, struct hash_elem *new)
{
  struct list *bucket = find_bucket (h, new);
  struct hash_elem *old = find_elem (h, bucket, new);

  if (old == NULL) 
    insert_elem (h, bucket, new);

  rehash (h);

  return old; 
}

/* Inserts E into BUCKET (in hash table H). */
static void
insert_elem (struct hash *h, struct list *bucket, struct hash_elem *e) 
{
  h->elem_cnt++;
  /* New element will put at the front the list */
  list_push_front (bucket, &e->list_elem);
}

This is the code used to insert a new element into the table. Note that the bucket is implemented in Linked List, so it can grow or shrink easily.

Create A Hash Table

To create a table, we need first declare a struct hash variable which will contains following:

/* Hash table. */
struct hash 
  {
    size_t elem_cnt;            /* Number of elements in table. */
    size_t bucket_cnt;          /* Number of buckets, a power of 2. */
    struct list *buckets;       /* Array of `bucket_cnt' lists. */
    hash_hash_func *hash;       /* Hash function. */
    hash_less_func *less;       /* Comparison function. */
    void *aux;                  /* Auxiliary data for `hash' and `less'. */
  };

It is basically the same when you create a hash table in any other languages, requiring a hash function, comparing functions and it will store the pointers to each bucket.

Put An Element Into Table

And for the same reason described in the linked list post, to be able to put a struct into the table, developer has to add struct hash_elem as a member. Note that the struct hash_elem is just a wrap of struct list_elem since when doing inserting, our final operation is to put the element into the chaining linked list.

struct page
  {
    struct hash_elem hash_elem; /* Hash table element. */
    void *addr;                 /* Virtual address. */
    /* ...other members... */
  };

Hash Functions

Since C itself has no support for hash table, so it’s up developers to decide how to hash the data they want to put into the table. Luckily, Pintos’s authors have provided hash functions for most commonly used data types:

// hash.h

/* Sample hash functions. */
unsigned hash_bytes (const void *, size_t);
unsigned hash_string (const char *);
unsigned hash_int (int);

Comparing Functions

Since we are using chaining to store the elements in the same bucket. When doing any operations, after locating the bucket, we need to iterate through the linked list to see if there exists the same element in the bucket. Thus we need also define a comparing function ourselves to check if they elements are different or not.

/* Returns true if page a precedes page b. */
bool
page_less (const struct hash_elem *a_, const struct hash_elem *b_,
           void *aux UNUSED)
{
  const struct page *a = hash_entry (a_, struct page, hash_elem);
  const struct page *b = hash_entry (b_, struct page, hash_elem);

  return a->addr < b->addr;
}

Costumed compared function in the hash table, here addr is the key of page hash table, so when we compares two pages, we can compare them by their addr values.

Rehashing

Since our bucket is implemented in Linked List, each time we do a look up would be $O(N)$ long, it will significantly decrease the look up speed if some hash collisions happening too frequently, i.e. some chaining becomes too long. We need to add more buckets from time to time so that more bytes would be used for hashing and we can have shorter chains (hopefully).

In terms of how to grow the number of buckets, the policy in Pintos is quite simple:

static void
rehash (struct hash *h) {
  // ...
  /* Calculate the number of buckets to use now.
     We want one bucket for about every BEST_ELEMS_PER_BUCKET.
     We must have at least four buckets, and the number of
     buckets must be a power of 2. */
  new_bucket_cnt = h->elem_cnt / BEST_ELEMS_PER_BUCKET;
  if (new_bucket_cnt < 4)
    new_bucket_cnt = 4;
  while (!is_power_of_2 (new_bucket_cnt))
    new_bucket_cnt = turn_off_least_1bit (new_bucket_cnt); 

  // ... 
}

Ideally, we hope the average number of elements in each bucket is 2 and no more than 4. So after inserting or deleting, the table will automatically recalculate the required bucket number and cast it down to the nearest power of two.

After allocating the new memory space for the new buckets, it will read each element from the old buckets, rehash bytes based on the new bucket number and put the element into the corresponding new buckets.