Before you begin the well-deserved flaming, please understand that I do not intend on submitting this to CPAN. This is purely an academic exercise to demonstrate the algorithm (poorly), and to help me practice some stuff. I understand that this isn’t terribly efficient, and I am quite confident that one could do this much faster (and probably has many times over). I wrote this for fun, and for boredom, and I’m posting it because I thought it was cool and because I promised you I’d post stuff.

This algorithm is pretty simple. The premise is that you take a key value of some sort, you hash it using any algorithm, and then you take the mod of that hash value by the capacity of your hash table, and that gives you an index! It’s pretty simple, and even in the implementation I have here I believe it’s not a terrible look-up time.

The basic idea is this: $array_index = hashval($key) % $array_capacity

Challenges

The two biggest challenges in this implementation were index duplicates and growing the hash table.

The unfortunate fact is that sometimes with many different keys you can take the mod of those keys’ hash values by the capacity you get duplicate index values. The best way to handle that is to then have an array at that index wherein you store those values, and then upon look-up you look in that array for the values.

For growing the hash, you never really want to hit capacity because that will result in more duplicates. For this very simple implementation I just said that if you reach a sum of elements which is within 80% of the capacity, grow the hash table by 20%. There are probably much better strategies, and if I was really trying to make this perform better I would work on improving this.

The Code

I’m not going to go too into the code since it really took me a while to write the code. I went through the trouble of using perltidy and perlcritic, so enjoy (and thanks to Damian Conway and the authors of Perl::Critic and the authors of perltidy for making life easier for Perlers the world over).

Ways to Improve the Implementation

There are several things I think we could do to improve this implementation, but I couldn’t be bothered to do so (remember, academic exercise!).

  1. We could detect large duplicate arrays and use binary search to find those items. Remember, I am keeping the list of dups sorted!
  2. We could probably vastly improve the hashing algorithm, the pack I’m using currently is just a sum of the ASCII values. Something which would result in a wider distribution would be better.
  3. We could make it so that we only grow the hash table when we start seeing large duplicate lists rather than just on the count of items in the table.
  4. We could make the _grow() method not duplicate the hash table and then re-populate it, but instead shuffle it around internally.
  5. We could probably modify the way we’re storing the items in the list so that sorting happens inline rather than making a copy and sorting the copy before replacing the prior list.
  6. There are likely a bunch more different ways we could improve upon this implementation, but this is what I’ve got right now. :)

The Conclusion

Overall I’m rather pleased with this brief experiment. I think this is the longest it’s taken me to do one of these, and it was a lot of fun. As always I welcome your comments and your suggestions. I see this as a form of inviting peer review, so if you see something I could improve and you feel so inclined then please do.

Please also note that I developed this entire program using koding.com. If you think you’d like to use koding.com, check it out. If you click here and sign up, it gives me more disk space to write code with, which enables me to do more of this. I apologize in advance, but there is no way to stop me.