Anagrams Predicate

January 08, 2009 #

Recently, I spent some time thinking about a simple problem. How do you test if two strings are anagrams of each other?

There are many ways, but the naive solution is to simply sort both strings, character-wise, and then compare the results. In Python, you might do that like so (We'll leave out the fact that anagrams are actually real words and phrases. We also work in a case-sensitive manor [e.g. "JimMorrison" and "MrMojoRisin" isn't truthy, though a simple s.lower() goes a long way.]):

def isAnagram(str1, str2): if len(str1) != len(str2): return False return sorted(str1) == sorted(str2)

Here, I'm using sorted, a Python __builtin__, that takes an iterable and produces a sorted list of that iterable. I'm then taking advantage of the fact that Python lists can be compared element-wise with the == operator. Doing this produces a function that will work on any string, and it's simple to see why. Sort the string "parental" and you get "aaelnprt." Sort the string "paternal" and you still get "aaelnprt." Obviously "aaelnprt" and "aaelnprt" are equivalent. This of course returns False for the strings "snowman" and "iceman" since they don't compare sorted equally (or non-sorted for that matter).

However, this solution isn't the most efficient use of resources. For one, most sorting algorithms are only O(n log n), which means in the best case isAnagram is too. It also needs to allocate two lists to store the results returned by sorted.

There is of course a way to do better. You just have to think about the problem for a little longer than a minute:

def isAnagramN(str1, str2): if len(str1) != len(str2): return False counts = defaultdict(lambda: [0, 0]) for c1, c2 in izip(str1, str2): counts[c1][0] += 1 counts[c2][1] += 1 for k, v in counts.iteritems(): if v[0] != v[1]: return False return True

This code does not allocate proportionally to the size of the strings, but instead on the diversity of the strings. In other words, isAnagramN("aaaaaaa", "bbbbbbb") allocates 1 defaultdict, and 2 lists of size 2. Why? Because, the algorithm simply counts up how many times each letter occurs in each string. Of course Python also has to allocate the generators to use for izip and counts.iteritems(), but that isn't significant. The big win here of course is that given strings of any length, the algorithm uses only as much space as the diversity of the contents contained in the strings!

As if that wasn't a win enough, this algorithm runs in O(n) on the length of the strings!

But, does it actually make a difference? The answer of course is yes. For strings of significant length, isAnagramN runs almost 2x as fast as isAnagram.

The proof is in the bacon, so let's take a look at some numbers. Using Python's timeit module, I tested strings of length 1 through 100,001, incrementing by 10,000 (I've tested other lengths as well, and reach a similar conclusion). At each length, the test was repeated 50 times. The results are below:

Length Time isAnagram Time isAnagramN
10.000435113906860.000530958175659
100010.6573090553280.38395690918
200011.254372835160.793761968613
300011.914312839511.15374517441
400012.551819086071.53560996056
500013.146157979972.07976388931
600013.767454862592.30041193962
700014.489137887952.81828999519
800015.1551489833.09482097626
900015.671855926513.477850914
1000016.336145877843.94285678864

Download the code: anagram.py

Tagged: algorithms , python