SIGUSR2

Recently, I spent some time thinking about a simple problem. How do you test if two strings are anagrams of each other?

There are many ways, but the naive solution is to simply sort both strings, character-wise, and then compare the results. In Python, you might do that like so (We’ll leave out the fact that anagrams are actually real words and phrases. We also work in a case-sensitive manor [e.g. “JimMorrison” and “MrMojoRisin” isn’t truthy, though a simple s.lower() goes a long way.]):

def isAnagram(str1, str2): if len(str1) != len(str2): return False return sorted(str1) == sorted(str2)

Here, I’m using sorted, a Python __builtin__, that takes an iterable and produces a sorted list of that iterable. I’m then taking advantage of the fact that Python lists can be compared element-wise with the == operator. Doing this produces a function that will work on any string, and it’s simple to see why. Sort the string “parental” and you get “aaelnprt.” Sort the string “paternal” and you still get “aaelnprt.” Obviously “aaelnprt” and “aaelnprt” are equivalent. This of course returns False for the strings “snowman” and “iceman” since they don’t compare sorted equally (or non-sorted for that matter).

However, this solution isn’t the most efficient use of resources. For one, most sorting algorithms are only O(n log n), which means in the best case isAnagram is too. It also needs to allocate two lists to store the results returned by sorted.

There is of course a way to do better. You just have to think about the problem for a little longer than a minute:

def isAnagramN(str1, str2): if len(str1) != len(str2): return False counts = defaultdict(lambda: [0, 0]) for c1, c2 in izip(str1, str2): counts[c1][0] += 1 counts[c2][1] += 1 for k, v in counts.iteritems(): if v[0] != v[1]: return False return True

This code does not allocate proportionally to the size of the strings, but instead on the diversity of the strings. In other words, isAnagramN("aaaaaaa", "bbbbbbb") allocates 1 defaultdict, and 2 lists of size 2. Why? Because, the algorithm simply counts up how many times each letter occurs in each string. Of course Python also has to allocate the generators to use for izip and counts.iteritems(), but that isn’t significant. The big win here of course is that given strings of any length, the algorithm uses only as much space as the diversity of the contents contained in the strings!

As if that wasn’t a win enough, this algorithm runs in O(n) on the length of the strings!

But, does it actually make a difference? The answer of course is yes. For strings of significant length, isAnagramN runs almost 2x as fast as isAnagram.

The proof is in the bacon, so let’s take a look at some numbers. Using Python’s timeit module, I tested strings of length 1 through 100,001, incrementing by 10,000 (I’ve tested other lengths as well, and reach a similar conclusion). At each length, the test was repeated 50 times. The results are below:

Length Time isAnagram Time isAnagramN

0.00043511390686

0.000530958175659

10001

0.657309055328

0.38395690918

20001

1.25437283516

0.793761968613

30001

1.91431283951

1.15374517441

40001

2.55181908607

1.53560996056

50001

3.14615797997

2.07976388931

60001

3.76745486259

2.30041193962

70001

4.48913788795

2.81828999519

80001

5.155148983

3.09482097626

90001

5.67185592651

3.477850914

100001

6.33614587784

3.94285678864

Download the code: anagram.py