Anagrams Predicate
Recently, I spent some time thinking about a simple problem. How do you test if two strings are anagrams of each other?
There are many ways, but the naive solution is to simply sort both strings, character-wise, and then compare the results. In Python, you might do that like so (We‘ll leave out the fact that anagrams are actually real words and phrases. We also work in a case-sensitive manor [e.g. “JimMorrison” and “MrMojoRisin” isn’t truthy, though a simple s.lower()
goes a long way.]):
def isAnagram(str1, str2): if len(str1) != len(str2): return False return sorted(str1) == sorted(str2)
Here, I‘m using sorted
, a Python __builtin__
, that takes an iterable and produces a sorted list of that iterable. I’m then taking advantage of the fact that Python lists can be compared element-wise with the ==
operator. Doing this produces a function that will work on any string, and it‘s simple to see why. Sort the string “parental” and you get “aaelnprt.” Sort the string “paternal” and you still get “aaelnprt.” Obviously “aaelnprt” and “aaelnprt” are equivalent. This of course returns False
for the strings “snowman” and “iceman” since they don’t compare sorted equally (or non-sorted for that matter).
However, this solution isn‘t the most efficient use of resources. For one, most sorting algorithms are only O(n log n), which means in the best case isAnagram
is too. It also needs to allocate two lists to store the results returned by sorted
.
There is of course a way to do better. You just have to think about the problem for a little longer than a minute:
def isAnagramN(str1, str2): if len(str1) != len(str2): return False counts = defaultdict(lambda: [0, 0]) for c1, c2 in izip(str1, str2): counts[c1][0] += 1 counts[c2](http://en.wikipedia.org/wiki/Anagram) += 1 for k, v in counts.iteritems(): if v[0] != v[1]: return False return True
This code does not allocate proportionally to the size of the strings, but instead on the diversity of the strings. In other words, isAnagramN("aaaaaaa", "bbbbbbb")
allocates 1 defaultdict, and 2 lists of size 2. Why? Because, the algorithm simply counts up how many times each letter occurs in each string. Of course Python also has to allocate the generators to use for izip
and counts.iteritems()
, but that isn’t significant. The big win here of course is that given strings of any length, the algorithm uses only as much space as the diversity of the contents contained in the strings!
As if that wasn‘t a win enough, this algorithm runs in O(n) on the length of the strings!
But, does it actually make a difference? The answer of course is yes. For strings of significant length, isAnagramN
runs almost 2x as fast as isAnagram
.
The proof is in the bacon, so let’s take a look at some numbers. Using Python‘s timeit
module, I tested strings of length 1 through 100,001, incrementing by 10,000 (I’ve tested other lengths as well, and reach a similar conclusion). At each length, the test was repeated 50 times. The results are below:
Length Time isAnagram Time isAnagramN
1
0.00043511390686
0.000530958175659
10001
0.657309055328
0.38395690918
20001
1.25437283516
0.793761968613
30001
1.91431283951
1.15374517441
40001
2.55181908607
1.53560996056
50001
3.14615797997
2.07976388931
60001
3.76745486259
2.30041193962
70001
4.48913788795
2.81828999519
80001
5.155148983
3.09482097626
90001
5.67185592651
3.477850914
100001
6.33614587784
3.94285678864
Download the code: anagram.py
—2009-01-08