Recently, I spent some time thinking about a simple problem. How do you test if two strings are anagrams of each other?
There are many ways, but the naive solution is to simply sort both strings,
character-wise, and then compare the results. In Python, you might do
that like so (We’ll leave out the fact that anagrams are actually real words
and phrases. We also work in a case-sensitive manor [e.g. “JimMorrison” and
“MrMojoRisin” isn’t truthy, though a simple s.lower()
goes a long way.]):
def isAnagram(str1, str2): if len(str1) != len(str2): return False return
sorted(str1) == sorted(str2)
Here, I’m using sorted
, a Python __builtin__
, that takes an iterable and
produces a sorted list of that iterable. I’m then taking advantage of the fact
that Python lists can be compared element-wise with the ==
operator. Doing
this produces a function that will work on any string, and it’s simple to see
why. Sort the string “parental” and you get “aaelnprt.” Sort the string
“paternal” and you still get “aaelnprt.” Obviously “aaelnprt” and “aaelnprt”
are equivalent. This of course returns False
for the strings “snowman” and
“iceman” since they don’t compare sorted equally (or non-sorted for that
matter).
However, this solution isn’t the most efficient use of resources. For one,
most sorting algorithms are only O(n log n), which means in the best case
isAnagram
is too. It also needs to allocate two lists to store the results
returned by sorted
.
There is of course a way to do better. You just have to think about the problem for a little longer than a minute:
def isAnagramN(str1, str2): if len(str1) != len(str2): return False counts =
defaultdict(lambda: [0, 0]) for c1, c2 in izip(str1, str2): counts[c1][0] += 1
counts[c2][1] += 1 for k, v in counts.iteritems(): if v[0] != v[1]: return
False return True
This code does not allocate proportionally to the size of the strings, but
instead on the diversity of the strings. In other words,
isAnagramN("aaaaaaa", "bbbbbbb")
allocates 1 defaultdict, and 2 lists of
size 2. Why? Because, the algorithm simply counts up how many times each
letter occurs in each string. Of course Python also has to allocate the
generators to use for izip
and counts.iteritems()
, but that isn’t
significant. The big win here of course is that given strings of any length,
the algorithm uses only as much space as the diversity of the contents
contained in the strings!
As if that wasn’t a win enough, this algorithm runs in O(n) on the length of the strings!
But, does it actually make a difference? The answer of course is yes. For
strings of significant length, isAnagramN
runs almost 2x as fast as
isAnagram
.
The proof is in the bacon, so let’s take a look at some numbers. Using
Python’s timeit
module, I tested strings of length 1 through 100,001,
incrementing by 10,000 (I’ve tested other lengths as well, and reach a similar
conclusion). At each length, the test was repeated 50 times. The results are
below:
Length Time isAnagram Time isAnagramN
1
0.00043511390686
0.000530958175659
10001
0.657309055328
0.38395690918
20001
1.25437283516
0.793761968613
30001
1.91431283951
1.15374517441
40001
2.55181908607
1.53560996056
50001
3.14615797997
2.07976388931
60001
3.76745486259
2.30041193962
70001
4.48913788795
2.81828999519
80001
5.155148983
3.09482097626
90001
5.67185592651
3.477850914
100001
6.33614587784
3.94285678864
Download the code: anagram.py