How to diagram and code a recursive solution with memoization for finding distinct subsequences
Given a string S
and a string T
, count the number of distinct subsequences of S
which equals T
.
A subsequence of a string is a new string which is formed from the original string by deleting some (can be none) of the characters without disturbing the relative positions of the remaining characters. (ie, “ACE” is a subsequence of “ABCDE” while “AEC” is not).
Visualizing a solution for two sample strings
Consider s = ‘ccat’
and t=’cat’
We’ll compare characters across both strings to find out if string s
contains a subsequence of string t
.
In our example, there are 2 possible substrings:
Diagramming the solution
To diagram the solution, we’ll consider the index of each character. We’ll represent this with the index at (s,t)
.
- We’ll start at the 0 index for each string.
- If things look good we will do two things:
- Move forward one index on each string
- Move forward one index on string
s
and see if there’s another match to stringt
.
If things don’t look good we’ll move forward one index on string s
and see if there’s a match yet to string t
.
As we move to the end of the strings, if we reach the end of each string and the characters still match, we return 1. If we reach the end of one string but some things still don’t match, we return 0.
The solution
There’s a lot going on here, but let’s focus on the recursive aspects. We set our base case:
While we increment i and j, if i or j become equal to the length of their input strings, we determine if that substring is a match or not. We return 0 or 1 depending on the result. We must always reach the length of j to return a 1.
if i == n:
if j == m:
return 1
else:
return 0
if j == m:
return 1
We initialize a variable match to accumulate all matching subsequences.
We use match to accumulate the number of matches we find, which we return as 1s — Remember that 1s mean we have a subsequence match.
match = 0
return match
Finally, to find our matches, we use an if statement to check for equivalency. If they are equal we call the helper function again, twice. The first call will increment i and j. The second call will only increment j.
If they don’t match we only increment i and keep looking for a match. Each recursive branch will run until the length of string s, t, or both are reached.
if s[i] == t[j]:
match += helper(i+1, j+1) + helper(i+1, j)
else:
match += helper(i+1, j)
return match
We run the code to see how many subsequences exist:
$ python3 distinct_subsequences.py
2
2! We already knew this answer, so no surprise there. But what if we have a more complicated string?
Consider strings s=”ababbabaaabbbaaaaaabababababababbbaa” and t=”ababbababababbababbbaa”.
$ python3 distinct_subsequences.py
6872
6872 subsequences — that’s a whole lot! And it takes a long time to run- 3.2227540016174316 seconds. Let’s reduce the complexity.
Figuring out run time
import time
.
.
.
.
t = time.time()
print(distinct_subsequences("ababbabaaabbbaaaaaabababababababbbaa", "ababbababababbababbbaa"))
print(time.time()-t)
At the top of your code, import time so we can keep track of how long it’s taking for distinct_subsequences to run on our very long strings. Initialize time.time() before we call the function and check the time again right after the function call. This will check about how long the call took to run. For our simple s=’ccat’ and t=’cat’ strings the function call was almost instantaneous.
What is taking so long?!
Remember in our original tree we had some repetition? Look at the two nodes circled in red. Once we saw it the first time, we should have already know the answer the second time.
If we add a print statement for (i,j) for each iteration, we can see the tree moves from the top to the left. Right now it is repeating a lot of (i,j) combinations.
(0, 0)
(1, 1)
(2, 2)
(3, 2)
(2, 1)
(3, 2)
(3, 1)
(1, 0)
(2, 0)
(3, 0)
Memoization to the rescue
Memoization keeps track of everything we already did so we don’t have to do it again. We’ll add a dictionary of tuples to track all possible solutions. See below how we add the cache:
import timedef distinct_subsequences(s, t):n, m = len(s), len(t)
cache = {}
def helper(i, j):if i == n:
if j == m:
return 1
else:
return 0
if j == m:
return 1**if (i,j) in cache:
return cache[(i,j)]**
print((i,j))
match = 0
if s[i] == t[j]:
match += helper(i+1, j+1) + helper(i+1, j)
else:
match += helper(i+1, j) **cache[(i,j)] = match**
return match
return helper(0, 0)t = time.time()
print(distinct_subsequences("ccat", "cat"))
print(time.time()-t)
We initialize a dictionary
cache = {}
We see if the tuple (i,j) is already in the cache before we move on
We add the match to the cache
Let’s check the runtime…
Remember our crazy long strings? Let’s see how they perform with the cache in place.
$ python3 distinct_subsequences.py
6872
0.001993894577026367
We still have 6872 answers, but this time it only takes fractions of a second to compute. That sounds pretty good to me!
For more coding tutorials or to see what I’m working on, follow me on Medium or fork this code on repl.it.