How to diagram and code a recursive solution with memoization for finding distinct subsequences

Given a string S and a string T, count the number of distinct subsequences of S which equals T.

A subsequence of a string is a new string which is formed from the original string by deleting some (can be none) of the characters without disturbing the relative positions of the remaining characters. (ie, “ACE” is a subsequence of “ABCDE” while “AEC” is not).

Visualizing a solution for two sample strings

Consider s = ‘ccat’ and t=’cat’

We’ll compare characters across both strings to find out if string s contains a subsequence of string t.

In our example, there are 2 possible substrings:

Diagramming the solution

To diagram the solution, we’ll consider the index of each character. We’ll represent this with the index at (s,t).

  1. We’ll start at the 0 index for each string.
  2. If things look good we will do two things:
  • Move forward one index on each string
  • Move forward one index on string s and see if there’s another match to string t.

If things don’t look good we’ll move forward one index on string s and see if there’s a match yet to string t.

As we move to the end of the strings, if we reach the end of each string and the characters still match, we return 1. If we reach the end of one string but some things still don’t match, we return 0.

The solution

There’s a lot going on here, but let’s focus on the recursive aspects. We set our base case:

While we increment i and j, if i or j become equal to the length of their input strings, we determine if that substring is a match or not. We return 0 or 1 depending on the result. We must always reach the length of j to return a 1.

if i == n:
if j == m:
return 1
else:
return 0
if j == m:
return 1

We initialize a variable match to accumulate all matching subsequences.

We use match to accumulate the number of matches we find, which we return as 1s — Remember that 1s mean we have a subsequence match.

match = 0
return match

Finally, to find our matches, we use an if statement to check for equivalency. If they are equal we call the helper function again, twice. The first call will increment i and j. The second call will only increment j.

If they don’t match we only increment i and keep looking for a match. Each recursive branch will run until the length of string s, t, or both are reached.

if s[i] == t[j]:
match += helper(i+1, j+1) + helper(i+1, j)
else:
match += helper(i+1, j)
return match

We run the code to see how many subsequences exist:

$ python3 distinct_subsequences.py
2

2! We already knew this answer, so no surprise there. But what if we have a more complicated string?

Consider strings s=”ababbabaaabbbaaaaaabababababababbbaa” and t=”ababbababababbababbbaa”.

$ python3 distinct_subsequences.py
6872

6872 subsequences — that’s a whole lot! And it takes a long time to run- 3.2227540016174316 seconds. Let’s reduce the complexity.

Figuring out run time

import time
.
.
.
.
t = time.time()
print(distinct_subsequences("ababbabaaabbbaaaaaabababababababbbaa", "ababbababababbababbbaa"))
print(time.time()-t)

At the top of your code, import time so we can keep track of how long it’s taking for distinct_subsequences to run on our very long strings. Initialize time.time() before we call the function and check the time again right after the function call. This will check about how long the call took to run. For our simple s=’ccat’ and t=’cat’ strings the function call was almost instantaneous.

What is taking so long?!

Remember in our original tree we had some repetition? Look at the two nodes circled in red. Once we saw it the first time, we should have already know the answer the second time.

If we add a print statement for (i,j) for each iteration, we can see the tree moves from the top to the left. Right now it is repeating a lot of (i,j) combinations.

(0, 0)
(1, 1)
(2, 2)
(3, 2)
(2, 1)
(3, 2)
(3, 1)
(1, 0)
(2, 0)
(3, 0)

Memoization to the rescue

Memoization keeps track of everything we already did so we don’t have to do it again. We’ll add a dictionary of tuples to track all possible solutions. See below how we add the cache:

import timedef distinct_subsequences(s, t):n, m = len(s), len(t)
cache = {}
def helper(i, j):
if i == n:
if j == m:
return 1
else:
return 0
if j == m:
return 1
**if (i,j) in cache:
return cache[(i,j)]**
print((i,j))
match = 0
if s[i] == t[j]:
match += helper(i+1, j+1) + helper(i+1, j)
else:
match += helper(i+1, j)
**cache[(i,j)] = match**
return match
return helper(0, 0)
t = time.time()
print(distinct_subsequences("ccat", "cat"))
print(time.time()-t)

We initialize a dictionary

cache = {}

We see if the tuple (i,j) is already in the cache before we move on

We add the match to the cache

Let’s check the runtime…

Remember our crazy long strings? Let’s see how they perform with the cache in place.

$ python3 distinct_subsequences.py
6872
0.001993894577026367

We still have 6872 answers, but this time it only takes fractions of a second to compute. That sounds pretty good to me!

For more coding tutorials or to see what I’m working on, follow me on Medium or fork this code on repl.it.

Journalist and endlessly curious person. One half of @hatchbeat.