Understanding the longest common subsequence problem

Many times when building software, the problem we're really solving is comparison. A diff tool shows a developer what changed between two versions of a file, a spell checker figures out the closest match to a misspelt word, and a plagiarism detector finds overlapping passages between two documents.

In all these cases, the core challenge is the same: given two sequences, find out what they share. A fundamental problem in this space is the longest common subsequence problem, where the goal is to find the longest sequence of elements that appears in the same relative order in both input sequences.

An example of a common subsequence between two strings.

The longest common subsequence problem is a foundational problem in dynamic programming and appears frequently in applications such as version control systems for detecting changes in files, computational biology for DNA sequence alignment, plagiarism detection, and data synchronization.

In this lesson, we will learn about the longest common subsequence problem and how it can be solved efficiently using a dynamic programming solution.

The longest common subsequence problem

Consider we are given two strings s1 and s2 of lengths m and n respectively. A subsequence of a string is a sequence of characters that appears in the same relative order as in that string, but not necessarily contiguously. Unlike in a substring, the characters in a subsequence do not need to be adjacent in the original strings.

We need to find the length of the longest subsequence that is common to both s1 and s2.