Edit Distance is quite a interesting and popular problem. Here I present an efficient bottom up C++ program to solve it.
Problem – We are given 2 strings. We have to find the “edit distance” or the cost of converting one string to other. We are allowed 3 operations – Insert, Delete, Replace. All 3 have equal cost that is 1 unit.
Example – Edit distance between “abc” and “abd” is 1. We replaced “c” with “d”. You can also see this as “Delete c”, then “Insert d” which would give a edit distance of 2. But since we try to minimize edit distance, we consider it as single operation that is Replace. Few more examples-
EditDistance(“ab”, “bc”) = 2
EditDistance(“man”, “woman”) = 2
EditDistance(“c”, “java”) = 4
EditDistance(“abcd”, “acd”) = 2
One of the above 4 examples has wrong answer! Tell me in comments which one is wrong, and what is the correct answer 😉
We will use bottom up version of dynamic programming, as it is much cleaner and efficient (than top down). If the strings have length N and M, we will need a table of size (N+1)*(M+1). First we need to initialize the top row i.e TABLE[0][i] = i and first column i.e TABLE[i][0] = i. These are the base cases in which one of the string is empty. Now filling the rest of the TABLE column wise is simple-
TABLE[i][j] = min(
TABLE[i][j-1]+1,
TABLE[i-1][j]+1,
TABLE[i-1][j-1]+cost,
)
Here cost = 0 if STR1[i] = STR2[j], 1 if unequal
And the C++ function-
int EditDistance(string word1, string word2) { int i, j, l1, l2, m; l1 = word1.length(); l2 = word2.length(); vector< vector<int> > t(l1 + 1, vector<int>(l2 + 1)); for (i = 0; i <= l1; i++) t[i][0] = i; for (i = 1; i <= l2; i++) t[0][i] = i; for (i = 1; i <= l1; i++) { for (j = 1; j <= l2; j++) { m = min(t[i-1][j], t[i][j-1]) + 1; t[i][j] = min(m, t[i-1][j-1] + (word1[i-1] == word2[j-1] ? 0 : 1)); } } return t[l1][l2]; }
The TABLE for “money” and “monkey” looks like->
1 is the final answer.
EditDistance(“abcd”, “acd”) = 1
Have you seen the C++ implementation on Rosetta Code?
http://rosettacode.org/wiki/Levenshtein_distance#C.2B.2B
I have measured it as four times faster than your implementation. So I’m not sure where your claim of “very efficient” comes from. 🙂
Thanks for pointing this.
It would be useful if you also provide the code for comarison along with input. And do mention your OS, compiler and flags passed.
I just used a very simple regime for testing and a handful of strings up to about length 10. I’ve also made some changes to the Rosetta Code implementation to make it even faster, which I will try to get uploaded to Rosetta Code.
Linux, x86_64, gcc-4.7.2, -O2 -march=native -std=c++0x
This is the guts of it, I’ve left out the window dressing for brevity:
high_resolution_clock::time_point const t0(high_resolution_clock::now());
for (size_t i(0); i < 10000000; ++i)
edit_distance_three(s1, s2); // I have several edit_distance_xxxxx functions.
high_resolution_clock::time_point const t1(high_resolution_clock::now());
cout << "Total time: " << duration_cast(t1 – t0).count() << " milliseconds.\n";
cout << "Average time: " << duration_cast(t1 – t0).count() / 10000000.0 << " nanoseconds.\n";
Could you explain what line 18 of your code does, please?
Line 18 considers the cost of replacement. So if word1[i-1] == word2[j-1] then the cost of replacement is 0 otherwise it is 1.
Thank you for your quick response!
WOW very clearly explained 😀 Thanks a lot 😀
great one.!
please explain this: vector< vector > t(l1 + 1, vector(l2 + 1));
This is used to initialize a 2D vector of size (l1+1)*(l2+1).
One of the (many) vector’s constructor takes first argument as count of elements, and the second (optional) argument as default element value.
So if I say `vector(5)`, that gives a vector of size 5, with default value of ‘0’. However if I say `vector(5, 42), that gives vector of size 5, where each value is set to 42`.
Now what happens if the vector’s type itself is vector i.e vector of vectors. The constructor would be called something like vector(5, vector(2)); This simply gives me a vector of size 5, where each element is vector of size 2. Notices that the values in this 2D vector will be all 0. You can imagine this construction as matrix with 5 rows and 2 columns.
HTH