Pagini recente » linegraph | Rolling hash, Rabin Karp, palindromes, rsync and others | Diferente pentru utilizator/lasm_coders intre reviziile 5 si 3 | Diferente pentru blog/rolling-hash intre reviziile 3 si 2 | Rolling hash, Rabin Karp, palindromes, rsync and others

Atenţie! Aceasta este o versiune veche a paginii, scrisă la 2012-10-30 09:54:39.
Revizia anterioară Revizia următoare

Rolling hash, Rabin Karp, palindromes, rsync and others

Cosmin Negruseri
30 octombrie 2012

Rolling hash is a neat idea found in the Rabin-Karp string matching algorithm which is easy to grasp and useful in quite a few different contexts.

Now let's go through a few applications.

The Rabin-Karp algorithm

This algorithm is a textbook example of the rolling hash trick. Rabin-Karp solves the following problem:

Given a string P of length n and a string S of length m find out all the occurrences of P within S

The algorithm works by building a fingerprint for each subsequence of S of length n. It checks if any of them match the fingerprint of P. Implementing this directly leads to a O(n*m) solution which isn't faster than the naive matching algorithm (in fact it might be even slower).

The insight is that you can efficiently compute the hash value of a subsequence using the hash value of previous subsequence. One can use the first character of the previous sequence, the new character of the current sequence and the previous fingerprint without needing to go through the other characters.

A polynomial with the string characters as coefficients works well.
Lets chose $H(c)=c[0]a^{n-1} + c[1]a^{n-2}+...+c[n-1]$ as the hash function for the string c.
All the operations are done modulo a prime number so that we don’t have to deal with large integers.

Let’s see what happens when we go from H(S[i..i+n-1]) to H(S[i+1..i+n]):

$H(S[i+1..i+n])= S[i+1]a^{n-1}+S[i+2]a^{n-2}+.. + S[i+n]$

Adding and substracting $S[i]a^n$ we get

$a(S[i]a^{n-1}+S[i+1]a^{n-2}+ S[i+2]a^{n-3}+ .. + S[i+n -1])+S[i+n]-S[i]a^{n} =$
$aH(S[i..i+n-1])-S[i]a^{n}+S[i+n]$

Deleting a character and appending another one corresponds to adding a number and subtracting a number in our hashing algorithm. The complexity of the algorithm is O(n).

Let's go through an example. If we S and P to be strings of digits and chose a to be 10, our problem maps to finding numbers in a string of digits.

Let P = 53424, S = 3249753424234837 and a = 10
3249753424234837
32497
 24975 = 10 * 32497 - 3 * 10000 + 5  
  49753 = 10 * 24975 - 2 * 10000 + 3
   97534 = 10 * 49753 - 4 * 10000 + 4 
     ...
     53424 // match
      ...

Here's some code:

an = 1;
  rolling_hash = 0;
  for (int i = 0; i < n; i++) {
    rolling_hash = (rolling_hash * a + S[i]) % MOD;
    an = (an * a) % MOD,
  }
  
  for (int i = 1; i <= n - m; i++)
    rolling_hash = (rolling_hash * a + S[i + n - 1] - an * S[i - 1]) % MOD,
    if (rolling_hash == hash_p)
        print "match"
  } 
}

Substrings of variable length

The polynomial function can be used to compute hashes for substrings of varying length

Let’s choose $H(S[i..j]) = S[i]+S[i+1]a+S[i+2]a{2}+...+S[j]a^{j-i}$ Then we compute the array h which contains cummulative hashes: $h[k]=S[0]+S[1]a+S[2]a^{2}+...+S[k]a^{k}$ . It’s easy to see that $H(S[i..j])=(h[j]-h[i-1])/a^i$

Longest common substring

Given three strings, compute their longest common substring.

We can find a common subsequence of fixed length in linear time using the rolling hash technique. Basically we hash all fixed length subseqences of the three strings and check if the intersection of the three respective hash sets is not empty. Now to find the largest length we can use a binary search algorithm.
The complexity of this algorithm is O(n log n).

rsync

The linux utility is used to keep two copies of the same file in sync by copying around deltas. Instead of replicating the whole changed file, rsync segments the file into fixed size chunks, computes a rolling hash for each chunk. On the client side computes rolling hashes for all chunks of the target file including overlapping ones. Then it checks which chunks are already there, which chunks need to be deleted and which chunks need to be copied over. This improves the transfer speed as you only copy over deltas and some chunk fingerprints.

Largest palindrome

Given a string of length n, find out it's largest palindromic subsequence.

The naive solution is O(n^3). For every possible subsequence it tests in linear time if it's a palindrome.

A smarter solution tries every position in the original string as the center of a palindrome and extends to the left and to the right as long as the ... This solution takes O(n^2) time.

There's a O(n log n) solution that involves rolling hash. For a fixed length ... then we do a binary search.

Caveats

Hashing collisions are a problem. One way to deal with them is to check for a match when two fingerprints are equal, but this makes the solution inefficient if there are lots of matches. Another way is to use more hashing functions to decrease the collision probability.

Alternatives

Suffix trees or suffix arrays are very good tools to tackle string matching problems. I prefer hashing as the code is usually simpler.

Additional problems

Given a string of length n, count the number of palindromic subsequences of S. Solve better than O(n^2).
Given a string of length n, find the longest substring that is repeated at least k times. Solve in O(n log n)
Given a bitmap, find out the largest square that’s repeated in the bitmap.
Given a string S figure out if the string is periodic. (there is a p such that for any i we have that s[i] == s[i + p]) For example abcabcab is periodic with p = 3. O(n log n)
Given two equal length strings, figure out if one is a rotation of the other. O(n)
Given two polygons find out if they are similar polygons. O(n)
Given a string, find it's longest periodic prefix. O(n log n) For aaabaaabcde the answer is aaabaaab
Given a tree T with character labeled nodes and a given string P count how many downward a paths matching P. O(n)

Categorii:

infoarena informatica de performanta

Ultimele insemnari

Categorii

Blogroll