Pagini recente » Diferente pentru automate-finite-si-kmp intre reviziile 2 si 3 | Profil LASM_Coders | Diferente pentru problema/bitmap intre reviziile 5 si 4 | Atasamentele paginii sumtree | Diferente pentru blog/rolling-hash intre reviziile 18 si 24

Diferente pentru blog/rolling-hash intre reviziile #18 si #24

Diferente intre #17 si #18

Nu exista diferente intre titluri.

Diferente intre continut:

Rolling hash is a neat idea found in the Rabin-Karp string matching algorithm which is easy to grasp and useful in quite a few different contexts.

As before, the most interesting part of the article is the problem section. You can discuss them in the comment section, but first let's go through a few applications.

As before, the most interesting part of the article is the problem section. Discuss them in the comment section, but first let's go through a few applications.

h2. The Rabin-Karp algorithm
bq. Given a string P of length n and a string S of length m find out all the occurrences of P within S

The algorithm works by building a fingerprint for each subsequence of S of length n. It checks if any of them match the fingerprint of P. Implementing this directly leads to a $O(n*m)$ solution which isn't faster than the naive matching algorithm (in fact it may be slower).

The algorithm works by building a fingerprint for each substring of S of length n. It checks if any of them match the fingerprint of P. Implementing this directly leads to a $O(n*m)$ solution which isn't faster than the naive matching algorithm (in fact it may be slower).

*The insight* is that you can efficiently compute the hash value of a subsequence using the hash value of previous subsequence if you use the right hash function. A polynomial with the string characters as coefficients works well. Let <tex>H(c)=c[0]a^{n-1} + c[1]a^{n-2}+...+c[n-1]</tex> be the hash function for the string c.

*The insight* is that you can efficiently compute the hash value of a substring using the hash value of previous substring if you use the right hash function. A polynomial with the string characters as coefficients works well. Let <tex>H(c)=c[0]a^{n-1} + c[1]a^{n-2}+...+c[n-1]</tex> be the hash function for the string c.

All the operations are done modulo a prime number so that we don’t have to deal with large integers.
Let’s see what happens when we go from $H(S[i..i+n-1])$ to $H(S[i+1..i+n])$:

h2. rsync

The linux utility is used to keep two copies of the same file in sync by copying around deltas. Instead of replicating the whole changed file, rsync segments the file into fixed size chunks, computes a rolling hash for each chunk. On the client side computes rolling hashes for all chunks of the target file including overlapping ones. Then it checks which chunks are already there, which chunks need to be deleted and which chunks need to be copied over. This improves the transfer speed as you only copy over deltas and some chunk fingerprints.

The linux utility is used to keep two copies of the same file in sync by copying around deltas. Instead of replicating the whole changed file, rsync segments the file into fixed size chunks, computes a rolling hash for each chunk. On the client side computes rolling hashes for all chunks of the target file including overlapping ones. Then it checks which chunks are already there, which chunks need to be deleted and which chunks need to be copied over. This improves the transfer speed as you only copy over deltas and some chunk fingerprints. Here's a 'better description':http://en.wikipedia.org/wiki/Rsync#Algorithm from wikipedia.

h2. Largest palindrome

bq. Given a string of length n, find out it's largest palindromic subsequence.

bq. Given a string of length n, find out its largest palindromic substring.

The naive solution is O(n^3^). For every possible subsequence it tests in linear time if it's a palindrome.

The naive solution is O(n^3^). For every possible substring it tests in linear time if it's a palindrome.

A smarter solution tries every position in the original string as the center of a palindrome and extends to the left and to the right as long as the corresponding characters match. This solution takes O(n^2^) time.

Let’s choose <tex>H(S[i..j]) = S[i]+S[i+1]a+S[i+2]a{2}+...+S[j]a^{j-i} </tex> Then we compute the array h which contains cummulative hashes: <tex>h[k]=S[0]+S[1]a+S[2]a^{2}+...+S[k]a^{k}</tex>. It’s easy to see that <tex>H(S[i..j])=(h[j]-h[i-1])/a^i</tex>

After spending O(n) time on preprocessing h, we have an efficient way to compute the hash code for any subsequence of S.

After spending O(n) time on preprocessing h, we have an efficient way to compute the hash code for any substring of S.

The division operation modulo a prime needs some number theory insight. You can do it using Fermat's little theorem or using the Extended Euclidean Algorithm. If you need to process a lot of sub strings it's useful to pre process all the values for <tex>a^{-i} \bmod{p}</tex>.

h2. Additional problems

# Given a string of length n, count the number of palindromic subsequences of S. Solve better than O(n^2^).

# Given a string of length n, count the number of palindromic substring of S. Solve better than O(n^2^).

# Given a string of length n, find the longest substring that is repeated at least k times. Solve in O(n log n)
# Given a bitmap, find out the largest square that’s repeated in the bitmap.
# Given a string S figure out if the string is periodic. (there is a p such that for any i we have that s[i] == s[i + p]) For example abcabcab is periodic with p = 3. O(n log n)

Diferente intre securitate:

public

protected

infoarena informatica de performanta

Diferente pentru blog/rolling-hash intre reviziile #18 si #24

Nu exista diferente intre titluri.

Diferente intre continut:

Diferente intre securitate:

Topicul de forum nu a fost schimbat.