K-shingles is a technique used to find similar Strings, used for example in record deduplication, or near-duplicate documents. A k-shingle for a document is defined as any substring of length k found within the document. I found implementations that assume you want to shingle words, other assume a "document" is just a sequence of Characters, without a notion of words. For convenience, I will cover both although the difference is very subtle:
- k is always a positive integer.
- Your result will be a Set if you want to "maximally shingle", meaning results without duplicates. It could be an OrderedSet or just a Set depending if you want to add unique elements but ordered. Otherwise it will be an arrayed collection.
- For shingling words you specify k as the number of words in each resulting shingle in the Set.
- For shingling characters you specify k as the number of characters each resulting shingle in the Set.
- "k should be picked large enough that the probability of any given shingle appearing in any given document is low". From Jeffrey Ullman's book.
- The Jaccard similarity coefficient (a.k.a Tanimoto Coefficient, a token based edit distance) uses k-shingles.
| k s | k := 2. s := 'a rose is a rose is a rose' findTokens: ' '. (1 to: s size - k + 1) collect: [ : i | (s copyFrom: i to: i + k - 1) asArray ]
For different values of k we will have:
k = 2 -> #(#('a' 'rose') #('rose' 'is') #('is' 'a') #('a' 'rose') #('rose' 'is') #('is' 'a') #('a' 'rose')) k = 3 -> #(#('a' 'rose' 'is') #('rose' 'is' 'a') #('is' 'a' 'rose') #('a' 'rose' 'is') #('rose' 'is' 'a') #('is' 'a' 'rose')) k = 4 -> #(#('a' 'rose' 'is' 'a') #('rose' 'is' 'a' 'rose') #('is' 'a' 'rose' 'is') #('a' 'rose' 'is' 'a') #('rose' 'is' 'a' 'rose'))
For K = 4, the first two of these shingles each occur twice in the text, it is not "maximally shingled". To shingle sequence of Characters, is pretty much the same implementation:
| k s | k := 2. s := 'abcdabd'. (1 to: s size - k + 1) collect: [ : i | s copyFrom: i to: i + k - 1 ] as: OrderedSet.
And in this case we have:
k = 2 -> "an OrderedSet('ab' 'bc' 'cd' 'da' 'bd')" k = 3 -> "an OrderedSet('abc' 'bcd' 'cda' 'dab' 'abd')" k = 4 -> "an OrderedSet('abcd' 'bcda' 'cdab' 'dabd')"You can find this implemented in the StringExtensions package. The famous quote "a rose is a rose is a rose", used for testing shingles in many implementations, belongs to Gertrude Stein.
0 comentarios:
Publicar un comentario