domingo, 7 de octubre de 2018

Pharo Script of the Day: k-shingles implementation

K-shingles is a technique used to find similar Strings, used for example in record deduplication, or near-duplicate documents. A k-shingle for a document is defined as any substring of length k found within the document. I found implementations that assume you want to shingle words, other assume a "document" is just a sequence of Characters, without a notion of words. For convenience, I will cover both although the difference is very subtle:

  • k is always a positive integer.
  • Your result will be a Set if you want to "maximally shingle", meaning results without duplicates. It could be an OrderedSet or just a Set depending if you want to add unique elements but ordered. Otherwise it will be an arrayed collection.
  • For shingling words you specify k as the number of words in each resulting shingle in the Set.
  • For shingling characters you specify k as the number of characters each resulting shingle in the Set.
  • "k should be picked large enough that the probability of any given shingle appearing in any given document is low". From Jeffrey Ullman's book.
  • The Jaccard similarity coefficient (a.k.a Tanimoto Coefficient, a token based edit distance) uses k-shingles.
So for word shingling:

| k s |
k := 2.
s := 'a rose is a rose is a rose' findTokens: ' '.
(1 to: s size - k + 1) collect: [ : i | (s copyFrom: i to: i + k - 1) asArray ]

For different values of k we will have:

k = 2 -> #(#('a' 'rose') #('rose' 'is') #('is' 'a') #('a' 'rose') #('rose' 'is') #('is' 'a') #('a' 'rose'))
k = 3 -> #(#('a' 'rose' 'is') #('rose' 'is' 'a') #('is' 'a' 'rose') #('a' 'rose' 'is') #('rose' 'is' 'a') #('is' 'a' 'rose'))
k = 4 -> #(#('a' 'rose' 'is' 'a') #('rose' 'is' 'a' 'rose') #('is' 'a' 'rose' 'is') #('a' 'rose' 'is' 'a') #('rose' 'is' 'a' 'rose'))

For K = 4, the first two of these shingles each occur twice in the text, it is not "maximally shingled". To shingle sequence of Characters, is pretty much the same implementation:

| k s |
k := 2.
s := 'abcdabd'.
(1 to: s size - k + 1) 
 collect: [ : i | s copyFrom: i to: i + k - 1 ]
 as: OrderedSet.

And in this case we have:

k = 2 -> "an OrderedSet('ab' 'bc' 'cd' 'da' 'bd')"
k = 3 -> "an OrderedSet('abc' 'bcd' 'cda' 'dab' 'abd')" 
k = 4 -> "an OrderedSet('abcd' 'bcda' 'cdab' 'dabd')"
You can find this implemented in the StringExtensions package. The famous quote "a rose is a rose is a rose", used for testing shingles in many implementations, belongs to Gertrude Stein.

0 comentarios:

Publicar un comentario