jueves, 25 de octubre de 2018

Pharo Script of the Day: Text analysis using tf-idf

Today's snippet takes a natural language text as input (a.k.a. the Corpus) where each line is considered a different document, and outputs a matrix of term documents with word mappings and frequencies for the given documents. This is also known as tf-idf, a distance metric widely used in information retrieval and provides the relevance or weight of terms in a document.

Why is not this just simple word counting?

If you increase relevance proportionally to word count, then all your query results will have words like "the" as the most relevant in the whole set of documents (or even in a single document), as it is a very common word. So you would need to decrease count for these common words, or increase count for "rare" words to get their relevance. This is where IDF (inverse document frequency) comes into play. With IDF you count documents, so you will assign low score to terms appeared in a lot of documents, then increasing the divider and decreasing relevance.

Finally, Stop words are removed and stemming is performed to reduce words with the same root.

First of all, you can install Moose-Algos (with some needed Hapax classes in a clean Pharo image by evaluating:


Metacello new
  configuration: 'MooseAlgos';
  smalltalkhubUser: 'Moose' project: 'MooseAlgos';
  version: #development;
  load.
Gofer it
  smalltalkhubUser: 'GustavoSantos' project: 'Hapax';
  package: 'Hapax';
  package: 'Moose-Hapax-VectorSpace';
  load.

Then you can execute the script:

| corpus tdm documents |
corpus := MalCorpus new.
documents := 'Julie loves me more than Linda loves me 
Jane likes me more than Julie loves me'.
documents lines doWithIndex: [: doc : index |
  corpus
   addDocument: index asString
   with: (MalTerms new
      addString: doc
      using: MalCamelcaseScanner;
      yourself)].
corpus removeStopwords.
corpus stemAll.
tdm := HapTermDocumentMatrix on: corpus. 
tdm.

0 comentarios:

Publicar un comentario