Class | RIR::Document |
In: |
lib/rir/document.rb
|
Parent: | Object |
A Document is a bag of words and is constructed from a string.
doc_content | [R] | |
words | [R] |
Returns a Hash containing the words and their associated counts in the current Document.
count_words #=> { "guitar"=>1, "bass"=>3, "album"=>20, ... }
Computes the entropy of a given string s inside the document.
If the string parameter is composed of many words (i.e. tokens separated by whitespace(s)), it is considered as an ngram.
entropy("guitar") #=> 0.00432114812727959 entropy("dillinger escape plan") #=> 0.265862076325102
Returns an Array containing the n-grams (words) from the current Document.
ngrams(2) #=> ["the free", "free encyclopedia", "encyclopedia var", "var skin", ...]
Any non-word characters are removed from the words (see perldoc.perl.org/perlre.html and the W special escape).
Protected function, only meant to by called at the initialization.