Class RIR::Document
In: lib/rir/document.rb
Parent: Object

A Document is a bag of words and is constructed from a string.

Methods

count_words   entropy   format_words   new   ngrams   tf  

Attributes

doc_content  [R] 
words  [R] 

Public Class methods

Public Instance methods

Returns a Hash containing the words and their associated counts in the current Document.

  count_words #=> { "guitar"=>1, "bass"=>3, "album"=>20, ... }

Computes the entropy of a given string s inside the document.

If the string parameter is composed of many words (i.e. tokens separated by whitespace(s)), it is considered as an ngram.

  entropy("guitar") #=> 0.00432114812727959
  entropy("dillinger escape plan") #=> 0.265862076325102

Returns an Array containing the n-grams (words) from the current Document.

  ngrams(2) #=> ["the free", "free encyclopedia", "encyclopedia var", "var skin", ...]

Computes the term frequency of a given word s.

  tf("guitar") #=> 0.000380372765310004

Protected Instance methods

Any non-word characters are removed from the words (see perldoc.perl.org/perlre.html and the W special escape).

Protected function, only meant to by called at the initialization.

[Validate]