Deveaud Romain / mirimiri

Browse Code »

Commit b0ffa2ad49638e2a223fff528de1a4ad336acb72

Authored by Romain Deveaud 2013-04-18 14:14:25 +0200

1 parent b3c0213975

Exists in master

finally committing some recent changes

Showing 7 changed files with 146 additions and 16 deletions Inline Diff

README.markdown
lib/mirimiri/document.rb
lib/mirimiri/index.rb
lib/mirimiri/query.rb
lib/mirimiri/result.rb
lib/mirimiri/string.rb
main.rb

README.markdown

Diff comments View file @ b0ffa2a

 # mirimiri
-Copyright (C) 2010-2011 Romain Deveaud <romain.deveaud@gmail.com>
+The various tools of this project were developed for research purposes during
+my Ph.D. and heavily rely on the use of Indri (<http://lemurproject.org/indri.php>).
+Setting up Ruby is not as painful as it used to be since RVM (<https://rvm.io/>),
+visit at least these two websites before trying to use `mirimiri`.
+Copyright (C) 2010-2013 Romain Deveaud <romain.deveaud@gmail.com>
 > The Fijian monkey-faced bat (Mirimiri acrodonta), also called the Fiji
 > Flying Fox, is an Old World fruit bat endemic to Fiji. It was discovered
 > the hills of Taveuni by Bill Beckon in 1977 and is Fiji's only endemic
 > mammal. It is listed as a critically endangered species due to habitat
 > loss. It has recently been transferred from Pteralopex to its own
 > monotypic genus Mirimiri.
 >
 > #####Wikipedia
 License
 =======
 This program is free software: you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by
 the Free Software Foundation, either version 3 of the License, or
 (at your option) any later version.
 This program is distributed in the hope that it will be useful,
 but WITHOUT ANY WARRANTY; without even the implied warranty of
 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 GNU General Public License for more details.
 You should have received a copy of the GNU General Public License
 along with this program.  If not, see <http://www.gnu.org/licenses/>.

lib/mirimiri/document.rb

Diff comments View file @ b0ffa2a

 #!/usr/bin/env ruby
 #--
 # This file is a part of the mirimiri library
 #
 # Copyright (C) 2010-2011 Romain Deveaud <romain.deveaud@gmail.com>
 #
 # This program is free software: you can redistribute it and/or modify
 # it under the terms of the GNU General Public License as published by
 # the Free Software Foundation, either version 3 of the License, or
 # (at your option) any later version.
 #
 # This program is distributed in the hope that it will be useful,
 # but WITHOUT ANY WARRANTY; without even the implied warranty of
 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 # GNU General Public License for more details.
 #
 # You should have received a copy of the GNU General Public License
 # along with this program.  If not, see <http://www.gnu.org/licenses/>.
 #++
 # General module
 module Mirimiri
   # A Document is a bag of words and is constructed from a string.
   class Document
-    attr_reader :words, :doc_content, :count_words
+    attr_reader :words, :doc_content, :xcount
     # Any non-word characters are removed from the words (see http://perldoc.perl.org/perlre.html
     # and the \\W special escape).
     #
     # Protected function, only meant to by called at the initialization.
     def format_words
       wo = []
       @doc_content.split.each do |w|
         w.split(/\W/).each do |sw|
-          wo.push(sw.downcase) if sw =~ /[a-zA-Z]/
+          wo.push(sw.downcase) if sw =~ /[[:alpha:]]/
         end
       end
       wo
     end
     # Returns an Array containing the +n+-grams (words) from the current Document.
     #
     #   ngrams(2) #=> ["the free", "free encyclopedia", "encyclopedia var", "var skin", ...]
     def ngrams(n)
       window       = []
       ngrams_array = []
       if @ngrams[n].nil?
         @words.each do |w|
           window.push(w)
           if window.size == n
             ngrams_array.push window.join(" ")
             window.delete_at(0)
           end
         end
         @ngrams[n] = ngrams_array
       end
       @ngrams[n]
     end
     # Returns a Hash containing the words and their associated counts in the current Document.
     #
     #   count_words #=> { "guitar"=>1, "bass"=>3, "album"=>20, ... }
     def count_words
       counts = Hash.new { |h,k| h[k] = 0 }
       @words.each { |w| counts[w] += 1 }
       counts
     end
     # Old entropy function.
     # TODO: remove.
     def entropy0(s)
       en = 0.0
       s.split.each do |w|
-        p_wi = @count_words[w].to_f/@words.count.to_f
+        p_wi = @xcount[w].to_f/@words.count.to_f
         en += p_wi*Math.log2(p_wi)
       end
       en *= -1
       en
     end
     # Computes the entropy of a given string +s+ inside the document.
     #
     # If the string parameter is composed of many words (i.e. tokens separated
     # by whitespace(s)), it is considered as an ngram.
     #
     #   entropy("guitar") #=> 0.014348983965324762
     #   entropy("dillinger escape plan") #=> 0.054976093116768154
     def entropy(s)
       en = 0.0
       size = s.split.size
       if size == 1
-        p_wi = @count_words[s].to_f/@words.count.to_f
+        p_wi = @xcount[s].to_f/@words.count.to_f
         en += p_wi*Math.log(p_wi)
       elsif size > 1
         ng_size = ngrams(size)
         p_wi = ng_size.count(s).to_f/ng_size.count.to_f
         en += p_wi*Math.log(p_wi)
       end
       en *= -1
       en
     end
     # Computes the term frequency of a given *word* +s+.
     #
     #   tf("guitar") #=> 0.000380372765310004
     def tf(s)
-      @count_words[s].to_f/@words.size.to_f
+      @xcount[s].to_f/@words.size.to_f
     end
+    # Computes the KL divergence between the language model of the +self+
+    # and the language model of +doc+.
+    #
+    # KL is not symmetric, see http://en.wikipedia.org/wiki/Kullback-Leibler_divergence
+    # for more information.
+    #
+    #   d1.kl(d2) #=> 0.2971808085725761
+    def kl(doc)
+      raise ArgumentError, 'Argument is not a Mirimiri::Document' unless doc.is_a? Mirimiri::Document
+      vocab = self.words & doc.words
+      vocab.inject(0.0) { |res,w| res + self.tf(w)*Math.log(self.tf(w)/doc.tf(w)) }
+    end
     def initialize(content="")
       @doc_content = content
       @words = format_words
-      @count_words = count_words
+      @xcount = count_words
       @ngrams = {}
     end
     protected :format_words, :count_words
   end
   # A WebDocument is a Document with a +url+.
   class WebDocument < Document
     attr_reader :url
     # Returns the HTML text from the page of a given +url+.
     def self.get_content(url)
       require 'net/http'
       Net::HTTP.get(URI.parse(url))
     end
     # WebDocument constructor, the content of the Document is the HTML page
     # without the tags.
     def initialize(url,only_tags=nil)
       require 'sanitize'
       @url = url
       content = only_tags.nil? ? WebDocument.get_content(url) : WebDocument.get_content(url).extract_xmltags_values(only_tags).join("")
-      super Sanitize.clean(content.unaccent.toutf8.force_encoding("UTF-8"), :remove_contents => ['script'])
+      super Sanitize.clean(content, :remove_contents => ['script','style'])
     end
   end
   # A WikipediaPage is a WebDocument.
   class WikipediaPage < WebDocument
     require 'rexml/document'
     require 'net/http'
     require 'kconv'
     def self.search_wikipedia_titles(name)
-      raise ArgumentError, "Bad encoding", name unless name.isutf8
+#      raise ArgumentError, "Bad encoding", name unless name.isutf8
-      res = REXML::Document.new(Net::HTTP.get( URI.parse "http://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=#{URI.escape name}&format=xml" ).unaccent.toutf8).elements['api/query/search']
+      res = REXML::Document.new(Net::HTTP.get( URI.parse "http://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=#{URI.escape name}&srlimit=20&format=xml" ).force_encoding("ISO-8859-1").encode("UTF-8")).elements['api/query/search']
      res.collect { |e| e.attributes['title'] } unless res.nil?
     end
     def self.get_url(name)
       raise ArgumentError, "Bad encoding", name unless name.isutf8
       atts = REXML::Document.new(Net::HTTP.get( URI.parse "http://en.wikipedia.org/w/api.php?action=query&titles=#{URI.escape name}&inprop=url&prop=info&format=xml" ).unaccent.toutf8).elements['api/query/pages/page'].attributes
       atts['fullurl'] if atts['missing'].nil?
     end
     def self.search_homepage(name)
       title = WikipediaPage.search_wikipedia_titles name
       WikipediaPage.get_url(title[0]) unless title.nil? || title.empty?
     end
     def self.extract_anchors(url)
       self.get_content(url).extract_xmltags_values('p').join(' ').scan(/<a href="(.+?)" title=.*?>(.+?)<\/a>/).delete_if { |a| a[0] =~ /^\/wiki\/.*$/.negated }
     end
   end
   class FreebasePage < WebDocument
     require 'net/http'
     require 'kconv'
     require 'json'
     def self.search_article_ids query,limit
       raise ArgumentError, "Bad encoding", name unless name.isutf8
       JSON.parse(Net::HTTP.get( URI.parse "http://api.freebase.com/api/service/search?query=#{query.gsub(" ","+")}&limit=#{limit}" ))['result'].collect { |a| a['article']['id'] unless a['article'].nil? }.compact
     end
     def self.get_url id
       "http://api.freebase.com/api/trans/raw#{id}"
     end
   end
 end

lib/mirimiri/index.rb

Diff comments View file @ b0ffa2a

lib/mirimiri/query.rb

Diff comments View file @ b0ffa2a

 #!/usr/bin/env ruby
 #--
 # This file is a part of the mirimiri library
 #
 # Copyright (C) 2010-2011 Romain Deveaud <romain.deveaud@gmail.com>
 #
 # This program is free software: you can redistribute it and/or modify
 # it under the terms of the GNU General Public License as published by
 # the Free Software Foundation, either version 3 of the License, or
 # (at your option) any later version.
 #
 # This program is distributed in the hope that it will be useful,
 # but WITHOUT ANY WARRANTY; without even the implied warranty of
 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 # GNU General Public License for more details.
 #
 # You should have received a copy of the GNU General Public License
 # along with this program.  If not, see <http://www.gnu.org/licenses/>.
 #++
 class Query
+  attr_accessor :query
 end
 module Indri
   class Parameters
     attr_accessor :index_path, :memory, :count, :offset, :run_id, :print_query, :print_docs, :rule, :baseline
-    def initialize(corpus,count="1000",mem="1g",threads="1",offset="1",run_id="default",print_query=false,print_docs=false)
+    def initialize(corpus,count="1000",mem="1g",threads="1",offset="1",run_id="default",print_passages=false,print_query=false,print_docs=false)
       @index_path  = corpus
       @memory      = mem
       @count       = count
       @threads     = threads
       @offset      = offset
       @run_id      = run_id
       @print_query = print_query ? "true" : "false"
       @print_docs  = print_docs  ? "true" : "false"
+      @print_passages  = print_passages  ? "true" : "false"
+      @indexes     = [corpus]
     end
     def to_s
       h = "<memory>#{@memory}</memory>\n"
-      h += "<index>#{@index_path}</index>\n"
+      @indexes.each do |i|
+        h += "<index>#{i}</index>\n"
+      end
       h += "<count>#{@count}</count>\n"
       h += "<threads>#{@threads}</threads>\n"
       unless @baseline.nil?
         h += "<baseline>#{@baseline}</baseline>\n"
       else
         h += "<rule>#{@rule}</rule>\n"
       end
       h += "<trecFormat>true</trecFormat>\n"
       h += "<queryOffset>#{@offset}</queryOffset>\n"
       h += "<runID>#{@run_id}</runID>\n"
+      h += "<printPassages>#{@print_passages}</printPassages>\n"
       h += "<printQuery>#{@print_query}</printQuery>\n"
       h += "<printDocuments>#{@print_docs}</printDocuments>\n"
       h
     end
+    def add_index path
+      @indexes << path
+    end
   end
   class IndriQueryOld < Query
     attr_accessor :id, :query, :rule
     def initialize(id,query)
       @id     = id
       @query  = query
     end
     def to_s
       h = "<query>\n"
       h += "<number>#{@id}</number>\n"
       h += "<text>#{@query}</text>\n"
       h += "</query>\n"
       h
     end
     def exec params
       `IndriRunQuery -query='#{@query}' -index=#{params.index_path} -count=#{params.count} -rule=method:dirichlet,mu:2500 -trecFormat`
     end
   end
   class IndriQuery < Query
     attr_accessor :query, :count, :sm_method, :sm_param, :sm_value, :args
     def initialize atts={},args=nil
       raise ArgumentError, 'Argument 1 must be a Hash' unless atts.is_a? Hash
       atts.each do |k,v|
         instance_variable_set("@#{k}", v) unless v.nil?
       end
       raise ArgumentError, 'Argument 2 must be a String' unless (args.is_a?(String) || args.nil?)
       @args = args
+    end
+    def clarity index_path,terms=10,documents=5
+      `clarity -index=#{index_path} -documents=#{documents} -terms=#{terms} -smoothing=\"method:#{@sm_method},#{@sm_param}:#{@sm_value}\" -query=\"#{query}\"`.split("=").last.strip
     end
   end
   class IndriQueries
     attr_accessor :params, :queries
     def initialize params
 #      @queries = queries
       @params = params
       @queries = {}
       # Here we set the default retrieval model as Language Modeling
       # with a Dirichlet smoothing at 2500.
       # TODO: maybe a Rule class...
       @params.rule  = 'method:dirichlet,mu:2500' if @params.rule.nil?
     end
     def push id,query
       @queries[id.to_i] = query
     end
     def to_s
       h = "<parameters>\n"
       h += @params.to_s
       h += @queries.sort { |a,b| a[0] <=> b[0] }.collect do |q|
             "<query>\n" +
             "<number>#{q[0]}</number>\n" +
             "<text>#{q[1]}</text>\n" +
             "</query>\n"
       end.join ""
 #      h += @queries.collect { |q| q.to_s }.join ""
       h += "</parameters>"
       h
     end
   end
 end

lib/mirimiri/result.rb

Diff comments View file @ b0ffa2a

File was created	1	#!/usr/bin/env ruby
	2
	3	#--
	4	# This file is a part of the mirimiri library
	5	#
	6	# Copyright (C) 2010-2012 Romain Deveaud <romain.deveaud@gmail.com>
	7	#
	8	# This program is free software: you can redistribute it and/or modify
	9	# it under the terms of the GNU General Public License as published by
	10	# the Free Software Foundation, either version 3 of the License, or
	11	# (at your option) any later version.
	12	#
	13	# This program is distributed in the hope that it will be useful,
	14	# but WITHOUT ANY WARRANTY; without even the implied warranty of
	15	# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
	16	# GNU General Public License for more details.
	17	#
	18	# You should have received a copy of the GNU General Public License
	19	# along with this program. If not, see <http://www.gnu.org/licenses/>.
	20	#++
	21
	22	module Mirimiri
	23
	24	# This class represents one line of a TREC-formatted retrieval
	25	# result. Typical output of Indri or Terrier.
	26	class TrecResult
	27	attr_accessor :topic, :doc, :rank, :score, :run
	28
	29	def initialize arg
	30	t = arg.split
	31	@topic = t[0]
	32	@doc = t[2]
	33	@rank = t[3]
	34	@score = t[4]
	35	@run = t[5]
	36	end
	37	end
	38
	39	# This class represents the output of trec_eval, when
	40	# -q option is given.
	41	class TrecEval
	42	attr_accessor :metric, :run, :queries
	43
	44	def initialize arg
	45	@queries = {}
	46
	47	arg.each_line do \|line\|
	48	t = line.split
	49	@metric = t[0] if @metric.nil?
	50	@queries[t[1]] = t[2].to_f if t[1].is_integer?
	51	end
	52	end
	53	end
	54
	55	# An array of TrecResult, or a run.
	56	class TrecResults < Array
	57
	58	def initialize args
	59	super args.collect { \|res\| TrecResult.new res }
	60	end
	61	end
	62	end
	63

lib/mirimiri/string.rb

Diff comments View file @ b0ffa2a

 #!/usr/bin/env ruby
 #--
 # This file is a part of the mirimiri library
 #
 # Copyright (C) 2010-2011 Romain Deveaud <romain.deveaud@gmail.com>
 #
 # This program is free software: you can redistribute it and/or modify
 # it under the terms of the GNU General Public License as published by
 # the Free Software Foundation, either version 3 of the License, or
 # (at your option) any later version.
 #
 # This program is distributed in the hope that it will be useful,
 # but WITHOUT ANY WARRANTY; without even the implied warranty of
 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 # GNU General Public License for more details.
 #
 # You should have received a copy of the GNU General Public License
 # along with this program.  If not, see <http://www.gnu.org/licenses/>.
 #++
 module Mirimiri
   # These are the default stopwords provided by Lemur.
   Stoplist = [
 "a","about","above","according","across","after","afterwards","again","against",
 "albeit","all","almost","alone","along","already","also","although","always","am",
 "among","amongst","an","and","another","any","anybody","anyhow","anyone","anything",
 "anyway","anywhere","apart","are","around","as","at","av","be","became","because",
 "become","becomes","becoming","been","before","beforehand","behind","being","below",
 "beside","besides","between","beyond","both","but","by","can","cannot","canst",
 "certain","cf","choose","contrariwise","cos","could","cu","day","do","does","doesn't",
 "doing","dost","doth","double","down","dual","during","each","either","else",
 "elsewhere","enough","et","etc","even","ever","every","everybody","everyone",
 "everything","everywhere","except","excepted","excepting","exception","exclude",
 "excluding","exclusive","far","farther","farthest","few","ff","first","for",
 "formerly","forth","forward","from","front","further","furthermore","furthest","get",
 "go","had","halves","hardly","has","hast","hath","have","he","hence","henceforth",
 "her","here","hereabouts","hereafter","hereby","herein","hereto","hereupon","hers",
 "herself","him","himself","hindmost","his","hither","hitherto","how","however",
 "howsoever","i","ie","if","in","inasmuch","inc","include","included","including",
 "indeed","indoors","inside","insomuch","instead","into","inward","inwards","is",
 "it","its","itself","just","kind","kg","km","last","latter","latterly","less","lest",
 "let","like","little","ltd","many","may","maybe","me","meantime","meanwhile","might",
 "moreover","most","mostly","more","mr","mrs","ms","much","must","my","myself",
 "namely","need","neither","never","nevertheless","next","no","nobody","none",
 "nonetheless","noone","nope","nor","not","nothing","notwithstanding","now","nowadays",
 "nowhere","of","off","often","ok","on","once","one","only","onto","or","other",
 "others","otherwise","ought","our","ours","ourselves","out","outside","over","own",
 "per","perhaps","plenty","provide","quite","rather","really","round","said","sake",
 "same","sang","save","saw","see","seeing","seem","seemed","seeming","seems","seen",
 "seldom","selves","sent","several","shalt","she","should","shown","sideways","since",
 "slept","slew","slung","slunk","smote","so","some","somebody","somehow","someone",
 "something","sometime","sometimes","somewhat","somewhere","spake","spat","spoke",
 "spoken","sprang","sprung","stave","staves","still","such","supposing","than","that",
 "the","thee","their","them","themselves","then","thence","thenceforth","there",
 "thereabout","thereabouts","thereafter","thereby","therefore","therein","thereof",
 "thereon","thereto","thereupon","these","they","this","those","thou","though",
 "thrice","through","throughout","thru","thus","thy","thyself","till","to","together",
 "too","toward","towards","ugh","unable","under","underneath","unless","unlike",
 "until","up","upon","upward","upwards","us","use","used","using","very","via","vs",
 "want","was","we","week","well","were","what","whatever","whatsoever","when","whence",
 "whenever","whensoever","where","whereabouts","whereafter","whereas","whereat",
 "whereby","wherefore","wherefrom","wherein","whereinto","whereof","whereon",
 "wheresoever","whereto","whereunto","whereupon","wherever","wherewith","whether",
 "whew","which","whichever","whichsoever","while","whilst","whither","who","whoa",
 "whoever","whole","whom","whomever","whomsoever","whose","whosoever","why","will",
 "wilt","with","within","without","worse","worst","would","wow","ye","yet","year",
 "yippee","you","your","yours","yourself","yourselves",
-  "edit", "new", "page", "article", "http", "www", "com", "org", "wikipedia", "en","html"
+  "edit", "new", "page", "article", "http", "www", "com", "org", "wikipedia", "en","html",
+  "amp","nbsp","quot"
   ]
   Transmap = {
   "\xC3\x80" => "A", "\xC3\x81" => "A", "\xC3\x82" => "A", "\xC3\x83" => "A",
   "\xC3\x84" => "A", "\xC3\x85" => "A", "\xC3\x86" => "AE","\xC3\x87" => "C",
   "\xC3\x88" => "E", "\xC3\x89" => "E", "\xC3\x8A" => "E", "\xC3\x8B" => "E",
   "\xC3\x8C" => "I", "\xC3\x8D" => "I", "\xC3\x8E" => "I", "\xC3\x8F" => "I",
   "\xC3\x90" => "D", "\xC3\x91" => "N", "\xC3\x92" => "O", "\xC3\x93" => "O",
   "\xC3\x94" => "O", "\xC3\x95" => "O", "\xC3\x96" => "O", "\xC3\x98" => "O",
   "\xC3\x99" => "U", "\xC3\x9A" => "U", "\xC3\x9B" => "U", "\xC3\x9C" => "U",
   "\xC3\x9D" => "Y", "\xC3\x9E" => "P", "\xC3\x9F" => "ss",
   "\xC3\xA0" => "a", "\xC3\xA1" => "a", "\xC3\xA2" => "a", "\xC3\xA3" => "a",
   "\xC3\xA4" => "a", "\xC3\xA5" => "a", "\xC3\xA6" => "ae","\xC3\xA7" => "c",
   "\xC3\xA8" => "e", "\xC3\xA9" => "e", "\xC3\xAA" => "e", "\xC3\xAB" => "e",
   "\xC3\xAC" => "i", "\xC3\xAD" => "i", "\xC3\xAE" => "i", "\xC3\xAF" => "i",
   "\xC3\xB0" => "o", "\xC3\xB1" => "n", "\xC3\xB2" => "o", "\xC3\xB3" => "o",
   "\xC3\xB4" => "o", "\xC3\xB5" => "o", "\xC3\xB6" => "o", "\xC3\xB8" => "o",
   "\xC3\xB9" => "u", "\xC3\xBA" => "u", "\xC3\xBB" => "u", "\xC3\xBC" => "u",
   "\xC3\xBD" => "y", "\xC3\xBE" => "p", "\xC3\xBF" => "y",
   "\xC4\x80" => "A", "\xC4\x81" => "a", "\xC4\x82" => "A", "\xC4\x83" => "a",
   "\xC4\x84" => "A", "\xC4\x85" => "a", "\xC4\x86" => "C", "\xC4\x87" => "c",
   "\xC4\x88" => "C", "\xC4\x89" => "c", "\xC4\x8A" => "C", "\xC4\x8B" => "c",
   "\xC4\x8C" => "C", "\xC4\x8D" => "c", "\xC4\x8E" => "D", "\xC4\x8F" => "d",
   "\xC4\x90" => "D", "\xC4\x91" => "d", "\xC4\x92" => "E", "\xC4\x93" => "e",
   "\xC4\x94" => "E", "\xC4\x95" => "e", "\xC4\x96" => "E", "\xC4\x97" => "e",
   "\xC4\x98" => "E", "\xC4\x99" => "e", "\xC4\x9A" => "E", "\xC4\x9B" => "e",
   "\xC4\x9C" => "G", "\xC4\x9D" => "g", "\xC4\x9E" => "G", "\xC4\x9F" => "g",
   "\xC4\xA0" => "G", "\xC4\xA1" => "g", "\xC4\xA2" => "G", "\xC4\xA3" => "g",
   "\xC4\xA4" => "H", "\xC4\xA5" => "h", "\xC4\xA6" => "H", "\xC4\xA7" => "h",
   "\xC4\xA8" => "I", "\xC4\xA9" => "i", "\xC4\xAA" => "I", "\xC4\xAB" => "i",
   "\xC4\xAC" => "I", "\xC4\xAD" => "i", "\xC4\xAE" => "I", "\xC4\xAF" => "i",
   "\xC4\xB0" => "I", "\xC4\xB1" => "i", "\xC4\xB2" => "IJ","\xC4\xB3" => "ij",
   "\xC4\xB4" => "J", "\xC4\xB5" => "j", "\xC4\xB6" => "K", "\xC4\xB7" => "k",
   "\xC4\xB8" => "k", "\xC4\xB9" => "L", "\xC4\xBA" => "l", "\xC4\xBB" => "L",
   "\xC4\xBC" => "l", "\xC4\xBD" => "L", "\xC4\xBE" => "l", "\xC4\xBF" => "L",
   "\xC5\x80" => "l", "\xC5\x81" => "L", "\xC5\x82" => "l", "\xC5\x83" => "N",
   "\xC5\x84" => "n", "\xC5\x85" => "N", "\xC5\x86" => "n", "\xC5\x87" => "N",
   "\xC5\x88" => "n", "\xC5\x89" => "n", "\xC5\x8A" => "N", "\xC5\x8B" => "n",
   "\xC5\x8C" => "O", "\xC5\x8D" => "o", "\xC5\x8E" => "O", "\xC5\x8F" => "o",
   "\xC5\x90" => "O", "\xC5\x91" => "o", "\xC5\x92" => "CE","\xC5\x93" => "ce",
   "\xC5\x94" => "R", "\xC5\x95" => "r", "\xC5\x96" => "R", "\xC5\x97" => "r",
   "\xC5\x98" => "R", "\xC5\x99" => "r", "\xC5\x9A" => "S", "\xC5\x9B" => "s",
   "\xC5\x9C" => "S", "\xC5\x9D" => "s", "\xC5\x9E" => "S", "\xC5\x9F" => "s",
   "\xC5\xA0" => "S", "\xC5\xA1" => "s", "\xC5\xA2" => "T", "\xC5\xA3" => "t",
   "\xC5\xA4" => "T", "\xC5\xA5" => "t", "\xC5\xA6" => "T", "\xC5\xA7" => "t",
   "\xC5\xA8" => "U", "\xC5\xA9" => "u", "\xC5\xAA" => "U", "\xC5\xAB" => "u",
   "\xC5\xAC" => "U", "\xC5\xAD" => "u", "\xC5\xAE" => "U", "\xC5\xAF" => "u",
   "\xC5\xB0" => "U", "\xC5\xB1" => "u", "\xC5\xB2" => "U", "\xC5\xB3" => "u",
   "\xC5\xB4" => "W", "\xC5\xB5" => "w", "\xC5\xB6" => "Y", "\xC5\xB7" => "y",
   "\xC5\xB8" => "Y", "\xC5\xB9" => "Z", "\xC5\xBA" => "z", "\xC5\xBB" => "Z",
   "\xC5\xBC" => "z", "\xC5\xBD" => "Z", "\xC5\xBE" => "z", "\xC6\x8F" => "E",
   "\xC6\xA0" => "O", "\xC6\xA1" => "o", "\xC6\xAF" => "U", "\xC6\xB0" => "u",
   "\xC7\x8D" => "A", "\xC7\x8E" => "a", "\xC7\x8F" => "I",
   "\xC7\x90" => "i", "\xC7\x91" => "O", "\xC7\x92" => "o", "\xC7\x93" => "U",
   "\xC7\x94" => "u", "\xC7\x95" => "U", "\xC7\x96" => "u", "\xC7\x97" => "U",
   "\xC7\x98" => "u", "\xC7\x99" => "U", "\xC7\x9A" => "u", "\xC7\x9B" => "U",
   "\xC7\x9C" => "u",
   "\xC7\xBA" => "A", "\xC7\xBB" => "a", "\xC7\xBC" => "AE","\xC7\xBD" => "ae",
   "\xC7\xBE" => "O", "\xC7\xBF" => "o",
   "\xC9\x99" => "e",
   "\xC2\x82" => ",",        # High code comma
   "\xC2\x84" => ",,",       # High code double comma
   "\xC2\x85" => "...",      # Tripple dot
   "\xC2\x88" => "^",        # High carat
   "\xC2\x91" => "\x27",     # Forward single quote
   "\xC2\x92" => "\x27",     # Reverse single quote
   "\xC2\x93" => "\x22",     # Forward double quote
   "\xC2\x94" => "\x22",     # Reverse double quote
   "\xC2\x96" => "-",        # High hyphen
   "\xC2\x97" => "--",       # Double hyphen
   "\xC2\xA6" => "|",        # Split vertical bar
   "\xC2\xAB" => "<<",       # Double less than
   "\xC2\xBB" => ">>",       # Double greater than
   "\xC2\xBC" => "1/4",      # one quarter
   "\xC2\xBD" => "1/2",      # one half
   "\xC2\xBE" => "3/4",      # three quarters
   "\xCA\xBF" => "\x27",     # c-single quote
   "\xCC\xA8" => "",         # modifier - under curve
   "\xCC\xB1" => "",         # modifier - under line
 #  /\W/ => ""
   }
 end
 # Extention of the standard class String with useful function.
 class String
   include Mirimiri
   def unaccent
     # force_encoding is needed with ruby1.9
+#    Transmap.inject(self) { |str, (utf8, asc)| str.gsub(utf8, asc) }
     Transmap.inject(self.force_encoding("ASCII-8BIT")) { |str, (utf8, asc)| str.gsub(utf8, asc) }
   end
   # Returns +true+ if +self+ belongs to Mirimiri::Stoplist, +false+ otherwise.
   def is_stopword?
     self.split.all? { |e| Stoplist.include?(e.downcase) }
   end
-  def sequential_dependence_model t=0.85,o=0.10,u=0.05,field=nil
+  def is_integer?
+    !self.empty? && self =~ /\A\d+\Z/
+  end
+  def numeric?
+    Float(self) != nil rescue false
+  end
+  def sequential_dependence_model field=nil,t=0.85,o=0.10,u=0.05
     d = Mirimiri::Document.new self
     if field.nil?
       ematch = d.ngrams(2).collect { |ng| "#1(#{ng})" }
       pmatch = d.ngrams(2).collect { |ng| "#uw8(#{ng})" }
     else
       ematch = d.ngrams(2).collect { |ng| "#1(#{ng}).(#{field})" }
       pmatch = d.ngrams(2).collect { |ng| "#uw8(#{ng}).(#{field})" }
     end
     if ematch.empty?
       if field.nil?
         ematch = d.words.collect { |ng| "#1(#{ng})" }
         pmatch = d.words.collect { |ng| "#uw8(#{ng})" }
       else
         ematch = d.words.collect { |ng| "#1(#{ng}).(#{field})" }
         pmatch = d.words.collect { |ng| "#uw8(#{ng}).(#{field})" }
       end
     end
     "#weight ( #{t} #combine( #{d.words.join(" ")} ) #{o} #combine ( #{ematch.join(" ")} ) #{u} #combine ( #{pmatch.join(" ")} ) )"
   end
   # Do not use.
   # TODO: rewamp. find why this function is here.
   def remove_special_characters
     self.split.collect { |w| w.gsub(/\W/,' ').split.collect { |w| w.gsub(/\W/,' ').strip.sub(/\A.\z/, '')}.join(' ').strip.sub(/\A.\z/, '')}.join(' ')
   end
   # Removes all XML-like tags from +self+.
   #
   #   s = "<html><body>test</body></html>"
   #   s.strip_xml_tags!
   #   s                                     #=> "test"
   def strip_xml_tags!
     replace strip_with_pattern /<\/?[^>]*>/
   end
   # Removes all XML-like tags from +self+.
   #
   #   s = "<html><body>test</body></html>"
   #   s.strip_xml_tags                      #=> "test"
   #   s                                     #=> "<html><body>test</body></html>"
   def strip_xml_tags
     dup.strip_xml_tags!
   end
   # Removes all Javascript sources from +self+.
   #
   #   s = "<script type='text/javascript'>
   #         var skin='vector',
   #         stylepath='http://bits.wikimedia.org/skins-1.5'
   #        </script>
   #
   #        test"
   #   s.strip_javascripts!
   #   s                                     #=> "test"
   def strip_javascripts!
     replace strip_with_pattern /<script type="text\/javascript">(.+?)<\/script>/m
   end
   # Removes all Javascript sources from +self+.
   #
   #   s = "<script type='text/javascript'>
   #         var skin='vector',
   #         stylepath='http://bits.wikimedia.org/skins-1.5'
   #        </script>
   #
   #        test"
   #   s.strip_javascripts                   #=> "test"
   def strip_javascripts
     dup.strip_javascripts!
   end
   def strip_stylesheets!
   # TODO: rewamp. dunno what is it.
     replace strip_with_pattern /<style type="text\/css">(.+?)<\/style>/m
   end
   def strip_stylesheets
     dup.strip_stylesheets!
   end
   # Removes punctuation from +self+.
   #
   #   s = "hello, world. how are you?!"
   #   s.strip_punctuation!
   #   s                                 # => "hello world how are you"
   def strip_punctuation!
     replace strip_with_pattern /[^a-zA-Z0-9\-\s]/
   end
   # Removes punctuation from +self+.
   #
   #   s = "hello, world. how are you?!"
   #   s.strip_punctuation               # => "hello world how are you"
   def strip_punctuation
     dup.strip_punctuation!
   end
   # Returns the text values inside all occurences of a XML tag in +self+
   #
   #   s = "four-piece in <a href='#'>Indianapolis</a>, <a href='#'>Indiana</a> at the Murat Theatre"
   #   s.extract_xmltags_values 'a' #=> ["Indianapolis", "Indiana"]
   def extract_xmltags_values(tag_name)
     self.scan(/<#{tag_name}.*?>(.+?)<\/#{tag_name}>/).flatten
   end
   def strip_with_pattern(pattern)
     require 'cgi'
     CGI::unescapeHTML(self.gsub(pattern,"")).unaccent.encode("UTF-8", {:invalid => :replace, :undef => :replace, :replace => " "})
   end
   private :strip_with_pattern
 end
 module Indri
   class IndriPrintedDocuments < String
     def extract_docs
       self.split(/\d+ Q0 .+ \d+ -\d+.\d+ .+/).delete_if{ |x| x.empty? }
+    end
+    def extract_docs_score
+      score = self.scan(/\d+ Q0 .+ \d+ (-\d+.\d+) .+/).flatten
+      name  = self.scan(/\d+ Q0 (.+) \d+ -\d+.\d+ .+/).collect { |n| n.first.scan(/(\d+).xml/).first }
+      return self.split(/\d+ Q0 .+ \d+ -\d+.\d+ .+/).delete_if{ |x| x.empty? },score,name
     end
   end
 end

main.rb

Diff comments View file @ b0ffa2a

1	$LOAD_PATH.unshift File.expand_path(File.join(File.dirname(__FILE__), "lib"))	1	$LOAD_PATH.unshift File.expand_path(File.join(File.dirname(__FILE__), "lib"))
2		2
3	require 'mirimiri'	3	require 'mirimiri'
4	require "benchmark"	4	require "benchmark"
5		5
		6	# Fetch the text content of two Wikipedia pages using their URLs
6	w = Mirimiri::WikipediaPage.new("http://en.wikipedia.org/wiki/The_Dillinger_Escape_Plan")	7	w = Mirimiri::WikipediaPage.new("http://en.wikipedia.org/wiki/The_Dillinger_Escape_Plan")
		8	u = Mirimiri::WikipediaPage.new("http://en.wikipedia.org/wiki/Pantera")
		9
		10	# Compute the entropy of a word sequence, using `w` as context
7	p w.entropy("dillinger escape plan")	11	p w.entropy("dillinger escape plan")
8	p w.tf("guitar")	12	p w.tf("guitar")
9		13
		14	# Compute the KL-Divergence between the two pages
		15	p w.kl u
		16
		17
		18	# Mirimiri also comprises Indri-related classes
		19
		20	# Building an Indri query
10	query = Indri::IndriQuery.new({:query => "dillinger escape plan".sequential_dependence_model, :count => 10}, "-trecFormat=true -printDocuments=true")	21	query = Indri::IndriQuery.new({:query => "dillinger escape plan".sequential_dependence_model, :count => 10}, "-trecFormat=true -printDocuments=true")
		22
		23	# Initializing the index on which the query will be executed
		24	# Must have been previously built using `IndriBuildIndex`
11	index = Indri::IndriIndex.new "/mnt/disk1/ClueWeb09_English_1noSpam"	25	index = Indri::IndriIndex.new "/mnt/disk1/ClueWeb09_English_1noSpam"
		26
		27	# Run the query on the index and fetch the text of the documents
12	s = Indri::IndriPrintedDocuments.new(index.runquery(query).force_encoding("ISO-8859-1").encode("UTF-8"))	28	s = Indri::IndriPrintedDocuments.new(index.runquery(query).force_encoding("ISO-8859-1").encode("UTF-8"))
13		29