Google


   


You are here: CodeIdol.com > Other > Ruby Cookbook > Databases And Persistence > Indexing Unstructured Text With SimpleSearch

SAVE
Digg
Shown on del.icio.us del.icio.us
See Whos Talking About This on Technorati Technorati
I've Reddit reddit

Recipe 13.4. Indexing Unstructured Text with SimpleSearch

Problem

You want to index a number of texts and do quick keyword searches on them.

Solution

Use the SimpleSearch library, available in the SimpleSearch gem.

Here's how to create and save an index:

	require 'rubygems'
	require 'search/simple'

	contents = Search::Simple::Contents.new
	contents << Search::Simple::Content.
	              new('In the beginning God created the heavens…',
	                  'Genesis.txt', Time.now)
	contents << Search::Simple::Content.new('Call me Ishmael…',
	                                    'MobyDick.txt', Time.now)
	contents << Search::Simple::Content.new('Marley was dead to begin with…',
	                                    'AChristmasCarol.txt', Time.now)

	searcher = Search::Simple::Searcher.load(contents, 'index_file')

Here's how to load and search an existing index:

	require 'rubygems'
	require 'search/simple'

	searcher = nil
	open('index_file') do |f|
	  searcher = Search::Simple::Searcher.new(Marshal.load(f), Marshal.load(f),
	                                          'index_file')
	end

	searcher.find_words(['begin']).results.collect { |result| result.name }
	# => ["AChristmasCarol.txt", "Genesis.txt"]

Discussion

SimpleSearch is a library that makes it easy to do fast keyword searching on unstructured text documents. The index itself is represented by a Searcher object, and each document you feed it is a Content object.

To create an index, you must first construct a number of Content objects and a Contents object to contain them. A Content object contains a piece of text, a unique identifier for that text (often a filename, though it could also be a database ID or a URL), and the time at which the text was last modified. Searcher.load transforms a Contents object into a searchable index that gets serialized to disk with Marshal.

The indexer analyzes the text you gives it, removes stop words (like "a"), truncates words to their roots (so "beginning" becomes "begin"), and puts every word of the text into binary data structures. Given a set of words to find and a set of words to exclude, SimpleSearch uses these structures to quickly find a set of documents.

Here's how to add some new documents to an existing index:

	class Search::Simple::Searcher
	  def add_contents(contents)
	     Search::Simple::Searcher.create_indices(contents, @dict,
	                                             @document_vectors)
	     dump                             # Re-serialize the file
	  end
	end

	contents = Search::Simple::Contents.new
	contents << Search::Simple::Content.new('A spectre is haunting Europe…',
	                                        'TheCommunistManifesto.txt', Time.now)
	searcher.add_contents(contents)
	searcher.find_words(['spectre']).results[0].name
	# => "TheCommunistManifesto.txt"

SimpleSearch doesn't support incremental indexing. If you update or delete a document, you must recreate the entire index from scratch.

See Also


SAVE
Digg
Shown on del.icio.us del.icio.us
See Whos Talking About This on Technorati Technorati
I've Reddit reddit

You are here: CodeIdol.com > Other > Ruby Cookbook > Databases And Persistence > Indexing Unstructured Text With SimpleSearch


ADBRITE ads links
   
Related tags







Popular Categories
Unix books and guides

AJAX popular information
C# language guides
Windows books and cookbooks

.......








Business Key Top Sites

be number one
rate your site




    С 2009 года мы стали переводить структура сайта на различные языки. Сайт теперь будет содержать книги не только на английском языке, но также и на других европейских языках, в том числе и на Русском языке.

    Русский Polski Francais Deutsch
    support sitemap terms

© CodeIdol Labs, 2007 - 2009