Recipe 13.4. Indexing Unstructured Text with SimpleSearch
Problem
You want to index a number of texts and do quick keyword searches on them.
Solution
Use the
SimpleSearch library, available in the SimpleSearch gem.
Here's how to create and save an index:
require 'rubygems'
require 'search/simple'
contents = Search::Simple::Contents.new
contents << Search::Simple::Content.
new('In the beginning God created the heavens…',
'Genesis.txt', Time.now)
contents << Search::Simple::Content.new('Call me Ishmael…',
'MobyDick.txt', Time.now)
contents << Search::Simple::Content.new('Marley was dead to begin with…',
'AChristmasCarol.txt', Time.now)
searcher = Search::Simple::Searcher.load(contents, 'index_file')
Here's how to load and search an existing index:
require 'rubygems'
require 'search/simple'
searcher = nil
open('index_file') do |f|
searcher = Search::Simple::Searcher.new(Marshal.load(f), Marshal.load(f),
'index_file')
end
searcher.find_words(['begin']).results.collect { |result| result.name }
# => ["AChristmasCarol.txt", "Genesis.txt"]
Discussion
SimpleSearch is a library that makes it easy to do fast keyword searching on unstructured text documents. The index itself is represented by a Searcher object, and each document you feed it is a Content object.
To create an index, you must first construct a number of Content objects and a Contents object to contain them. A Content object contains a piece of text, a unique identifier for that text (often a filename, though it could also be a database ID or a URL), and the time at which the text was last modified. Searcher.load transforms a Contents object into a searchable index that gets serialized to disk with Marshal.
The indexer analyzes the text you gives it, removes stop words (like "a"), truncates words to their roots (so "beginning" becomes "begin"), and puts every word of the text into binary data structures. Given a set of words to find and a set of words to exclude,
SimpleSearch uses these structures to quickly find a set of documents.
Here's how to add some new documents to an existing index:
class Search::Simple::Searcher
def add_contents(contents)
Search::Simple::Searcher.create_indices(contents, @dict,
@document_vectors)
dump # Re-serialize the file
end
end
contents = Search::Simple::Contents.new
contents << Search::Simple::Content.new('A spectre is haunting Europe…',
'TheCommunistManifesto.txt', Time.now)
searcher.add_contents(contents)
searcher.find_words(['spectre']).results[0].name
# => "TheCommunistManifesto.txt"
SimpleSearch doesn't support incremental indexing. If you update or delete a document, you must recreate the entire index from scratch.
See Also
|