
Daniel Naber
(380) Apache Lucene: Searching the Web and Everything Else
Peer-Refereed Talk
Tuesday, 2007-06-26, 15:50 - 16:30, Arena 9
Daniel Naber - Mindquarry GmbH (speaker)
Topics
Abstract
Apache Lucene is a collection of search-related software at the Apache Software
foundation, most notably Lucene Java (often just called "Lucene"),
Solr, and Nutch.
Considering its active community and the number of high-class deployments,
Lucene Java is by far the most successful Open Source fulltext search library.
It is used, amongst many others, to power the search at Wikipedia, monster.com,
and the desktop search tool Beagle.
Technically, Lucene is a pure Java library that requires Java 1.4 and has no
external dependencies. Solr and Nutch are Java-based applications that are built
on Lucene Java and that can be used almost without programming knowledge.
The talk will introduce fulltext indexing and searching with Lucene, Solr, and
Nutch. The important steps in fulltext information retrieval will be described:
file format conversion, meta data extraction, text normalization, and the
indexing step itself. Examples will be given to give you an idea of how easy it
is to use Lucene Java and when it may be more sensible to use Solr or Nutch --
or even a standard relational database.
Lucene Java will be explained using Java code examples, showing how the
important classes fit together. Solr is a Lucene-based search server with HTTP
interfaces which expect and return XML documents. It has some higher level
feature like a web frontend, replication, and caching which make it an
interesting alternative even for software developer's that are willing to learn
Lucene Java. Solr's configuration files will be explained and the XML format
will be shown.
Unlike Lucene Java, Nutch is not a library but a complete web search engine.
Technically Nutch is Lucene plus a web crawler, a plug-in system, document
converters, and a web search front end. A short demonstration will show how to
get the crawler started.







