Platinum Sponsors

SUN

ELCA

Gold Sponsors

AdNovum

Credit-Suisse

Silver Sponsors

SAP

SyBOR AG

Partners

Netcetera AG

JUGS - Java User Group Switzerland

Stadt Zürich Wirtschaftsförderung

cR Kommunikation

Eveni AG

LiSoG - Linux Solutions Group e.V.

Star Alliance

ICTnet

simsa

Creatronic Media Supply

Media Partners

Netzwoche

inside-it.ch

javamagazine

InfoWeek

IT Reseller

JavaSPECTRUM

APRESS

Daniel Naber

Daniel Naber

(380) Apache Lucene: Searching the Web and Everything Else

Peer-Refereed Talk

Tuesday, 2007-06-26, 15:50 - 16:30, Arena 9

Daniel Naber - Mindquarry GmbH (speaker)

Topics

Download the presentation

Abstract

Apache Lucene is a collection of search-related software at the Apache Software 
foundation, most notably Lucene Java (often just called "Lucene"), 
Solr, and Nutch.

Considering its active community and the number of high-class deployments,  
Lucene Java is by far the most successful Open Source fulltext search library. 
It is used, amongst many others, to power the search at Wikipedia, monster.com, 
and the desktop search tool Beagle.

Technically, Lucene is a pure Java library that requires Java 1.4 and has no 
external dependencies. Solr and Nutch are Java-based applications that are built 
on Lucene Java and that can be used almost without programming knowledge.

The talk will introduce fulltext indexing and searching with Lucene, Solr, and 
Nutch. The important steps in fulltext information retrieval will be described: 
file format conversion, meta data extraction, text normalization, and the 
indexing step itself. Examples will be given to give you an idea of how easy it 
is to use Lucene Java and when it may be more sensible to use Solr or Nutch -- 
or even a standard relational database.

Lucene Java will be explained using Java code examples, showing how the 
important classes fit together. Solr is a Lucene-based search server with HTTP 
interfaces which expect and return XML documents. It has some higher level 
feature like a web frontend, replication, and caching which make it an 
interesting alternative even for software developer's that are willing to learn 
Lucene Java. Solr's configuration files will be explained and the XML format 
will be shown.

Unlike Lucene Java, Nutch is not a library but a complete web search engine. 
Technically Nutch is Lucene plus a web crawler, a plug-in system, document 
converters, and a web search front end. A short demonstration will show how to 
get the crawler started.