Advas -- A Python Search Engine Module

Werbung
Advas – A Python Search Engine Module
Dipl.-Inf. Frank Hofmann
Potsdam
11. Oktober 2007
Dipl.-Inf. Frank Hofmann (Potsdam)
Advas – A Python Search Engine Module
11. Oktober 2007
1 / 15
Contents
1
Project Overview
2
Functional Overview
3
Package Details
4
Future
5
References
Dipl.-Inf. Frank Hofmann (Potsdam)
Advas – A Python Search Engine Module
11. Oktober 2007
2 / 15
Project Overview
AdvaS in Five Minutes (1)
a library to build a search engine
abbreviation for Advanced Search
written in Python
available as an additional module
project website
advas.sourceforge.net
Dipl.-Inf. Frank Hofmann (Potsdam)
Advas – A Python Search Engine Module
11. Oktober 2007
3 / 15
Project Overview
AdvaS in Five Minutes (2)
Goals:
free library
to extend an CMS or AI system
to study and and understand
information retrieval/search
to learn how does an search engine
works
to understand why you do not find the
data you are looking for
to improve formulating your search
queries
to improve and test enhanced search
algorithms
Dipl.-Inf. Frank Hofmann (Potsdam)
Advas – A Python Search Engine Module
11. Oktober 2007
4 / 15
Project Overview
Project History
development started in 2002 as a student’s project
since then: several improvements
released as rpm packages (Fedora, Mandrake)
since 2006: available as Debian package (for Etch)
test application with GTK+
Dipl.-Inf. Frank Hofmann (Potsdam)
Advas – A Python Search Engine Module
11. Oktober 2007
5 / 15
Functional Overview
Statistical Algorithms
term frequency (tf)
term frequency with stop list
inverse document frequency (idf)
retrieval status value (rsv)
language detection by keywords
k-nearest neighbour algorithm (kNN)
Dipl.-Inf. Frank Hofmann (Potsdam)
Advas – A Python Search Engine Module
11. Oktober 2007
6 / 15
Functional Overview
Linguistic Algorithms
stemming algorithms
table lookup stemmer
n-gram stemmer
successor variety stemmer using peak-and-plateau method
synonym detection with the use of the OpenThesaurus (plain text
version)
Dipl.-Inf. Frank Hofmann (Potsdam)
Advas – A Python Search Engine Module
11. Oktober 2007
7 / 15
Functional Overview
Sound-like Methods
soundex
metaphone
NYSIIS algorithm
caverphone algorithm (version 2.0)
Dipl.-Inf. Frank Hofmann (Potsdam)
Advas – A Python Search Engine Module
11. Oktober 2007
8 / 15
Functional Overview
Additional Methods
a simple descriptor-based ranking algorithm
text search algorithms (Knuth-Morris-Pratt)
Dipl.-Inf. Frank Hofmann (Potsdam)
Advas – A Python Search Engine Module
11. Oktober 2007
9 / 15
Package Details
Requirements
technical requirements
python 2.4
other facts
enthusiasm
idea or plan which advas components to combine
data to play around with
Dipl.-Inf. Frank Hofmann (Potsdam)
Advas – A Python Search Engine Module
11. Oktober 2007
10 / 15
Package Details
Package Content
AdvaS library
Documentation (plain text, HTML in German and English)
Examples
demo application using pyGTK
Dipl.-Inf. Frank Hofmann (Potsdam)
Advas – A Python Search Engine Module
11. Oktober 2007
11 / 15
Future
Wanted
code review
code improvements
extending the demo application
testing
documentation: translation into other languages
Dipl.-Inf. Frank Hofmann (Potsdam)
Advas – A Python Search Engine Module
11. Oktober 2007
12 / 15
References
Literature (selection)
G. G. Chowdhury:
Introduction to Modern Information Retrieval
Facet Publishing, London, 2004, 474 S., ISBN 1-8560-4480-7
David A. Grossman/Ophir Frieder:
Information Retrieval: Algorithms and Heuristics (The Information
Retrieval Series), Springer, 2006, 356 S., ISBN 978-1402030048
Dipl.-Inf. Frank Hofmann (Potsdam)
Advas – A Python Search Engine Module
11. Oktober 2007
13 / 15
References
Links
Semantic Web School – Zentrum für Wissenstransfer
http://www.semantic-web.at
Semantic Pool
http://www.semanticpool.de
SearchEngine Watch
http://www.searchenginewatch.com
Search Theory and Strategies
http://wiki.apache.org/nutch/Search Theory
Dipl.-Inf. Frank Hofmann (Potsdam)
Advas – A Python Search Engine Module
11. Oktober 2007
14 / 15
References
The End
Thank you for your attention :-)
Kontakt:
Dipl.-Inf. Frank Hofmann
Hofmann EDV – Linux, Layout und Satz
Dortustr. 53
14467 Potsdam / Germany
Email <[email protected]>
web www.efho.de
Dipl.-Inf. Frank Hofmann (Potsdam)
Advas – A Python Search Engine Module
11. Oktober 2007
15 / 15
Herunterladen