Pinot
Copyright 2005 Fabrice Colin <fabricecolin at users dot berlios dot de>
http://pinot.berlios.de/


1. What is Pinot
2. Upgrading from previous versions
3. Using the Google API
4. Automatic indexing and monitoring
5. Search strategy
6. Viewing cached results
7. Installing additional search plugins
8. File formats
9. More Like This
10. D-Bus service
11. Deskbar Applet plugin
12. How to reset indexes
13. Dependencies
14. FAQ


1. What is Pinot


  Pinot is :
    * a D-Bus service that crawls, indexes and monitors your documents
      ("pinot-dbus-daemon")
    * a GTK-based user interface that enables to query the index built by the
      service or your favourite Web engine, and display and analyze the
      results ("pinot")
    * other command-line tools

  It was developed and tested on GNU/Linux and should work on other Unix-like
  systems.


2. Upgrading from previous versions


  * from 0.50 or older to 0.60
  The "My Documents" index was renamed "My Web Pages". A new index, appropriately
  named "My Documents" ;-), includes all local files including mbox files and is
  populated by the new D-Bus service.

  If you used the Import facility to index specific directories on your system,
  it is recommended you manually unindex the corresponding documents from the
  "My Web Pages" index, and configure these directories to be automatically
  indexed by the D-Bus service, See section "4. Automatic indexing and monitoring".
  The "My EMail" index was dropped. Email messages will now appear in your
  "My Documents" index.


3. Using the Google API


  If the version of Pinot you are using was built with support for the Google
  SOAP API, a "Google API" engine will appear in pinot's engines list. To query
  this engine, you need to obtain a key from Google.

  Go to http://api.google.com/ and create an account. You will then receive
  your Google API key by email. Start pinot and enter the key in the corresponding
  field in the General tab of the Preferences box.


4. Automatic indexing and monitoring


  Pinot can automatically index :
    * directories and their contents. Monitoring for changes is optional.
    * mbox mail accounts. Monitoring is always enabled.
  Both are configured with the Indexing tab in the Preferences box and are
  handled by the D-Bus service.
  If you want to exclude specific files or directories from indexing, use
  patterns as described in section "8. File formats".
  See section "10. D-Bus service".


5. Search strategy


  When querying an index, Pinot adopts a multi-stepped search strategy that
  first attempts to find the documents that best match the query. If a given
  step doesn't return results, the next step alters the query slightly before
  it is run again. These steps are :
    * step 1
      obey the query's operators, don't stem terms
    * step 2
      obey operators, stem terms
    * step 3
      don't follow operators (all terms are OR'ed) and don't stem terms
    * step 4
      don't follow operators and stem terms
  Stemming is available only to Stored Queries for which a language is defined
  (Query parameters box, Indexes only tab).

  Queries can be expressed in a very natural way, using a combination of operators
  and filters. For instance "type:text/html and lang:en and (tcp near ip)" looks
  for HTML files in English that mention TCP/IP. 

  Supported filters are :
    * filters that apply to all engines
      "site" for host name, eg "site:pinot.berlios.de"
      "file" for file name, eg "file:index.html"
    * filters that only apply to Current User engines
      "title" for title, eg "title:Session"
      "url" for URL, eg "url:http://pinot.berlios.de/"
      "dir" for directory, eg "dir:/home/fabrice/test.txt"
      "lang" for ISO language code, eg "lang:en"
      "type" for MIME type, eg "type:text/html"
      "label" for label, eg "label:Important"

  Allowed language codes are "da", "nl", "en", "fi", "fr", "de", "it", "nn", "pt",
  "ru", "es" and "sv".

  For more information about the query syntax, you may want to have a look at the
  documentation for Xapian's QueryParser at http://www.xapian.org/docs/queryparser.html


6. Viewing cached results


  Results returned by search engines can be viewed "live" by selecting the View
  menuitem under Results. This opens whatever application defined for the
  result's MIME type and/or protocol scheme.
  In addition, Pinot allows to view the page as cached by Google and the Wayback
  Machine. Cache providers are actually configured in globalconfig.xml, located
  in $PREFIX/share/pinot/. For instance :
  <cache>
    <name>Google</name>
    <location>http://www.google.com/search?q=cache:%url0</location>
    <protocols>http, https</protocols>
  </cache>

  This is self-explanatory :-) Here it configures a cache provider called
  "Google" that handles both http and https. The location field supports
  two parameters that are substituted to obtain the URL to open : 
    * %url is the result's URL as displayed by the UI, eg
      http://pinot.berlios.de/download.html
    * %url0 is the result's URL without the protocol, eg
      pinot.berlios.de/download.html


7. Installing additional search plugins


  Starting with v0.42, Pinot supports both Sherlock and OpenSearch Description
  plugins. They are installed in $PREFIX/share/pinot/engines/, where PREFIX
  is usually /usr.
  Additional engines can be installed in that directory or in ~/.pinot/engines.
  Note this directory is not created automatically.
  Sherlock is what Firefox and the Mozilla suite use. Chances are that somebody
  wrote a plugin for the engine you are interested in. MyCroft at
  http://mycroft.mozdev.org/ has got plenty of plugins. Beware that a lot are
  out of date and will require some changes. Use pinot-search on the
  command-line to run a quick check on a plugin, eg
  $ pinot-search sherlock /usr/share/pinot/engines/Bozo.src "clowns" 10

  Plugins are categorized by channels. For Sherlock plugins, the routeType
  element under SEARCH specifies the name of the channel the plugin belongs to.
  As for OpenSearch, A9.com has an extensive list at
  http://a9.com/-/search/moreColumns.jsp. A Description file is not available
  for all engines though...
  Pinot should work with OpenSearch Description 1.0 and 1.1 (draft 2) plugins.
  One thing to keep in mind is that because Description doesn't describe how
  to parse the results page returned by the search engine, Pinot assumes that
  the engine will return results formatted according to the OpenSearch Response
  standard.
  In practice, this means that plugins that don't stick to the following rules
  will be ignored or won't show any result :
    * For Description 1.1 plugins, the type attribute on the Url field must be
      set to "application/atom+xml" or "application/rss+xml" (default).
      "text/html" will be rejected.
    * The search engine's results page content type must be some form of XML,
      otherwise Pinot won't attempt parsing it. 
  Pinot differs from the Description spec in that it interprets the Tags field
  as a channel name. The standard defines Tags as a "space-delimited set of
  words that are used as keywords to identify and categorize this search
  content".


8. File formats


  The following document types are supported and will be fully indexed
  * plain text
  * HTML
  * PDF (pdftotext is required)
  * RTF (unrtf is required)
  * MS Word (antiword is required)
  * XML
  * OpenDocument/StarOffice files
  * mbox
  * MP3 and Ogg Vorbis (TagLib required)
  More formats will be supported as the project matures.

  For other document types, Pinot will only index metadata such as name,
  location etc... In any case, starting with 0.62, it is possible to
  skip indexing of files with glob(3) patterns.
  These patterns are configured in the Indexing tab of the Preferences box.
  For example, "*.mp3" will skip all files with the extension "mp3", even
  though that particular document type is supported by Pinot.
  Patterns also apply to directory; for instance "*/Desktop*" will skip
  "~/Desktop" and not crawl nor monitor this directory's contents.
  If you have never run Pinot before, it will initialize the patterns list
  to common picture, archive and video types such as JPG, GIF, ZIP and MPG.


9. More Like This


   The More Like This feature is enabled when at least one of the results
   currently selected is indexed, ie if the result was previously manually
   indexed in "My Web Pages", or if it's a local file indexed in "My Documents".
   Its role is to expand queries.

   When activated, it will suggest other terms from the selected results
   and create a new Stored Query prefixed with "More Like". For instance,
   if you run a Stored Query with name "Me", the expanded query's name will be
   "More Like Me".


10. D-Bus service


   D-Bus activation makes sure the service is running whenever one of its
   methods is invoked by any consumer application. For instance, when
   directories or mbox files are configured for indexing in the UI, clicking
   OK on the Preferences box will call the service's GetStatistics method,
   thus ensuring the service is running.

   Starting with 0.63, the service is auto-started. The file that enables
   this is located at /etc/xdg/autostart/pinot-dbus-daemon.desktop.
   For older versions, if you want the service to be started with your X
   session, add pinot-dbus-daemon to your desktop environment's startup
   programs list.

   A few things to keep in mind :
     * the service doesn't reload its configuration while running, so any
       change (adding or removing a directory or mbox file) will take effect
       when the service is restarted.
     * when starting, the service will first crawl all configured locations
       and (re)index new and modified files. Starting with 0.64, the daemon's
       scheduling priority is set very low (15, can be adjusted with --priority)
       so that you can still play UT2004 while it is indexing !
     * when finished crawling, the service will monitor these locations
       for changes and should consume little resources, unless a huge
       quantity of files needs its attention.
     * any change detected by the monitor is immediately queued and acted upon
       as soon as possible, eg reindex a file that was modified. This and low
       priority should make the daemon not very intrusive.

  If you have dbus-send installed, you can start the daemon manually by calling
  the GetStatistics method with
  $ dbus-send --session --print-reply --type=method_call \
    --dest=de.berlios.Pinot /de/berlios/Pinot de.berlios.Pinot.GetStatistics
  and stop it by calling the Stop method with
  $ dbus-send --session --print-reply --type=method_call \
    --dest=de.berlios.Pinot /de/berlios/Pinot de.berlios.Pinot.Stop
  The daemon will also exit if killed.

  For a list of available D-Bus methods, see the file pinot-dbus-daemon.xml.
  Note that only the My Documents index is accessible through D-Bus for the
  time being.


11. Deskbar Applet plugin


  Deskbar Applet provides an omnipresent versatile search interface for Gnome.
  Searching documents indexed by Pinot with the applet is possible. Users
  of RPM-based systems may install the pinot-desbar RPM. Otherwise, the
  file scripts/python/pinot-live.py can simply be copied into
  $PREFIX/lib/deskbar-applet/handlers (or, on 64 bit systems,
  $PREFIX/lib64/deskbar-applet/handlers) for use by all users, or into
  ~/.gnome2/deskbar-applet/handlers for use only by the current user.
  Once enabled, the plugin will rely on the D-Bus service to query the My
  Documents index.
  Deskbar Applet appears to sort results by names, so it may not present
  results in the same order as the Pinot UI.


12. How to reset indexes


  You may wish to reset one of the index and start from scratch. There
  are several ways to do this, depending on which index it is.

  If you want to reset My Web Pages, you can either :
    * use Pinot to unindex every single document by selecting them all
      and choosing Unindex in the Index menu
    * or delete ~/.pinot/index or ~/.pinot/index/* while Pinot isn't running

  If you want to reset My Documents, special considerations apply because
  of the historical data maintained by the daemon. First, make sure the
  daemon is stopped, then delete the index with
  $ rm ~/.pinot/daemon/*
  and remove historical data as follows
  $ sqlite3 ~/.pinot/history "delete from CrawlHistory; \
    delete from CrawlSources;"


13. Dependencies


  The version numbers indicate the minimum version Pinot has been tested
  with; older versions may or may not work.

---------------------------------------------------------------
Name							Version
---------------------------------------------------------------
SQLite							3.1.2
http://www.sqlite.org/

xapian-core						0.9.6
http://www.xapian.org/

curl (1)						7.13.1
http://curl.haxx.se/
- OR -
neon (1)						0.24.7
http://www.webdav.org/neon/

Google SOAP API (2)	Search/Google/googleapi		beta2

 gSOAP							2.7.8c
 http://www.cs.fsu.edu/~engelen/soap.html

gtkmm							2.6.2
http://www.gtkmm.org/

libxml++						2.12.0
http://libxmlplusplus.sourceforge.net/

libtextcat						2.2
http://software.wise-guys.nl/libtextcat/

gmime							2.1.9
http://spruce.sourceforge.net/gmime

boost (3)						1.32.0
http://www.boost.org/

D-Bus with GLib bindings				0.61
http://www.freedesktop.org/wiki/Software/dbus

shared-mime-info					0.17
http://freedesktop.org/Software/shared-mime-info

desktop-file-utils					0.10
http://www.freedesktop.org/software/desktop-file-utils

unzip							5.52
http://www.info-zip.org/pub/infozip/UnZip.html

pdftotext						3.00
http://www.foolabs.com/xpdf/
http://poppler.freedesktop.org/

antiword						0.36.1
http://www.winfield.demon.nl/

unrtf							0.19.3
http://www.gnu.org/software/unrtf/unrtf.html

TagLib							1.4
http://ktown.kde.org/~wheeler/taglib/
---------------------------------------------------------------
Notes :
(1) enabled with "./configure --with-http=neon|curl"
(2) enabled with "./configure --with-soap=yes"
(3) for building only
---------------------------------------------------------------


14. FAQ


  * At startup, when listing an index or indexing documents, Pinot complains
    of an "index error"
    This is likely because a previous instance didn't exit properly and one
    (or more) index is still locked. Quit Pinot and look for a "db_lock" file
    in "~/.pinot/index". If it's there, delete it and restart Pinot.
    Starting with 0.60, indexes are created with the newer Flint back-end
    which is immune to this issue. The easiest way to "upgrade" is to delete
    the index, let Pinot recreate it, then reindex your data.

  * When compiling from source, you may run into linking problems related to
    libstdc++.la. Try patching xapian-config as explained in the post at
    http://lists.tartarus.org/pipermail/xapian-devel/2006-January/000293.html

  * If you experience segfaults at startup and you are on Fedora, chances are
    it's because of libxml++/libsigc++. See this Bugzilla entry :
    https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=178592
    The latest version seems to fix this issue.

  * If the daemon crashes seemingly randomly while indexing, it may be because
    SQLite wasn't built thread-safe. I have witnessed this mostly on dual CPU
    machines, but others are not immune. Try rebuilding SQLite by passing
    "--enable-threadsafe --enable-threads-override-locks" to configure.

  * If you have built Pinot from source, make sure you have done a "make install".
    Pinot will fail to start if it can't find stuff it needs, its icon for instance.

  * If "make install" fails with an error about Categories in pinot.desktop
    and you have desktop-file-utils 0.11, either downgrade to 0.10 or upgrade to 0.12
    if possible, or edit the top-level Makefile and replace the line 
     @desktop-file-install --vendor="" --dir=$(DESTDIR)$(datadir)/applications pinot.desktop
    with
     $(INSTALL_DATA) pinot.desktop $(DESTDIR)$(datadir)/applications/pinot.desktop
    and run "make install" again.

  * The directory filter doesn't work as expected if the filter starts with a
    non alpha-numeric character
    This has been fixed in Xapian 0.9.8.


