swish-e indexing sofware - HOWTO

Discuss the QPKG software applications here for installation guideline, general usage, knowledge exchange and wishlists.

swish-e indexing sofware - HOWTO

Postby michaeldaly600 » Sun Mar 25, 2012 11:22 am

Hi, I am wanting to share this HOWTO for installing swish-e on a QNAP NAS device. I have a TS219-P. Swish-e has been a document retrieval friend for years now, and was missed after migrating to the QNAP for document storage. Please refer my earlier post viewtopic.php?f=121&t=55096 which describes the prerequisite upgrading of GCC and Perl from source, although as described below it may not be necessary to do all of this.

Here are the Steps:
Step 1. Understand what swish-e is designed for, how it is used and whether it will fit your requirements. Swish-e will on its own, index .html and .txt documents on your file server and, with easy-to-install additional software, .doc, .xls and .pdf documents. Command line configuration is required.

Main reference: http://www.swish-e.org/

Step 2. Dependencies
Refer http://swish-e.org/docs/install.html#sy ... quirements.

I installed all dependencies with the --prefix=/opt configuration directive, to be consistent with the Optware Chroot strategy.

GCC and GNU 'make' are required. But using the ipkg version of GCC, I repeatedly got libtool 'object name conflicts' failures during the 'make' phase when attempting to compile swish-e-2.4.7 :

Code: Select all
libtool: link: ERROR: object name conflicts: .libs/libswish-e.lax/libreplace.a//share/MD0_DATA/misc/swish-e-2.4.7/src/replace/.libs/libreplace.a
make[3]: *** [libswish-e.la] Error 1
make[3]: Leaving directory `/share/MD0_DATA/misc/swish-e-2.4.7/src'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/share/MD0_DATA/misc/swish-e-2.4.7/src'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/share/MD0_DATA/misc/swish-e-2.4.7/src'
make: *** [all-recursive] Error 1


After taking advice from various fora, on my qnap 219P+ the effort boiled down to *updating* gcc; refer to my other posting on this topic at:
viewtopic.php?p=248944

Note that Libtool was updated immediately prior to updating GCC. I subsequently wondered if updating only libtool might be sufficient to install swish-e, ie only doing steps 1 to 6 of the GCC update, as listed in the described posting. Any feedback on this would be useful.

I also replaced perl (see the other posting), as I wanted indexing via the spidering (perl based) approach (see below). Updating Perl may not be essential...again, please post any relevant feedback here.
edit: I recall perl refused to install the additional modules required by swish-e; whether steps 1 to 6 is enough or not to overcome this, I can't say

2.1 libxml2 dependency
Reference: http://swish-e.org/docs/install.html#item_libxml2
libxml2 is an 'optional' package, in that having it will increase parsing accuracy. The ipkg version had previously installed on my qnap, and I did not update it as it seemed to be the latest version. If its no longer the latest version, it could be updated via compilation from source, as long as there were no conflicts with other libxml2 dependencies. The latter could be determined via the command:
Code: Select all
ipkg whatdepends libxml2

If other libxml2 dependencies are found, it may still be possible to install a second version of libxml2 in a separate location to the existing version and compile swish-e with reference to this second libxml2 version during the configuration stage (I would suggest to check the swish-e ./configure options and/or take note of the advice to ensure the /bin directory containing the relevant 'xml2-config' binary is in your $PATH before building Swish-e).

2.2 Zlib Compression
Reference - http://swish-e.org/docs/install.html#item_zlib
As with libxml2, it was already installed on my qnap. It was not the latest version but I did not update it because of dependencies. A similar update process to that described in Step 2.1 above could be followed. To list the dependencies of this ipkg installed package, run:
ipkg whatdepends zlib

2.3 Perl Modules
Reference: http://swish-e.org/docs/install.html#item_perl
I installed all the recommended modules, via cpan directly, via the command:
Code: Select all
cpan

then
Code: Select all
install ...

(you can do it all on one line as indicated, ie:
Code: Select all
perl -MCPAN -e 'install Bundle::LWP'
)


2.3 Dependencies for indexing the various proprietary documents
Reference - http://swish-e.org/docs/install.html#item_indexing
I followed the advice listed in the swish-e install document for indexing MS Word, pdf, and Excel documents. Note that xpdf and catdoc are available via ipkg, though I elected to compile these from source, installing in the /opt directory (use ./configure with the --prefix=/opt)

My configure script for xpdf was:
./configure --with-freetype2-library=/opt/lib --with-freetype2-includes=/opt/include/freetype2 --prefix=/opt
(nb must have compiled the 'freetype' library beforehand)

Also note that at end of xpdf configure (ie prior to beginning 'make'), there was an alert that ' Motif' was mssing. Several attempts to compile 'Motif' failed and xpdf compiled without it, though it may be worth trying on other setups. Refer to Step 12 below.

2.4 Installing PCRE
In order to avoid this warning when configuring swish-e:
Code: Select all
Not building with perl compatible regex - use --with-pcre to enable


You may need to install pcre (Perl Compatible Regular Expressions). It can be installed via ipkg, or to get the latest version use:
Code: Select all
wget ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/pcre-8.20.tar.gz

and install it in /opt with this configure script:
Code: Select all
./configure --enable-utf8 --prefix=/opt


Run 'make' then 'make check'. With 'make check' there was no output for test 3, which is the perl test, other than the paragraph below referencing 'locale specific features...':
Code: Select all
Cannot test locale-specific features - none of the 'fr_FR', 'fr' or 'french' locales exist, or the "locale" command is not available to check for them.


Also note that compiling with the utf support (as specified in the configure script above) will avoid the 'UTF-8 support is not available' warnings when running 'make check'



Step 3. Installing Swish-e
Reference - http://swish-e.org/docs/install.htm#building_swish_e
So presuming swish-e-2.4.7 has been downloaded and unpacked as indicated below, configure, install, make and check it:
/share/MD0_DATA/misc/swish-e-2.4.7/configure --with-pcre=/opt --prefix=/opt
(follow the swish-e installation advice)




Step 4. Setup and configure an index file(s)
There are two ways to index. Here in step 4 I will describer the simpler method, which is indexing of local files. This is known as the File Access method. The other method is the 'spidering' method. More than one index can be configured and used, of course. Swish-e install directions advise to not execute either method as root, although that is what I am doing for the moment. Although the information below is available in the swish-e installation and other areas, it can be confusing, so I am going to do my best to make it as clear as possible here. Also, my various .conf , .config and other files are attached.

For indexing via the File Access method, create and save this file as swish_1.conf file. Save this file, for example, in /share/MD0_DATA/swish-e/config/

All the directives below are described in detail in:
http://www.swish-e.org/docs/swish-config.html
and note that the first section applies to both indexing methods, and there is a later section that applies to specific indexing types.

There is more information in this forum posting:
Ref: http://www.swish-e.org/archive/2005-12/10202.html

So, here is the contents of /share/MD0_DATA/swish-e/config/swish_1.conf :
Code: Select all
# Tell Swish-e what to index (same as -i switch above)
   IndexDir /share/MD0_DATA/server_dir
 
# Index only the following type of files
   IndexOnly .htm .html .txt .doc .pdf .xls

# Tell Swish-e that .txt files are to use the text parser.
   IndexContents TXT* .txt

# Otherwise, use the HTML parser
   DefaultContents HTML*

#Other config options
   MetaNames swishdocpath swishtitle

# add the filters to enable indexing of more than .txt and .html type documents
# ref: http://swish-e.org/docs/swish-config.html#document_filter_directives
# ref: http://swish-e.org/docs/swish-faq.html#how_do_i_filter_documents_
# ref: http://www.swish-e.org/archive/2005-12/10202.html
# note that xls2csv is installed when catdoc is installed
         FileFilter .pdf pdftotext   "'%p' -"
   FileFilter .doc catdoc "-s8859-1 -d8859-1 %p"
   FileFilter .xls xls2csv "-s8859-1 -d8859-1 %p"

# Set StoreDescription for each parser to display context with search results
   StoreDescription TXT 200
   StoreDescription HTML <body> 200

# Document Source Directives - to strip off unwanted parts of the pathname
# Retaining the final slash is more intuitive ie ReplaceRules ends in ..server_dir not ...server_dir/
# but it depends on what your 'prepend' directive is set to, in swish.cgi (see below)
# ReplaceRules [replace|remove|prepend|append|regex]
   ReplaceRules remove /share/MD0_DATA/server_dir

# Ask libxml2 to report any parsing errors and warnings or any UTF-8 to 8859-1 conversion errors
# Set it to a lower level if desired, but be inquisitive about any reported errors
   ParserWarnLevel 9
   
# Set an IndexFile directive, otherwise the default index file name of index.swish-e will be used
# I presume the default index file location would be the $HOME of the user invoking the indexing command
# keep the index file in a separate directory to the config file
# swish_1.index will be created automatically ie you don't need to 'touch' or otherwise pre-create this file)
   IndexFile /share/MD0_DATA/swish-e-files/swish-e/index/swish_1.index


# Keep instructions in this .conf file,  on how to index eg:
# swish-e -c /share/MD0_DATA/swish-e-files/swish-e-conf/swish_1.conf

# To search via command line, specify the .index file and specify the search word(s) via the -w option  eg:
# swish-e -f /share/MD0_DATA/swish-e-files/swish-e-index/swish_1.index -w whatever-word or phrase
# For Web based searching from a separate workstation using this particular index file, refer the swish.cgi section below





Step 5 'Spidering' method of indexing
This is a more complex method, but it allows web-based indexing of any network server or for that matter, extranet servers on the world wide web. The indexing of non-html documents requires certain apache directives such as 'Options +Indexes' to be set on the source server ie the server to be indexed.

So as in Step 4, set up a .conf file, for example /share/MD0_DATA/swish-e/config/spider_1.conf, with contents:
Code: Select all
# Use spider.pl for indexing (location of spider.pl set at installation time)
   IndexDir spider.pl
# Reference the spider.config file (see step below)
    SwishProgParameters /share/MD0_DATA/swish-e-files/swish-e-conf/spider.config

# Only the following type of files
   IndexOnly .htm .html .txt .doc .pdf .xls

# Tell Swish-e that .txt files are to use the text parser - also .xls to use it
# This seemed to result in less error messages when indexing .xls files
   IndexContents TXT* .txt .xls
   
# Otherwise, use the HTML parser
   DefaultContents HTML*

# FileRules is only available for the -S fs feature -->modify the test_url regex in spider.config instead

# Likewise, FileFilter ie file filtering is handled by the...
# 'my ($filter_sub, $response_sub ) = swish_filter()' command in ./spider.config


# Document Source Directives - as per Step 4
# Set up a virual apache server on port 104 of the host server (see Step below)
# Add additional directives for each additional source server eg ReplaceRules http://www.example.com.au
   ReplaceRules remove http://localhost:104

# Additional directives
       Metanames swishtitle swishdocpath

# Set StoreDescription for each parser to display context with search results
   StoreDescription TXT 200
   StoreDescription HTML <body> 200

# Specify index file location, as per Step 4
   IndexFile /share/MD0_DATA/swish-e-files/swish-e-index/swish_2.index

#to run this index via command line, do:
# swish-e -S prog -c /share/MD0_DATA/swish-e-files/swish-e-conf/spider_1.conf

# to test it via command line run:
# swish-e -f /share/MD0_DATA/swish-e-files/swish-e-index/swish_2.index -w whatever-word or phrase




Step 6. Spider.config configuration
The reference here is: http://swish-e.org/docs/spider.html
As a test, and to limit the spider, try spidering just one file, eg include the following config in spider.config
Code: Select all
base_url    => 'http://localhost:104/_docs/test3.doc', # this 'spiders' and indexes everything anyway

Otherwise specifying a directory does not limit spidering to the subdirectories..ie it 'spiders' everything available on the server!
Code: Select all
base_url    => 'http://localhost:104/_docs/test/', # I found that this will 'spider' and index all parent and sub directories of /test/


Note that you must have the proprietary indexing filters, eg pdftotext or catdoc, installed for indexing other than html or txt files, as per Step 2.3

Here is the contents of my spider_1.conf file, which is referenced by the file created in the Step above:
Code: Select all
my ($filter_sub, $response_sub ) = swish_filter();

@servers = (
    {
   base_url          => 'http://localhost:104/',
   email             => 'swish@user.failed.to.set.email.invalid',
        link_tags              => [qw/ a frame /],
        keep_alive             => 1,
        #test_url modified to also exclude indexing of the files listed
   test_url    => sub {  $_[0]->path !~ /\.(?:gif|tif|jpeg|jpg|png|zip|gz|tar)$/i },
        test_response       => $response_sub,
        use_head_requests   => 1,  # Due to the response sub
        filter_content      => $filter_sub,
   #debug   could be limited to eg => 'failed',
   debug   => 'errors, failed, headers, info, links, redirect, skipped, url',
   
    } );




Step 7. Setting up Localhost:104 on the QNAP
This is necessary for the spidering approach (refer steps 5 and 6)
Note that selecting port 104 is entirely arbitrary...it could be any port other than the so called 'reserved ports'. In fact, port 104 can cause problems with firefox in that it is restricted. To overcome the restriction, you need to type 'about:config' in the firefox address bar, scroll down to:
Code: Select all
network.security.ports.banned.override

and add the new port or port range to be overridden
(if network.security.ports.banned.override preference is not present, it can be added via the 'New' command)

Ideally, find the designated apache virtual host conf file and add the information below to it.
On my qnap this was located in /mnt/HDA_ROOT/.config/apache/extra/httpd-vhosts-user.conf
After making the changes, restart apache via command line. On my qnap I used the command:
Code: Select all
/etc/init.d/Qthttpd.sh restart


Then you should be able to view the files in /server_dir from a workstation on the LAN via
Code: Select all
http://192.168.x.y:104

where 192.168.x.y is the IP address of the QNAP

You could also check this step after installing the lynx browser in the QNAP. Install it via ipkg, then via command line on the QNAP type:
Code: Select all
   lynx http://localhost:104

and see if the directory structure and files are visible. Also try to follow the tree.

Similarly the swish.cgi search form should be visible on the workstation via:
Code: Select all
http://192.168.x.y:104/swish/swish.cgi

or, via lynx from within the QNAP,
Code: Select all
lynx http://localhost:104/swish/swish.cgi


Contents of: ./httpd-vhosts-user.conf begin now:
Code: Select all
NameVirtualHost *:104

Listen 80
Listen 104
#
<VirtualHost _default_:80>
   DocumentRoot "/share/Web"
</VirtualHost>
#

<VirtualHost *:104>
   ServerName my_server
   DocumentRoot "/share/MD0_DATA/server_dir"
#
# The AddDefaultCharset directive, below, stopped swish hanging immediately after the following Warning:
# Unknown header line:.....err: External program failed to return required headers
# It only happened with a particular html file, probably bec it was poorly formed
# Refer http://swish-e.org/archive/2007-08/11559.html for more information
#
   AddDefaultCharset utf-8
#
# We need an Alias to the directory holding the swish.cgi file (see step 8)
# Use 'Alias' rather than 'ScriptAlias' as the mod_alias apache module is not installed...check by: /usr/local/apache/bin/apache -l
   Alias /swish "/share/MD0_DATA/.qpkg/Optware/lib/swish-e"
#
# server_dir contains the files to be indexed...and of course add the +Indexes option
   <Directory "/share/MD0_DATA/server_dir">
   Allow from all
   Options +Indexes
   </Directory>
#
# the directory holding the swish.cgi file also needs to be made available to apache, and add the ExecCGI directive
   <Directory /share/MD0_DATA/.qpkg/Optware/lib/swish-e>
   Order allow,deny
   Allow from all
   Options +Indexes +ExecCGI
   </Directory>

</VirtualHost>
# restart apache via /etc/init.d/Qthttpd.sh restart





Step 8. Swish.cgi configuration
The location and function of this cgi file has been described above. All that needs to be done is a few tweaks. Note that this example used the swish_2.index. To repeat, it is referenced from:
Code: Select all
/share/MD0_DATA/.qpkg/Optware/lib/swish-e/swish.cgi

Of course, a copy could be made and used from a different directory, or it could be referenced via symlink

I only changed two lines, but many more configurable options exist. The first option was on line 157:
Code: Select all
swish_index     => '/share/MD0_DATA/swish-e-files/swish-e-index/swish_2.index',   


And on line 167:
Code: Select all
    prepend_path    => 'http://192.168.x.y:104', 

Do not append a slash after the 104 ie end as 104/, if you have used the ReplaceRules configuration as specified above, ie:
Code: Select all
ReplaceRules remove http://localhost:104





Step 9. Cron
Refer: http://wiki.qnap.com/wiki/Add_items_to_crontab and scroll down to the bottom of the wiki till you see:
Method 1 bis
Then follow the directions. Note that /etc/config/crontab symlinks to:
Code: Select all
/mnt/HDA_ROOT/.config/crontab

This is the entry I added to cron, to have swish-e index at 5.00am each morning:
Code: Select all
0 5 * * * /opt/bin/swish-e -S prog -c /share/MD0_DATA/swish-e-files/swish-e-conf/web_2.conf > /dev/null


Load the changes and restart cron, as described in that part of the wiki



Step 10. Help can be obtained from the swish-e User Discussion List.
Don't underestimate this!


Step 11. Using Swish from a remote location
Ssh into your qnap and use lynx to access swish.cgi as described in Step 7, or configure swish.cgi as follows:
Code: Select all
prepend_path    => 'http://your.domain.com:104',

or
Code: Select all
   prepend_path    => 'http://your.external.ip.address:104', 

(this might need some tweaking and of course adding apache security to the virtual host directives may be needed. Ditto Port forwarding.)



Step 12. Motif (refers back to Step 2.3)
Obtain motif:
Code: Select all
wget http://www.openmotif.org/files/public_downloads/openmotif/2.3/2.3.3/openmotif-2.3.3.tar.gz

Motif had its own dependencies, as indicated in the motif config script:
Code: Select all
cd [/share/MD0_DATA/misc/openmotif-2.3.3]
-->./configure  --with-freetype-includes=/opt/include/freetype2 --with-freetype-lib=/opt/lib --with-freetype-config=/opt/bin --with-libjpeg-includes=/opt/include --with-libjpeg-lib=/opt/lib --with-libpng-includes=/opt/include/libpng15 --with-libpng-lib=/opt/lib --without-x --prefix=/opt

(despite the 'without-x', 'make' failed due to missing X11/xos.h at this line:
Code: Select all
makestrs.c:51:21: fatal error: X11/Xos.h: No such file or directory


and prior to that, despite the above options and attempted variations, motif ./configure would not recognise freetype:
Code: Select all
checking freetype/freetype.h usability... no
checking freetype/freetype.h presence... no
checking for freetype/freetype.h... no   )

As I said, I could not compile Motif, but xpdf compiled without it and pdf2text seems to work. It may be useful tohave freetype installed however so if any users are successful in compiling it, please sure this information in the forum!

So thats it...Happy Indexing!
:D
You do not have the required permissions to view the files attached to this post.
Last edited by michaeldaly600 on Thu Aug 02, 2012 9:37 am, edited 2 times in total.
michaeldaly600
Starting out
 
Posts: 14
Joined: Fri Sep 30, 2011 5:23 am
NAS Model: SS-439 Pro

Re: swish-e indexing sofware - HOWTO

Postby pnap888 » Mon Apr 09, 2012 7:39 pm

well done '600, this is a tool that is long overdue for qnap users.

You have gone to a lot of trouble...an easier way may have been to install the ipkg version of the full debian installation then install swish-e on the debian (fingers crossed!). The *spidering* option could then be used to index whats on the qnap from the debian.

Also, another option for 'remote' access is of course the command line search option (but of course without the option of accessing the file)
Rgds
'pnap'
pnap888
New here
 
Posts: 3
Joined: Wed Feb 01, 2012 12:06 pm
NAS Model: SS-439 Pro


Return to QPKG Software Application Discussions

Who is online

Users browsing this forum: No registered users and 1 guest