Here are the Steps:
Step 1. Understand what swish-e is designed for, how it is used and whether it will fit your requirements. Swish-e will on its own, index .html and .txt documents on your file server and, with easy-to-install additional software, .doc, .xls and .pdf documents. Command line configuration is required.
Main reference: http://www.swish-e.org/
Step 2. Dependencies
Refer http://swish-e.org/docs/install.html#sy ... quirements.
I installed all dependencies with the --prefix=/opt configuration directive, to be consistent with the Optware Chroot strategy.
GCC and GNU 'make' are required. But using the ipkg version of GCC, I repeatedly got libtool 'object name conflicts' failures during the 'make' phase when attempting to compile swish-e-2.4.7 :
- Code: Select all
libtool: link: ERROR: object name conflicts: .libs/libswish-e.lax/libreplace.a//share/MD0_DATA/misc/swish-e-2.4.7/src/replace/.libs/libreplace.a
make[3]: *** [libswish-e.la] Error 1
make[3]: Leaving directory `/share/MD0_DATA/misc/swish-e-2.4.7/src'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/share/MD0_DATA/misc/swish-e-2.4.7/src'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/share/MD0_DATA/misc/swish-e-2.4.7/src'
make: *** [all-recursive] Error 1
After taking advice from various fora, on my qnap 219P+ the effort boiled down to *updating* gcc; refer to my other posting on this topic at:
viewtopic.php?p=248944
Note that Libtool was updated immediately prior to updating GCC. I subsequently wondered if updating only libtool might be sufficient to install swish-e, ie only doing steps 1 to 6 of the GCC update, as listed in the described posting. Any feedback on this would be useful.
I also replaced perl (see the other posting), as I wanted indexing via the spidering (perl based) approach (see below). Updating Perl may not be essential...again, please post any relevant feedback here.
edit: I recall perl refused to install the additional modules required by swish-e; whether steps 1 to 6 is enough or not to overcome this, I can't say
2.1 libxml2 dependency
Reference: http://swish-e.org/docs/install.html#item_libxml2
libxml2 is an 'optional' package, in that having it will increase parsing accuracy. The ipkg version had previously installed on my qnap, and I did not update it as it seemed to be the latest version. If its no longer the latest version, it could be updated via compilation from source, as long as there were no conflicts with other libxml2 dependencies. The latter could be determined via the command:
- Code: Select all
ipkg whatdepends libxml2
If other libxml2 dependencies are found, it may still be possible to install a second version of libxml2 in a separate location to the existing version and compile swish-e with reference to this second libxml2 version during the configuration stage (I would suggest to check the swish-e ./configure options and/or take note of the advice to ensure the /bin directory containing the relevant 'xml2-config' binary is in your $PATH before building Swish-e).
2.2 Zlib Compression
Reference - http://swish-e.org/docs/install.html#item_zlib
As with libxml2, it was already installed on my qnap. It was not the latest version but I did not update it because of dependencies. A similar update process to that described in Step 2.1 above could be followed. To list the dependencies of this ipkg installed package, run:
ipkg whatdepends zlib
2.3 Perl Modules
Reference: http://swish-e.org/docs/install.html#item_perl
I installed all the recommended modules, via cpan directly, via the command:
- Code: Select all
cpan
then
- Code: Select all
install ...
(you can do it all on one line as indicated, ie:
- Code: Select all
perl -MCPAN -e 'install Bundle::LWP'
2.3 Dependencies for indexing the various proprietary documents
Reference - http://swish-e.org/docs/install.html#item_indexing
I followed the advice listed in the swish-e install document for indexing MS Word, pdf, and Excel documents. Note that xpdf and catdoc are available via ipkg, though I elected to compile these from source, installing in the /opt directory (use ./configure with the --prefix=/opt)
My configure script for xpdf was:
./configure --with-freetype2-library=/opt/lib --with-freetype2-includes=/opt/include/freetype2 --prefix=/opt
(nb must have compiled the 'freetype' library beforehand)
Also note that at end of xpdf configure (ie prior to beginning 'make'), there was an alert that ' Motif' was mssing. Several attempts to compile 'Motif' failed and xpdf compiled without it, though it may be worth trying on other setups. Refer to Step 12 below.
2.4 Installing PCRE
In order to avoid this warning when configuring swish-e:
- Code: Select all
Not building with perl compatible regex - use --with-pcre to enable
You may need to install pcre (Perl Compatible Regular Expressions). It can be installed via ipkg, or to get the latest version use:
- Code: Select all
wget ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/pcre-8.20.tar.gz
and install it in /opt with this configure script:
- Code: Select all
./configure --enable-utf8 --prefix=/opt
Run 'make' then 'make check'. With 'make check' there was no output for test 3, which is the perl test, other than the paragraph below referencing 'locale specific features...':
- Code: Select all
Cannot test locale-specific features - none of the 'fr_FR', 'fr' or 'french' locales exist, or the "locale" command is not available to check for them.
Also note that compiling with the utf support (as specified in the configure script above) will avoid the 'UTF-8 support is not available' warnings when running 'make check'
Step 3. Installing Swish-e
Reference - http://swish-e.org/docs/install.htm#building_swish_e
So presuming swish-e-2.4.7 has been downloaded and unpacked as indicated below, configure, install, make and check it:
/share/MD0_DATA/misc/swish-e-2.4.7/configure --with-pcre=/opt --prefix=/opt
(follow the swish-e installation advice)
Step 4. Setup and configure an index file(s)
There are two ways to index. Here in step 4 I will describer the simpler method, which is indexing of local files. This is known as the File Access method. The other method is the 'spidering' method. More than one index can be configured and used, of course. Swish-e install directions advise to not execute either method as root, although that is what I am doing for the moment. Although the information below is available in the swish-e installation and other areas, it can be confusing, so I am going to do my best to make it as clear as possible here. Also, my various .conf , .config and other files are attached.
For indexing via the File Access method, create and save this file as swish_1.conf file. Save this file, for example, in /share/MD0_DATA/swish-e/config/
All the directives below are described in detail in:
http://www.swish-e.org/docs/swish-config.html
and note that the first section applies to both indexing methods, and there is a later section that applies to specific indexing types.
There is more information in this forum posting:
Ref: http://www.swish-e.org/archive/2005-12/10202.html
So, here is the contents of /share/MD0_DATA/swish-e/config/swish_1.conf :
- Code: Select all
# Tell Swish-e what to index (same as -i switch above)
IndexDir /share/MD0_DATA/server_dir
# Index only the following type of files
IndexOnly .htm .html .txt .doc .pdf .xls
# Tell Swish-e that .txt files are to use the text parser.
IndexContents TXT* .txt
# Otherwise, use the HTML parser
DefaultContents HTML*
#Other config options
MetaNames swishdocpath swishtitle
# add the filters to enable indexing of more than .txt and .html type documents
# ref: http://swish-e.org/docs/swish-config.html#document_filter_directives
# ref: http://swish-e.org/docs/swish-faq.html#how_do_i_filter_documents_
# ref: http://www.swish-e.org/archive/2005-12/10202.html
# note that xls2csv is installed when catdoc is installed
FileFilter .pdf pdftotext "'%p' -"
FileFilter .doc catdoc "-s8859-1 -d8859-1 %p"
FileFilter .xls xls2csv "-s8859-1 -d8859-1 %p"
# Set StoreDescription for each parser to display context with search results
StoreDescription TXT 200
StoreDescription HTML <body> 200
# Document Source Directives - to strip off unwanted parts of the pathname
# Retaining the final slash is more intuitive ie ReplaceRules ends in ..server_dir not ...server_dir/
# but it depends on what your 'prepend' directive is set to, in swish.cgi (see below)
# ReplaceRules [replace|remove|prepend|append|regex]
ReplaceRules remove /share/MD0_DATA/server_dir
# Ask libxml2 to report any parsing errors and warnings or any UTF-8 to 8859-1 conversion errors
# Set it to a lower level if desired, but be inquisitive about any reported errors
ParserWarnLevel 9
# Set an IndexFile directive, otherwise the default index file name of index.swish-e will be used
# I presume the default index file location would be the $HOME of the user invoking the indexing command
# keep the index file in a separate directory to the config file
# swish_1.index will be created automatically ie you don't need to 'touch' or otherwise pre-create this file)
IndexFile /share/MD0_DATA/swish-e-files/swish-e/index/swish_1.index
# Keep instructions in this .conf file, on how to index eg:
# swish-e -c /share/MD0_DATA/swish-e-files/swish-e-conf/swish_1.conf
# To search via command line, specify the .index file and specify the search word(s) via the -w option eg:
# swish-e -f /share/MD0_DATA/swish-e-files/swish-e-index/swish_1.index -w whatever-word or phrase
# For Web based searching from a separate workstation using this particular index file, refer the swish.cgi section below
Step 5 'Spidering' method of indexing
This is a more complex method, but it allows web-based indexing of any network server or for that matter, extranet servers on the world wide web. The indexing of non-html documents requires certain apache directives such as 'Options +Indexes' to be set on the source server ie the server to be indexed.
So as in Step 4, set up a .conf file, for example /share/MD0_DATA/swish-e/config/spider_1.conf, with contents:
- Code: Select all
# Use spider.pl for indexing (location of spider.pl set at installation time)
IndexDir spider.pl
# Reference the spider.config file (see step below)
SwishProgParameters /share/MD0_DATA/swish-e-files/swish-e-conf/spider.config
# Only the following type of files
IndexOnly .htm .html .txt .doc .pdf .xls
# Tell Swish-e that .txt files are to use the text parser - also .xls to use it
# This seemed to result in less error messages when indexing .xls files
IndexContents TXT* .txt .xls
# Otherwise, use the HTML parser
DefaultContents HTML*
# FileRules is only available for the -S fs feature -->modify the test_url regex in spider.config instead
# Likewise, FileFilter ie file filtering is handled by the...
# 'my ($filter_sub, $response_sub ) = swish_filter()' command in ./spider.config
# Document Source Directives - as per Step 4
# Set up a virual apache server on port 104 of the host server (see Step below)
# Add additional directives for each additional source server eg ReplaceRules http://www.example.com.au
ReplaceRules remove http://localhost:104
# Additional directives
Metanames swishtitle swishdocpath
# Set StoreDescription for each parser to display context with search results
StoreDescription TXT 200
StoreDescription HTML <body> 200
# Specify index file location, as per Step 4
IndexFile /share/MD0_DATA/swish-e-files/swish-e-index/swish_2.index
#to run this index via command line, do:
# swish-e -S prog -c /share/MD0_DATA/swish-e-files/swish-e-conf/spider_1.conf
# to test it via command line run:
# swish-e -f /share/MD0_DATA/swish-e-files/swish-e-index/swish_2.index -w whatever-word or phrase
Step 6. Spider.config configuration
The reference here is: http://swish-e.org/docs/spider.html
As a test, and to limit the spider, try spidering just one file, eg include the following config in spider.config
- Code: Select all
base_url => 'http://localhost:104/_docs/test3.doc', # this 'spiders' and indexes everything anyway
Otherwise specifying a directory does not limit spidering to the subdirectories..ie it 'spiders' everything available on the server!
- Code: Select all
base_url => 'http://localhost:104/_docs/test/', # I found that this will 'spider' and index all parent and sub directories of /test/
Note that you must have the proprietary indexing filters, eg pdftotext or catdoc, installed for indexing other than html or txt files, as per Step 2.3
Here is the contents of my spider_1.conf file, which is referenced by the file created in the Step above:
- Code: Select all
my ($filter_sub, $response_sub ) = swish_filter();
@servers = (
{
base_url => 'http://localhost:104/',
email => 'swish@user.failed.to.set.email.invalid',
link_tags => [qw/ a frame /],
keep_alive => 1,
#test_url modified to also exclude indexing of the files listed
test_url => sub { $_[0]->path !~ /\.(?:gif|tif|jpeg|jpg|png|zip|gz|tar)$/i },
test_response => $response_sub,
use_head_requests => 1, # Due to the response sub
filter_content => $filter_sub,
#debug could be limited to eg => 'failed',
debug => 'errors, failed, headers, info, links, redirect, skipped, url',
} );
Step 7. Setting up Localhost:104 on the QNAP
This is necessary for the spidering approach (refer steps 5 and 6)
Note that selecting port 104 is entirely arbitrary...it could be any port other than the so called 'reserved ports'. In fact, port 104 can cause problems with firefox in that it is restricted. To overcome the restriction, you need to type 'about:config' in the firefox address bar, scroll down to:
- Code: Select all
network.security.ports.banned.override
and add the new port or port range to be overridden
(if network.security.ports.banned.override preference is not present, it can be added via the 'New' command)
Ideally, find the designated apache virtual host conf file and add the information below to it.
On my qnap this was located in /mnt/HDA_ROOT/.config/apache/extra/httpd-vhosts-user.conf
After making the changes, restart apache via command line. On my qnap I used the command:
- Code: Select all
/etc/init.d/Qthttpd.sh restart
Then you should be able to view the files in /server_dir from a workstation on the LAN via
- Code: Select all
http://192.168.x.y:104
where 192.168.x.y is the IP address of the QNAP
You could also check this step after installing the lynx browser in the QNAP. Install it via ipkg, then via command line on the QNAP type:
- Code: Select all
lynx http://localhost:104
and see if the directory structure and files are visible. Also try to follow the tree.
Similarly the swish.cgi search form should be visible on the workstation via:
- Code: Select all
http://192.168.x.y:104/swish/swish.cgi
or, via lynx from within the QNAP,
- Code: Select all
lynx http://localhost:104/swish/swish.cgi
Contents of: ./httpd-vhosts-user.conf begin now:
- Code: Select all
NameVirtualHost *:104
Listen 80
Listen 104
#
<VirtualHost _default_:80>
DocumentRoot "/share/Web"
</VirtualHost>
#
<VirtualHost *:104>
ServerName my_server
DocumentRoot "/share/MD0_DATA/server_dir"
#
# The AddDefaultCharset directive, below, stopped swish hanging immediately after the following Warning:
# Unknown header line:.....err: External program failed to return required headers
# It only happened with a particular html file, probably bec it was poorly formed
# Refer http://swish-e.org/archive/2007-08/11559.html for more information
#
AddDefaultCharset utf-8
#
# We need an Alias to the directory holding the swish.cgi file (see step 8)
# Use 'Alias' rather than 'ScriptAlias' as the mod_alias apache module is not installed...check by: /usr/local/apache/bin/apache -l
Alias /swish "/share/MD0_DATA/.qpkg/Optware/lib/swish-e"
#
# server_dir contains the files to be indexed...and of course add the +Indexes option
<Directory "/share/MD0_DATA/server_dir">
Allow from all
Options +Indexes
</Directory>
#
# the directory holding the swish.cgi file also needs to be made available to apache, and add the ExecCGI directive
<Directory /share/MD0_DATA/.qpkg/Optware/lib/swish-e>
Order allow,deny
Allow from all
Options +Indexes +ExecCGI
</Directory>
</VirtualHost>
# restart apache via /etc/init.d/Qthttpd.sh restart
Step 8. Swish.cgi configuration
The location and function of this cgi file has been described above. All that needs to be done is a few tweaks. Note that this example used the swish_2.index. To repeat, it is referenced from:
- Code: Select all
/share/MD0_DATA/.qpkg/Optware/lib/swish-e/swish.cgi
Of course, a copy could be made and used from a different directory, or it could be referenced via symlink
I only changed two lines, but many more configurable options exist. The first option was on line 157:
- Code: Select all
swish_index => '/share/MD0_DATA/swish-e-files/swish-e-index/swish_2.index',
And on line 167:
- Code: Select all
prepend_path => 'http://192.168.x.y:104',
Do not append a slash after the 104 ie end as 104/, if you have used the ReplaceRules configuration as specified above, ie:
- Code: Select all
ReplaceRules remove http://localhost:104
Step 9. Cron
Refer: http://wiki.qnap.com/wiki/Add_items_to_crontab and scroll down to the bottom of the wiki till you see:
Method 1 bis
Then follow the directions. Note that /etc/config/crontab symlinks to:
- Code: Select all
/mnt/HDA_ROOT/.config/crontab
This is the entry I added to cron, to have swish-e index at 5.00am each morning:
- Code: Select all
0 5 * * * /opt/bin/swish-e -S prog -c /share/MD0_DATA/swish-e-files/swish-e-conf/web_2.conf > /dev/null
Load the changes and restart cron, as described in that part of the wiki
Step 10. Help can be obtained from the swish-e User Discussion List.
Don't underestimate this!
Step 11. Using Swish from a remote location
Ssh into your qnap and use lynx to access swish.cgi as described in Step 7, or configure swish.cgi as follows:
- Code: Select all
prepend_path => 'http://your.domain.com:104',
or
- Code: Select all
prepend_path => 'http://your.external.ip.address:104',
(this might need some tweaking and of course adding apache security to the virtual host directives may be needed. Ditto Port forwarding.)
Step 12. Motif (refers back to Step 2.3)
Obtain motif:
- Code: Select all
wget http://www.openmotif.org/files/public_downloads/openmotif/2.3/2.3.3/openmotif-2.3.3.tar.gz
Motif had its own dependencies, as indicated in the motif config script:
- Code: Select all
cd [/share/MD0_DATA/misc/openmotif-2.3.3]
-->./configure --with-freetype-includes=/opt/include/freetype2 --with-freetype-lib=/opt/lib --with-freetype-config=/opt/bin --with-libjpeg-includes=/opt/include --with-libjpeg-lib=/opt/lib --with-libpng-includes=/opt/include/libpng15 --with-libpng-lib=/opt/lib --without-x --prefix=/opt
(despite the 'without-x', 'make' failed due to missing X11/xos.h at this line:
- Code: Select all
makestrs.c:51:21: fatal error: X11/Xos.h: No such file or directory
and prior to that, despite the above options and attempted variations, motif ./configure would not recognise freetype:
- Code: Select all
checking freetype/freetype.h usability... no
checking freetype/freetype.h presence... no
checking for freetype/freetype.h... no )
As I said, I could not compile Motif, but xpdf compiled without it and pdf2text seems to work. It may be useful tohave freetype installed however so if any users are successful in compiling it, please sure this information in the forum!
So thats it...Happy Indexing!
News