Phine Solutions web work notes

Using Sphinx as a site search engine

Filed under: tools — 1.618 @ 6:17 pm

A website, especially a content oriented one, needs good search functionality. This can be implemented locally, or outsourced to a search engine like Google. The former obviously requires a lot of work in database design and coding; and the later, relies on Google to guess what you have and what your users are looking for. And in the case of a site information behind a “walled garden”, it becomes impossible and insecure to let Google crawling and indexing the protected content.

Sphinx is a search server that I believe provides better approach to overcome the prior issues. It handles the indexing by reading into database (or files), and provides full text search capability via standard APIs.

Overall Sphinx is pretty easy to set up. Installation on a Linux server requires downloading the source code and going through the usual “make install” process. If you have done the installation from source before this should be easy. After installing the software, you also need to create a Sphinx configuration file to get Sphinx work in your environment. This is when I ran into some issues and I’ll share some of this experience in the rest of the post.

Basically Sphinx adds two processes to your server: indexer and searchd service. The Indexer is a process which should be kick off periodically, depending on the frequency you wish, to index the data (mostly) from a database; Searchd, by the name of it, is a daemon which listens on a port and handles the search request. The Indexer can be controlled by crontab, and Searchd should run as a service and configured properly so in the event of server reboot, it will be started automatically.

Adding a new service startup script in in a Linux environment requires creating a new shell script and putting it in somewhere under etc directory. Here is a sample script that I have for a CentOS server (it’s was quickly put together so it’s pretty rough on the edges):

#!/bin/bash
##
# chkconfig: 345 55 35
# description: Sphinx search daemon
#

case “$1″ in
start)
echo -n “Starting Sphinx searchd:”
sudo -u myuser /usr/local/bin/searchd –config /home/myuser/sphinx/sphinx.conf >> /dev/null 2>&1
echo
;;
stop)
echo -n “Stopping Sphinx searchd:”
/usr/local/bin/searchd –config /home/myuser/sphinx/sphinx.conf –stop >> /dev/null 2>&1
echo
;;
restart)
$0 stop
$0 start
;;
*)
echo “usage: $0 [start|stop|restart]”
esac
exit 0

Notice that I added sudo command so Searchd runs under “myuser”. This is because having indexer and searchd run under different users can post some issues.

At this point I want to go over my setup a little bit. The server that I have Sphinx installed hosts several websites. A couple of user are created with different sites deployed under their home directories respectively. Since I plan to use Sphinx on only one of the site, I want the site owner “myuser” owns the indexer process, and keep the Sphinx data and log files locally, somewhere under myuser’s home directory. In this particular setup, if the searchd service runs by root, I ran into permission issues.

First, searchd creates a *.spl file, which myuser doesn’t have the read permission on. The indexer produces the following error even the rotate option IS indeed presented:

indexing index ‘mydatabase_search’…
FATAL: failed to open /home/myuser/sphinx/data/mydatabase.spl: Permission denied, will not index. Try –rotate option.

Another issue is the ownership of the searchd.pid file. The indexer complains again if it can’t read it:

WARNING: failed to open pid_file ‘…/searchd.pid’.
WARNING: indices NOT rotated.

If you wonder why the indexer needs the access to these files, it is because whenever the indexer is running, it notifies searchd by sending a SIGHUP signal.

Now these issues can be bypassed by changing the permission of these files in the searchd startup script. But in the end I think using sudo command is a cleaner solution. And ultimately, all Sphinx related files, including configuration and process pid, are stored locally and can be accessed easily by the “site owner”. The only drawback in this approach I can see now, is when Sphinx is added to another site owned by a different user, there needs to be a searchd process for each site owner.

Since I’m still experimenting this particular setup may not be the best solution. But hopefully the post can shed some light on certain issues that other people might run into.

Tools to help code deployment

Filed under: tools — 1.618 @ 12:39 pm

Depending on the type of the technology you use to build your web sites, there are different ways to put your code out there. Since a lot of sites today are developed in PHP, updating probably means upload a bunch of scripts to the server.

One way to do this is uploading the whole directory from the site root and switch the Apache site directory using symbolic link. But if you only have a handful of files updated it is really not necessary to upload everything every time. And often times there are user uploaded content and log files generated by the web server in the file system, which you don’t want to lose during the process.

Another way that I prefer is using the “sync” methodology. Basically I use a tool to compare the differences between my local development drive and the remote directory, and let the tool handle the remote copying and deleting.

“rsync” it is a great tool from *nix family which can sync two locations using ssh protocol. Although it is command line based you can always write a simple script to automate it. But if you are developing from a Windows PC it might be a bit difficult. I used to run rsync under Cygwin on XP and it does very well backing up files from the remote server. However because of the different ways of handling file permission under Cygwin and Windows I had a lot of problem to commit to rsync as a deployment tool.

For a long time I also used a tool called “Site Publisher” from helexis.com. It is a small FTP based tool which you can use to set up different site profiles and sync the code from your local drive to remote directory. It has worked very well for me but there are a couple of issues that made me to look for new ones:

  • No sftp support
  • Since I have firewall installed I have to use “Active” mode for FTP. But Site Publisher seems to have some problem with this. It would hang during a transfer session, which is not acceptable for a production release.

Recently I found “InstantSync(TM)” from sitedesigner.com and have been very happy about it. It support sftp so I can probably shutdown the FTP server on my host. It also supports multiple site profiles and the file transmission has been rock solid. Although it cost $99 I think it is definitely worth the money.

During the search process I also evaulated TurboFTP and SynchronEX which didn’t fit my needs. In my opinion, TurboFTP is primarily a FTP tool so there are a lot of features that are great for FTP but not necessarily in my case. I just want something simple to use and does this one thing good. SynchronEX looks promising but its user interface to set up site profile is hard for me to comprehend.

Disclaimer: The tools reviews mentioned above are solely based on personal experience and I have no affiliation with the companies mentioned.

©phinesolutions.com