A website, especially a content oriented one, needs good search functionality. This can be implemented locally, or outsourced to a search engine like Google. The former obviously requires a lot of work in database design and coding; and the later, relies on Google to guess what you have and what your users are looking for. And in the case of a site information behind a “walled garden”, it becomes impossible and insecure to let Google crawling and indexing the protected content.
Sphinx is a search server that I believe provides better approach to overcome the prior issues. It handles the indexing by reading into database (or files), and provides full text search capability via standard APIs.
Overall Sphinx is pretty easy to set up. Installation on a Linux server requires downloading the source code and going through the usual “make install” process. If you have done the installation from source before this should be easy. After installing the software, you also need to create a Sphinx configuration file to get Sphinx work in your environment. This is when I ran into some issues and I’ll share some of this experience in the rest of the post.
Basically Sphinx adds two processes to your server: indexer and searchd service. The Indexer is a process which should be kick off periodically, depending on the frequency you wish, to index the data (mostly) from a database; Searchd, by the name of it, is a daemon which listens on a port and handles the search request. The Indexer can be controlled by crontab, and Searchd should run as a service and configured properly so in the event of server reboot, it will be started automatically.
Adding a new service startup script in in a Linux environment requires creating a new shell script and putting it in somewhere under etc directory. Here is a sample script that I have for a CentOS server (it’s was quickly put together so it’s pretty rough on the edges):
# chkconfig: 345 55 35
# description: Sphinx search daemon
case “$1″ in
echo -n “Starting Sphinx searchd:”
sudo -u myuser /usr/local/bin/searchd –config /home/myuser/sphinx/sphinx.conf >> /dev/null 2>&1
echo -n “Stopping Sphinx searchd:”
/usr/local/bin/searchd –config /home/myuser/sphinx/sphinx.conf –stop >> /dev/null 2>&1
echo “usage: $0 [start|stop|restart]”
Notice that I added sudo command so Searchd runs under “myuser”. This is because having indexer and searchd run under different users can post some issues.
At this point I want to go over my setup a little bit. The server that I have Sphinx installed hosts several websites. A couple of user are created with different sites deployed under their home directories respectively. Since I plan to use Sphinx on only one of the site, I want the site owner “myuser” owns the indexer process, and keep the Sphinx data and log files locally, somewhere under myuser’s home directory. In this particular setup, if the searchd service runs by root, I ran into permission issues.
First, searchd creates a *.spl file, which myuser doesn’t have the read permission on. The indexer produces the following error even the rotate option IS indeed presented:
indexing index ‘mydatabase_search’…
FATAL: failed to open /home/myuser/sphinx/data/mydatabase.spl: Permission denied, will not index. Try –rotate option.
Another issue is the ownership of the searchd.pid file. The indexer complains again if it can’t read it:
WARNING: failed to open pid_file ‘…/searchd.pid’.
WARNING: indices NOT rotated.
If you wonder why the indexer needs the access to these files, it is because whenever the indexer is running, it notifies searchd by sending a SIGHUP signal.
Now these issues can be bypassed by changing the permission of these files in the searchd startup script. But in the end I think using sudo command is a cleaner solution. And ultimately, all Sphinx related files, including configuration and process pid, are stored locally and can be accessed easily by the “site owner”. The only drawback in this approach I can see now, is when Sphinx is added to another site owned by a different user, there needs to be a searchd process for each site owner.
Since I’m still experimenting this particular setup may not be the best solution. But hopefully the post can shed some light on certain issues that other people might run into.