As I blogged a couple of days ago, I just recently put up a new site at DougHughes.net. One thing I wanted to do was implement a search tool that used the Verity spider which comes with ColdFusion. After much research and hair-pulling I found out that the Verity spider’s not supported on Linux. D’oh! I pondered several solutions to the problem, including writing my own spider in Java, and then feeding the data into Verity somehow or other.
The problem I kept running into is that Verity, at least as it ships with ColdFusion, is really intended for indexing files on disk and not dynamic web content directly from websites. This meant that for me to try to index html and other binary content that I would need to use a spider and cache web documents to disk before using a bulk insert file to index them with Verity.
I tried several different approaches to the problem, but I wasn’t really very happy with any of them. In the end, I did a little more research and found out about a custom tag called Lindex which was distributed on Macromedia’s DRK 3 CD and which used a search engine I’d never heard of to index content. The search engine was Lucene.
Lucene, it seems, is a “high-performance, full-featured text search enginelibrary written entirely in Java”. It also happened to be free, open source, and published by the Apache Foundation. I downloaded it right away and started learning how to use it.
I’ve now build a CFC which uses lucene to create indexes, index content, and search indexes. Currently, I’m only indexing HTML content however, I plan to grow that to PDF, DOC, XLS, PPT, and RTF accoding to the FAQs I found. Content is fed to Lucene via a half-assed spider I wrote using cfhttp. Asside from some things which could use a little work on my side, I’ve been very happy with the preformance and functionality of Lucene and my search component.
It’s just so much much more powerful and easier than Verity. It’s just exactly what I need. I love the fact that I have an API I can code against. (There’s really not an API for Verity.) If you’re using (or failing to use) Verity, I strongly suggest looking into Lucene. When is all said and done, I’ll probably release the search component as an Alagad product. Maybe by 2005? It could happen!