Yes, it has been awfully quiet here since I came back from TechEd. I’ve noticed my own silence, and I’d love to blog more, but I’ve totally consumed by some research I’m doing in the Full Text Indexing realm. I’m a late comer to FTS, so I’m working hard at coming up to speed (as best as I can) on the subject. It started because of my current client, and the indexing of semi-structured documents that I’m doing for them. Because they are trying to index concepts and not key words, and it is text data mining based, this isn’t your typical project. Basically, they have a known set of concepts that they wish to index on. In there current “system” they have people (aka indexers) read every document and add references (metadata) between the document and their concepts. This way the documents can be “searched” by concept and not literals found within the text. What we are doing is trying to augment the process with as much automation as possible.
The first version of automation is the simplest one, trying to come up with a list of literal strings that can be found in the document text, and map them to the appropriate concept. Since they already have a pre-existing system (that includes a list of all the concepts), I tried my hand at generating the index to concept. The problem is that English words morphological and inflexional endings that you either have to remove from the source text, or add onto the concept text (if you wanted to build an index based on the literal). I’ve found The Porter Stemming Algorithm, which is a standard algorithm used this very purpose. But, a concept can (and usually) contain more then one word, so not all variations of a word make sense when combine with the variations of the other words. And, to make things even more difficult, the majority of text in the documents they are indexing is science oriented, so the stemming algorithm isn’t not very useful for the scientific words. Even worse, they also need to index chemical compounds, which for some reason there is no standard way of naming a chemical compound.
Example: caffeine
Chemical Formula: C8H10N4O2 (or it can be express C8 H10 N4 O2 or C8 N4 O2)
Chemical Names: 1,3,7-trimethylxanthine
1,3,7-Trimethyl-2,6-dioxopurine
3,7-Dihydro-1,3,7-trimethyl-1H-purine-2,6-dione
7-Methyltheophylline
1H-Purine-2,6-dione, 3,7-dihydro-1,3,7-trimethyl- (9CI)
1-methyltheobromine
1,3,7-trimethyl-Xanthine
Try mining a text document (or the internet) for the concept of “caffeine”, especially when people are purposely trying to make it hard for you to mine this information.
It is definitely not your typical Google or Lucene type system.
But, there is hope. There are organizations like the Open Bioinformatics Foundation, who encourage open source solutions for mining biological text (things like proteins and amino acids), and the National Center for Biotechnology Information. I even found GeneWays which is “A System for Mining Text and for Integrating Data on Molecular Pathways”. This is all doctorate level material, which can be a bit heady for a guy that never finished his bachelor degree.
The paper I’ve been using the most to help develop this indexing system is Analysis of Biomedical Text for Chemical Names: A Comparison of Three Methods, which has lots of great information. The problem is that most of the technology used to accomplish the text mining is based on Natural Language Systems, which isn’t something the average developer can support. I’ve been working on building my own tokenization algorithms (for the .Net port of Lucene, dotLucene), along with studying Bayesian Analysis and Support Vector Machine (libsvm is great and there is even a C# port).
More then enough to keep me too busy to blog. But don’t worry. I’ve got a new open source project that I’m working as my summer project. Can’t say what it is yet, but it involves this stuff and SQL Server 2005. I’m still trying to get it all running, but once it sort of works, I’ll let everyone know.