Sunday, March 11, 2012

Datamining for ntext

Is there any way to datamine using ntext? I'm trying to run some BI on some email messages -- seeing if it can accurately classify email into the proper folder. Currently, I get complains that ntext isn't comparable.

Is this scenario supported?

You will need to use Integration Services to preprocess the messages into terms and phrases using the Term Extraction and Term Lookup transforms. From there, you can use the algorithms to perform classification. FYI, Logistic Regression tends to have good results for these types of problems.|||Thanks.

Say I have extracted terms from both subject and body. Thus, I have two sets of terms. Lets also say that I have a table with data conetaining all the header information.

How do I feed all three of these sets (subject terms, body terms, and header data) into a datamining transform?|||

It depends on what you really want to do. You can merge them using merge transforms in Integration Services to do term extraction. On the term lookup side, you can do the same. However, you may want to output the results of the different sets to different tables so you can seperate their inputs as input to a mining algorithm.

For example, the phrase "Data Mining" in the subject may have different predictive power than "Data Mining" in the body.

Note that Integration Services does not support nested tables. To implement such a process, your SSIS pipeline will have to put the data into tables that you mine using the Analysis Services project user interface.