date and time diffs: Dataming of subsets question

Sunday, March 11, 2012

Dataming of subsets question

Have not found an active general data mining forum yet so asking here.

I'm new to data mining band have been given the task of setting up a data mining system. The problem is that our setup seems pretty non standard and I'm not sure how to use data mining on it or what my expectations should be. The situation is:

We have a large and growing set of strings which we get requests for (> 100,000). The requests have many, mostly nominal (non-numeric), variables associated with them. We can only handle a subset (probably less than 10,000) or the strings at any one time. We want to use data mining to analyze historic requests so we can figure out which strings we are going to handle under a given set of variables. So given that our variable currently have values X1, ..., Xn, what subset should be use given than a large database of historic string requests.

Anyone know what techniques would work well for this kind of problem? This is a quick and dirty kind of project, no special purpose hardware or expensive software on this one. I've been looking at using rapid miner but not sure that it's a great tool in this case.

thanks in advance,
max

A first ideea is to use Microsoft Clustering Algorithm to cluster your requests and find similare requests- for this situation you don't have to have a predictable attribute.

A second way is to define an objective by selecting one (or more) columns from your requests, let's say Xj and put the question :

What are the links/influences of X1,..., Xn with Xj?

Or create another attribute Xn+1 which same question.

Let see as Microsoft example that is presented in it's tutorials;it have following attributes:

CommuteDistance(X1)

Gender(X2)

HouseOwnerFlag(X3)

MaritalStatus(X4)

...

NumberCarsOwned(Xn)

and a predictible attribute:

BikeBuyer(Xj)

that mean if customer buy or not a bicycle; so in this case the question/objective is in the some way formulated:

What are the links/influences of X1,..., Xn with Xj?

There is a good book presented here

|||

You could try using and classification method with the other fields as input and the target string as output. Then you could rank the likelihood of each string given some input.

Candidate algorithms are

- Naive Bayes

- Logistic Regression

- Neural Nets (slow)

- Association Rules (not traditionally a classification method, but could work in this case)

After creating the model you would do something like

SELECT FLATTENED (SELECT TOP 10000 TargetString FROM PredictHistogram(TargetString) ORDER BY $Probability) FROM MyModel NATURAL PREDICTION JOIN

(SELECT 'Factor1' AS Factor1, 'Factor2' AS Factor2) AS t

HTH

-Jamie

Sunday, March 11, 2012

Dataming of subsets question

date and time diffs

Blog Archive

About Me