#### Transcript Data Mining - Computer Science Intranet

COMP527: Data Mining COMP527: Data Mining M. Sulaiman Khan ([email protected]) Dept. of Computer Science University of Liverpool 2009 Classification: Bayes February 17, 2009 Slide 1 COMP527: Data Mining COMP527: Data Mining Introduction to the Course Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction Classification: Bayes Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam February 17, 2009 Slide 2 COMP527: Data Mining Today's Topics Statistical Modeling Bayes Rule Naïve Bayes Fixes to Naïve Bayes Document classification Bayesian Networks Structure Learning Classification: Bayes February 17, 2009 Slide 3 COMP527: Data Mining Bayes Rule The probability of hypothesis H, given evidence E: Pr[H|E] = PR[E|H]*Pr[H] / Pr[E] Pr[H] = A Priori probability of H (before evidence seen) Pr[H|E] = A Posteriori probability of H (after evidence seen) We want to use this in a classification system, so our goal is to find the most probable hypothesis (class) given the evidence (test instance). Classification: Bayes February 17, 2009 Slide 4 COMP527: Data Mining Example Meningitis causes a stiff neck 50% of the time. Meningitis occurs 1/50,000, stiff necks occur 1/20. Probability of Meningitis, given that the patient has a stiff neck: Pr[H|E] = PR[E|H]*Pr[H] / Pr[E] Pr[M|SN] = Pr[SN|M]*Pr[M]/Pr[SN] = 0.5 * 1/50000 / 1/20 = 0.0002 Classification: Bayes February 17, 2009 Slide 5 COMP527: Data Mining Bayes Rule Our evidence E is made up of different attributes A[1..n], so: Pr[H|E] = Pr[A1|H]*Pr[A2|H]...Pr[An|H]*Pr[H]/Pr[E] So we need to work out the probability of the individual attributes per class. Easy... Outlook=Sunny appears twice for yes out of 9 yes instances. We can work these out for all of our training instances... Classification: Bayes February 17, 2009 Slide 6 COMP527: Data Mining Weather Probabilities Outlook Temperature Yes Humidity Yes No No Sunny 2 3 Hot 2 2 Overcast 4 0 Mild 4 2 Rainy 3 2 Cool 3 1 Sunny 2/9 3/5 Hot 2/9 2/5 Overcast 4/9 0/5 Mild 4/9 2/5 Rainy 3/9 2/5 Cool 3/9 1/5 Windy Yes No High 3 4 Normal 6 High Normal Play Yes No Yes No False 6 2 9 5 1 True 3 3 3/9 4/5 False 6/9 2/5 9/14 5/14 6/9 1/5 True 3/9 3/5 Given a test instance (sunny, cool, high, true) play=yes: 2/9 * 3/9 * 3/9 * 9/14 = 0.0053 play=no: 3/5 * 1/5 * 4/5 * 3/5 * 5/14 = 0.0206 So we'd predict play=no for that particular instance. Classification: Bayes February 17, 2009 Slide 7 COMP527: Data Mining Weather Probabilities play=yes: 2/9 * 3/9 * 3/9 * 9/14 = 0.0053 play=no: 3/5 * 1/5 * 4/5 * 3/5 * 5/14 = 0.0206 This is the likelihood, not the probability. We need to normalise these. Prob(yes) = 0.0053 / 0.0053 + 0.0206 = 20.5% This is when the Pr[E] denominator disappears from Bayes's rule. Nice. Surely there's more to it than this... ? Classification: Bayes February 17, 2009 Slide 8 COMP527: Data Mining Naïve Bayes Issue: It's only valid to multiply probabilities when the events are independent of each other. It is “naïve” to assume independence between attributes in datasets, hence the name. Eg: The probability of Liverpool winning a football match is not independent of the probabilities for each member of the team scoring a goal. But even given that, Naïve Bayes is still very effective in practice, especially if we can eliminate redundant attributes before processing. Classification: Bayes February 17, 2009 Slide 9 COMP527: Data Mining Naïve Bayes Issue: If an attribute value does not co-occur with a class value, then the probability generated for it will be 0. Eg: Given outlook=overcast, the probability of play=no is 0/5. The other attributes will be ignored as the final result will be multiplied by 0. This is bad for our 4 attribute set, but horrific for (say) a 1000 attribute set. You can easily imagine a case where the likelihood for all classes is 0. Eg: 'Viagra' is always spam, 'data mining' is never spam. An email with both will be 0 for spam=yes and 0 for spam=no ... probability will be undefined ... uh oh! Classification: Bayes February 17, 2009 Slide 10 COMP527: Data Mining Laplace Estimator The trivial solution is of course to mess with the probabilities such that you never have 0s. We add 1 to the numerator and 3 to the denominator to compensate. So we end up with 1/8 instead of 0/5. No reason to use 3, could use 2 and 6. No reason to split equally... we could add weight to some attributes by giving them a larger share: a+3/na+6 * b+2/nb+6 * c+1/nc+6 However, how to assign these is unclear. For reasonable training sets, simply initialise counts to 1 rather than 0. Classification: Bayes February 17, 2009 Slide 11 COMP527: Data Mining Missing Values Naïve Bayes deals well with missing values: Training: Ignore the instance for the attribute/class combination, but we can still use it for the known attributes. Classification: Ignore the attribute in the calculation as the difference will be normalised during the final step anyway. Classification: Bayes February 17, 2009 Slide 12 COMP527: Data Mining Numeric Values Naïve Bayes does not deal well with numeric values without some help. The probability of it being exactly 65 degrees is zero. We could discretize the attribute, but instead we'll calculate the mean and standard deviation and use a density function to predict the probability. mean: sum(values) / count(values) variance: sum(square(value - mean)) / count(values)-1 standard deviation: square root of variance Mean for temperature is 73, Std. Deviation is 6.2 Classification: Bayes February 17, 2009 Slide 13 COMP527: Data Mining Numeric Values Density function: f(x) = 1 e 2p s ( x- m ) 2 2s 2 Unless you've a math background, just plug the numbers in... At which point we get a likelihood of 0.034 Then we continue with this number as before. This assumes a reasonably normal distribution. Often not the case. Classification: Bayes February 17, 2009 Slide 14 COMP527: Data Mining Document Classification The Bayesian model is often used to classify documents as it deals well with a huge number of attributes simultaneously. (eg boolean occurrence of words within the text) But we may know how many times the word occurs. This leads to Multinomial Naive Bayes. Assumptions: 1. Probability of a word occurring in a document is independent of its location within the document. 2. The document length is not related to the class. Classification: Bayes February 17, 2009 Slide 15 COMP527: Data Mining Document Classification Pr[E|H] = N! * product(pn/n!) N = number of words in document p = relative frequency of word in documents of class H n = number of occurrences of word in document So, if A has 75% and B has 25% frequency in class H Pr[“A A A”|H] = 3! * 0.753/3! * 0.250/0! = 27/64 = 0.422 Pr[“A A A B B”|H] = 5! * 0.753/3! * 0.252/2! = 0.264 Classification: Bayes February 17, 2009 Slide 16 COMP527: Data Mining Document Classification Pr[E|H] = N! * product(pn/n!) We don't need to work out all the factorials, as they'll normalise out at the end. We still end up with insanely small numbers, as vocabularies are much much larger than 2 words. Instead we can sum the logarithms of the probabilities instead of multiplying them. Classification: Bayes February 17, 2009 Slide 17 COMP527: Data Mining Bayesian Networks Back to the attribute independence assumption. Can we get rid of it? Yes, with a Bayesian Network. Each attribute has a node in a Directed Acyclic Graph. Each node has a table of all attributes with edges pointing at the node linked against the probabilities for the attribute's values. Examples will be hopefully enlightening... Classification: Bayes February 17, 2009 Slide 18 COMP527: Data Mining Simple Network play yes no .633 .367 outlook play| sunny overcast rainy yes | .238 .429 .333 no | .538 .077 .385 windy play| false true yes | .350 .650 no | .583 .417 temperature play| hot mild cold yes | .238 .429 .333 no | .385 .385 .231 Classification: Bayes humidity play| high normal yes | .350 .650 no | .750 .250 February 17, 2009 Slide 19 COMP527: Data Mining Less Simple Network play yes no .633 .367 Outlook play sunny overcast rainy yes .238 .429 .333 no .538 .077 .385 play yes yes yes no no no temperature outlook hot mild sunny .238 .429 overcast .385 .385 rainy .111 .556 sunny .556 .333 overcast .333 .333 rainy .143 .429 play yes yes yes no no no cold .333 .231 .333 .111 .333 .429 Classification: Bayes windy outlook false sunny .500 overcast .500 rainy .125 sunny .375 overcast .500 rainy .833 play yes yes yes no no no February 17, 2009 true .500 .500 .875 .625 .500 .167 humidity temp high normal hot .500 .500 mild .500 .500 cool .125 .875 hot .833 .167 mild .833 .167 cool .250 .750 Slide 20 COMP527: Data Mining Bayesian Networks To use the network, simply step through each node and multiply the results in the table together for the instance's attributes' values. Or, more likely, sum the logarithms as with the multinomial case. Then, as before, normalise them to sum to 1. This works because the links between the nodes determine the probability distribution at the node. Using it seems straightforward. So all that remains is to find out the best network structure to use. Given a large number of attributes, there's a LARGE number of possible networks... Classification: Bayes February 17, 2009 Slide 21 COMP527: Data Mining Training Bayesian Networks We need two components: - Evaluate a network based on the data As always we need to find a system that measures the 'goodness' without overfitting (overfitting in this case = too many edges) We need a penalty for the complexity of the network. - Search through the space of possible networks As we know the nodes, we need to find where the edges in the graph are. Which nodes connect to which other nodes? Classification: Bayes February 17, 2009 Slide 22 COMP527: Data Mining Training Bayesian Networks Following the Minimum Description Length ideal, networks with lots of edges will be more complex, and hence likely to over-fit. We could add a penalty for each cell in the nodes' tables. AIC: MDL: -LL +K -LL + K/2 log(N) LL is total log-likelihood of the network and training set. eg Sum of log of probabilities for each instance in the data set. K is the number of cells in tables, minus the number of cells in the last row (which can be calculated, by 1- sum of other cells in row) N is the number of instances in the data. Classification: Bayes February 17, 2009 Slide 23 COMP527: Data Mining Network Training: K2 K2: for each node, for each previous node, add node, calculate worth continue when doesn't improve (Use MDL or AIC to determine worth) The results of K2 depend on initial order selected to process the nodes in. Run it several times with different orders and select the best. Can help to ensure that the class attribute is first and links to all nodes (not a requirement) Classification: Bayes February 17, 2009 Slide 24 COMP527: Data Mining Other Structures TAN: Tree Augmented Naive Bayes. Class attribute is only parent for each node in Naive Bayes. Start here and consider adding a second parent to each node. Bayesian Multinet: Build a separate network for each class and combine the values. Classification: Bayes February 17, 2009 Slide 25 COMP527: Data Mining Further Reading Witten 4.2, 6.7 Han 6.4 Dunham 4.2 Devijver and Kittler, Pattern Recognition: A Statistical Approach, Chapter 2 Berry and Browne, Chapter 2 Classification: Bayes February 17, 2009 Slide 26