[TOP]


Review Questions

Data Mining
CS 510 (DM)
Winter,2004
home | news | site map
review | project | subject | group
weka | mining | gawk | bash
modeling | reference | pods
Display: big | small

Why all the scripting?

Week9

The following questions ask you to compare and contrast two learners:

Week8

  1. What is the problem with numeric attributes? Describe a method for handling numeric attributes and say what learner uses that method.

  2. What is the problem with missing data? Describe a method for handling missing values and say what learner uses that method.

  3. What is the problem with noise? Describe a method for handling noisy data and say what learner uses that method.

  4. Explain the following. In decision tree learning. splitting on a discrete/numeric attribute exhausts/does not exhaust that attribute for that branch.

  5. How does Naive Bayes handle post-pruning?

  6. What are the tune-grow-prune-test sets? How are they built for INDUCT?

  7. A naive method for turning trees into rules is to generate one rule per branch. Why is this naive? What are C4.5rules non-naive methods?

  8. Distinguish greedy search from some non-greedy search. Give an example of greedy search and non-greedy search in data mining.

  9. In the following example, (a)describe the logic that leads to iris-Virginia via two unless branches; (b)how are mis-classified examples used in this structure? (c)suggest a pre-pruning method based on support that would eliminate the right-hand-side rules; (d)how might incremental or global pruning be used in the example?

  10. What is the MDL principle? Suggest a post-pruning method based on MDL that might eliminate the right-hand-side rules in the above diagram. Explain: for the above example, a quadratic loss function would be an inappropriate encoding of the theory with respect to the evidence. What alternate encoding might you use? Explain: MDL is less useful for pre-pruning.

Week7

  1. Define correlation in English (not mathematically). Offer interpretations a correlation of (a)0.7 (b)-0.9 (c)0.1 (d) 2

  2. Draw a ROC curve. Label the axises. Where is the ``sweet spot''? Where is the ``do nothing'' spot and what is a side effect of doing nothing? Where is the ``do everything'' spot and what is a side effect of doing nothing? Show and explain the ``zone of no information''. Show and explain the ``zone of negative information''.

  3. A dot-com company is rushing a simple single-user game to market. In terms of the ROC curve, what kind of defect detectors do they want?

  4. A mission-critical safety-critical controller is being built for supplying oxygen to astronauts. In terms of the ROC curve, what kind of defect detectors do they want?

  5. Here's some data:
                   truth
                  +---+---+
                  | 0 | 1 |
              +---+---+---+
     detector | 0 | A | B |
              | 1 | C | D |
              +---+---+---+
     A=40
     B=10
     C=20
     D=30

    Define accuracy. Compute it for this example. Define probability of false alarm. Compute it for this example. Define probability of detection. Compute it for this example. Define precision. Compute it for this example.

  6. In terms of the quantities in the last question, define a ``safe'' detector. Consider a data set where (B+D) << (A+C). What are the difficulties of making a detector safe and accurate.

  7. After a 10-way cross-way on a data set with three classes, 30 dots are added to a ROC curve. Explain why.

Week6

Here is a snip from nbc.awk that is relevant to the first three questions

 function classify(         i,temp,what,like,c) {  
   like = -100000; # smaller than any log
   for(c in Classes) {  
     temp=log(Classes[c]/Total); #uses logs to stop numeric errors
     for(i=1;i<NF;i++) {  
       if ( $i=="?" ) continue;
       temp += log((Freq[c,i,$i]+1)/(Classes[c]+Attributes[i]));
     };
     if ( temp >= like ) {like = temp; what=c}
   };
   return what;
 }
  1. What do the following data structures store? Classes,Total, Freq,Attributes. For each structure in turn, what would happen if it was just thrown away?

  2. Why does this code use logarithms, not raw numbers?. Change the code so it does not use logarithms.

  3. This code returns classes with max likelihood, not max probability. Is that a mistake? Explain. Change the code so it computes probabilities, not just likelihoods.

    End of NBC.awk questions

  4. Describe the 1-NN learning algorithm. What is k-NN and how might k-NN suffer less from over-fitting bias than 1-NN?

  5. Discuss: ``J48 offers an inductive generalization but 1-NN does not''.

  6. Discuss: order of rules matter but order of branches in a decision tree do not. Describe one simple rule-ordering method.

  7. Describe, using pseudo-code, the PRISM rule-covering algorithm. How might PRISM suffer from over-fitting bias?

  8. Show how PRISM performs on the following data. Show all calculations. Carefully document what is going on.

    Current rule:

     If astigmatics = yes and tear production rate = normal
     then recommendation = hard

    Instances covered by the current rule:

          tear   
          prod.
     lens rate   Astigmatism  prescription age
     ==== ====== ============ ============ =============
     None Normal Yes          Hypermetrope Pre-presbyopic
     Hard Normal Yes          Myope        Presbyopic
     None Normal Yes          Hypermetrope Presbyopic
     Hard Normal Yes          Myope        Pre-presbyopic
     Hard Normal Yes          Hypermetrope Young
     Hard Normal Yes          Myope        Young

  9. Define the parts of an association rule. Define support. Describe item set generation in APRIORI. Include an example in your description. Use your description to demonstrate support-based pruning. What does support-based pruning offer to the over-fitting problem?

  10. Define confidence. Assuming item sets have been generated, describe rule generation in APROPRI. Include an example in your description. Use your description to demonstrate confidence-based pruning. Include in your description how and why c+1 consequence rules can be built from c consequence rules.

Week5

  1. How is a treatment different from a decision tree ( define all terms).

  2. Define lift. For the following example, calculate the lift of outlook=sunny.
        outlook,temperature,humidity,windy,playing
      1 Sunny   85          86       False None
      2 Sunny   80          90       True  None
      3 Sunny   72          95       False None
      4 Rain    65          70       True  None
      5 Rain    71          96       True  None
      6 Rain    70          96       False Some
      7 Rain    68          80       False Some
      8 Rain    75          80       False Some
      9 Sunny   69          70       False Lots
     10 Sunny   75          70       True Lots
     11 Overcast 83         88       False Lots 
     12 Overcast 64         65       True Lots 
     13 Overcast 72         90       True Lots 
     14 Overcast 81         75       False Lots

  3. Calculate the best support for the treatment outlook=sunny (define all terms, show your working).

  4. In TAR3, what is the role of the bestClass variable? Explain how bestClass means that treatments are usually smaller than decision trees?

  5. TAR3 composes some treatments before we pausing to look for new best treatments. In that process, what is the purpose of maxNumber and randomTrials and futileTrials?

  6. What is the role of TAR3's maxSize and granularity variables?

  7. What is the role of ``minObs'' (or ``M'' in the WEKA) in decision tree learning? Explain what happens when ``M'' is made very large and very small.

  8. Discuss the following claim with respect to the following example: Small decision trees result from increasing the purity of the subsets. (In this example, we have been spying on our neighbors and now want to predict when they will play play golf, polo, or tennis or nothing):
     1. if outlook=sunny    then polo,polo,polo,polo,polo,polo,polo,polo,nothing,nothing, nothing,nothing
     2. if outlook=overcast then tennis,tennis, tennis, tennis
     3. if outlook=raining  then golf,golf,polo,polo

    Your dialogue might be able to use these formulae:

     N=a+b+...z
     info([a,b,..z]) = [-a*log(a) - b*log(b) - .... -z*log(z) + N*log(N)]/N

  9. A data set contains one ID column that is different for every instance. Discuss the fragmentation problem this creates. Informally (i.e. non-mathematically) define and describe the use of the gain ratio to reduce the fragmentation problem.

  10. Explain this code:
     BEGIN {

    Command line arguments (none).

    Internal globals:

         Total=0    # count of all instances
       # Classes    # table of class names/frequencies
       # Freg       # table of counters for values in attributes in classes
       # Seen       # table of counters for values in attributes
       # Attributes # table of number of values per attribute
       }
     Pass==1 {train()}
     Pass==2 {print $NF "|" classify()}
     function train(    i,c) { 
       Total++;
       c=$NF;
       Classes[c]++;
       for(i=1;i<=NF;i++) {
         if ($i=="?") continue;
         Freq[c,i,$i]++
         if (++Seen[i,$i]==1) Attributes[i]++}
     }
     function classify(         i,temp,what,like,c) {  
       like = -100000; # smaller than any log
       for(c in Classes) {  
         temp=log(Classes[c]/Total); #uses logs to stop numeric errors
         for(i=1;i<NF;i++) {  
           if ( $i=="?" ) continue;
           temp += log((Freq[c,i,$i]+1)/(Classes[c]+Attributes[i]));
         };
         if ( temp >= like ) {like = temp; what=c}
       };
       return what;
     }

Week4

  1. (a) What is n-way cross-validation? (b) Why does it assess the external validity of a learnt theory? (c) Here, in order are the classes of the 12 instances in a data set:
     1    2    3    4    5    6    7    8    9    10     11     12
     male,male,male,male,male,male,male,male,male,female,female,female

    Assuming a 4-way cross-val, what are the numbers of the instances that would go into the sub-samples assuming (i)stratified cross-val and (ii)non-stratified cross-validation?

  2. Bayesian learners assuming Gaussian distributions for numeric variables. Draw a Gaussian curve. Draw two curves that could defeat a Bayes learner making the Gaussian assumption.

  3. Distinguish (a) supervised discretization from (b) unsupervised discretization. Which kind of discretization does OneR use? Explain your answer.

  4. Distinguish (a) equal-interval binning from (b) bin-log discretization from (c) percentile chops (a.k.a.. equal-frequency binning or histogram equalization). Give separate examples of distributions that would defeat equal-interval and percentile chop discretization.

  5. Regarding the assume tool: (a) what is its purpose?; (b) what is an assumption in assume?; (c) What does the forget() function do?; (d) What are the assume types are restraints files? Also, (e) a data miner has learnt that some range is critical for improving the importance of some simulation. How could assume be made aware that those ranges are important?

  6. Regarding the COCOMO tool: (a) what is its purpose; (b) the COCOMO assumption is that program development time is proportional in what way to program size? (c) distinguish between scale factors from effort multipliers and give one example of each; (d) what is a tuning table?

  7. Regarding the COCOMO expert tool: (a) what is its purpose; (b) give an example of a risk table and how it might be used; (c) how is a single risk calculated? (d) how is total risk calculated?

  8. Explain how the following generates http://www.cs.pdx.edu/~timm/dm/wk4g2a.html.

    Here's some awk code (file=wk4g2.awk)

     /<pre>/,/<\/pre>/ {
       sub(/\#.*/,"<font class=comment>&</font>");
     }
     {print $0}

    Here's some shell script:

     gawk -f wk4g2.awk wk4g2.html> wk4g2a.html; chmod a+r wk4g2a.html

    Here's the header of wk4g2.awk:

     <html>
     <head>
     <LINK REL="stylesheet" HREF="wk4g2.css" TYPE="text/css">
     </head>
     <body>
     <h2>learners.sh</h2>
     <pre>
     #!/usr/bin/bash
     #  CS510,  Data  Mining,  Week  4  Report
     #  Learners.sh,  a  program  to  calculate  and  print  the  accuracy  of  several  learners
     .  config
     #  comma-delimited  output.
     sep=","
     ...

    Here's a CSS file:

     body {background-color: #FFFFFF; }
     pre  {background-color: #FFFFDD;
           color: #0000FF;
           border: .05em solid #000000;
          }
     font.comment{color:#660000}

  9. Here/s some awk code (file= select.awk}
     function any(max, n)   {return int(max*rand())+1}
     function size(group)   {return Members[group,0]}
     function who(group,n)  {return Members[group,n]}
     NR==1  {next}
     {Members[$2,++Members[$2,0]]=$1}
     {Groups[$2]}
     END {
       for(Group in Groups)
         print Group " " who(Group,any(size(Group)))}

    Here's some bash script:

     . config
     cat<<-EOF | $gawk -f select.awk
      fname         group
      george       1
      john        1
      thomas       1
      andrew      1
      martin        2
      william        2
      zachary       2
      ..
      EOF

    What does all this do? Why would students hate this code?

Week3

  1. Discuss: ``OneR offers an inductive generalization but Naive Bayes does not''.

  2. State the Pseudo code of 1R

  3. What is over-fitting? Discuss how the discretization method of 1R tries to avoid over-fitting. Give an example.

  4. State Bayes rule and discuss how it can infer future beliefs as a function of past beliefs plus new evidence.

  5. What are the draw backs of Naive Bayes method ?

  6. How are missing values handled in Naive Bayes?

  7. Apply the 1R pseudo-code to the given data set and determine the attribute and rules which are to be used for classification.
                                                    tear     recommended 
     Age            Spectacle       astigmatism     rate          lenses
     ===================================================================
     young          myope           no              reduced         none
     young          myope           yes             normal          soft
     young          myope           yes             normal          hard
     young          hypermetrope    yes             reduced         none
     Pre-presbyopic myope           no              reduced         none
     Pre-presbyopic myope           no              normal          soft
     Pre-presbyopic hypermetrope    yes             reduced         none
     presbyopic     myope           no              reduced         none
     presbyopic     hypermetrope    yes             normal          none
     presbyopic     hypermetrope    no              normal          soft

    Show all your working. Leave fractions as fractions.

  8. Suppose we have a table of data:
     Make        Size  Convertible Type
     ----        ----  ----------- ----
     Mitsubishi  small    yes      coup
     Mitsubishi  medium   no       suv
     Toyota      small    yes      coup
     Toyota      large    no       coup
     Toyota      large    no       suv
     Benz        small    yes      coup
     Benz        large    no       suv
     BMW         small    yes      coup
     BMW        medium    yes      coup
     Ford        small    yes      coup
     Ford        large    no       suv
     Honda       small    no       coup

    and we see a new example:

     Make Size    Convertible Type           
     ---- ------  ----------- ----           
     Ford medium  no           ?

    Calculate the following for the new example, given the database of old examples:

     A=Likelihood of SUV:
     B=Likelihood of Coup:
     C=Probability of SUV:
     D=Probability of coup:

    Show all your working. Leave fractions as fractions.

Week2

  1. Dumb apes get by. Why is that so surprising? What is one explanation for the surprising success of imperfectly reasoning apes like homo sapien?

  2. For what DIFFERENT tasks would you use (a)the WEKA; (b)bash and (c)gawk. For what SAME task would you use all three tasks together (give details)?

  3. Give an example of a domain with (a)continuous classes and (b)discrete classes.

  4. Distinguish classification rules from decision trees from regression trees from model trees from association rules from linear models from treatments from clusters.

  5. For each of the above types of learnt theory, name one learner that generates them and say whether or not that learner supports continuous or discrete classes.

  6. A user wants a learner to generate a perfect theory that works for all future instances. Explain to this user over-fitting avoidance strategies and how they impact the accuracy of a learner on future examples.

[TOP]


Credits

Author

Tim Menzies , tim@menzies.us, http://menzies.us

Software

This page generated by Site: see http://www.cs.pdx.edu/~timm/dm/site.html

Acknowledgements

This site is built using PerlPod.

Style sheet switching method taken from Eddie Traversa's excellent and simple-to-apply tutorial: http://dhtmlnirvana.com/content/styleswitch/styleswitch1.html.

Search engine powered by ATOMZ http://www.atomz.com/search/. Note, the indexes to this site are only updated weekly (heh, its a free service- what more ja want?).

Icons on this site come from http://www.sql-news.de/rubriken/olap.asp and http://www.ifnet.it/webif/centrodi/eng/toolbar.htm.

The JAVA machine learners used at this site come from the extensive data mining libraries found in the University of Waikato's Environment for Knowledge Analysis (the WEKA) http://www.cs.waikato.ac.nz/ml/weka/

[TOP]


Legal

Copyright

Copyright (C) Tim Menzies 2004

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 2; see http://www.gnu.org/copyleft/gpl.html. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

Disclaimer

The content from or through this web page are provided 'as is' and the author makes no warranties or representations regarding the accuracy or completeness of the information. Your use of this web page and information is at your own risk. You assume full responsibility and risk of loss resulting from the use of this web page or information. If your use of materials from this page results in the need for servicing, repair or correction of equipment, you assume any costs thereof. Follow all external links at your own risk and liability.

[TOP]