Assignment 1: Introduction to the WEKA Data Mining Software

Starting WEKA in the PC room

Decision Trees

  1. Study the animals in the document (zoo.xls or zoo.txt). Without using a data mining tool, draw a decision tree of three to five levels deep that classifies animals into a mammal, bird, reptile, fish, amphibian, insect or invertebrate.
  2. Read about the ARFF-format here. Construct the header for the animal file.
  3. Download datasets.zip and unzip it. Open zoo.arff by going to Weka and then choosing the explorer.
  4. Find out in WEKA how many animals this dataset contains.
  5. Go to the Classify tab and select the decision tree classifier j48. Click on the line behind the choose button. This shows you the parameters you can set and a button called 'More'. Which algorithm is implemented by j48?
  6. Run the selected classifier. Which percentage of instances is correctly classified by j48? Which families are mistaken for each other?
  7. Again go to the parameter settings by clicking on the box after the 'Choose' button. Now change binarySplit to true and build a new decision tree. What is the difference?
  8. Experiment with some of the other classifiers and until you get a better classification performance. Write down the classifier and its performance.

Pima indians, mushrooms and politicians

  1. The datasets.zip file contains different data sets ranging from predicting diabetes in an indian population, distinguishing edible mushrooms from poisonous ones, to distinguishing republicans from democrats. Most datasets contain a short description in the 'header'. Choose at least one dataset, and answer the following questions:
    • -What needs to be predicted (i.e. which class)?
    • -Build a classifier and give the quality of the prediction.

Other datasets

On the internet you can find many more data sets. Not all these data sets are in the ARFF format. Have a look at the standard repository for datasets UCI http://archive.ics.uci.edu/ml/. Convert the Iris dataset to the ARFF format and try different data mining techniques.