Assignment 1: Introduction to the WEKA Data Mining Software
Starting WEKA in the PC room
- Linux: Download the WEKA jar file. You can put it on the Desktop.
- Start a command window, e.g. GNOME
- Go to desktop: cd Desktop
- Start the jar file: java -jar weka.jar
Decision Trees
- Study the animals in the document (zoo.xls or zoo.txt). Without using a data mining tool, draw a decision tree of three to five levels deep that classifies animals into a mammal, bird, reptile, fish, amphibian, insect or invertebrate.
- Read about the ARFF-format here. Construct the header for the animal file.
- Download datasets.zip and unzip it. Open zoo.arff by going to Weka and then choosing the explorer.
- Find out in WEKA how many animals this dataset contains.
- Go to the classifier tab and select the decision tree classifier j48. Click on the line behind the choose button. This shows you the parameters you can set and a button called 'More'. Which algorithm is implemented by j48?
- Which percentage of instances is correctly classified by j48? Which families are mistaken for each other?
- Again go to the parameter settings by clicking on the box after the 'Choose' button. Now change binarySplit to true and build a new decision tree. What is the difference?
- Experiment with some of the other classifiers and until you get a better classification performance. Write down the classifier and its performance.
|
Pima indians, mushrooms and politicians
- The datasets.zip file contains different data sets ranging from predicting diabetes in an indian population, distinguishing eatable mushrooms from poisonous till separating republicans from democrats. Most datasets contain a short description in the 'header'. Choose at least one data set, and answer the following questions:
- -What needs to be predicted (i.e. which class)?
- -Build a classifier and give the quality of the prediction.
- -Give one or more interesting association rules and explain why these are interesting.
|
Other datasets
On the internet you can find many more data sets. Not all these data sets are in the ARFF format. Choose one of the data sets from http://archive.ics.uci.edu/ml/. Convert this dataset to the ARFF format and try different data mining techniques.
|