Assignment 1: Introduction to the WEKA Data Mining Software
Starting WEKA in the PC room
- Linux: Download the WEKA jar file. You can put it on the Desktop.
- Start a command window, e.g. GNOME
- Go to desktop: cd Desktop
- Start the jar file: java -jar weka.jar
- Study the animals in the document (zoo.xls or zoo.txt). Without using a data mining tool, draw a decision tree of three to five levels deep that classifies animals into a mammal, bird, reptile, fish, amphibian, insect or invertebrate.
- Read about the ARFF-format here. Construct the header for the animal file.
- Download datasets.zip and unzip it. Open zoo.arff by going to Weka and then choosing the explorer.
- Find out in WEKA how many animals this dataset contains.
- Go to the Classify tab and select the decision tree classifier j48. Click on the line behind the choose button. This shows you the parameters you can set and a button called 'More'. Which algorithm is implemented by j48?
- Run the selected classifier. Which percentage of instances is correctly classified by j48? Which families are mistaken for each other?
- Again go to the parameter settings by clicking on the box after the 'Choose' button. Change the reducedErrorPruning setting to True. What is the difference?
- Experiment with some of the other classifiers and until you get a better classification performance. Write down the classifier and its performance.
Diabetes, contact lenses and politicians
- The datasets.zip file contains different data sets ranging from predicting diabetes in an indian population, to distinguishing republicans from democrats. Most datasets contain a short description in the 'header'. Choose at least one dataset, and answer the following questions:
- -What needs to be predicted (i.e. which class)?
- -Build a classifier and give the quality of the prediction.
On the internet you can find many more data sets. Not all these data sets are in the ARFF format. Have a look at the standard repository for datasets UCI http://archive.ics.uci.edu/ml/. Convert the Iris dataset to the ARFF format and try different data mining techniques.