Assignment 3: A data mining challenge

Introduction

The final assignment of this course is a competition. The challenge is to build a classification model for a given dataset, and the most accurate model wins. You are given a labeled dataset on which you can try out several algorithms, and an unlabeled dataset for which you are asked to provide the labels (predictions) using your model.

You participate in this challenge by yourself. You can ask questions to classmates, but the final process and report should be done individually. Your submitted solution will count for 1 out of 10 points on your final grade (the remaining 9 points are given based on the written exam). Since this is a competition, the best solutions will be rewarded: the solutions will be ranked on predictive accuracy, and the top 30% of submissions will gain an additional bonus point.

Experimenting

The labeled dataset can be found here. Use it to preprocess the data, select algorithms, optimize parameters and build models, using the WEKA Explorer or Experimenter. Note that this a rather large dataset, and some classifiers may require a lot of memory. Therefore, it is good to start WEKA with additional memory, e.g., using 'java -Xmx1000M -jar weka.jar'.

Submitting your predictions

The unlabeled dataset can be found here. When you have done all preprocessing and have selected your classifier and parameter settings, you should use the generated model to generate predictions for this unlabeled dataset. Guidelines can be found here. For instance, you can do the following:

  • Use the Explorer to build a model on the labeled dataset. You will get performance results as usual.
  • Left, under 'Supplied test set', set the unlabeled dataset.
  • Under 'More Options', make sure that 'Output predictions' is checked. If you use WEKA 3.7, select CSV as output format.
  • In the result list (bottom left), right-click your model and choose 'Re-evaluate model on current test set'.
  • In the output window (right), you will find the predictions under the header '=== Predictions on user test set ==='

Using this method, you will get an output that looks like this:

=== Predictions on test set ===
inst# actual predicted error probability distribution
1?2:++0.104 *0.896
2?1:-+*0.741 0.259

These are the instance number, actual label (unknown), the prediction (pos or neg), the error (unknown) and the probability of each prediction. If you use WEKA 3.7, the output can be slightly different.

Finally, send in the entire prediction list.

Timing

  • Start: October 16, 2017
  • Questions and answers: November 13, 2017
  • Deadline: December 1, 2017
  • Results published: Mid December, 2017

What to hand in

You should hand in two separate, uncompressed, files.
Name them xxxxxxx-prediction.csv and xxxxxxx-report.pdf, where xxxxxxx is your student number (no leading 's').
Do not send a single file with the combined content of these two files.
Do not put the two files in an archive (eg. zip) together, or compress them individually.

The prediction file should start with the header line 'inst# actual predicted error probability distribution'.
The number of instances should be exactly equal to the number of records in the unlabeled dataset file. If it is not, you are probably about to hand in a result on the training data.
Both comma and tab delimited prediction files are accepted.

Also required is a report (2-4 pages) describing the process, the classifiers tried, the pre-processing tried, and the final choices that led to your prediction.
Put you name and student number in the report.

Both files can be sent to m.meeng@liacs.leidenuniv.nl
Construct your e-mail as follows, where xxxxxxx is you student number (no leading 's'):

  • Subject: dami2017-xxxxxxx
  • Attachments: xxxxxxx-prediction.csv, xxxxxxx-report.pdf
  • Put your name and student number in the message body.
    No further information is required in the message.
If you hand in a written report of assignment 1 or 2, follow the same guidelines as above.
Name your report xxxxxxx-assignment1.pdf or xxxxxxx-assignment2.pdf, respectively.
Put you name and student number in the report.