Assignment 2: Weka Experiment Environment


The Weka Experiment Environment enables the user to create, run, modify, and analyse experiments in a more convenient manner than is possible when processing the schemes individually. For example, a user can create an experiment that runs several schemes against a series of datasets and then analyse the results to determine if one of the schemes is (statistically) better than the other schemes.
To begin the Experiment Environment GUI, start Weka and click on Experimenter in the Weka GUI Chooser window.

Defining an Experiment

When the Experimenter is started, the Setup window (actually a pane) is displayed. Click New to initialize an experiment. This causes default parameters to be defined for the experiment.
To define the dataset to be processed by a scheme, first select “Use relative paths” in the Datasets panel of the Setup window, click New, and then click on “Add new ...” to open a dialog window.
(The arff files can be found in c:\program files\weka-3-x, or download them here
Double click on the “data” folder to view the available datasets or navigate to an alternate location. Select iris.arff and click Open to select the Iris dataset. The dataset name is now displayed in the Datasets panel of the Setup window.

Saving the Results of the Experiment

To identify a file to which the results are to be sent, click on the “CSV file” entry in the Destination panel. Type the name of the output file. (If you save results, do it on your student account (Y:).)

Saving the Experiment Definition

The experiment definition can be saved at any time. Select “Save”. Type the dataset name with the extension “exp” (or select the dataset name if the experiment definition dataset already exists). The experiment can be restored by selecting Open.

Running an Experiment

First select the ZeroR algorithm under Algorithms using “Add new ...”. To run the current experiment, click the Run tab at the top of the Experiment Environment window. Select the experiment type such that the experiment performs 10 randomized train and test runs on the Iris dataset, using 66% of the patterns for training and 34% for testing, and using the ZeroR scheme. Click Start to run the experiment.
If the experiment was defined correctly, 3 messages will be displayed in the Log panel. The results of the experiment are saved in the comma-separated value file you selected earlier. Load it into Excel for analysis.

Changing the Experiment Parameters

Select the classifier entry (ZeroR) and “Edit selected”. This scheme has no modifiable properties but most other schemes do have properties that can be modified by the user. Click on the “Add new” to select J48. See how you can edit the parameters, and, if desired, modify the parameters.
Run the experiment and observe that results are generated for both schemes.
To add additional schemes, repeat this process. To remove a scheme, select the scheme by clicking on it and then click Delete. Run an experiment with a number of data sets and a number of classifiers at the same time.

Adding Additional Datasets

The scheme(s) may be run on any number of datasets at a time. Additional datasets are added by clicking “Add new …” in the Datasets panel. Datasets are deleted from the experiment by selecting the dataset and then clicking Delete Selected.

Experiment Analyser

Weka includes an experiment analyzer that can be used to analyse the results of experiments. Set up an experiment that uses 3 schemes, ZeroR, OneR, and J48, to classify the Iris data in an experiment using 10 train and test runs, with 66% of the data used for training and 34% used for testing.
After the experiment setup is complete, run the experiment. Then, to analyse the results, select the Analyse tab at the top of the Experiment Environment window. Use “Experiment” (or as alternative “File...”) to analyse the results of the current experiment.
The number of result lines available (“Got 30 results”) is shown in the Source panel. This experiment consisted of 10 runs, for 3 schemes, for 1 dataset, for a total of 30 result lines.
Select the Percent_correct attribute from the Comparison field and click Perform test to generate a comparison of the 3 schemes.
The schemes used in the experiment are shown in the columns and the datasets used are shown in the rows.
The percentage correct for each of the 3 schemes is shown in each data set row. The annotation “v” or “*” indicates that a specific result is statistically better (v) or worse (*) than the baseline scheme (in this case, ZeroR) at the significance level specified (currently 0.05). The results of both OneR and J48 are statistically better than the baseline established by ZeroR. At the bottom of each column after the first column is a count (xx/ yy/ zz) of the number of times that the scheme was better than (xx), the same as (yy), or worse than (zz) the baseline scheme on the datasets used in the experiment. In this example, there was only one dataset and OneR was better than ZeroR once and never equivalent to or worse than ZeroR (1/0/0); J48 was also better than ZeroR on the dataset.
The value “(10)” at the beginning of the “iris” row defines the number of runs of the experiment.
The standard deviation of the attribute being evaluated can be generated by selecting the Show std. deviations check box.
Selecting Number_correct as the comparison field and clicking Perform test generates the average number correct (out of a maximum of 51 test patterns, which is 34% of 150 patterns in the Iris dataset).

Saving the Results

The information displayed in the Test output panel is controlled by the currently-selected entry in the Result list panel. Clicking on an entry causes the results corresponding to that entry to be displayed. The results shown in the Test output panel can be saved to a file by clicking Save output.

Changing the Baseline Scheme

The baseline scheme can be changed by clicking Select base… and then selecting the desired scheme. Select the OneR scheme. This causes the other schemes to be compared individually with the OneR scheme.
Use the Percent_correct field with OneR as the base scheme. The system will indicate that there is no statistical difference between the results for OneR and J48. Is there a statistically significant difference between OneR and ZeroR?

Statistical Significance

The term “statistical significance” used in the previous section refers to the result of a pair-wise comparison of schemes using a “t-test”. As the significance level is decreased, the confidence in the conclusion increases.
In the current experiment, there is not a statistically significant difference between the OneR and J48 schemes. Play with the significance level.

Summary Test

Select for Test base Summary and perform a test. Then you will see output (ignore the numbers inside the brackets) in which the first row “- 1 1” indicates that column “b” (OneR) is better than row “a” (ZeroR) and that column “c” (J48) is also better than row “a”. The remaining entries are 0 because there is no significant difference between OneR and J48 on the data set that was used in the experiment.

Ranking Test

Select Ranking from Test base. The ranking test ranks the schemes according to the total wins (“>”) and losses (“<”) against the other schemes. The first column (“>-<”) is the difference between the number of wins and the number of losses.


To change from random train and test experiments to cross-validation experiments, choose in the setup tab the cross-validation experiment type.
Set the number of iterations to 1 in the Setup window.
Analyse this experiment (there are 30 (1 run times 10 folds times 3 schemes) result lines).

Averaging Result Producer

An alternative to the CrossValidation the Averaging Result. This result producer takes the average of a set of runs (which are typically cross-validation runs). This result producer is identified by clicking advanced in the setup and then the Result Generator panel and then selecting AveragingResultProducer from the drop-down list.
Conduct an experiment, in which the ZeroR, OneR, and j48.J48 schemes are run 10 times with 10-fold cross validation. Each run of 10 cross-validation folds is then averaged, producing one result line for each run (instead of one result line for each fold as in the previous example using the cross-validation result producer) for a total of 30 result lines.
It should be noted that while the results generated by the averaging result producer are slightly worse than those generated by the cross-validation result producer, the standard deviations are significantly smaller with the averaging result producer.