Machine Learning Algorithms in Java
Ian H. Witten
Department of Computer Science University of Waikato Hamilton, New Zealand E-mail: email@example.com
Department of Computer Science University of Waikato Hamilton, New Zealand E-mail: firstname.lastname@example.org
This tutorial is Chapter 8 of the book Data Mining: Practical Machine Learning Tools and Techniques with JavaImplementations. Cross-references are to other sections of that book. © 2000 Morgan Kaufmann Publishers. All rights reserved.
Nuts and bolts: Machine learning algorithms in Java
ll the algorithms discussed in this book have been implemented and made freely available on the World Wide Web (www.cs.waikato. ac.nz/ml/weka) for you to experiment with. This will allow you tolearn more about how they work and what they do. The implementations are part of a system called Weka, developed at the University of Waikato in New Zealand. “Weka” stands for the Waikato Environment for Knowledge Analysis. (Also, the weka, pronounced to rhyme with Mecca, is a flightless bird with an inquisitive nature found only on the islands of New Zealand.) The system is written in Java, anobjectoriented programming language that is widely available for all major computer platforms, and Weka has been tested under Linux, Windows, and Macintosh operating systems. Java allows us to provide a uniform interface to many different learning algorithms, along with methods for pre- and postprocessing and for evaluating the result of learning schemes on any given dataset. The interface isdescribed in this chapter. There are several different levels at which Weka can be used. First of all, it provides implementations of state-of-the-art learning algorithms that you can apply to your dataset from the command line. It also includes a variety of tools for transforming datasets, like the algorithms for discretization
CHAPTER EIGHT | MACHINE LEARNING ALGORITHMS IN JAVAdiscussed in Chapter 7. You can preprocess a dataset, feed it into a learning scheme, and analyze the resulting classifier and its performance—all without writing any program code at all. As an example to get you started, we will explain how to transform a spreadsheet into a dataset with the right format for this process, and how to build a decision tree from it. Learning how to build decision treesis just the beginning: there are many other algorithms to explore. The most important resource for navigating through the software is the online documentation, which has been automatically generated from the source code and concisely reflects its structure. We will explain how to use this documentation and identify Weka’s major building blocks, highlighting which parts contain supervised learningmethods, which contain tools for data preprocessing, and which contain methods for other learning schemes. The online documentation is very helpful even if you do no more than process datasets from the command line, because it is the only complete list of available algorithms. Weka is continually growing, and—being generated automatically from the source code—the online documentation is always upto date. Moreover, it becomes essential if you want to proceed to the next level and access the library from your own Java programs, or to write and test learning schemes of your own. One way of using Weka is to apply a learning method to a dataset and analyze its output to extract information about the data. Another is to apply several learners and compare their performance in order to choose onefor prediction. The learning methods are called classifiers. They all have the same command-line interface, and there is a set of generic command-line options—as well as some scheme-specific ones. The performance of all classifiers is measured by a common evaluation module. We explain the command-line options and show how to interpret the output of the evaluation procedure. We describe the...