Chapter 02 - Concept Learning
Date:
Book Name: Machine Learning
Chapter Name: 02 - Concept Learning and the General-to-Specific Ordering
Writer: Tom Mitchell
This chapter is all about concept learning, which is the process of generalizing a category using a series of positive and negative a samples of the category.
Concept Learning (a.k.a. classification) can be thought of as the process of inferring a boolean-valued function from training examples of its input and output.
In general, any concept learning task is described using:
The set of instances over which the target function is defined.
The target function.
The set of candidate hypothesis considered by the learner.
The set of available training examples.
Once all of these things have been properly defined, the objective of the learner is to use the training examples to estimate the target function by picking one function among all of the canditate hypothesis.
Since in the context of a concept learning task, the only information about the target function available to the learner are the training examples, it follows that
Inductive learning algorithms can, at best, guarantee that the ouput hypothesis fits the target concept over the training data.
This brings us to the fundamental assumption fo inductive learning, also known as the The inductive learning hypothesis
Any hypothesis found to approximate the target function well over a sufficiently large set of training examples will also approximate the target function well over other unobserved examples.
Concept learning can be viewed as the task of searching through a large space of hypothesis implictly defined by the hypothesis representation.
By selecting a hypothesis representation, the designer of the learning algorithm implicitly defines the space of all hypothesis that the program can ever represent and therefore can ever learn.
With this view in mind, it is then natural that our study of learning algorithms will examine different strategies for searching the hypothesis space.
Many algorithms for concept learning search through the hypothesis space by using a very useful structure that exists for any cocnept learning problem: a general-to-specific ordering of hypotheses.
Definition: Let \(h_j\) and \(h_k\) be boolean-valued functions defined over \(X\). We say that \(h_j\) is more_general_than_or_equal_to \(h_k\), and we write \(h_j \geq_g h_k\), if and only if
\[\forall x \in X: \;\; (h_k(x) = 1) \implies (h_j(x) = 1)\]
Sometimes its also useful to consider when one hypothesis is stricly more general than the other. In these cases we say that \(h_j\) is (stricly) more_general_than \(h_k\), and we write \(h_j >_g h_k\), if and only if
\[(h_j \geq_g h_k) \land (h_k \not \geq_g h_j)\]
The \(\geq_g\) relation is important because it provides a useful structure over the hypothesis space \(H\) for any concept learning problem.
One simply way of using the more_general_than partial ordering is to begin with the most specific possible hypothesis in \(H\), then generalize this hypothesis each time it fails to cover an observed positive training example.
This brings us to the Find-S algorithm, which can be used for hypothesis space described by conjunctions of attribute constraints
Initialize h to the most specific hypothesis in H For each positive training instance x for each attribute contraint a_i in h If the constriant a_i is NOT satisfied by x replace a_i in h by the next more general contraint that is satisfied by x Output hypothesis h
Notice that negative examples are never considered by the Find-S algorithm. This is because, in general, as long as we assume that the hypothesis space \(H\) contains a hypothesis that describes the true target concept \(c\) and that the training data contains no errors, then the current hypothesis \(h\) can never require a revision in response to a negative exmaples.
While the Find-S algorithm when applied to an hypothesis space described by conjunctions of attribute constraints is guaranteed to output the most specific hypothesis within \(H\) that is consisted with the positive training examples, there are still several questions left unanswered:
Has the learning converged to the correct target concept?
Why prefer the most specific hypothesis?
Are the training examples consistent?
What if there are several maximally specific consistent hypothesis?
Let us say that a hypothesis is consistent with the training examples if it correctly classifies these examples. Formally,
Definition: A hypothesis \(h\) is consistent with a set of training examples \(D\) if and only if \(h(x) = c(x)\) for each example \((x, c(x)) \in D\).
Notice the key difference between consistency and satisfaction over a given element \(x\)
An example \(x\) is said to satisfy hypothesis \(h\) when \(h(x) = 1\), regardless of whether \(x\) is a positive or negative example.
An example \(x\) is instead said to be consistent with \(h\) when \(h(x) = c(x)\).
As we can see, satisfaction depends only on \(h\), while consistency depends on the relation between \(h\) and the target function \(c\).
The subset of all the hypothesis that are consistent with the observed training examples is called the version space. Formally,
Defintion: The version space, denoted \(VS_{H,D}\), with respect to hypothesis space \(H\) and training examples \(D\), is the subset of hypotheses from \(H\) consistent with the training examples in \(D\).
\[VS_{H,D} = \{h \in H \;\;|\;\; \text{Consistent}(h, D)\}\]
One obvious way to represent the version space is simply to list all of its members.
This brings us to the following algorithm, called List-Then-Eliminate.
VersionSpace <- list containing every hypothesis in H For each training example (x, c(x)) remove from VersionSpace any hypothesis h for which h(x) \neq c(x) Output VersionSpace
In principle, the List-Then-Eliminate algorithm can be applied whenever the hypothesis space \(H\) is finite. It has many advantages, but unfortunately it requires exhaustively enumerating all hypothesis in \(H\), which is an unrealistic requirement.
It is intuitively plausible that we can represent the version space in terms of its most specific and most general members. These are defined below:
Definition 1: The general boundary \(G\), with respect to hypothesis space \(H\) and training data \(D\), is the set of maximally general members of \(H\) consistent with \(D\).
\[G = \{g \in H | \text{Consistent}(g, D) \land \not \exists g^{'} \in H: (g^{'} \geq_g g) \land \text{Consistent}(g^{'}, D)\}\]
Definition 2: The specific boundary \(S\), with respect to hypothesis space \(H\) and training data \(D\), is the set of minimally general (i.e. maximally specific) members of \(H\) consistent with \(D\).
\[G = \{s \in H | \text{Consistent}(s, D) \land \not \exists s^{'} \in H: (s \geq_g s^{'}) \land \text{Consistent}(s^{'}, D)\}\]
As long as the sets \(G\) and \(S\) are well defined, they completely specify the version space. In particular, we have the following theorem, which states that the version space is precisely the set of hypothesis contained in \(G\), plus those contained in \(S\), plus those that lie in between \(G\) and \(S\).
Theorem (Version space representation thereom): Let \(X\) be an arbitrary set of instances, and let \(H\) bet a set of boolean-valued hypotheses defined over \(X\). Let \(c: X \to \{0, 1\}\) be an arbitrary target concept defined over \(X\), and let \(D\) be an arbitrary set of training examples \(\{(x, c(x))\}\). For all \(X, H, c\) and \(D\) such that \(S\) and \(G\) are well defined, we have that
\[VS_{H,D} = \{h \in H | \exists s in S, \exists g \in G: g \geq_g h \geq_g s \}\]
The Candidate-Elimination algorithm works on the same principle as the List-Then-Eliminate one, but it employs a much more compact representation of the version space. In particular, the version space is represented by its most general and least general members.
The algorithm is described as follows
Initialize G to be the set of maximally general hypotheses in H Initialize S to be the set of maximally specific hypotheses in H For each training example d do If d is a positive example ... If d is a negative example ...
As we can see, as each training example is considered by the algorithm, the \(S\) and \(G\) boundary are generalized and specialized, respectively, to eliminate from the version space any hypotheses found incosistent with the new training example. After all the xamples have been processed, the computed version space contains all the hypotheses consistent with these examples and only these hypotheses.
The detailed implementation of these operations will depend on the specific representations for instances and hypotheses. However, the algorithm itself can be applied to any concept learning task and hypothesis space for which these operations are well-defined.
The version space learned by te CANDIDATE-ELIMINATION algorithm will converge towards the hypothesis that correctly describes the target book provided that
There are no errors in the training examples.
There is some hypothesis in H that correctly describes the target concept.
The target concept is exactly learned when the S and G boundary sets converge to a single, identical, hypothesis.
Sometimes it is possible to classify certain examples with the same degree of confidence as if the target concept had been uniquely identified even when the version space contains multiple hypothesis.
For example, if all hypothesis say that a certain instance is positive, then that instance must be positive.
Suppose we wish to assure that the hypothesis space contains the unknown target concept. The obvious solution is to enrich the hypothesis space to include every possible hypothesis.
The obvious solution to the problem of assuring that the target concept is in the hypothesis space \(H\) is to provide a hypothesis space capable of representing every teachable concept; that is, it is capable of representing every possible subset of the instances X. In general, the set of all subsets of a set \(X\) is called the power set of \(X\).
Given this hypothesis space, we can safely use the CANDIDATE-ELIMINATION algorithm without worrying that the target might not be expressible. However, this raises a new problem: our concept learning algorithm is now completely unable to generalize beyond the observed examples.
The above point illustrates a fundamental property of inductive inference:
A learner that makes no a priori assumptions regarding the identity of the target concept has no rational basis for classifying any unseen instances.
Because inductive learning requires some form of prior assumptions, or inductive bias, we will characterize different learning approaches by the indutive bias they employ. That is, we wish to capture the policy by which the learner generalizes beyond the observed training data, to infer the classification of new instances. This brings us to the following defintion
Definition: Consider a concept learning algorithm \(L\) for the set of instances \(X\). Let \(c\) be an arbitrary concept defined over \(X\), and let \(D_c = \{(x, c(x)\}\) be an arbitrary set of training examples of \(c\). Let \(L(x_i, D_c)\) denote the classification assigned to the instance \(x_i\) by \(L\) after the training on the data \(D_c\). The inductive bias of \(L\) is any minimal set of assertions \(B\) such that for any target concept \(c\) and corresponding training examples \(D_c\) we have
\[\forall x_i \in X: B \land D_c \land x_i \models L(x_i, D_c)\]
where the notation \(y \models z\) indicates that \(z\) follows deductively from \(y\) (i.e., that \(z\) is provably from \(y\)).
Consider the following three learning algorithms, which are listed from weakest to strongest bias.
Rote-Learner
Learning: Learning corresponds to storing each observed training example in memory, and subsequent instances are classified by looking them up in memory.
Bias: no bias.
Candidate-Elimination
Learning:
Bias: Target concept can be represented in its hypothesis space.
Find-S:
Learning:
Bias: all instances are negative instances unless the opposite is entailed by its other knowledge.
As we examine other inductive inference methods, it is useful to keep in mind this means of characterizing them and the strength of their inductive bias.