The previous section has shown how features specific to the classification problem at hand can be extracted through a multimodal information theoretic framework. The application of this framework results in decreasing the estimation error probability. But the question of minimizing the probability P
E
of committing an error on the whole classification process still remains. It relies on the choice of a classifier able to classify the extracted features as correctly as possible.
Hypothesis testing for classification
Hypothesis tests are used in detection problems in order to take the most appropriate decision given an observation x of a random variable X. In the problem at hand, the decision function has to decide whether two measurements A and V (or their corresponding extracted features F
A
and F
V
) originate from a common bimodal source S – the speaker – or from two independent sources – speech and video noise. As previously stated, the problem of deciding between two mouth regions which one is responsible for the simultaneously recorded speech audio signal can be solved by evaluating the synchrony, or dependence relationship, that exists between this audio signal and each of the two video signals.
From a statistical point of view, the dependence between the audio and the video features corresponding to a given mouth region can be expressed through a hypothesis framework, as follows:
H0 : f
a
, f
v
~ P0 = P (f
a
) · P (f
v
),
H1 : f
a
, f
v
~ P1 = P (f
a
, f
v
),
H0 postulates the data
f
a
and
f
v
to be governed by a probability density function stating the independence of the video and audio sources. The mouth region should therefore be labeled as "non-speaker". Hypothesis
H1 states the dependence between the two modalities: the mouth region is then associated to the measured speech signal and classified as "speaker". The two hypothesis are obviously mutually exclusive. In the Neyman-Pearson approach [
10] certain probabilities associated with the hypothesis test are formulated. The false-alarm probability
P
FA
, or size
α of the test, is defined as:
(7)
while the detection probability
P
D
, or power
β of the test, is given by:
(8)
The Neyman-Pearson criterion selects the most powerful test of size
α: the decision rule should be constructed so that the probability of detection is maximal while the probability of false-alarm do not exceed a given value
α. Using the log-likelihood ratio, the Neyman-Pearson test can be expressed as follows:
(9)
The test function must then decide which of the hypothesis is the most likely to describe the probability density functions of the observations f
a
and f
v
, by finding the threshold η that will give the best test of size α.
The mutual information is a metric evaluating the distance between a joint distribution stating the dependence of the variables and a joint distribution stating the independence between those same variables:
(10)
The link with the hypothesis test of Eq. (7) seems straightforward. Indeed, as the number of observations
f
a
and
f
v
grows large, the normalized log-likelihood ratio approaches its expected value and becomes equal to the mutual information between the random variables
F
A
and
F
V
[
9]. The test function can then be defined as a simple evaluation of the mutual information between audio and video random variables, with respect to a threshold
η. This result differs from the approach of Fisher
et al. in [
6], where the mouth region which exhibits the largest mutual information value is assumed to have produced the speech audio signal. The formulation of the hypothesis test with a Neyman-Pearson approach allows to define a measure of confidence on the decision taken by the classifier, in the sense that the
α-
β trade-off is known. Considering that two mouth regions could potentially be associated to the current audio signal and defining one hypothesis test (with associated thresholds
η1 and
η2) for each of these regions, four different cases can occur:
1.
I1(
F
A
,
) >
η1 and
I1(
F
A
,
) <
η2: speaker 1 is speaking and speaker 2 is not;
2.
I1(
F
A
,
) <
η1 and
I1(
F
A
,
) >
η2: speaker 2 is speaking and speaker 1 is not;
3.
I1(
F
A
,
) <
η1 and
I1(
F
A
,
) <
η2: none of the speaker is speaking;
4.
I1(
F
A
,
) >
η1 and
I1(
F
A
,
) >
η2: both speakers are speaking.
The experimental conditions are defined so as to eliminate the possibilities 3 and 4: the test set is composed of sequences where speakers 1 and 2 are speaking each in turn, without silent states. This allows, in the context of this preliminary work, to define the simpler following cases: if a speaker is silent, it implies that the other one is actually speaking. Notice also that a possible equality with the threshold is solved by attributing randomly a class to the random variable pair.