Thirty-three analyses were performed: 24 voice samples are originating from 12 patients diagnosed with spasmodic dysphonia (SD) by (at least) two experienced laryngologists, and analyzed (just) pre- and (a few weeks) post-treatment. Seven patients had no post-treatment analysis. Two patients had two pre-recordings at different moments, with a time interval of several months. There were 11 females and 8 males. Mean age was 60.6 (±9.3).
All patients read a standardized list of 40 short German sentences, phonetically selected for being constantly voiced. This is supposed to increase the sensitivity for detecting interruptions of vocal fold vibrations induced by the SD. Duration of reading is about 2′30″.
The digital recording was made with a sampling frequency of 44.100 Hz, in a quiet room.
Perceptual parameters
They are scored on a scale 0–10, 0 meaning the worst possible rating, and 10 the best possible one. Scoring was performed blindly and independently by three experienced voice clinicians, and scores were averaged. When for a given patient two recordings were available, ratings were made comparatively, but without knowledge of the condition (pre- or post-).
Traditional perceptual voice characteristics [2] :
-
G (Grade): overall impression of quality, integrating all specific characteristics.
-
B (Breathiness): audible unintended additive turbulent noise.
-
R (Roughness): the impression of irregular Fo, creakiness, harshness, including perception of individual acoustic impulses (fry).
Perceptual voice characteristics dedicated to SD:
-
I (Intelligibility): actually the impression of intelligibility: to what extent can the message be correctly understood?
-
F (Fluency): smoothness of speech production.
-
Vo (Voicing): in the sense that the speech is voiced or unvoiced when it actually needs to be voiced or unvoiced.
-
S (Spasmodicity): it means the specific perceptual characteristic of adductor spasmodic dysphonia, combining strain, perception of spasms and tremor.
I, F, and Vo are taken over from the INFVo rating scale developed for and investigated on substitution voices [
7] and already tried out on SD-voices [
8].
Acoustic parameters
An analysis program “AMPEX” (Auditory Model Based Pitch Extractor) created by Van Immerseel and Martens [
9] (and further developed until very recently) was used for the acoustic measurements. It has proven to be able to extract in a valid way the period in irregular signals with background noise. It also detects low frequency components (<0.1 kHz), is suited for running speech and has been efficiently used for substitution voices [
10]. A characteristic of this program is that it includes the three deviant acoustic events that were found relevant for characterizing SD: aperiodicity, phonatory breaks, and frequency shifts, without requesting subjective intervention of an experimenter for placing cursors and identifying deviant events, as in the experiment of Sapienza et al. [
5].
The acoustic analysis is performed in three stages. In the first stage, short-term acoustic features are extracted every 10 ms by the auditory model described in [
9]. Then these features are employed to distinguish speech frames from background (silence) frames. Finally, a global analysis of the short-term acoustic feature patterns over the entire recording is performed to produce a limited set of features that is expected to characterize the voice of the recorded speaker.
Every 10 ms, the auditory model produces a set of more than 30 features, but for the present study, only 4 of them are relevant, namely, the energy (E), the voicing evidence (VE), the voiced/unvoiced nature (VU), and the pitch frequency (Fo) (in case of voicing) of the frame. The reader is referred to [
5] for more details as to how these features are actually computed.
The speech/background classification of the frames is based on an analysis of the smoothed energy pattern. The smoothed energy of frame i is computed as the mean of the energies in frames i − 2 to i + 2. In a first step, a background threshold is determined as 1.1 times the minimal energy plus 0.05 times the maximum energy found in the recording. All frames exceeding this threshold are initially labeled as speech and the others as background. However, to avoid that too many weak parts of speech (e.g., closures of plosives, weak consonants) are classified as background, any interval shorter than 100 ms that was labeled as background is converted to speech again.
The first feature emerging from the global analysis stage characterizes the ability of the speaker to produce voicing. It comes in two flavors: the proportion of voiced frames (PVF) in the entire recording and the proportion of voiced speech frames (PVS). Because pauses and weak speech sounds are typically unvoiced, PVS is expected to be larger than PVF.
The second feature is the average voicing evidence (AVE) in the voiced frames. It characterizes the degree of regularity/periodicity in the voiced frames. Since the real background frames are normally unvoiced, the analysis is performed on all frames, and not just on the speech frames, in the hope to be more robust against possible errors of the speech/background classification, which is after all purely energy based, whereas the voicing evidence is derived from an analysis of all the subband signals created by the auditory model.
The third feature is the traditional ‘Jitter’: JIT and JITc (corrected jitter) represent the Fo-jitter in all voiced frame pairs (=two consecutive frames) and in the voiced frame pairs with a reliable Fo in each of the two frames. The formula, which is used to compute the jitter, is:
$$ {\text{Jitter}} = {\text{sum of VE}}\left( i \right)\, \times \,\left| {{\text{T}}0\left( i \right)-{\text{T}}0\left( {i - 1} \right)} \right|/{\text{sum of VE}}\left( i \right){\kern 1pt} \, \times {\kern 1pt} \,{\text{T}}0\left( {i - 1} \right),\quad {\text{ T}}0 = 1/{\text{F}}0 $$
A fourth feature is the 90th percentile (VL 90) of the voicing length distribution. It is considered a robust estimate of the maximum voicing duration. The voicing length is defined as the number of consecutive voiced frames in the data.
Acoustic measurements
With AMPEX, the following features have been estimated:
-
PVF/PVS: PVF is the proportion of voiced frames and depends on the pauses appearing in speech. In addition, the PVS, the proportion of voiced speech frames is computed, thus considering only frames that are classified as speech in the first step of the analysis. Since pauses and weak sounds are typically unvoiced, PVS will typically be larger than PVF. For sustained vowels it should be expected that PVS = PVF = 100% in a normal voice. For constantly voiced sentences, the rule is: the better the voice, the highest the percentages.
-
AVE: the average voicing evidence in voiced frames. The more regular (periodic) the voiced frames, the higher the AVE.
-
VL 90: the 90th percentile of the voicing length distribution. The voicing length is defined as the number of consecutive voiced frames found in the data. The 90th percentile of the voicing length distribution may be considered a robust estimate of the maximum voicing duration. Phonatory breaks decrease the value of this feature.
-
JIT and JITc: the cycle-to-cycle period perturbation and the corrected cycle-to-cycle period perturbation. JIT represents the Fo-jitter in all voiced frame pairs (=2 consecutive frames), and JITc the Fo-jitter in the voiced pairs with a reliable Fo in each of the two frames.
-
JITN and JITNc: there is also a jitter feature which is computed without applying the VE (voicing evidence)-weighting.
-
PVFU: the percentage of frames with an “unreliable” Fo. For example, observed sudden frequency shifts suggest that the Fo estimate is unreliable.
Further, the total duration required for reading the 40 sentences was also measured in seconds.