Automatic Speech Sequence Segmentation

Bhukya Venkatesh, Roll No.: 150102012, Branch: ECE

;    

Bodda Sai Charan, Roll No.: 150102013, Branch: ECE

;    

Chintapalli Tarun, Roll No.: 150102014, Branch: ECE

;    

Gaddabathini Rahul, Roll No.: 150102019, Branch: ECE

;    
Abstract

This project aims at an unsupervised method of speaker segmention and clustering of audio data using MFCC(Mel Frequency Cepstral Coefficients) and features extraction from the same. here the characteristics of speaker or speech and number of speakers is unknown speech sequences based on speaker transitions.

The first step before doing any audio processing is extracting features from the audio data. Best feature that that describes the human perception of sensitivity with respect to frequency of identifying different speakers are mel-frequency cepstral coefficents. Followed by calculation of Delta coefficients, Delta-Delta(accelaration) coefficients, Which are then fed into a clustering algorithm (K-means,GMM) followed by speech and speaker segmentation. In this project we have implemented K-means,Spherical K-means clustering along with a modifed version of K-means(based on Divide and conquer algorithm)

1. Introduction
Nowadays, With the rapid advancement in technology and increase in the volume of recorded speech is manifested. Indeed, television and audio broadcasting, meeting recordings, and voice mails have become a commonplace.It is very important for organisations and companies to organise speech data into different classes. However, the huge volume size hinders content organization, navigation, browsing, and retrieval. Speaker segmentation and speaker clustering are tools that alleviate the management of huge audio archives. Speaker segmentation aims at splitting an audio stream into acoustically homogeneous segments based on the speaker.
1.1 Introduction to Problem
The Main Aim of this project is to segment and cluster an audio sample based on speaker when number of speakers are not known before hand. Main challenge in the process of speaker recognition is separting audio based on speaker.It can enhance the readability of an automatic speech transcription by structuring the audio stream into speaker turns and, when used together with speaker recognition systems, by providing the speaker's true identity.Other challenges are due to multiple speakers present at the time instant
1.2 Figure
This text displays when the image is unavailable
1.3 Literature Review
Unsupervised Speaker Diarization

It explore the conventional techniques which involves hierarchichical agglomerative clustering and later shift to Integer Linear Programming clustering which gives state of the art results for unsupervised speaker diarization.In the ILP clustering, the k-means problem is modified to obtain a set of clusters.


Vector Quantization Approach for Speaker Recognition

This paper mainly concentrates on using vector quantization approach for d for mapping vectors from a large vector space to a finite number of regions in that space. Each region is called a cluster and can be represented by its center called a codeword


The HTK Book

Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D. Liu, X., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P., 2006. This book mainly emphasis on implementation of MFCC it discusses about the voice box tool kit used in the matlab for extraction of MFCC.


1.4 Proposed Approach
The technique of speaker segmentation relies on the following steps :-
1) Removing noise from input audio sample (using Amplitude threshold,ZCR)
2) MFCC feature extraction
3) Delta,delta-delta and other features from MFCC
4) Speech segmentation and Clustering using k-means clustering and spherical k-means clustering
1.5 Report Organization
2. Proposed Approach
The technique of speaker segmentation relies on the following steps :-

1) Removal of noise for input sound sample
Most of the noise is present in the 'silence' part of the speech signal. So, the task is to identify the silent part of the speech signal and to reduce the noise present there.The silence detection from the audio signal is carried out using amplitude thresholding.In this method, the audio sample is broken down into many small frames each of 20ms.In each of these frames, maximum amplitude is found. If the maximum amplitude of a particular frame is less than the threshold (0.05), the data of the frame is replaced by zeros.In this way,silence is detected and noise present is removed.

2) MFCC feature extraction
This text displays when the image is unavailable
After the removal of noise from the audio signal, features like MFCC, delta-MFCC and Delta-Delta MFCC are extracted from it. Extraction of MFC coefficents from the audio signal is carried out as follows
The speech signal is first preemphasised using a first order FIR filter The preemphasised speech signal is subjected to the short-time Fourier transform analysis with frame durations of 25, frame shifts of 10 and analysis hamming window function. This is followed by magnitude spectrum computation followed by filterbank design with 16 triangular filters uniformly spaced on the mel scale between lower and upper frequency limits as 300 Hz and 3000 Hz. The filterbank is applied to the magnitude spectrum values to produce filterbank energies (FBEs) .Log-compressed FBEs are then decorrelated using the discrete cosine transform to produce cepstral coefficients.

i) PRE EMPHASIS:In speech processing, the original signal usually has too much lower frequency energy, and processing the signal to emphasize higher frequency energy is necessary. To perform pre-emphasis, we choose some value a between .9 and 1. Then each value in the signal is re-evaluated using this formula: y[n] = x[n] - a*x[n-1]. This is apparently a first order high pass filter Another good property of preemphasis is that it helps to deal with DC offset which is often present in recordings and thus it can improve energy-based voice activity detection.

2)WINDOWING:Speech is non-stationary signal where properties change quite rapidly over time. This is fully natural and nice thing but makes the use of DFT or autocorrelation as a such impossible. For most phonemes the properties of the speech remain invariant for a short period of time (5-100 ms). Thus for a short window of time, traditional signal processing methods can be applied relatively successfully The speech signal is divided into short frames of 25ms Finally, it is usually beneficial to taper the samples in each window so that discontinuities at the window edges are attenuated. This is done by Hamming window
hamming = @(N)(0.54-0.46*cos(2*pi*[0:N-1].'/(N-1))

iii) FFT MAGNITUDE SPECTRUM:magnitude spectrum of discrete fourier transform of the above signal is calculated

iv) FILTER BANKS GENERATION:The human ear resolves frequencies non-linearly across the audio spectrum and empirical evidence suggests that designing a front-end to operate in a similar non-linear manner improves recognition performance.So, filter banks are generated using triangular filters with uniformly spaced filters on mel scale(logarithmic on linear scale). The filterbank is applied to the magnitude spectrum values to produce filterbank energies (FBEs)

v)CEPSTRAL FEATURES:Most often, however, cepstral parameters are required and these are indicated by setting the target kind to MFCC standing for Mel-Frequency Cepstral Coefficients (MFCCs). These are calculated from the log filterbank amplitudes {mj} using the Discrete Cosine Transform.

3) Delta,delta-delta and other features from MFCC
Delta and delta-delta are calculated from the MFCCs using the formulae:
This text displays when the image is unavailable



Silences found in the first step are employed here, The 12 dimensional MFCCs found between two adjacent silences are taken average and stored in an other matrix corresponding to the speech signal.

4) Speech segmentation and Clustering using k-means clustering and spherical k-means clustering (UNSUPERVISED)
Here, we assume that there is at least 5ms scilence between each speaker,
since the number of speakers is not known before hand,we have implemented a program(kmean_un.m in /code/ github code link) that starts with a large number of clusters n . let their centroids be A1,A2,A3......, if the distance between any three adjacent clusters centroid does not follow the relation ,
0.4 < (Distance (An ,An-1)/ (Distance (An ,An-1) + Distance (An-1 ,An-2)))<0.6
decrement n and follow same proceduce and the above relation is followed,this results in almost eqvidistant clusters(shown better results and converging to actual number of speakers)
The mean values of the MFCCs between the adjacent silences are calculated clustered using K-means clustering and spherical K-means Clustering with numbers of clusters from above (Kmean_un). The audio date corresponding to the mean values of the MFCCs are joined together and hence the speech audio of different speakers are obtained seperately.

5) Modified K-Means Clustering for more than 3 speakers

since K means is a classifier based on distance metric,it does not work expected as number of speakers increases. To address this proplem we have modified the traditional uses of kmeans . we have found k-means is more accurate while clustering 2 speakers. first we find number of number of speakers using the program we mentioned above(kmean_un) , if n is > 2 , we cluster it into 2 parts , and repeat same procedure with child clusters until there is no more child with more than 2 speakers

Pseudo Code
step 1: sample to kmean_un % to find number of speakers
step 2: if n>=2; sample into 2 clusters.
step 3: each clustered sample to Kmean_un
step 4: repeat step 2 for each child clusters. output will be be when no child cluster cannot be divided further

This text displays when the image is unavailable
3. Experiments & Results
3.1 Analysis

The audio signal can be analysed in two ways, analysing in the time domain and analysing in the frequency domain.Frequency domain analysis becomes very handy when it comes to differentiating the speakers in an audio signal. for our project, We started with Fourier transform of the speech signal to check if we can find any similarities in the spectrum of the audio clips of same speaker and differences in the spectrum of the different speakers.Unfortunately we are unable to see any similarities in the spectrum of the same speaker, moreover, the spectrum of the different speakers appear to be the same sometimes. We found out that there are high-frequency components in the spectrum of the speech signal. This shows that there is a noise in the speech signal. To measure the noise in the speech signal at a particular instant, we made use of ZCR(Zero Crossing Rate).from that, we found that the most of the noise is present in the silent part of the audio signal i.e. the part of the audio signal where no one is speaking.To remove the noise in the silent part, we employed amplitude thresholding to remove the parts of the signal whose amplitude is less than 5% of maximum amplitude and replaced it zeros. This ensures that there is no noise in the signal and also lets us know about the position of the silences in the signal.The features we need to extract should approximate the human auditory system's response. By going through some research papers and some literary reviews, we found that MFCCs approximate the human auditory system's response. We extracted the MFC coefficients from the audio signal as shown in the "proposed approach" above. The final part is to cluster the feature vectors obtained. we used K-means as well as Spherical K-means clustering algorithms. It works well for two clusters, but when there are more than two clusters, K-means doesn't show results as expected. It is because of K-means and spherical K-means clusters based on the distance metric. The margins of the cluster are exactly at the halfway between the two centroids. But that should not happen in this case because the clusters formed by the feature vectors are similar to Gaussian distributed. To find the number of speakers in the audio signal and cluster the data accordingly, we did the following process. First of all, we cluster the data into two clusters (which may contain more clusters in it). Next, we send each cluster to an algorithm which decides the number of clusters present with that parent cluster. Then we cluster accordingly.

3.2 Dataset Description and Observations

Data Set 1 :Male-Female(2 speakers)

This text displays when the image is unavailable

Clustered audio according to speaker using Kmeans

This text displays when the image is unavailable
Speaker 1 clustered audio using K-means


Speaker 2 clustered audio using K-means


Clustered audio according to speaker using Spherical-Kmeans

This text displays when the image is unavailable
Speaker 1 clustered audio using spherical - Kmeans


Speaker 2 clustered audio using Spherical- Kmeans


Data Set 2 Male-Male(2 speakers)

This text displays when the image is unavailable

Clustered audio according to speaker using K-means

This text displays when the image is unavailable
speaker 1 clustered audio using Kmeans


speaker 2 clustered audio using Kmeans


Clustered audio according to speaker using Spherical-Kmeans

This text displays when the image is unavailable
speaker 1 clustered audio using spherical- Kmeans


speaker 2 clustered audio using spherical- Kmeans


Data Set 3 Merge of Dataset1 and Dataset2(4 speakers)

This text displays when the image is unavailable

Clustered audio according to speaker using K-means

This text displays when the image is unavailable
speaker 1 clustered audio using Kmeans


speaker 2 clustered audio using Kmeans


speaker 3 clustered audio using Kmeans


speaker 4 clustered audio using Kmeans



Clustered audio according to speaker using Spherical-Kmeans

This text displays when the image is unavailable
speaker 1 clustered audio using spherical- Kmeans


speaker 2 clustered audio using spherical- Kmeans

speaker 3 clustered audio using spherical- Kmeans


speaker 4 clustered audio using spherical- Kmeans




This text displays when the image is unavailable

Clustered audio according to speaker using K-means (modified)

This text displays when the image is unavailable
speaker 1 clustered audio using Kmeans (modified)


speaker 2 clustered audio using Kmeans (modified)


speaker 3 clustered audio using K-means (modified)


speaker 4 clustered audio using Kmeans (modified)




4. Conclusions
4.1 Summary
Results

The results obtained for the data sets are as follows: For the dataset 1, the algorithm clusters perfectly and two different audio clips are obtained corresponding to each speaker(male and female) For the dataset 2, the algorithm clusters perfectly and two different audio clips are obtained corresponding to each speaker(male and male) When the dataset 1 and data set 2 are merged and sent as input to the algorithm, the results are not up to the mark, by just using K-means algorithm. When the merged dataset is sent as input to the Divide and conquer algorithm for Kmeans, with this modified kmeans algorithm, the clustering is better than the one which uses only K-means or spherical kmeans.

Conclusions

The results obtained are satisfactory. When there are two speakers, the algorithm clusters perfectly but When there are more than two speakers, the algorithm is a bit less efficient in clustering. K-means is the best fit when there are two speakers. The results can be better obtained when we use clustering algorithms that cluster the Gaussian distributed data. GMM or HMM can be employed for that purpose.

4.2 Future Extensions

Clustering using GMM and HMM can be employed for the better performance
The algorithm can be made supervised by training it with the audio samples of the known speakers. And identifying the speaker from the new audio data.