1. Introduction
In the era of the Internet of Things, smart and intelligent assistants (e.g., Alexa,Footnote a Siri,Footnote b Cortana,Footnote c Google Assistant,Footnote d etc.) can interact with humans with some default language settings (mostly in English), and these smart assistants rely heavily on Automatic Speech Recognition (ASR). The motivation for our work stems from the inadequacy of virtual assistants in providing support in multilingual settings. In order to enhance the durability of intelligent assistants, language identification (LID) can be implemented to enable automatic recognition of the speaker’s language, thereby facilitating appropriate language setting adjustments. Psychological behaviour exhibits that humans have an inherent capability to determine the language of a statement nearly instantly. Automatic LID seeks to classify a speaker’s language usage from their speech utterances.
We focus our study of LID on Indian languages since India is the world’s second most populated and seventh largest country in landmass and a linguistically diverse country. Currently, India has 28 states and 8 Union Territories, where each state and Union Territories has its own language, but none of the languages is recognised as the national language of the country. Only, English and Hindi are used as official languages according to the Constitution of India Part XVII Chapter 1 Article 343.Footnote e Currently, the Eighth Schedule of the Constitution consists of 22 languages. Table 1 describes the recognised 22 languages according to the Eighth Schedule of the Constitution of India, as of 1 December 2007.
Most of the Indian languages originated from the Indo-Aryan and Dravidian language families. It can be seen from Table 1 that different languages are spoken in different states; however, languages do not obey geographical boundaries. Therefore, many of these languages, particularly in the neighbouring regions, have multiple dialects which are amalgamations of two or more languages.
Such enormous linguistic diversity makes it difficult for citizens to communicate in different parts of the country. Bilingualism and multilingualism are the norms in India. In this context, a LID system becomes a crucial component for any speech-based smart assistant. The biggest challenge and hence an area of active innovation for the Indian language is the reality that most of these languages are under-resourced.
Every spoken language has its underlying lexical, speaker, channel, environment, and other variations. The likely differences among various spoken languages are in their phoneme inventories, frequency of occurrence of the phonemes, acoustics, the span of the sound units in different languages, and intonation patterns at higher levels. The overlap between the phoneme set of two or more familial languages makes it a challenge for recognition. The low-resource status of these languages makes the training of machine learning models doubly difficult. The idea behind our methodology is interesting on account of the aforementioned limitations. Our methodology involves forecasting the accurate spoken language, irrespective of the limitations mentioned earlier.
Convolutional neural network (CNN) has been heavily utilised by natural language processing (NLP) researchers from the very beginning due to their efficient use of local features. While recurrent neural networks (RNNs) have been shown to be effective in a variety of NLP tasks in the past, recent work with Attention-based methods have outperformed all previous models and architectures because of their ability to capture global interactions. Yamada et al. (Reference Yamada, Asai, Shindo, Takeda and Matsumoto2020) were able to achieve better results than BERT (Devlin et al. Reference Devlin, Chang, Lee and Toutanova2019), SpanBERT (Joshi et al. Reference Joshi, Chen, Liu, Weld, Zettlemoyer and Levy2020), XLNet (Yang et al. Reference Yang, Dai, Yang, Carbonell, Salakhutdinov and Le2019), and ALBERT (Lan et al. Reference Lan, Chen, Goodman, Gimpel, Sharma and Soricut2020) using their Attention-based methods in the Question-Answering domain. Researchers (Gu, Wang, and Junbo, Reference Gu, Wang and Junbo2019; Chen and Heafield, Reference Chen and Heafield2020; Takase and Kiyono, Reference Takase and Kiyono2021) have employed Attention-based methods to achieve state-of-the-art (SOTA) performance in machine translation. Transformers (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017), which utilise a self-attention mechanism, have found extensive application in almost all fields of NLP such as language modelling, text classification, topic modelling, emotion classification, and sentiment analysis and produced SOTA performance.
In this work, we present LID for Indian languages using a combination of CNN, RNN, and Attention-based methods. Our LID methods cover 13 Indian languages.Footnote f Additionally, our method is language-agnostic. The main contributions of this work can be summarised as follows:
-
We carried out exhaustive experiments using CNN, convolutional recurrent neural network (CRNN), and Attention-based CRNN for the LID task on 13 Indian languages and achieved SOTA results.
-
The model exhibits exceptional performance in languages that are part of the same language family, as well as in diverse language sets under both normal and noisy conditions.
-
We empirically proved that the CRNN framework achieves better or similar results compared to CRNN with Attention framework, although CRNN without Attention requires less computational overhead.
2. Related works
Extraction of language-dependent features, for example, prosody and phonemes, was widely used to classify spoken languages (Zissman Reference Zissman1996; Martínez et al. Reference Martínez, Plchot, Burget, Glembek and Matějka2011; Ferrer, Scheffer, and Shriberg Reference Ferrer, Scheffer and Shriberg2010). Following the success of speaker verification systems, identity vectors (i-vectors) have also been used as features in various classification frameworks. Use of i-vectors requires significant domain knowledge (Dehak et al. Reference Dehak, Torres-Carrasquillo, Reynolds and Dehak2011b; Martínez et al. Reference Martínez, Plchot, Burget, Glembek and Matějka2011). In recent trends, researchers rely on neural networks for feature extraction and classification (Lopez-Moreno et al. Reference Lopez-Moreno, Gonzalez-Dominguez, Plchot, Martinez, Gonzalez-Rodriguez and Moreno2014; Ganapathy et al. Reference Ganapathy, Han, Thomas, Omar, Segbroeck and Narayanan2014). Researcher Revay and Teschke (Reference Revay and Teschke2019) used the ResNet50 (He et al. Reference He, Zhang, Ren and Sun2016) framework for classifying languages by generating the log-Mel spectra for each raw audio. The framework uses a cyclic learning rate where the learning rate increases and then decreases linearly. The maximum learning rate for a cycle is set by finding the optimal learning rate using fastai (Howard and Gugger Reference Howard and Gugger2020).
Researcher Gazeau and Varol (Reference Gazeau and Varol2018) established the use of a neural network, support vector machine (SVM), and hidden Markov model (HMM) to identify different languages. HMMs convert speech into a sequence of vectors and are used to capture temporal features in speech. Established LID systems (Dehak et al., Reference Dehak, Kenny, Dehak, Dumouchel and Ouellet2011a; Martínez et al. Reference Martínez, Plchot, Burget, Glembek and Matějka2011; Plchot et al. Reference Plchot, Matejka, Glembek, Fer, Novotny, Pesan, Burget, Brummer and Cumani2016; Zazo et al. Reference Zazo, Lozano-Diez, Gonzalez-Dominguez, Toledano and Gonzalez-Rodriguez2016) are based on identity vector (i-vectors) representations for language processing tasks. In Dehak et al. (Reference Dehak, Kenny, Dehak, Dumouchel and Ouellet2011a), i-vectors are used as data representations for a speaker verification task and fed to the classifier as the input. Dehak et al. (Reference Dehak, Kenny, Dehak, Dumouchel and Ouellet2011a) applied SVM with cosine kernels as the classifier, while Martínez et al.(Reference Martínez, Plchot, Burget, Glembek and Matějka2011) used logistic regression for the actual classification task. Recent years have found the use of feature extraction with neural networks, particularly with long short-term memory (LSTM) (Lozano-Diez et al. Reference Lozano-Diez, Zazo-Candil, Gonzalez-Dominguez, Toledano and Gonzalez-Rodriguez2015; Zazo et al., Reference Zazo, Lozano-Diez, Gonzalez-Dominguez, Toledano and Gonzalez-Rodriguez2016; Gelly et al. Reference Gelly, Gauvain, Le and Messaoudi2016). These neural networks produce better accuracy while being simpler in design compared to the conventional LID methods (Dehak et al. Reference Dehak, Kenny, Dehak, Dumouchel and Ouellet2011a; Martínez et al. Reference Martínez, Plchot, Burget, Glembek and Matějka2011; Plchot et al. Reference Plchot, Matejka, Glembek, Fer, Novotny, Pesan, Burget, Brummer and Cumani2016). Recent trends in developing LID systems are mainly focused on different forms of LSTMs with deep neural networks (DNNs). Plchot et al. (Reference Plchot, Matejka, Glembek, Fer, Novotny, Pesan, Burget, Brummer and Cumani2016) used a three-layered CNN where i-vectors were the input layer and softmax activation function was the output layer. Zazo et al. (Reference Zazo, Lozano-Diez, Gonzalez-Dominguez, Toledano and Gonzalez-Rodriguez2016) used mel-frequency cepstral coefficient (MFCC) with shifted delta coefficient features as information to a unidirectional layer that is directly connected to a softmax classifier. Gelly et al. (Reference Gelly, Gauvain, Le and Messaoudi2016) used audio transformed to perceptual linear prediction (PLP) coefficients and their first and second-order derivatives as information for a bidirectional LSTM in forward and backward directions. The forward and backward sequences generated from the bidirectional LSTM were joined together and used to classify the language of the input samples. Lozano-Diez et al. (Reference Lozano-Diez, Zazo-Candil, Gonzalez-Dominguez, Toledano and Gonzalez-Rodriguez2015) used CNNs for their LID system. They transformed the input data into an image containing MFCCs with shifted delta coefficient features. The image represents the time domain for the x-axis and frequency bins for the y-axis.
Lozano-Diez et al. (Reference Lozano-Diez, Zazo-Candil, Gonzalez-Dominguez, Toledano and Gonzalez-Rodriguez2015) used CNN as the feature extractor for the identity vectors. They achieved better performance when combining both the CNN features and identity vectors. Revay and Teschke (Reference Revay and Teschke2019) used ResNet (He et al. Reference He, Zhang, Ren and Sun2016) framework for language classification by generating spectrograms of each audio. Cyclic learning (Smith, Reference Smith2018) was used where the learning rate increases and decreases linearly. Venkatesan et al. (Reference Venkatesan, Venkatasubramanian and Sangeetha2018) utilised MFCCs to infer aspects of speech signals from Kannada, Hindi, Tamil, and Telugu. They obtained an accuracy of 76 per cent and 73 per cent using SVM and decision tree classifiers, respectively, on 5 hours of training data. Mukherjee et al. (Reference Mukherjee, Shivam, Gangwal, Khaitan and Das2019) used CNNs for LID in German, Spanish, and English. They used filter banks to extract features from frequency domain representations of the signal. Aarti and Kopparapu (Reference Aarti and Kopparapu2017) experimented with several auditory features in order to determine the optimal feature set for a classifier to detect Indian spoken language. Sisodia et al. (Reference Sisodia, Nikhil, Kiran and Sathvik2020) evaluated ensemble learning models for classifying spoken languages such as German, Dutch, English, French, and Portuguese. Bagging, Adaboosting, random forests, gradient boosting, and additional trees were used in their ensemble learning models.
Heracleous et al. (Reference Heracleous, Takai, Yasuda, Mohammad and Yoneyama2018) presented a comparative study of DNNs and CNNs for spoken LID, with SVMs as the baseline. They also presented the performance of the fusion of the mentioned methods. The NIST 2015 i-vector machine learning challenge dataset was used to assess the system’s performance with the goal of detecting 50 in-set languages. Bartz et al. (Reference Bartz, Herold, Yang and Meinel2017) tackled the problem of LID in the image domain rather than the typical acoustic domain. A hybrid CRNN is employed for this, which acts on spectrogram images of the provided audio clips. Draghichi et al. (Reference Draghici, Abeßer and Lukashevich2020) tried to solve the task of LID while using mel-spectrogram images as input features. This strategy was employed in CNNs and CRNN in terms of performance. This work is characterised by a modified training strategy that provides equal class distribution and efficient memory utilisation. Ganapathy et al. (Reference Ganapathy, Han, Thomas, Omar, Segbroeck and Narayanan2014) reported how they used bottleneck features from a CNN for the LID task. Bottleneck features were used in conjunction with conventional acoustic features, and performance was evaluated. Experiments revealed that when a system with bottleneck features is compared to a system without them, average relative improvements of up to 25 per cent are achieved. Zazo et al. (Reference Zazo, Lozano-Diez, Gonzalez-Dominguez, Toledano and Gonzalez-Rodriguez2016) proposed an open-source, end-to-end, LSTM-RNN system that outperforms a more recent reference i-vector system by up to 26 per cent when both are tested on a subset of the NIST Language Recognition Evaluation (LRE) with eight target languages.
Our research differs from the previous works on LID in the following aspects:
-
Comparison of performance of CNN, CRNN, as well as CRNN with Attention.
-
Extensive experiments with our proposed model show its applicability both for close language and noisy speech scenarios.
3. Model framework
Our proposed framework consists of three models.
-
CNN-based framework
-
CRNN-based framework
-
CRNN with Attention-based framework
We made use of the capacity of CNNs to capture spatial information to identify languages from audio samples. In a CNN-based framework, our network uses four convolution layers, where each layer is followed by the ReLU (Nair and Hinton, Reference Nair and Hinton2010) activation function and max pooling with a stride of 3 and a pool size of 3. The kernel sizes and the number of filters for each convolution layer are (3, 512), (3, 512), (3, 256), and (3, 128), respectively.
Figure 1 provides a schematic overview of the framework. The CRNN framework passes the output of the convolutional module to a bidirectional LSTM consisting of a single LSTM with 256 output units. The LSTM’s activation function is $tanh$ , and its recurrent activation is $sigmoid$ . The Attention mechanism used in our framework is based on Hierarchical Attention Networks (Yang et al. Reference Yang, Yang, Dyer, He, Smola and Hovy2016). In the Attention mechanism, contexts of features are summarised with a bidirectional LSTM by going forward and backwards:
In equation (1), $L$ is the number of audio specimens, $a_{n}$ is the input sequence for the LSTM network, and $\overrightarrow{a_{n}}$ and $\overleftarrow{a_{n}}$ provide the learned vectors from LSTM forward direction and backward directions, respectively. The vector, $a_{i}$ , builds the base for the Attention mechanism. The goal of the Attention mechanism is to learn the model through training with randomly initialised weights and biases. The layer also ensures with the $tanh$ function that the network does not stall. The function keeps the input values between –1 and 1 and maps zeros to near-zero values. The layer with $tanh$ function is again multiplied by trainable context vector $u_{i}$ . The trainable context vector refers to a vector learned during the training process and used as a fixed-length representation of the entire input document. In our framework, the Attention mechanism is used to compute a weighted sum of the sequences for each speech utterance, where the weights are learned based on the relevance of each sequence to the speech utterances. This produces a fixed-length vector for each utterance that captures the most salient information in the sequences. The context weight vector $u_{i}$ is randomly initiated and jointly learned during the training process. Improved vectors are represented by $a_{i}^{\prime}$ as shown in equation (2):
Context vectors are finally calculated by providing a weight to each $W_{i}$ by dividing the exponential values of the previously generated vectors with the summation of all exponential values of previously generated vectors as shown in equation (3). To avoid division by zero, an epsilon is added to the denominator:
The sum of these importance weights concatenated with the previously calculated context vectors is fed to a linear layer with thirteen output units serving as a classifier for the thirteen languages.
Figure 2 presents the schematic diagram of the Attention module where $a_{i}$ is the input to the module and output of the bidirectional LSTM layers.
4. Experiments
4.1 Feature extraction
For feature extraction of spoken utterances, we used MFCCs. For calculating MFCCs, we used $pre\_emphasis$ , frame size represented as $f\_size$ , frame stride represented as $f\_stride$ , N-point fast Fourier transform represented as $NFFT$ , low-frequency mel represented as $lf$ , the number of filters represented as $nfilt$ , the number of cepstral coefficients represented as $ncoef$ , and cepstral lifter represented $lifter$ of values 0.97, 0.025 (25 ms), 0.015 (15 ms overlapping), 512, 0, 40, 13, and 22, respectively. We used a frame size of 25 ms as typically frame sizes in the speech processing domain use 20 ms to 40 ms with 50 per cent (in our case 15 ms) overlapping between consecutive frames:
We used low-frequency mel (lf) as 0 and high-frequency mel (hf) is calculated using equation 4. lf and hf are used to generate the non-linear human ear perception of sound, by being more discriminative at lower frequencies and less discriminative at higher frequencies:
As shown in equation (5), the emphasised signal is calculated using a pre-emphasis filter applied on the signal ( $sig$ ) using the first-order filter. The number of frames is calculated by taking the ceiling value of the division of the absolute value of the difference between signal length ( $sig\_len$ ) and product of filter size ( $f\_size$ ) and sample rate (sr) with the product of frame stride ( $f\_stride$ ) and sample rate (sr) as shown in equation (6). Signal length is the length of $emphasized\_signal$ calculated in equation (5):
Using equation (7) $pad\_signal$ is generated from concatenation of $emphasized\_signal$ and zero value array of dimension ( $pad\_signal\_length - signal\_length$ ) $\times$ 1, where, $pad\_signal\_length$ is calculated by $n\_frames\times (f\_stride \times sr) + (f\_size \times sr)$ :
Frames are calculated as shown in equation (8) from the $pad\_signal$ elements where elements are the addition of an array of positive natural numbers from 0 to $f\_size\times sr$ repeated $n\_frames$ and the transpose of the array of size of $num\_frames$ where each element is the difference of $(f\_stride\times sr)$ :
Power frames shown in equation (9) are calculated as the square of the absolute value of the discrete Fourier transform (DFT) of the product of hamming window and frames of each element with NFFT:
Mel points are the array where elements are calculated as shown in equation (10), where i is the values belonging from lf to hf:
From equation (11), bins are calculated where the floor value of the elements are taken which is the product of hertz points and $NFFT + 1$ divided by the sample rate. Hertz points are calculated by multiplying 700 by subtraction of 1 from 10 power of $\frac{mel\_points}{2595}$ :
Bins calculated from equation (11) are used to calculate filter banks as shown in equation (12). Each filter in the filter bank is triangular, with a response of 1 at the central frequency and a linear drop to 0 till it meets the central frequencies of the two adjacent filters, where the response is 0.
Finally, MFCC is calculated as shown in equation (13) by decorrelating the filter bank coefficients using discrete cosine transform (DCT) to get a compressed representation of the filter banks. Sinusoidal liftering is applied to the MFCC to de-emphasise higher MFCCs which improves classification in noisy signals:
MFCCs features of shape $(1000, 13)$ generated from equation (13) is provided as input to the neural network which expects the same dimension followed by convolution layers as mentioned in Section 3. Raw speech signal cannot be provided input to the framework as it contains lots of noise data; therefore, extracting features from the speech signal and using it as input to the model will produce better performance than directly considering raw speech signal as input. Our motivation to use MFCC features as the feature count is small enough to force us to learn the information of the sample. Parameters are related to the amplitude of frequencies and provide us with frequency channels to analyse the speech specimen.
4.2 Data
4.2.1 Benchmark data
The Indian language (IL) dataset was acquired from the Indian Institute of Technology, Madras.Footnote g The dataset includes thirteen widely used Indian languages. Table 2 presents the statistics of this dataset which we used for our experiments.
4.2.2 Experimental data
In the past two decades, the development of LID methods has been largely fostered through NIST LREs. As a result, the most popular benchmark for evaluating new LID models and methods is the NIST LRE evaluation dataset (Sadjadi et al. Reference Sadjadi, Kheyrkhah, Greenberg, Singer, Reynolds, Mason and Hernandez-Cordero2018). The NIST LREs dataset mostly contains narrow-band telephone speech. Datasets are typically distributed by the Linguistic Data Consortium (LDC) and cost thousands of dollars. For example, the standard Kaldi (Povey et al. Reference Povey, Ghoshal, Boulianne, Burget, Glembek, Goel, Hannemann, Motíček, Qian, Schwarz, Silovský, Stemmer and Vesel2011) recipe for LRE072 relies on 18 LDC SLR datasets that cost $15000 (approx) to LDC non-members. This makes it difficult for new research groups to enter the academic field of LID. Furthermore, the NIST LRE evaluations focus mostly on telephone speech.
As the NIST LRE dataset is not freely available, we used the EU dataset (Bartz et al., Reference Bartz, Herold, Yang and Meinel2017) which is open source. The (EU) dataset contains YouTube News data for four major European languages – English (en), French (fr), German (de), and Spanish (es). Statistics of the dataset are given in Table 3.
4.3 Environment
We implemented our framework using Tensorflow (Abadi et al. Reference Abadi, Barham, Chen, Chen, Davis, Dean, Devin, Ghemawat, Irving, Isard, Kudlur, Levenberg, Monga, Moore, Murray, Steiner, Tucker, Vasudevan, Warden, Wicke, Yu and Zheng2016) backend. We split the Indian language dataset into training, validation, and testing set, containing 80%, 10%, and 10% of the data, respectively, for each language and gender.
For regularisation, we apply dropout (Srivastava et al. Reference Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov2014) after the max-pooling layer and bidirectional LSTM layer. We use the rate of 0.1. A $l_{2}$ regularisation with $10^{-6}$ weight is also added to all the trainable weights in the network. We train the model with Adam (Kingma and Ba Reference Kingma and Ba2014) optimiser with $\beta _{1} = 0.9$ , $\beta _{2} = 0.98$ , and $\varepsilon = 10^{-9}$ and learning rate schedule (Vaswani et al. Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser and Polosukhin2017), with 4k warm-up steps and peak learning rate of $0.05/\sqrt{d}$ where d is 128. A batch size of 64 with “Sparse Categorical Crossentropy” as the loss function was used.
4.4 Result on Indian language dataset
The proposed framework was assessed against Kulkarni et al. (Reference Kulkarni, Joshi, Kamble and Apte2022) using identical datasets. Both CRNN and CRNN with Attention exhibited superior performance compared to the results reported by Kulkarni et al. (Reference Kulkarni, Joshi, Kamble and Apte2022), as shown in Table 4. They used six Linear layers where units are 256, 256, 128, 64, 32, and 13, respectively, in the CNN framework, whereas the DNN framework uses three LSTM layers having units 256, 256, and 128, respectively, followed by dropout layer followed by three time-distributed layers followed by a linear layer of 13 as units.
We evaluated system performance using the following evaluation metrics – recall (TPR), precision (PPV), f1 score, and accuracy. Since one of our major objectives was to measure the accessibility of the network to new languages, we introduced data balancing of training data for each class, as the number of samples available for each class may vary drastically. This is the case for the Indian language dataset as shown in Table 2 in which Kannada, Marathi, and particularly Bodo have a limited amount of data compared to the rest of the languages. To alleviate this data imbalance problem, we used class weight balancing as a dynamic method using scikit-learn (Pedregosa et al. Reference Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel, Blondel, Prettenhofer, Weiss, Dubourg, Vanderplas, Passos, Cournapeau, Brucher, Perrot and Duchesnay2011).
PPV, TPR, f1 score, and accuracy scores are reported in Table 5 for the three frameworks – CNN, CRNN, and CRNN with Attention. From Table 5, it is clearly visible that both CRNN framework and CRNN with Attention provide competitive results of 0.987 accuracy. Tables 6, 7, and 8 show the confusion matrix for CNN, CRNN, and CRNN with Attention.
From Tables 6, 7, and 8, it can be observed that Assamese gets confused with Manipuri; Bengali gets confused with Assamese, Manipuri, Tamil, and Telugu; and Hindi gets confused with Malayalam.
Assamese and Bengali have originated from the same language family, and they share approximately the same phoneme set. However, Bengali and Tamil are from different language families but share a similar phoneme set. For example, in Bengali, cigar is churut and star is nakshatra, while cigar in Tamil is charuttu and star in Tamil is natsattira, which is quite similar. Similarly, Manipuri and Assamese share similar phonemes. On close study, we observed that Hindi and Malayalam have also similar phoneme sets as both languages borrowed most of the vocabulary from Sanskrit. For example, ‘arrogant’ is Ahankar in Hindi and Ahankaram in Malayalam. Similarly, Sathyu or commonly spoken as Satya in Hindi means ‘Truth’, which is Sathyam in Malayalam. Also, the word Sundar in Hindi is Sundaram in Malayalam, which means ‘beautiful’.
Table 9 shows the most common classification errors encountered during evaluation.
4.5 Result on same language families on Indian language dataset
A deeper study into these thirten Indian languages led us to define five clusters of languages based on their phonetic similarity. Cluster internal languages are phonetically similar, close, and geographically contiguous, hence difficult to differentiate.
-
Cluster 1: Assamese, Bengali, Odia
-
Cluster 2: Gujarati, Hindi, Marathi, Rajasthani
-
Cluster 3: Kannada, Malayalam, Tamil, Telugu
-
Cluster 4: Bodo
-
Cluster 5: Manipuri
Bodo and Manipuri are phonetically very much distant from any of the rest of the languages; thus, they form singleton clusters. We carried out separate experiments for the identification of the cluster internal languages for cluster 1, 2, and 3, and the experimental results are presented in Table 10.
It can be clearly observed from Table 10 that both CRNN framework and CRNN with Attention provide competitive results for every language cluster. For cluster 1, CRNN framework and CRNN with Attention provides an accuracy of 0.98/0.974, for cluster 2 0.999/0.999, and for cluster 3 0.999/1, respectively. CNN framework also provides comparable results to the other two frameworks.
Tables 11, 12, and 13 present the confusion matrix for cluster 1, cluster 2, and cluster 3, respectively. From Table 11, we observed that Bengali gets confused with Assamese and Odia, which is quite expected since these two languages are spoken in neighbouring states and both of them share almost the same phonemes. For example, in Odia rice is pronounced as bhata whereas in Bengali pronounced as bhat, similarly fish in odia as machha whereas in Bengali it is machh. Both CRNN and CRNN with Attention perform well to discriminate between Bengali and Odia. It can be observed from Table 13 that CNN creates a lot of confusion when discriminating between these four languages. Both CRNN and CRNN with Attention prove to be better at discriminating among these languages. From the results in Tables 10, 11, 12, and 13, it is pretty clear that CRNN (bidirectional LSTM over CNN) and CRNN with Attention are more effective for Indian LID and they perform almost at par. Another important observation is that it is harder to classify the languages in cluster 1 than in the other two clusters.
4.6 Results on European language
We evaluated our model in two environments – No Noise and White Noise. According to our intuition, in real-life scenarios during prediction of language chances of capturing Background Noise of chatter and other sounds may happen. For the White Noise evaluation setup, we mixed White Noise into each test sample which has an audible solid presence but retains the identity of the language.
Table 14 compares the results of our models on the EU dataset with SOTA models presented by Bartz et al. (Reference Bartz, Herold, Yang and Meinel2017). The model proposed by Bartz et al. (Reference Bartz, Herold, Yang and Meinel2017) consists of CRNN and uses Google’s Inception-v3 framework (Szegedy et al. Reference Szegedy, Vanhoucke, Ioffe, Shlens and Wojna2016). The feature extractor performs convolutional operations on the input image through multiple stages, resulting in the production of a feature map that possesses a height of 1. The feature map is partitioned horizontally along the x-axis, and each partition is employed as a temporal unit for the subsequent bidirectional LSTM network. The network employs a total of five convolutional layers, with each layer being succeeded by the ReLU activation function, batch normalization, and $2\times 2$ max pooling with a stride of 2. The convolutional layers in question are characterised by their respective kernel sizes and the number of filters, which are as follows: ( $7\times 7$ , 16), ( $5\times 5$ , 32), ( $3\times 3$ , 64), ( $3\times 3$ , 128), and ( $3\times 3$ , 256). The bidirectional LSTM model comprises a pair of individual LSTM models, each with 256 output units. The concatenation of the two outputs is transformed into a 512-dimensional vector, which is then input into a fully connected layer. The layer has either four or six output units, which function as the classifier. They experimented in four different environments – No Noise, White Noise, Cracking Noise, and Background Noise. All our evaluation results are rounded to 3 digits after the decimal point.
The CNN model failed to achieve competitive results; it provided an accuracy of 0.948/0.871 in No Noise/White Noise. In the CRNN framework, our model provides an accuracy of 0.967/0.912 on the No Noise/White Noise scenario outperforming the SOTA results of Bartz et al. (Reference Bartz, Herold, Yang and Meinel2017). Use of Attention improves over the Inception-v3 CRNN in the No Noise scenario; however, it does not perform well on White Noise.
4.7 Ablation studies
4.7.1 Convolution kernel size
To study the effect of kernel sizes in the convolution layers, we sweep the kernel size with 3, 7, 17, 32, and 65 of the models. We found that performance decreases with larger kernel sizes, as shown in Table 15. On comparing the accuracy up to the second decimal place, kernel size 3 performs better than the rest.
4.7.2 Automatic class weight vs. manual class weight
Balancing the data using class weights gives better accuracy for CRNN with Attention (98.7 per cent) and CRNN (98.7 per cent), compared to CNN (98.3 per cent) shown in Table 5. We study the efficacy of the frameworks by manually balancing the datasets using 100 samples, 200 samples, and 571 samples drawn randomly from the dataset, and the results of these experiments are presented in Tables 16, 17, and 18, respectively.
The objective of the study was to observe the performance of the frameworks in increasing the sample size. Since the Bodo language has the minimum data (571 samples) among all the languages in the dataset, we performed our experiments on 571 samples.
A comparison of the results in Tables 16, 17, and 18 reveals the following observations.
-
All the models perform consistently better with more training data.
-
CRNN and CRNN with Attention perform consistently better than CNN.
-
CRNN is less data hungry among the three models and performs best in the lowest data scenario.
Figure 3 graphically shows the performance improvement over increasing data samples. The confusion matrices for the three frameworks for the three datasets are presented in Table A.1, A.2, A.3, B.1, B.2, B.3, C.1, C.2, and C.3 in the Appendix.
4.7.3 Additional performance and parameter size analysis of our frameworks
Table 19 demonstrates that both CRNN and CRNN with Attention perform better compared to the CNN-based framework. At the same time, CRNN itself produces better or equivalent performance compared to CRNN with an Attention-based mechanism. CRNN with Attention performs better only for cluster 1 of the Indian dataset; CRNN itself produces the best results in all other tasks, sometimes jointly with CRNN with Attention. This is despite the fact that the Attention-based framework has more parameters than the other models. The underlying intuition is that the Attention-based framework generally suffers from overfitting problems due to its additional parameter count. An Attention-based framework needs to learn how to assign importance to different parts of the input sequence, which may require a large number of training instances to produce a generalised performance. Thus, CRNN with Attention makes the experimental set-up time-consuming and resource-intensive, but still, it is not able to improve over CRNN.
5. Conclusion and future work
In this work, we proposed a LID method using CRNN that works on MFCC features of speech signals. Our framework efficiently identifies the language both in close language and noisy scenarios. We carried out extensive experiments, and our framework produced SOTA results. Through our experiments, we have also shown our framework’s robustness to noise and its extensibility to new languages. The model exhibits the overall best accuracy of 98.7 per cent which improves over the traditional use of CNN (98.3 per cent). CRNN with Attention performs almost at par with CRNN; however, the Attention mechanism which incurs additional computational overhead does not result in improvement over CRNN in most cases.
In the future, we would like to extend our work by increasing the language classes with speech specimens recorded in different environments. We would also like to extend our work to check the usefulness of the proposed framework on smaller time speech samples through which we can deduce the optimal time required to classify the languages with high accuracy. We would also like to test our method on language dialect identification.
Acknowledgements
This research was supported by the TPU Research Cloud (TRC) program, a Google Research initiative, and funded by Rashtriya Uchchatar Shiksha Abhiyan 2.0 (grant number R-11/828/19).
Competing interests
The author(s) declare none.
Appendix A. CNN framework
Appendix B. CRNN framework
Appendix C. CRNN with Attention framework