A review of blind source separation methods: two converging routes to ILRMA originating from ICA and NMF

Hiroshi Sawada; Nobutaka Ono; Hirokazu Kameoka; Daichi Kitamura; Hiroshi Saruwatari

doi:10.1017/ATSIP.2019.5

A review of blind source separation methods: two converging routes to ILRMA originating from ICA and NMF

Published online by Cambridge University Press: 14 May 2019

Daichi Kitamura and

Hiroshi Sawada*: Affiliation:
NTT Corporation, Tokyo, Japan
Nobutaka Ono: Affiliation:
Tokyo Metropolitan University, Hino, Japan
Hirokazu Kameoka: Affiliation:
NTT Corporation, Tokyo, Japan
Daichi Kitamura: Affiliation:
National Institute of Technology, Kagawa College, Takamatsu, Japan
Hiroshi Saruwatari: Affiliation:
The University of Tokyo, Tokyo, Japan
*: Corresponding author: Hiroshi Sawada Email: sawada.hiroshi@lab.ntt.co.jp

Article contents

Abstract
INTRODUCTION
MODELS
OPTIMIZATION
EXPERIMENT
CONCLUSION
References

Abstract

This paper describes several important methods for the blind source separation of audio signals in an integrated manner. Two historically developed routes are featured. One started from independent component analysis and evolved to independent vector analysis (IVA) by extending the notion of independence from a scalar to a vector. In the other route, nonnegative matrix factorization (NMF) has been extended to multichannel NMF (MNMF). As a convergence point of these two routes, independent low-rank matrix analysis has been proposed, which integrates IVA and MNMF in a clever way. All the objective functions in these methods are efficiently optimized by majorization-minimization algorithms with appropriately designed auxiliary functions. Experimental results for a simple two-source two-microphone case are given to illustrate the characteristics of these five methods.

Keywords

Blind source separation (BSS)Time-frequency-channel tensor Independent component analysis (ICA)Nonnegative matrix factorization (NMF)Majorization-minimization algorithm with auxiliary function

Information

Type: Overview Paper
Information: APSIPA Transactions on Signal and Information Processing , Volume 8 , 2019 , e12

DOI: https://doi.org/10.1017/ATSIP.2019.5 [Opens in a new window]
Creative Commons: This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike licence (http://creativecommons.org/licenses/by-nc-sa/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the same Creative Commons licence is included and the original work is properly cited. The written permission of Cambridge University Press must be obtained for commercial re-use.
Copyright: Copyright © The Authors, 2019

I. INTRODUCTION

The technique of blind source separation (BSS) has been studied for decades [Reference Jutten and Herault1–Reference Makino, Lee and Sawada5], and the research is still in progress. The term “blind” refers to the situation that the source activities and the mixing system information are unknown. There are many diverse purposes for developing this technology even if audio signals are focused on, such as (1) implementing the cocktail party effect as an artificial intelligence, (2) extracting the target speech in a noisy environment for better speech recognition results, (3) separating each musical instrumental part of an orchestra performance for music analysis.

Various signal processing and machine learning methods have been proposed for BSS. They can be classified using two axes (Fig. 1). The horizontal axis relates to the number M of microphones used to observe sound mixtures. The most critical distinction is whether M=1 or M ≥ 2, i.e., a single-channel or multichannel case. In a multichannel case, the spatial information of a source signal (e.g., source position) can be utilized as an important cue for separation. The second critical distinction is whether the number M of microphones is greater than or equal to the number N of source signals. In determined (N=M) and overdetermined (N<M) cases, the separation can be achieved using linear filters. For underdetermined (N>M) cases, one popular approach is based on clustering, such as by the Gaussian mixture model (GMM), followed by time-frequency masking [Reference Jourjine, Rickard and Yilmaz6–Reference Ito, Araki and Nakatani12]. The vertical axis indicates whether training data are utilized or not. If so, the characteristics of speech and audio signals can be learned beforehand. The learned knowledge helps to optimize the separation system, especially for single-channel cases where no spatial cues can be utilized. Recently, many methods based on deep neural networks (DNNs) have been proposed [Reference Hershey, Chen, Le Roux and Watanabe13–Reference Leglaive, Girin and Horaud21].

Fig. 1. Various methods for blind audio source separation. Methods in blue are discussed in this paper in an integrated manner.

Among the various methods shown in Fig. 1, this paper discusses the methods in blue. The motivation for selecting these methods is twofold: (1) As shown in Fig. 2, two originally different methods, independent component analysis (ICA) [Reference Hyvärinen, Karhunen and Oja3,Reference Cichocki and Amari4,Reference Comon22–Reference Ono and Miyabe29] and nonnegative matrix factorization (NMF) [Reference Lee and Seung30–Reference Févotte and Idier36], have historically been extended to independent vector analysis (IVA) [Reference Hiroe37–Reference Ikeshita, Kawaguchi, Togami, Fujita and Nagamatsu46] and multichannel NMF [Reference Ozerov and Févotte47–Reference Kameoka, Sawada, Higuchi and Makino54], respectively, which have recently been unified as independent low-rank matrix analysis (ILRMA) [Reference Kameoka, Yoshioka, Hamamura, Le Roux and Kashino55–Reference Mogami60]. (2) The objective functions used in these methods can effectively be minimized by majorization-minimization algorithms with appropriately designed auxiliary functions [Reference Févotte and Idier36,Reference Lange, Hunter and Yang61–Reference Sun, Babu and Palomar68]. With regard to these two aspects, all the selected methods are related and worth explaining in a single review paper.

Fig. 2. Historical development of BSS methods.

Although the mixing situation is unknown in the BSS problem, the mixing model is described as follows. Let s ₁, …, s _N be N original sources and x ₁, …, x _M be M mixtures at microphones. Let h _mn denote the transfer characteristic from source s _n to mixture x _m. When h _mn is described by a scalar, the problem is called instantaneous BSS and the mixtures are modeled as

(1)

$$x_m(t) = \sum_{n=1}^N h_{mn} s_n(t), \quad m=1, \ldots, M, $$

where t represents time. When h _mn is described by an impulse response of L samples that represents the delay and reverberations in a real-room situation, the problem is called convolutive BSS and the mixtures are modeled as

(2)

$$x_m(t) = \sum_{n=1}^N \sum_{\tau=0}^{L-1} h_{mn}(\tau) s_n(t-\tau), \quad m=1,\ldots, M.$$

To cope with a real-room situation, we need to solve the convolutive BSS problem.

Although there have been proposed time-domain approaches [Reference Amari, Douglas, Cichocki and Yang69–Reference Koldovsky and Tichavsky75] to the convolutive BSS problem, a more suitable approach for combining ICA and NMF is a frequency-domain approach [Reference Smaragdis76–Reference Duong, Vincent and Gribonval85], where we apply a short-time Fourier transformation (STFT) to the time-domain mixtures (2). Using a sufficiently long STFT window to cover the main part of the impulse responses, the convolutive mixing model (2) can be approximated with the instantaneous mixing model

(3)

$$x_{ij,m} = \sum_{n=1}^N h_{i,mn} s_{ij,n}, \quad m=1,\ldots, M$$

in each frequency bin i, with time frame j representing the position index of each STFT window. Table 1 summarizes the notations used in this paper.

Table 1. Notations.

The data structure that we deal with is a complex-valued tensor with three axes, frequency i, time j, and channel (mixture m or source n), as shown on the left-hand side of Fig. 3. Until IVA was invented in 2006, there had been no clear way to handle the tensor in a unified manner. A practical way was to slice the tensor into frequency-dependent matrices with time and channel axes, and apply ICA to the matrices. Another historical path is from NMF, applied to a matrix with time and frequency axes, to multichannel NMF. These two historical paths merged with the invention of ILRMA, as shown in Figs 2 and 3.

Fig. 3. Tensor and sliced matrices.

The rest of the paper is organized as follows. In Section II, we introduce probabilistic models for all the above methods and define corresponding objective functions. In Section III, we explain how to optimize the objective functions based on majorization-minimization by designing auxiliary functions. Section IV shows illustrative experimental results to provide an intuitive understanding of the characteristics of all these methods. Section V concludes the paper.

II. MODELS

A) ICA and IVA

In this subsection, we assume determined (N=M) cases for the application of ICA and IVA. For overdetermined (N<M) cases, we typically apply a dimension reduction method such as principal component analysis to the microphone observations as a preprocessing [Reference Winter, Sawada and Makino86,Reference Osterwise and Grant87].

1) ICA

Let the sliced matrix depicted in the upper right of Fig. 3 be ${\bf X}_i = \{{\bf x}_{ij}\}_{j=1}^J$ with x_ij = [x _ij,1, …, x _ij,M]^T. ICA calculates an M-dimensional square separation matrix W_i that linearly transforms the mixtures x_ij to source estimates y_ij = [y _ij,1, …, y _ij,N]^T by

(4)

$${\bf y}_{ij} = {\bf W}_i\,{\bf x}_{ij}.$$

The separation matrix W_i can be optimized in a maximum likelihood sense [Reference Cardoso26]. We assume that the likelihood of W_i is decomposed into time samples

(5)

$$p({\bf X}_i \vert {\bf W}_i) = \prod_{j=1}^J p({\bf x}_{ij}\vert {\bf W}_i).$$

The complex-valued linear operation (4) transforms the density as

(6)

$$p({\bf x}_{ij}\vert {\bf W}_i) = \vert {\rm det}\,{\bf W}_i\vert^2\, p({\bf y}_{ij}).$$

We assume that the source estimates are independent of each other,

(7)

$$p({\bf y}_{ij}) = \prod_{n=1}^N p(y_{ij,n}).$$

Putting (5)–(7) together, the negative log-likelihood ${\cal C} ({\bf W}_i) = -\log p({\bf X}_i\vert {\bf W}_i)$, as the objective function to be minimized, is given by

(8)

$${\cal C}({\bf W}_i) = \sum_{j=1}^J\sum_{n=1}^N G(y_{ij,n}) -2J\log \vert {\rm det}\,{\bf W}_i\vert,$$

where G(y _ij,n) = −log p(y _ij,n) is called a contrast function. In speech/audio applications, a typical choice for the density function is the super-Gaussian distribution

(9)

$$p(y_{ij,n}) \propto \exp \left( -\displaystyle{\sqrt{\vert y_{ij,n}\vert ^2+\alpha} \over \beta}\right),$$

with nonnegative parameters α and β. How to minimize the objective function (8) will be explained in Section III.

By applying ICA to the every sliced matrix, we have N source estimates for every frequency bin. However, the order of the N source estimates in each frequency bin is arbitrary, and therefore we have the so-called permutation problem. One approach to this problem is to align the permutations in a post-processing [Reference Sawada, Araki and Makino11,Reference Sawada, Mukai, Araki and Makino88]. This paper focuses on tensor methods (IVA and ILRMA) as another approach that automatically solves the permutation problem.

2) IVA

Figure 4 shows the difference between ICA and IVA. In ICA, we assume the independence of scalar variables, e.g., y _ij,1 and y _ij,2. In IVA, the notion of independence is extended to vector variables. Let us define a vector of source estimates spanning all frequency bins as ${\bf y}_{j,n} = [y_{1j,n},\ldots ,y_{Ij,n}]^{\ssf T}$. The independence among source estimate vectors is expressed as

(10)

$$p(\{{\bf y}_{j,n}\}_{n=1}^N) = \prod_{n=1}^N p({\bf y}_{j,n}) .$$

We now focus on the left-hand side of Fig. 3. The mixture is denoted by two types of vectors. The first one is channel-wise ${\bf x}_{ij} = [x_{ij,1},\ldots , x_{ij,M}]^{\ssf T}$. The second one is frequency-wise ${\bf x}_{j,m} = [x_{1j,m},\ldots ,x_{Ij,m}]^{\ssf T}$. The source estimates are calculated by (4) using the first type for all frequency bins i = 1, …, I. A density transformation similar to (6) is expressed using the second type as follows:

(11)

$$p(\{{\bf x}_{j,m}\}_{m=1}^M\vert {\cal W}) = p(\{{\bf y}_{j,n}\}_{n=1}^N) \prod_{i=1}^I \vert {\rm det}\,{\bf W}_i\vert ^2,$$

with ${\cal W} = \{{\bf W}_i\}_{i=1}^I$ being the set of separation matrices of all frequency bins. Similarly to (5), the likelihood of ${\cal W}$ is decomposed into time samples as

(12)

$$p({\cal X}\vert {\cal W}) = \prod_{j=1}^J p(\{{\bf x}_{j,m}\}_{m=1}^M\vert {\cal W}),$$

where ${\cal X} = \{\{{\bf x}_{j,m}\}_{m=1}^M\}_{j=1}^J$. Putting (10)–(12), together, the objective function, i.e., the negative log-likelihood, ${\cal C} ({\cal W}) = -\log p({\cal X}\vert {\cal W})$ is given as

(13)

$${\cal C} ({\cal W}) = \sum_{j=1}^J\sum_{n=1}^N G({\bf y}_{j,n}) -2J\sum_{i=1}^I\log \vert {\rm det}\,{\bf W}_i\vert,$$

where G(y_j,n) = −log p(y_j,n) is again a contrast function. A typical choice for the density function is the spherical super-Gaussian distribution

(14)

$$p({\bf y}_{j,n}) \propto \exp\left(- \displaystyle{\sqrt{\sum_{i=1}^I \vert y_{ij,n}\vert ^2+\alpha}\over \beta} \right),$$

with nonnegative parameters α and β. How to minimize the objective function (13) will be explained in Section III.

Fig. 4. Independence in ICA and IVA.

Comparing (9) and (14), we see that there are frequency dependences in the IVA cases. These dependences contribute to solving the permutation problem.

B) NMF and MNMF

Generally, NMF objective functions are defined as the distances or divergences between an observed matrix and a low-rank matrix. Popular distance/divergence measures are the Euclidean distance [Reference Lee and Seung31], the generalized Kullback–Leibler (KL) divergence [Reference Lee and Seung31], and the Itakura–Saito (IS) divergence [Reference Févotte, Bertin and Durrieu33]. In this paper, aiming to clarify the connection of NMF to IVA and ILRMA, we discuss NMF with the IS divergence (IS-NMF).

1) NMF

Let the sliced matrix depicted in the lower right of Fig. 3 be X, [X]_ij = x _ij. Microphone index m is omitted here for simplicity. The nonnegative values considered in IS-NMF are the power spectrograms |x _ij|², and they are approximated with the rank K structure

(15)

$$ \vert x_{ij}\vert ^2 \approx \sum_{k=1}^K t_{ik}v_{kj} = \hat{x}_{ij},$$

with nonnegative matrices T, [T]_ik = t _ik, and V, [V]_kj = v _kj, for i = 1, …, I and j = 1, …, J. In a matrix notation, we have

(16)

$${\bf X} = {\bf TV},$$

as a matrix factorization form. Figure 5 shows that a spectrogram can be modeled with this NMF model.

Fig. 5. NMF as spectrogram model fitting.

The objective function of IS-NMF can be derived in a maximum-likelihood sense. We assume that the likelihood of T and V for X is decomposed into matrix elements

(17)

$$ p({\bf X}\vert {\bf T},{\bf V}) = \prod_{i=1}^I\prod_{j=1}^J p(x_{ij}\vert \hat{x}_{ij}),$$

and each element x _ij follows a zero-mean complex Gaussian distribution with variance $\hat{x}_{ij}$ defined in (15),

(18)

$$ p(x_{ij}\vert \hat{x}_{ij}) \propto \displaystyle{1 \over \hat{x}_{ij}}\exp \left(-\displaystyle{\vert x_{ij}\vert ^2 \over \hat{x}_{ij}}\right).$$

Then, the objective function ${\cal C} ({\bf T},{\bf V}) = -\log p({\bf X}\vert {\bf T},{\bf V})$ is simply given as

(19)

$$ {\cal C}({\bf T},{\bf V}) = \sum_{i=1}^I\sum_{j=1}^J \left[ \displaystyle{\vert x_{ij}\vert ^2 \over \hat{x}_{ij}} + \log\hat{x}_{ij} \right].$$

The IS divergence between |x _ij|² and $\hat{x}_{ij}$ is defined as [Reference Févotte, Bertin and Durrieu33]

(20)

$$d_{IS}(\vert x_{ij}\vert ^2, \hat{x}_{ij}) = \displaystyle{\vert x_{ij}\vert ^2 \over \hat{x}_{ij}} - \log \displaystyle{\vert x_{ij}\vert ^2 \over \hat{x}_{ij}} - 1 ,$$

and is equivalent to the ij-element of the objective function (19) up to a constant term. How to minimize the objective function (19) will be explained in Section III.

2) MNMF

We now return to the left-hand side of Fig. 3 from the lower-right corner, and the scalar x _ij,m is extended to the channel-wise vector ${\bf x}_{ij} = [x_{ij,1},\ldots ,x_{ij,M}]^{\ssf T}$. The power spectrograms |x _ij|² considered in NMF are now extended to the outer product of the channel vector

(21)

$$ {\ssf X}_{ij} = {\bf x}_{ij}{\bf x}^{\ssf H}_{ij} = \left[\matrix{\vert x_{ij,1}\vert ^2 & \ldots & \!\! x_{ij,1}x_{ij,M}^{\ast} \cr \vdots & \ddots & \vdots \cr x_{ij,M}x_{ij,1}^{\ast} \!\! & \ldots & \vert x_{ij,M}\vert ^2}\right].$$

To build a multichannel NMF model, let us introduce a Hermitian positive semidefinite matrix ${\ssf H}_{ik}$ that is the same size as ${\ssf X}_{ij}$ and models the spatial property [Reference Arberet48,Reference Sawada, Kameoka, Araki and Ueda49,Reference Vincent, Jafari, Abdallah, Plumbley, Davies and Wang84,Reference Duong, Vincent and Gribonval85] of the kth NMF basis in the ith frequency bin. Then, the outer products are approximated with a rank-K structure similar to (15),

(22)

$$ {\ssf X}_{ij} \approx \sum_{k=1}^K {\ssf H}_{ik}t_{ik}v_{kj} = \hat{\ssf X}_{ij} .$$

The objective function of MNMF can basically be defined as the total sum $\sum _{i=1}^I \sum _{j=1}^J d_{IS}({\ssf X}_{ij}, \hat{\ssf X}_{ij})$ of the multichannel IS divergence (see [Reference Sawada, Kameoka, Araki and Ueda49] for the definition) between ${\ssf X}_{ij}$ and $\hat{\ssf X}_{ij}$, and can also be derived in a maximum-likelihood sense. Let $\underline{\bf H}$ be an I × K hierarchical matrix such that $[\underline{\bf H}]_{ik} = {\ssf H}_{ik}$. We assume that the likelihood of T, V, and $\underline{\bf H}$ for ${\cal X} = \{\{{\bf x}_{ij}\}_{i=1}^I\}_{j=1}^J$ is decomposed as

(23)

$$ p({\cal X}\vert {\bf T},{\bf V},\underline{\bf H}) = \prod_{i=1}^I\prod_{j=1}^J p({\bf x}_{ij}\vert \hat{\ssf X}_{ij}),$$

and that each vector x_ij follows a zero-mean multivariate complex Gaussian distribution with the covariance matrix $\hat{\ssf X}_{ij}$ defined in (22),

(24)

$$p({\bf x}_{ij}\vert \hat{\ssf X}_{ij}) \propto \displaystyle{1 \over {\rm det}\hat{\ssf X}_{ij}}\exp \left(- {\bf x}_{ij}^{\ssf H}\hat{\ssf X}_{ij}^{-1}{\bf x}_{ij}\right).$$

Then, similar to (19), the objective function ${\cal C} ({\bf T},{\bf V},\underline{\bf H}) = -\log p({\cal X}\vert {\bf T},{\bf V},\underline{\bf H})$ is given as

(25)

$${\cal C}({\bf T},{\bf V},\underline{\bf H}) = \sum_{i=1}^I\sum_{j=1}^J \left[ {\bf x}_{ij}^{\ssf H}\hat{\ssf X}_{ij}^{-1}{\bf x}_{ij} + \log {\rm det}\hat{\ssf X}_{ij} \right].$$

How to minimize the objective function (25) will be explained in Section III.

The spatial properties ${\ssf H}_{ik}$ learned by the model (22) can be used as spatial cues for clustering NMF bases. In particular, the argument $\arg ([{\ssf H}_{ik}]_{mm'})$ of an off-diagonal element m ≠ m′ represents the phase difference between the two microphones m and m′. The left plot of Fig. 6 follows model (22) with k = 1, …, 10. The 10 bases can be clustered into two sources based on their arguments as a post-processing. However, a more elegant way is to introduce the cluster-assignment variable [Reference Ozerov, Févotte, Blouet and Durrieu89] z _kn ≥ 0, $\sum _{n=1}^N z_{kn}=1$, k = 1, …, K, n = 1, …, N, and the source-wise spatial property ${\ssf H}_{in}$, and express the basis-wise property as ${\ssf H}_{ik} = \sum _{n=1}^N z_{kn}{\ssf H}_{in}$. As a result, the model (22) and the objective function (25) respectively become

(26)

$$\hat{\ssf X}_{ij} = \sum_{k=1}^K \sum_{n=1}^N z_{kn}{\ssf H}_{in}t_{ik}v_{kj},$$

(27)

$${\cal C}({\bf T},{\bf V},\underline{\bf H},{\bf Z}) = \sum_{i=1}^I\sum_{j=1}^J \left[ {\bf x}_{ij}^{\ssf H}\hat{\ssf X}_{ij}^{-1}{\bf x}_{ij} + \log{\rm det}\hat{\ssf X}_{ij}, \right]$$

with [Z]_kn = z _kn and the size of $\underline{\bf H}$ being I × N. The middle plot of Fig. 6 shows the result following the model (26). We see that source-wise spatial properties are successfully learned. The objective function (27) can be minimized in a similar manner to (25).

Fig. 6. Example of MNMF-learned spatial property. The left and middle plots show the learned complex arguments ${\rm arg}([{\ssf H}_{ik}]_{12}), k=1,\ldots,10$, and ${\rm arg}([{\ssf H}_{in}]_{12}), n=1,2$, respectively. The right figure illustrates the corresponding two-source two-microphone situation.

C) ILRMA

ILRMA can be explained in two ways, as there are two paths in Fig. 2.

1) Extending IVA with NMF

The first way is to extend IVA by introducing NMF for source estimates, as illustrated in Fig. 7, with the aim of developing more precise spectral models. Let the objective function (13) of IVA be rewritten as

(28)

$${\cal C}({\cal W}) = \sum_{n=1}^N G({\bf Y}_{n}) -2J\sum_{i=1}^I\log \vert {\rm det}\,{\bf W}_i\vert$$

with Y_n being an I × J matrix, [Y_n]_ij = y _ij,n. Then, let us introduce the NMF model for Y_n as

(29)

$$p({\bf Y}_{n}\vert {\bf T}_n,{\bf V}_n) = \prod_{i=1}^I \prod_{j=1}^J p(y_{ij,n}\vert \hat{y}_{ij,n})$$

(30)

$$p(y_{ij,n}\vert \hat{y}_{ij,n}) \propto \displaystyle{1 \over \hat{y}_{ij,n}}\exp\left( -\displaystyle{\vert y_{ij,n}\vert ^2 \over \hat{y}_{ij,n}} \right)$$

(31)

$$\hat{y}_{ij,n} =\sum_{k=1}^K t_{ik,n}v_{kj,n}$$

with [T_n]_ik = t _ik,n and [V_n]_kj = v _kj,n. The objective function is then

(32)

$$\eqalign{{\cal C}({\cal W},\{{\bf T}_n\}_{n=1}^N, \{{\bf V}_n\}_{n=1}^N) &=\sum_{n=1}^N\sum_{i=1}^I\sum_{j=1}^J \left[ \displaystyle{\vert y_{ij,n}\vert ^2 \over \hat{y}_{ij,n}} + \log\hat{y}_{ij,n} \right]\cr &\quad -2J\sum_{i=1}^I\log \vert {\rm det}\,{\bf W}_i\vert.}$$

Fig. 7. ILRMA: unified method of IVA and NMF.

2) Restricting MNMF

The second way is to restrict MNMF in the following manner for computational efficiency. Let the spatial property matrix ${\ssf H}_{in}$ be restricted to rank-1 ${\ssf H}_{in}={\bf h}_{in}{\bf h}_{in}^{\ssf H}$ with ${\bf h}_{in} = [h_{i1n},\ldots ,h_{iMn}]^{\ssf T}$. Then, the MNMF model (26) can be simplified as

(33)

$$\hat{\ssf X}_{ij} = {\bf H}_i{\bf D}_{ij}{\bf H}_i^{\ssf H}$$

with H_i = [h_i1, …, h_iN] and an N × N diagonal matrix D_ij whose nth diagonal element is

(34)

$$\hat{y}_{ij,n} = \sum_{k=1}^K z_{kn}t_{ik}v_{kj}.$$

We further restrict the mixing system to be determined, i.e., N=M, enabling us to convert the mixing matrix H_i to the separation matrix W_i by ${\bf H}_i = {\bf W}_i^{-1}$. Substituting (33) into (27), we have

(35)

$$\eqalign{{\cal C}({\cal W},{\bf T},{\bf V},{\bf Z}) &=\sum_{i=1}^I\sum_{j=1}^J\sum_{n=1}^N\left[ \displaystyle{\vert y_{ij,n}\vert ^2 \over \hat{y}_{ij,n}} + \log\hat{y}_{ij,n} \right]\cr &\quad -2J\sum_{i=1}^I\log\vert {\rm det}\,{\bf W}_i\vert.}$$

3) Difference between two models

The two ILRMA objective functions (32) and (35) are different in the models (31) and (34) of the source estimates. In (31), the NMF bases are not shared among the source estimates n through the optimization process. In (34), the NMF bases are shared at the beginning of the optimization in accordance with randomly generated cluster-assignment variables 0 ≤ z _kn ≤ 1, and assigned dynamically to the source estimates by optimizing the variable z _kn.

How to optimize the objective functions (32) and (35) will be explained in the next section.

III. OPTIMIZATION

The objective functions (8), (13), (19), (25), (27), (32), and (35) can be optimized in various ways. Regarding ICA (8), for instance, gradient descent [Reference Bell and Sejnowski23], natural gradient [Reference Amari, Cichocki, Yang, Touretzky, Mozer and Hasselmo24], FastICA [Reference Bingham and Hyvärinen27,Reference Hyvärinen90], and auxiliary function-based optimization (AuxICA) [Reference Ono and Miyabe29], to name a few, have been proposed as optimization methods. This paper focuses on an auxiliary function approach because all the above objective functions can efficiently be optimized by updates derived from this approach.

A) Auxiliary function approach

This subsection explains the general framework of the approach known as the majorization-minimization algorithm [Reference Lange, Hunter and Yang61–Reference Hunter and Lange63]. Let θ be a set of objective variables, e.g., θ = {T, V} in the case of NMF (19). For an objective function ${\cal C} (\theta )$, an auxiliary function ${\cal C} ^+(\theta, \tilde{\theta})$ with a set of auxiliary variables $\tilde{\theta}$ satisfies the following two conditions.

• The auxiliary function is greater or equal to the objective function
(36)$$ {\cal C}^+(\theta,\tilde{\theta}) \geq {\cal C}(\theta).$$
• When minimized with respect to the auxiliary variables, both functions become the same,
(37)$$ {\rm min}_{\tilde{\theta}}\,{\cal C}^+(\theta,\tilde{\theta}) = {\cal C}(\theta).$$

With these conditions, one can indirectly minimize the objective function ${\cal C} (\theta)$ by minimizing the auxiliary function ${\cal C} ^+(\theta ,\tilde{\theta})$ through the iteration of the following updates:

(i) the update of auxiliary variables
(38)$$ \tilde{\theta}^{(\ell)} \leftarrow {\rm argmin}_{\tilde{\theta}}\, {\cal C}^+(\theta^{(\ell-1)},\tilde{\theta}) ,$$
(ii) the update of objective variables
(39)$$ \theta^{(\ell)} \leftarrow {\rm argmin}_{\theta}\, {\cal C}^+(\theta,\tilde{\theta}^{(\ell)}),$$

as illustrated in Fig. 8. The superscript $\cdot ^{(\ell )}$ indicates that the update is in the ℓth iteration, starting from the initial sets θ⁽⁰⁾ and $\tilde{\theta}^{(0)}$ of variables (randomly initialized in most cases).

Fig. 8. Majorization-minimization: minimizing the auxiliary function indirectly minimizes the objective function.

A typical situation in which this approach is taken is that the objective function is complicated and not easy to directly minimize but an auxiliary function can be defined in a way that it is easy to minimize.

In the next three subsections, we explain how to minimize the objective functions introduced in Section II. The order is NMF/MNMF, IVA/ICA, and ILRMA, which is different from that of Section II. The reason why the NMF/MNMF case comes first is that the derivation is simpler than the IVA/ICA case and directly by the auxiliary function approach.