DAMPED CHIRP MIXTURE ESTIMATION VIA NONLINEAR BAYESIAN REGRESSION

Estimating mixtures of damped chirp sinusoids in noise is a problem that affects audio analysis, coding, and synthesis applications. Phase-based non-stationary parameter estimators assume that sinusoids can be resolved in the Fourier transform domain, whereas high-resolution methods estimate superimposed components with accuracy close to the theoretical limits, but only for sinusoids with constant frequencies. We present a new method for estimating the parameters of superimposed damped chirps that has an accuracy competitive with existing non-stationary estimators but also has a high-resolution like subspace techniques. After providing the analytical expression for a Gaussian-windowed damped chirp signal’s Fourier transform, we propose an efficient variational EM algorithm for nonlinear Bayesian regression that jointly estimates the amplitudes, phases, frequencies, chirp rates, and decay rates of multiple non-stationary components that may be obfuscated under the same local maximum in the frequency spectrum. Quantitative results show that the new method not only has an estimation accuracy that is close to the Cramér-Rao bound, but also a high resolution that outperforms the state-of-the-art.


INTRODUCTION
Sinusoidal modeling is a primary topic in audio signal processing with many applications for audio analysis, coding, transformation, and re-synthesis. Rooted in additive synthesis and Fourier theory, sinusoidal modeling assumes that a signal is composed of a sum of sinusoidal oscillations and noise. Stationary models assume that the sinusoids have constant frequencies and amplitudes within the finite temporal window of analysis. But musical audio typically exhibits time-varying features, for example from the vibrato of a singing voice or attack of a plucked string. Estimating the parameters of non-stationary sinusoids can improve the quality and efficiency of the signal representation, enabling precise transformation and re-synthesis even when there are modulations within the short temporal analysis window. Indeed, non-stationary analysis has received much attention over the last several decades, as summarized in Section 1.1.
However, estimating mixtures of non-stationary sinusoids is still a difficult problem, especially in polyphonic music where superimposed non-stationary sinusoids with strong modulations cross in the time-frequency plane. Typically, non-stationary estimators assume that sinusoids do not overlap in the frequency domain and detect them by picking peaks one-by-one. After detection, a few spectral values around the peak are used to estimate first the nonconstant parameters, and second the constant parameters. They do not usually consider the covariance between the sinusoids nor the contribution of negative frequencies to the Fourier transform that can bias their estimations. While subspace methods can detect overlapping components with high-resolution thanks to their measure of temporal covariance from an auto-correlation function, their performance is highly sensitive to the model order and they cannot estimate frequency modulations.
This paper addresses the gap between non-stationary and highresolution estimators with a new method that jointly resolves and estimates mixtures of damped chirp sinusoids in noise. We present a variational algorithm for nonlinear Bayesian regression that fits the entire frequency spectrum to a weighted sum of spectral basis functions: the weights encode the amplitudes and phases, and the basis functions encode the frequencies, chirp rates, and decay rates of the damped chirps. As opposed to existing methods, this offers a high resolution because it infers a distribution that encodes the spectral covariance between the chirps and a high accuracy because it integrates information from the entire spectrum. The new method can be used with any analysis window: the spectrum of a windowed damped chirp is computed analytically when using a Gaussian window or numerically when using a non-Gaussian window. Accuracy and resolution experiments show the new method's high quality compared to the state-of-the-art and theoretical bounds.

Overview of Previous Work
Parameter estimation of a noisy chirp was proposed by Djuric et. al. [1]. Zhou et. al. [2] generalized the estimation problem to polynomial phase signals, which includes the damped chirp. The quadratically interpolated fast Fourier transform (QIFFT) [3] assumes that a Gaussian analysis window is used and estimates nonstationary parameters after fitting a parabola to the log-magnitude spectrum and the unwrapped phase spectrum. The reassignment method was initially proposed by Kodera et. al. [4,5], generalized to time-frequency analysis by Auger and Flandrin [6], and to estimate a non-stationary sinusoid with the generalized reassignment method (GRM) by Röbel [7] and Hainsworth [8]. Marchand and Depalle [9] developed a generalized derivative method (GDM) that uses a signal's first and second derivatives to estimate a non-stationary sinusoid's parameters. Betser [10] proposed a distribution derivative method that estimates the parameters of a general polynomial signal model by solving a set of linear equations formed from the signal and its derivatives. A comparison of these non-stationary parameter estimation methods was presented in [11]. Estimation of signal parameters via rotational invariance techniques (ESPRIT) is a high-resolution method for damped, stationary frequency sinusoids proposed by Roy and Kailath [12] and DAFx.1 adapted to audio analysis by Badeau et. al. [13], that has an estimation accuracy close to the theoretical bound.

A DAMPED CHIRP AND ITS FOURIER TRANSFORM
A damped chirp is a sinusoidal oscillation with linear frequency modulation (FM) and exponential amplitude modulation (AM), Its parameters include the amplitude ρ, the phase θ in radians, the decay rate α in log-amplitude per second, the angular frequency ω0 in radians per second, and the chirp rate ψ in radians per second squared. The decay rate can be positive or negative to make an exponential decrease (damped) or increase (undamped), respectively. A Gaussian window with scale β > 0 is defined as The Fourier transform x(ω) ∈ C of a damped chirp signal has a closed-form solution when it is multiplied by the Gaussian window in equation (2): where a coefficient v ∈ C encodes the amplitude and phase, and a spectral basis function f ϕ (ω) ∈ C that has the parameter set ϕ = [α, ω0, ψ] is defined as hα,ω 0 (ω) = α + i(ω − ω0) .
Since the damped chirp in equation (1) is real, it consists of an equal contribution of positive and negative frequency components, f ϕ (ω)v andf ϕ (−ω)v, respectively. In theory, a Gaussian window and a damped chirp both have infinite time support and are non-causal. Practically, since we process causal digital signals that have finite temporal support, we assume the onset of the exponential in equation (1) occurs outside of the analysis window. This assumption is common to sinusoidal model estimators, which succumb to distortions in the frequency spectrum and possible estimation biases when the onset of a damped exponential occurs within the analysis window.
This model is temporally local in two ways. First, we use a window of finite duration T that localizes the estimation around its center. Scale β may be set such that w(±T /2) = ν, where 0 < ν ≪ 1 is an acceptably small threshold, using the equation Second, since real-world signals have evolving features, we are interested in using a local analysis window that slides through time, estimating parameters from a short snapshot of the signal. Perfect reconstruction is possible when the time between analysis windows, the hop length L, satisfies the following condition,

BAYESIAN REGRESSION MODEL
Now we turn to estimating the parameters of damped chirps given an audio signal, which we address as a nonlinear Bayesian regression problem. In nonlinear regression, the goal is to estimate the parameters of nonlinear functions, called basis functions, and regression coefficients (weights), such that the weighted sum of basis functions matches the data [14]. Data is assumed to be generated from a sum of M real-valued damped chirps plus zero-mean, normally-distributed noise. In the frequency domain, the noisy damped chirps model for complex Hermitian spectral data x(ω) ∈ C is where σ 2 x is the variance of the Hermitian noise η(ω) ∈ C, vm is the regression coefficient and ϕ (m) = [αm, ω0,m, ψm] is the parameter set of the mth damped chirp.
A real signal's negative and positive frequencies cause estimation bias and detection errors for existing estimators. Energy accumulation is prominent not only in lower frequencies where a component's bandwidth is likely greater than its center frequency, but also along the entire frequency range when a signal has significant chirp or decay rates. Previously, the Hilbert transform has been used to approximately remove the negative frequencies, and windows with low side-lobes were designed to reduce overlap. Instead, we enhance the estimator by explicitly modeling the contribution of both the positive and negative frequency components.

Discretization
In order to apply the concept of nonlinear Bayesian regression, we consider discretized data, basis functions, and regression coefficients, and express them in matrix form. Further, the parameters and coefficients are considered to be stochastic, random variables whose properties are described through their statistical distributions. In the following, the variables are normalized to make them independent of the sampling rate, e.g. ω is in radians per sample.
Data vector x ∈ C N ×1 contains N observations of a spectrum obtained by taking a zero-phased discrete Fourier transform (DFT) 1 of a truncated windowed signal with a duration of N samples, evaluated at frequencies ωn = 2π(n − 1)/N , . . .
Matrix − → F ∈ C N ×M contains values of the basis function f ϕ (m) (ωn) for each of the M damped chirps evaluated at the same N discrete frequencies of x, Likewise, Basis function parameters are concatenated in a 1×3M vector Vector To simplify estimation and notation, we introduce the real-valued vector of variables z ∈ R 2M ×1 that is formed from the vertical concatenation of v's real and imaginary parts, We define a M × 2M complex-valued matrix so v and z are related through a linear transformation, Lastly, we define the composite matrix Q ∈ C N ×2M as

Data likelihood
Following from equation (10), the likelihood of the data x given z and the entire parameter set ϕ is

Priors
The vector of regression coefficients z is given a zero-mean normal prior with diagonal 2M × 2M precision (inverse covariance) matrix Λ = Diag(λ) that is scaled by the data noise's level σ 2 x , Estimation of λ is key to relevance vector regression [15], a method of finding sparse solutions to Bayesian regression problems that represents the data with only a few relevant components. An irrelevant component is trimmed from the model as its coefficient magnitude |zj| is pushed towards the prior mean of zero. Relevant components correspond to the underlying damped chirps, while irrelevant components correspond to spurious peaks from the noise-power spectral density estimate, secondary lobes of the window, and distortion.
The mth component's frequency ω0,m > 0 has a normal prior with variance τω 0 = M −1 and mean uniformly spaced in the range (0, π/2), Assuming a normal prior distribution (even for positive-valued variables) simplifies estimation because it is a conjugate distribution to the normal likelihood. The prior variance encourages the mth estimate to be close to ω0,m. Further, our inclusion of both positive and negative components in the model enables accurate estimation around frequency zero. The mth component's chirp rate ψm ∈ R has a normal prior with mean ψm = 0 and variance τ ψm , This prior encourages the chirp rate to be within ± ω0,m/N because the instantaneous frequency, ω0,m ± 1 2 ψmn, should be greater than zero and less than the Nyquist frequency at n ± N/2. For example, if ω0,m is 0 or π, then τ ψm = 0, and the estimate of ψm will be ≈ 0. If ω0,m = π 2 , then the chirp rate is likely in the range − π 2N < ψm < π 2N . This parametrization is opposed to directly relating the chirp rate to the frequency through a factor, which would complicate the estimation of both variables.
Lastly, the mth component's decay rate αm ∈ R has a normal prior with mean αm = 0 and variance τα = N −1 ,

ESTIMATION
Applying Bayes' theorem gives the posterior distribution over the latent variables, ϕ and z, given the data, p(x) = p(x|z, ϕ)p(z, ϕ)dzdϕ .
As with most non-trivial models, it is not possible to solve the integral for the model evidence p(x) analytically. Since exact inference is intractable for this model, we develop a variational inference algorithm for approximate inference of the posterior over the regression coefficients and the parameters.
Variational inference circumvents the intractable integral involved in minimizing the Kullback-Leibler (KL) [16] divergence from q to p by instead maximizing the lower bound on model evidence [17,18], Approximate posterior q is factorized between the regression coefficients and basis function parameters, The optimal log distributions that maximize the lower bound are given by the calculus of variations, Each distribution is updated in turn to maximize L(q).

Regression coefficients (amplitudes and phases)
The optimal approximate posterior q ⋆ z (z) introduced in (34) is normal and has a closed-form solution, where the mean µz and covariance matrix Σz are derived using the properties [19] of marginal and conditional normal distributions, Indeed, Λ regularizes the ill-posed problem when Q H Q is illconditioned, which may happen when |αm| is very large. Amplitude and phase estimates are given by the absolute value and argument (angle) of Cµz, respectively, Maximizing the expected log-prior ⟨ln p(z)⟩q z w.r.t. λ gives

Algorithm 1 Damped Chirps Estimator
As an element of µz and its variance go towards zero, its corresponding precision will go to infinity. In turn, a large element in Λ = Diag( λ) causes the corresponding element in the estimate µz to go to zero. Therefore, this variational estimation of the regression coefficient and precision leads to sparse solutions, since irrelevant components (ones with small magnitudes and small variances) are driven to zero, while relevant components with significant magnitudes are not altered. This is referred to as a relevance vector machine [15], or more generally, automatic relevancy determination [20].
Estimation of ϕ is an unconstrained optimization problem [22] that can be addressed with gradient root finding. Gradient ascent performs well in this application because it is less sensitive to different initializations when compared to Newton's method but still converges quickly. The algorithm updates ϕ as follows, where ∇ ϕ denotes the gradient w.r.t. ϕ and γ is the learning rate. Section 8.1 provides closed-form equations for the gradient of the expected log-joint, which is decomposed as Pseudo-code for estimating damped chirps is presented in Algorithm 1. Its computational complexity is mainly from the 2M × 2M matrix inverse in equation (38) to update µz, and the inner product in equation (50), Section 8.1, to update the gradient.

Expected marginal likelihood
After inference, an interesting statistic that is directly available using the sum rule of probability is the expected likelihood of the data w.r.t. the inferred posterior over z, The mean is a smoothed (de-noised) representation of the data, and the covariance measures uncertainty about the data.

Generalization to non-Gaussian windows
The proposed method can be used with any analysis window. While the Fourier transform of a damped chirp with a non-Gaussian window does not admit a closed-form solution, we can use the damped chirp's time domain basis function and derivatives given in Section 8.1, multiply them with a non-Gaussian window, then numerically compute the DFTs. This provides us with the spectral basis functions and its derivatives w.r.t. the parameters for any analysis window. For example, the Hann window is often used for spectral analysis due to its compact main lobe and attenuated side-lobes [23].

Accuracy experiments (single component)
Nonlinear Bayesian regression (NLR) was tasked with estimating the parameters of damped chirps across a range of noise levels. It was compared to the Cramér-Rao bound (CRB) defined in Section 8.2, and existing state-of-the-art non-stationary sinusoidal model estimators discussed in Section 1.1: GRM [6,7], GDM [9], QIFFT [3], and ESPRIT [13]. For NLR, the spectral data was either obtained using a Gaussian window (NLR-G) or a Hann window (NLR-H). For NLR-G, the closed-form expressions for the spectrum and derivatives were used. The Gaussian window's scale was set using equation (8) with ν < 10 −3 . For NLR-H, the spectral basis function and its derivatives were computed numerically by taking DFTs of a Hannwindowed complex-valued damped chirp and its derivatives.
The duration of each test signal was N = 512 samples. The signal-to-noise ratio (SNR) in dB was SN R = 10 log 10 (48) and went from 0dB to +120dB by steps of 20dB, where sn is the signal and ζn is the zero-mean white noise of the nth time-domain sample. Since NLR does not have access to the true noise variance, we set σ 2 x = .1N constant for all tests. For each SNR and each analysis method, the variance of error was evaluated from R = 1000 Monte Carlo runs. The variance of error was approximated by the equation where ξ (r) was a target value and ξ (r) was an estimate from the rth Monte Carlo run. Monte Carlo run r involved randomly sampling from uniform distributions the phase θ (r) ∈ (0, 2π), and frequency in the range ω (r) 0 ∈ (4π/N, π/4), as GRM, GDM, and QIFFT had problems with ω (r) 0 < 4π/N . The chirp rate was either ψ (r) = 0 (no FM) or randomly sampled in the range ψ (r) ∈ (−2ω0/N, 2ω0/N ) (FM), so the instantaneous frequency could span 2ω0 in N samples. Similarly, the decay rate was either α (r) = 0 (no AM) or randomly sampled α ∈ (−2/N, 2/N ) (AM). Figures 1 and 2 show the variance of error given stationary signals (no FM or AM), and non-stationary, damped chirp signals (FM and AM), respectively. ESPRIT is closest to the CRB for stationary signals, but does not estimate chirp rate and its quality degrades for damped chirp signals because it assumes that signals have stationary frequencies (no FM). NLR has similar accuracy to GRM, GDM, and QIFFT below 60dB and more accuracy above 60dB, remaining close to the CRB for all parameters. Compared to a Gaussian window spectrum, a Hann window's spectrum has a small main lobe and low side-lobes enabling better frequency resolution. Indeed, NLR-H was better than NLR-G across all SNRs.

Resolution experiments (multiple components)
Joint estimation of multiple frequency components in a noisy signal is a difficult problem. First, Figure 3 shows the result of using NLR-G to estimate many stationary components from a noisy signal of duration N = 256 that are close in frequency and where some share the same spectral peak. The proposed method demonstrates its high resolution, detecting obfuscated components and estimating their parameters. The number of components was set to M > N/8, which allowed each component to be properly detected. Irrelevant components were automatically driven to zero DAFx.5 amplitudes thanks to the sparsity inducing prior over the regression coefficients and the estimation of λ. ESPRIT with order 24 (12 components) had a similar resolution but was sensitive to the order, did not trim irrelevant components, and did not consider frequency modulation. Figure 4 shows the results of estimating several damped chirp components from a noisy signal of duration N = 256. Fast chirps and decays created wide spectral lobes that caused significant phase cancellation artifacts in the magnitude spectrum, most noticeably at 4.3 kHz where a component's center frequency is above a valley. The proposed method resolved this difficult situation, estimating each component's chirp rate and decay rate, as shown in the bottom panels of Figure 4. GRM and GDM would not resolve these components, as they detect components by picking spectral peaks.

Short-term analysis with a sliding window
To analyze longer duration evolving sounds, we used a sliding Gaussian window of N = 256 duration with an L = 256  Figures 5a and 5b show the short-term estimation of noisy (20dB SNR) signals with fast chirp rates, and two cosine FM components that cross in the time-frequency plane. When the components cross at time 0.12 seconds, their chirp rates have the same magnitude but are of opposite signs. NLR resolves the two components and shows their crossing even though they are under the same spectral peak at 4kHz. This quality is relevant for subsequent partial tracking [24] and additive synthesis, where incorrect detection can lead to distortions in the re-synthesized sounds. (b) NLR accurately resolves and estimates the two cosine FM sounds from a noisy signal, even at 0.12 seconds where they share the same 4kHz spectral peak. Figure 5: Chirpograms: a line's position, slope, and color, correspond to an estimated frequency, chirp rate, and amplitude, respectively. Chirp rate and window scale affect the sizes of the spectrogram lobes (dark parts of background image).

CONCLUSIONS
This paper presented a method for resolving and estimating the parameters of superimposed damped chirp sinusoids in noise. Nonlinear Bayesian regression in the frequency domain was addressed using an analytic expression for a Gaussian-windowed damped chirp's Fourier transform, and generalized to any window by computing DFTs of a damped chirp's temporal basis function and its derivatives. Explicitly modeling the additive contribution of a realvalued signal's negative and positive frequency components enables quality estimation over the entire spectrum. Its high resolution is attributed to the probabilistic modeling and inference of the non-stationary parameters and their covariances, given all the data available. It can resolve crossing partials with fast chirp rates that even share the same spectral peak, and has more accuracy than existing estimators for highly non-stationary signals. These qualities are especially relevant for analyzing, coding, and synthesizing vocal sounds and polyphonic music. Future work could accelerate the algorithm. While our structured variational approximation provided high resolution and accuracy, a mean-field factorization of the model's variables may significantly reduce the computational complexity of estimation without altering its quality. A second order optimization algorithm like Newton's method could improve convergence speed after a careful initialization. Alternatively, a variational auto-encoder could be used to map the data to approximate posterior means and variances of the model parameters.

Gradients of the basis functions
The gradient of the log-likelihood w.r.t. ϕ is where ⊙ denotes element-wise multiplication and we defined Second, the gradient of a log-prior w.r.t. ϕ is τ ϕ = τω 0 τ ψ τα .
Considering equation (43), σ 2 x gives more or less influence to the likelihood compared to the prior. We keep this value fixed regardless of the actual noise level, which we assume is unknown.

Matrix−
→ F ∈ C N ×3M contains the partial derivatives of the M spectral basis functions with respect to their three parameters, where ∂ − → F /∂ϕj is an N × 1 vector because only one of the basis functions depends on ϕj, ∀j ∈ [1, 3M ].
where the index m is not subscripted to simplify the equations. For non-Gaussian windows, the spectral basis function and its derivatives are numerically computed with windowed DFTs of the temporal basis function d ϕ (t) and its derivatives,

Theoretical bounds
The Cramér-Rao bound (CRB) is defined as the best possible performance of an unbiased estimator in the presence of noise for a given dataset. For the damped chirp model in (1), the CRBs have been derived by Zhou et al. [2]. The CRB depends on the value where 0 ≤ n0 < N is the time sample at which the parameters are estimated. Djurić and Kay [1] noted that the optimal choice in terms of the CRB is n0 = N 2 , i.e. the center of the analysis frame. The Fischer information matrix (FIM) and CRB for the amplitude and decay rate are The FIM and CRB for the phase, frequency, and chirp rate are