A Music Cognition–Guided Framework for Multi-pitch Estimation

As one of the most important subtasks of automatic music transcription (AMT), multi-pitch estimation (MPE) has been studied extensively for predicting the fundamental frequencies in the frames of audio recordings during the past decade. However, how to use music perception and cognition for MPE has not yet been thoroughly investigated. Motivated by this, this demonstrates how to effectively detect the fundamental frequency and the harmonic structure of polyphonic music using a cognitive framework. Inspired by cognitive neuroscience, an integration of the constant Q transform and a state-of-the-art matrix factorization method called shift-invariant probabilistic latent component analysis (SI-PLCA) are proposed to resolve the polyphonic short-time magnitude log-spectra for multiple pitch estimation and source-specific feature extraction. The cognitions of rhythm, harmonic periodicity and instrument timbre are used to guide the analysis of characterizing contiguous notes and the relationship between fundamental frequency and harmonic frequencies for detecting the pitches from the outcomes of SI-PLCA. In the experiment, we compare the performance of proposed MPE system to a number of existing state-of-the-art approaches (seven weak learning methods and four deep learning methods) on three widely used datasets (i.e. MAPS, BACH10 and TRIOS) in terms of F-measure (F1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${F}_{1}$$\end{document}) values. The experimental results show that the proposed MPE method provides the best overall performance against other existing methods.


Introduction
Estimation and tracking of multiple fundamental frequencies is one of the major tasks in automatic music transcription (AMT) of polyphonic music analysis [1] and music information retrieval (MIR) [2], which is referred to as a subtask in the Music Information Retrieval Evaluation eXchange (MIREX). 1 Multiple fundamental frequency estimation (MFE), also namely multiple pitch estimation (MPE), is challenging in processing simultaneous notes from multiple instruments in polyphonic music [3,4]. There is often a trade-off between the robustness and efficiency of algorithms that focuses more on complexity rather than single-pitch estimation.
According to Benetos et al. [5], the MPE approaches are categorised into three types, i.e. feature based, spectrogramfactorization based and statistical model-based methods. In feature-based methods, signal processing techniques such as the pitch salience function [6] and pitch candidate set score function [7] are used. In spectrogram-factorization methods, both the nonnegative matrix factorisation (NMF) and the probabilistic latent component analysis (PLCA) approaches have received a lot of attention in recent years [6], and numerous improved versions [8,9] based on both methods have been published and are recognised as leading spectrogram factorization-based methods in the MPE domain. The statistical model-based methods employ the maximum a posteriori (MAP) [3] estimation, maximum likelihood (ML) or Bayesian theory [10] to detect the fundamental frequencies.. It is worth noting that these three distinct types of MPE approaches can be joined or interacted with [6] for a variety of applications. 1 3 In recent years, many deep learning (DL)-based supervised MPE approaches have also been developed. Cheuk et al. [11] presented a DL model for AMT by combining the U-Net and bidirectional long short-term memory (BiL-STM) neural network modules. Mukherjee et al. [12] used statistical characteristics and an extreme learning machine for musical instrument segregation, where LSTM and the recurrent neural network (RNN) [13] were combined to differentiate the musical chords for AMT. Fan et al. [14] proposed a deep neural network to extract the singing voice, followed by a dynamic unbroken pitch determination algorithm to track pitches. Sigtia et al. [15] developed a supervised approach for polyphonic piano music transcription that included a RNN and a probabilistic graphical model. Although DL approaches may provide adequate music transcriptions, they often require high-performance computers and excellent graphic processing units (GPU) to speed-up the lengthy training process [16]. Furthermore, DL algorithms may suffer from inaccurately labelled data, and the performance may be susceptible to the training samples and the learning procedures used. To this end, in this paper, we focus mainly on cognitive method, where the prior cognitive theories and assumptions from previous studies [17][18][19] will be used to guide the fundamental pitch detection in polyphonic music pieces.
To distinguish the pitch using harmonic analysis, two types of statistical models are often used. One is the expectation-maximization (EM)-based algorithms [20], and the other is Bayesian-based algorithms [21]. For EM-based methods, Emiya et al. [22] proposed a maximum likelihood-based method for multi-pitch estimation. Duan and Temperley [23] proposed a three-stage music transcription system and applied maximum likelihood for final note tracking. For Bayesian-based methods, Alvarado Duran [24] combined Gaussian processes and Bayesian models for multi-pitch estimation. Nishikimi et al. [25] integrate hidden Markov Model and Bayesian inference together to precisely detect the vocal pitch. Those statistical models can be also considered as shallow learning methods, as data should first be observed to gain some prior knowledge, based on which the experiments should then be conducted. After constant addition of the information of the new samples into prior distribution, the posterior inference can be delivered along with the final results. Although the shallow learning approaches have been widely investigated [26], they still have much room to improve.
Apart from the aforementioned issues, most MPE methods are designed from the viewpoint of signal processing rather than music cognition, resulting in a lack of sufficient underpinning theory and inefficient modelling. To tackle this issue, we propose a general framework in which music cognitions are used to guide the entire process of MPE. In the pre-processing, inspired by cognitive neuroscience of music [19], the Constant Q transform (CQT) [27] is employed to transfer the audio signal to time-frequency spectrogram. The pianoroll transcription is then generated using a conventional matrix factorization approach, shift-invariant probabilistic latent component analysis (SI-PLCA) [9]. In the harmonic structure detection (HSD) process, the cognitions of harmonic periodicity and instrument timbre [18] are used to guide the extraction of multiple pitches. The efficacy of the suggested methodologies has been fully validated by experiments on three publicly available datasets.
The major contributions of this paper may be highlighted as follows. First, a new HSD model that incorporates music cognition for multiple fundamental frequency extraction was proposed. Second, we proposed a new note tracking method guided by music connectivity and multi-pitch model. By combining conventional pianoroll transcription approaches and the proposed HSD model, a new music cognition-guided optimization framework is introduced for MPE. Experimental results on three datasets have demonstrated the merits of our approach, when benchmarked with 11 state-of-the-art methods.
The rest of the paper is structured as follows: "Cognitionguided multiple pitch estimation" describes pre-processing for MPE including time-frequency representation, matrix factorization and the implementation of the proposed harmonic structure detection method. "Experimental results" presents the experimental results and performance analysis. Finally, a thorough conclusion is drawn in "Conclusion".

System Overview
The objective of this work is to detect the multiple pitches from music pieces of mixed instruments, where an MPE system is proposed, which contains three key modules, i.e., pre-processing, harmonic structure detection and note tracking. Preprocessing covers a standard procedure, in which an input music signal needs to go through time-frequency (TF) representation and matrix factorization for feature extraction. The overall diagram of the MPE framework is illustrated in Fig. 1, where the implementation details are presented as follows.

Pre-processing
According to the cognitive neuroscience of music [19,28], before selectively stimulating the auditory cortex, different frequencies within the music need to be first filtered by human cochlea. As the frequency of human auditory perception is logarithmically distributed [27], there is a greater discrimination when hearing relatively lower frequencies.

3
The Constant Q transform (CQT) [29], based on the FFT principle, can process a logarithmic compression similar to that of human's cochlea helical structure [29]. Therefore, the CQT is employed as the TF representation module to derive the TF spectrogram, as it is efficient in lower frequencies.
There are fewer frequencies required in a given range, which has testified its usefulness when the frequency distribution in several octaves is discrete. Meanwhile, an increased frequency bin correlates to a decrease in the temporal resolution rate, making it suitable for auditory applications. A spectral resolution of 60 bins per octave is used as suggested by Brown [27]. The outputs from the TF transformation are linear when using the Fast Fourier Transform (FFT) to analyse the frequency (Fig. 2a).
In the matrix factorization module, the CQT spectrogram results are used as the input, approximately modelled as a bivariate probability distribution P(p, t) . The output of this module is a 2-dimensional non-binary representation of pianoroll transcription (a pitch vs. time matrix shown in Fig. 2b). In this paper, the fast shift-invariant probabilistic latent component analysis (SI-PLCA) [30] is used for automatic transcription of polyphonic music, as it is extremely useful for log-frequency spectrogram, due to the same interharmonic spacing for all periodic sounds [31]. Given an input signal X t , the output of CQT is a log-frequency spectrogram V z,t that can be considered as a joint time-frequency distribution P(z, t) where z and t denote the frequency and time, respectively. After applying the SI-PLCA, P(z, t) can be further decomposed into several components by [30]: where p, f , s are latent variables which denote respectively the pitch index, pitch-shifting parameter and instrument source. In Eq. (1), P(t) is the energy distribution of the spectrogram, which is known from the input signal. P(z − f |s, p) denotes the spectral templates for a given pitch p and instrument source s with f pitch shifting across the log-frequency. P t (f |p) is the log-frequency shift for each pitch on the time frame t, P t (s|p) represents instrumentation contribution for the pitch in the time frame t, and P t (p) is the pitch contribution which can be considered as transcription matrix on the time frame t. Since there are latent variables in this model, the expectation maximization (EM) algorithm [20] is often used to iteratively estimate the corresponding unknown variables.
In the Expectation step, the Bayes's theorem is adopted to estimate the contribution of the latent variables p, f, s for reconstruction of the model: In the Maximization step, the posterior of Eq. (2) is used to maximise the log-likelihood function in Eq. (3), which leads to the update of Eqs. (4)- (7). As suggested in [30], this step can converge after 15-20 iterations. The final result of the pianoroll transcription is derived by P(p, t) = P(t)P t (p):

Harmonic Structure Detection
This section is the core of the proposed MPE system where music theories in terms of the pattern of beat length and assumption of equal energy between mixed monophonic and polyphonic music pieces are used to guide the model for the extraction of the multiple fundamental frequencies from a mixture of music sources.
For a given piece of music, the time domain representation is illustrated in the input module in Fig. 1. The results of CQT and SI-PLCA are given in Fig. 2a and b, respectively. Upon observing Fig. 2b, the fundamental pitch and its harmonics have been highlighted by the shaded black and grey strips. However, there is considerable noise and redundant information represented by small and grey dots which may be misconstrued for pitches at lower frequencies. Furthermore, the white gaps in the black and grey strips indicate frequency information that has been lost in the analysis. This suggests that the consistency of fundamental pitch is insufficient if considered frame by frame (each frame was set to 10 ms). To address these inconsistencies, the HSD method is proposed followed by a note tracking process (Fig. 1). The proposed HSD includes two main stages. In the first stage, the pianoroll transcription P(p, t) is normalised into [0, 1] by using the following max-mean sigmoid activation function [32]: where PN represents the normalised P(p, t) . By applying a mean filter in Eqs. (8) and (9), the spectrogram can be smoothed. For extreme values which are too large or too small than expected, they can also be rationalised. For any PN , the value of PN t at time t can be expressed by Eq. (10).
Inspired by the music theory that most high-order harmonic components are in the high-frequency range with low amplitude [17], a two-step hard constrain is used to remove most of the high-frequency components, noise and redundancy. First, a fixed threshold TH 1 is applied in Eq. (11) to remove small values. Based on the characteristic of sigmoid function (Eq. (8)), TH 1 is set to 0.5. Finally, the filtered result PF of the whole frames is obtained and shown in Fig. 3a.
In the second step, the statistics of the beat length is used to guide the removal of noise and redundant information. According to the cognition of music perception, most notes in musical rhythms have a large number of crotchets and quavers, but fewer numbers of semiquavers and demisemiquavers [33]. The rate of occurrence of different notes in the BACH10 database was observed and measured according to the ground truth. A plot was generated of time vs. rate of occurrence in Fig. 4, with the labelled fractions (i.e. 1 2 , 1 4 , 1 8 , 1 16 , 1 32 ) denoting minim, crotchet, quaver, semiquaver and demisemiquaver, respectively. Figure 4 illustrates that the rate of occurrence of crotchets and quavers is larger than that of the demisemiquavers, semiquavers and minims. Especially, the number of demisemiquavers and semiquavers is extremely low. Furthermore, if the length of a semibreve is defined as , the length of a demisemiquaver is ∕32 . Any notes shorter than a demisemiquaver will be removed in PF before any further processing in the second stage. In Fig. 4, a peak value is identified at the initial time steps of the simulation, and this may be due to two reasons. Firstly, manually played music may contain some timing errors, for example, holding a note for its precise duration for every note in the piece may be impossible. Secondly, ornaments such as vibrato and glissando may be mistakenly performed despite not being present on the music score. The length of such vibrato and glissando is equal to a demisemiquaver or lower [34]. To extract more of the main body of multiple pitches, factors such as human playing habits or ornaments are ignored in the proposed work. Relevant results given in "Experimental results" demonstrate that the multiple pitches are highlighted whilst removing most of the unwanted noise.
After filtering the amplitudes from PLCA, the HSD framework was proposed to detect the fundamental pitch in the second stage. The flowchart in Fig. 5 outlines the process of HSD, and Table 1 lists the description of each parameter. As described in the flowchart in Fig. 5, the output from previous steps will be analysed in two domains, i.e. pitch domain PD and energy domain ED . In this context, each frame of PF is split into two vectors, PD(n) and ED(n) . PD(n) ℝ N * 1 is non-zero notes index in each frame, ED(n) ℝ N * 1 is the amplitude of PD(n) , and N is the number of non-zero notes. As seen, the process is only applied once on the non-zero notes rather than the whole frame, because there is no need to analyse those zero-value notes for efficiency.

Pitch Domain Analysis
After that, a matrix of pitch candidates and their corresponding harmonics PCH ℝ N * H can be extended from PD(n) . The first column of this matrix is non-zero pitch values and the rest of the columns have the associated harmonic pitches of each non-zero pitch, where the harmonic pitch is the corresponding pitch value of the harmonic frequency. A harmonic map HMap ℝ M * H is employed here to guide the extension process, which includes the pianoroll number (m) of the fundamental frequency ( F 0 ) and the corresponding harmonic frequency for every note. Following the MIDI tuning standard, we transfer the nth non-zero fundamental frequency to its corresponding pianoroll number using Eq. (12). Here, PD needs to be subtracted by 20 due to the difference between the pianoroll and the MIDI number:  and music pieces. It is worth mentioning that our algorithm does not reply on the frequency setting of concert A, as our algorithm focuses on the analysis of the relationship between fundamental frequency and harmonic frequencies, which mainly depends on the music temperament.
An example of calculating MIDI number of harmonic frequency in HMap is given in Table 2.
PCH(n, h) is the h th harmonic pitch component of the pitch n where n lies within [1, N] and h is within [1, H]. H is set to 5 in the experiment, and N is the number of non-zero value in each frame:  Let PCP be a matrix of the harmonics and their potential corresponding pitches, which contains the harmonic components and their associated pitches being calculated from the original pitch at a specific value of h as follows: where (x − y) is a function of the equivalence gate with two inputs. The output of the equivalence gate will be 1 if the two inputs equals (i.e. h = 1). Otherwise, it will become zero. Using Eqs. (14) and (15)

Energy Domain Analysis
In the energy domain, EDG(n, h) is a value generated from ED ℝ N * H and PHC(n, h) as defined below: In the following, we will describe two cognitive theories which have inspired our proposed guided weight mechanism for fundamental frequency detection. First, according to the harmonic periodicity and instrument timbre theory [18], the harmonic periodicity of different instruments should be the same, although the sound of which varies by their  HMap(m(n), h), PCH ℝ N×H

EDG(n, h) = ED(n) ⋅ [PHC(n, h) − PHC(n, 1)], EDG ℝ N×H
timbres as reflected on the ratio of harmonic amplitude to the fundamental amplitude [35]. The instruments from different families will have a large ratio, and vice versa. For the instrument that produces a sound from strings such as piano, and violin (Fig. 6d), their harmonic amplitudes generally decrease gradually. On a different note, for woodwind instruments such as clarinet (Fig. 6c) and bassoon (Fig. 6a), the amplitudes of their first harmonic would be lower than that of their second harmonic. Therefore, the energy ratio of the fundamental frequency and harmonic frequency energy (timbre) is unaffected by monophonic or polyphonic textures, but unique in individual instruments. Second, according to the acoustic theory [36], when two or more sound waves occupy the same space, they move through rather than bounce off each other. For example, the result of any combination of sound waves is simply the addition of these waves. Theoretically, the energy of the mixed monophonic and polyphonic audio should be the same, though there is unavoidable difference in the real case. The results of a single frame after step 1 (section III-B) of the harmonic structure detection (HSD) are plotted as profile of pitch values as shown in Fig. 6. The profiles of four single music sources are shown in Fig. 6a-d. The profile of the mixed monophonic notes is given in Fig. 6e, which is composed of four single music sources, i.e. notes no. 1-no. 4, and the profile of the polyphonic notes shown in Fig. 6f is generated from one mixed channel. Considering that the profile of mixed monophonic notes is the ideal value, and the profile of the polyphonic notes is the predicted actual value. As seen in Fig. 6f, there are few amplitude differences between the profiles of the polyphonic and monophonic notes due to the resonance in the polyphonic notes and channel distortion during data recording and transmission, but the overall trend of the two profiles is very similar. Motivated by these, we proposed the guided weight mechanism which is denoted as Eq. (18) in our model for improving the detection of the fundamental frequency. The guiding weight is calculated by the averaged ratio of the amplitude of harmonic ED_mono(h) and fundamental frequency ED_mono(1) in the monophonic data, before applying to the polyphonic data. The variable I is the number of known instruments that can be identified in the music piece: where T is the number of time frames in the monophonic data, the first non-zero value of ED_mono t (1) is always the fundamental frequency, and the remaining non-zero values ED_mono t (h) are the harmonic frequencies.
Equation (19) estimates the amplitude of harmonic components ( EHC ) presented in the pitch n by multiplying the guided weight of selected instrument with EDG . Theoretically, the amplitude of harmonic should be a portion to the amplitude of the fundamental frequencies. It is noted that the fundamental frequencies must occur at h = 1, then harmonic frequencies occur at h = 2:H.
Based on the EHC i determined from Eq. (19), the amplitude of fundamental frequency in pitch n after subtracting the summed harmonic components' amplitude will be kept updating until the fundamental frequencies from all instruments are estimated.
Eventually, the amplitude of fundamental frequency in pitch n, represented as EFF , can be obtained by Eq. (21).
For each non-zero pitch n in each frame t, it will have a rank value R(n) according to the EFF(n) , then a 2D rank map R(n, t) will be generated for the whole music piece, i.e. pitch/pianoroll vs. time frame as shown in Fig. 3b, which will be used to fully represent our detected harmonic

EFF(n) = ED(n), EFF ℝ N×1
structure. A brief implementation of energy domain procedure is summarized in Algorithm 1.

Algorithm 1
Inputs:ED(n) Step 1: Generate a matrix including the amplitude of fundamental pitch and their corresponding harmonic pitches using Eq. (17) Step 2: Calculate the weight for each type of instrument using Eq. (18) Step 3: Estimate the amplitude of harmonic components ( EHC ) presented in the pitch n using Eq. (19) Step 4: Update ED by Eq. (20) Step 5: Repeat steps 1-4 until the fundamental frequencies from all instruments are estimated Obtain the final estimated amplitude of fundamental frequency in pitch n by Eq. (21)

Note Tracking
As seen in Fig. 3b, although most fundamental pitches have been extracted, the notes still show a poor consistency. To improve this, a note tracking method based on the music perception and multi-pitch probability weight was proposed. According to the music theory [33], the occurrence of demisemiquaver is generally quite low in music pieces. As a result, notes with a length shorter than demisemiquaver are filtered out. The averaged rank of the connected pitch group in the rank map is calculated and denoted as R . If R is larger than an adaptive threshold TH 2 , the pitch group is considered a harmonic and will be skipped from the analysis. As the polyphonic music pitches vary over time, the TH 2 will also change accordingly. To account for this change, a fitting function was generated for TH 2 (Fig. 7a), which is adaptive to the number of notes x ∈ [1,12] for each frame, as given: The fitting curve of TH 2 is obtained by minimising the fitting error between ground truth and our estimate. Figure 7b displays the note tracking results where most of the noise and the inconsistencies have been filtered out. The result has also achieved a similar profile to that of the ground truth data.

Experimental Settings
To validate the effectiveness of the proposed approach, the first dataset used for evaluation is the MIDI Aligned Piano Sounds (MAPS) [37], in which all music pieces are recorded in the MIDI format initially and then converted into ".wav" format. MAPS also have differently purposed subsets such as monophonic excerpts and chords. For this case, only one subset is used which includes polyphonic music pieces. In addition, there are several instruments and recording conditions in MAPS. The "ENSTDkCI" is chosen as the music played using a real piano rather than an acoustic one, i.e. a virtual instrument, and recording occurs in soundproofed (22) TH 2 = 1.26x 0.9 conditions. The second dataset is BACH10 [38], which contains 10 pieces using violin, clarinet, saxophone and bassoon from J.S.Bach chorales, where each piece lasts approximately 30 s. The third dataset is TRIOS [39], which is the most complex one among the three as it contains five multitrack chamber music trio pieces. The sampling rate for all music pieces is 44,100 Hz.
For objective assessment, the most commonly used frame-based metric, F-measure ( F 1 ) [40,41], is adopted. It combines the positive predictive value (PPV, also namely precision) and the true positive rate (TPR, also namely recall) for a comprehensive evaluation as follows: where TPR = T p T p +F n , PPV = T p T p +F p , and T p , F p and F n refer respectively to the number of correctly detected F 0 , incorrectly detected F 0 and missing detection of the F 0 . Specifically, these three components can be calculated by comparing the binary masks of the detected MPE results and the ground truth. Table 3 shows the quantitative assessment of 11 benchmarking methods on MAPS, BACH10 and TRIOS datasets. We divide all benchmarking methods into two categories: shallow learning method and DL method. Weak learning methods include a traditional machine learning model or a prior knowledge-based model whereas DL methods include deep neural networks and deep convolutional neural networks.

Performance Evaluation
Many MPE approaches select a pair of methods from CQT, PLCA, equivalent rectangular bandwidth (ERB) and NMF for pianoroll transcription. Therefore, two of the most representative methods, i.e. CQT + PLCA proposed by Benotos and Dixon [31] and ERB + NMF proposed by Vincent et al. [42], are chosen for benchmarking. In Table 3, Benetos et al. [43] and Vincent [42] can produce the second best performance on the MAPS and TRIOS datasets, respectively, which validates the effectiveness of CQT + PLCA and ERB + NMF. However, due to the lack of efficient harmonic analysis, the performance of both methods is inferior to the proposed HSD method. Unlike the methods from Benetos and Vincent, other methods adopt different ideas for MPE. SONIC [44] proposed a connectionist approach where an adaptive oscillator network was used to track the partials in the music signal. However, without a matrix factorization process, its performance is limited on the three datasets. Su and Yang [8] proposed a combined frequency and periodicity (CFP) method to detect the pitch in both frequency domain and lag (frequency) domain. The CFP method in Table 3 gives the best performance on the BACH10 dataset, but relatively poorer results on the other two datasets. The main reason here is possibly because the music pieces in the MAPS and TRIOS datasets have more short notes than those in the BACH10 dataset, and CFP has the limited ability for detecting the short notes but exhibit less errors for continuous long notes. Furthermore, the assumption of CFP does not hold for high-pitch notes of piano, as both MAPS and TRIOS have many piano music pieces. In addition, the music pieces in the MAPS database contain multiple notes in most frames, which have led to extra difficulty for polyphonic detection. However, the proposed method can still successfully solve this problem by effectively analysing the relationship of the position and energy between the fundamental frequency and harmonic frequencies for the notes. As a result, the performance of the proposed method on MAPS is the best, which is 8% higher than that of CFP.
Klapuri [3] proposed an auditory model-based F 0 estimator, and Duan [38] proposed a maximum-likelihood approach for multiple F 0 estimation, but both methods result in inferior performance compared to the results achieved by Benetos et al. [31,43], Vincent et al. [42] or CFP [8]. Furthermore, Klapuri's [3] and Duan et al.'s [38] methods lack an effective pre-processing stage (i.e. TF representation and matrix factorization) or harmonic analysis, which is the main reason why their overall performance is less effective in comparison to ours. The proposed method was also compared with four deep learning-based supervised approaches on MAPS dataset. Due to lack of publicly available source codes, only the data that was reported in the original paper was duplicated for comparison. The first two methods are proposed by Sigtia et al. [15], which are mainly based on the music language models (MLMs). However, due to insufficiently labelled data in the existing polyphonic music databases for training, such limitations have affected further analysis of DL-based approaches. Furthermore, the MLM model is not robust to ambient noise, whereas music pieces in reality generally contain a lot of ambient noise. This has resulted in DL-based methods failing to fully analyse the inner structure of the music pieces. As a result, DL-based methods cannot achieve the same performance as the HSD method or some of the other unsupervised methods such as Benetos et al. [43] on the MAPS dataset. Su [40] and Kelz [41] also proposed DLbased methods for AMT. Although better than [15], their performance is still not ideal as there is insufficient music knowledge support embedded. To this end, more music theories should be introduced for improved AMT.
In summary, referring to Table 3, the proposed method yields the best results on both the MAPS and TRIOS datasets, also the second-best in BACH10 according to F 1 value, thanks to the guidance of music cognition. However, the method can still be improved, especially for reducing the computation cost. As it takes 2 min to process a 30-s music piece, this is longer than some other methods. In addition, although the profile of the real polyphonic note is close to the expected mixed monophonic note, as shown in Fig. 6e, f, there are still some differences in the final values of the monophonic and polyphonic profiles which can be further improved.

Key Stage Analysis
In this section, the contribution of several major stages in the proposed MPE system is discussed, where the performance of each stage is evaluated on the MAPS dataset in terms of the precision, recall and F 1 . To calculate these three metrics, the result of each stage is normalised by using Eqs. (8) and (9), and the results are binarized with a fixed threshold value of 0.5. We generalize our proposed MPE system into four key stages detailed as follows:  Table 4 illustrates the details of the system configurations. By combination of different key stages, the corresponding system is built up for evaluation. Each stage has specific components which are indispensable to the results of the system. Stage A shows the highest recall and lowest precision after applying CQT and SI-PLCA. The presence of F 0 and harmonics is all detected; however, many amplitudes are concentrated in higher frequency (harmonic) regions which inhibits the identification of F 0 . After combining stage B, the recall value decreases by 0.03%, but the precision value increases by almost 3%. This is mainly due to the removal of noise in HSD. In stage C, the core of the MPE system contributes to an increase of nearly 30% for precision and 15-18% for F 1 compared to previous combinations. Finally, after applying the proposed note tracking step (stage D), the recall value is further improved by 5.5% which leads to the final F 1 value improved by 3.8% compared to the previous stage.

Assessment of CQT and ERB
In our proposed MPE system, CQT is employed to model the human cochlea perception. However, cochlea perception is not always constant in Q. Therefore, apart from CQT, the equivalent rectangular bandwidth (ERB) method is also widely used for time-frequency transform [42]. As most ERB methods are actually based on the Gamma tone filter-bank to model the human auditory system [45], it decomposes a signal and passes it through a bank of gamma tone filters, equally spaced on the equivalent rectangular bandwidth (ERB) scale. However, ERB methods may not be necessary to produce better MPE performance than CQT. To further validate this assumption, we have combined CQT [27] and ERB [42] pair-wisely with PLCA [43] and NMF [42] to form four hybrid methods, i.e. CQT + PLCA, CQT + NMF, ERB + PLCA and ERB + NMF, for quantitative analysis in terms of the precision-recall, ROC, F-measure curve (Fig. 8), AUC, MAE and maxF (Table 5). Here AUC, MAE, and maxF denote respectively the area under the ROC curve, the mean average error and the max value of F-measure curve. These three criteria have the same importance. As seen in Fig. 8, the ERB + NMF and CQT + PLCA show comparable results; both outperform the other two methods. In Table 5, although ERB + NMF gives the best maxF value, CQT + PLCA gives the best AUC and lowest MAE, indicating a smaller false alarm. Therefore, CQT + PLCA is the best among these four  methods, which is also the main reason why it is used in our proposed MPE system.

Conclusion
In this paper, a harmonic analysis method is proposed for the MPE system, inspired by music cognition and perception. CQT and SI-PLCA are employed in the pre-processing stage for pianoroll transcription in mixture music audio signal, from which the proposed HSD is used to extract the multipitch pianorolls. The proposed MPE system is not limited by the number of instruments. For multi-instrument cases (i.e. symphony in BACH10 and TRIOS datasets), the mixture characteristics of each instrument can be extracted for adaptive detection of the fundamental frequencies. From the experiment results, the proposed MPE system yields the best performance on the MAPS and TRIOS datasets, and the second-best on the BACH10 dataset. Through investigation of the performance of key components, the HSD provided the greatest contribution to the system, which validates the value of adding an efficient harmonic analysis model for improving significantly the performance of the MPE system. Furthermore, adding note tracking can further improve the efficacy of the MPE system. However, the proposed MPE system still has much room to improve. First, it is worth mentioning that the expectation maximization (EM) algorithm has some limitations, especially the low convergence speed, sensitive to initial settings and inherent non-convex caused local optimum. As a result, it makes PLCA very time consuming, even unsuitable for processing large datasets. Therefore, how to better select the initial value and speed up the convergence can be a valuable work for future investigation. Second, the assumption of knowing the type of instruments in the music pieces is often unrealistic in real scenarios. Therefore, blind source separation can be integrated in our model to tackle this limitation. Third, analysis of the beat and chord along with integrated deep-learning models such as transformer networks [46] and long-short term memory [47] can be considered to further enhance the accuracy of pitch estimation. On the other hand, introducing more music perceptions such as ornaments and rhythm into the model will be helpful for more precise interpreting of the music pieces. Furthermore, an improved note tracking process can be introduced by fusing self-attention [48] and natural language processing model [49]. Finally, testing on larger datasets such as MusicNet [50] and MAESTRO [51] will be beneficial for more comprehensive modelling and validation.