Hybrid generative discriminative approaches based on Multinomial Scaled Dirichlet mixture models

Zamzami, Nuha; Bouguila, Nizar

doi:10.1007/s10489-019-01437-0

Hybrid generative discriminative approaches based on Multinomial Scaled Dirichlet mixture models

Published: 07 March 2019

Volume 49, pages 3783–3800, (2019)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

428 Accesses
12 Citations
Explore all metrics

Abstract

Developing both generative and discriminative techniques for classification has achieved significant progress in the last few years. Considering the capabilities and limitations of both, hybrid generative discriminative approaches have received increasing attention. Our goal is to combine the advantages and desirable properties of generative models, i.e. finite mixture, and the Support Vector Machines (SVMs) as powerful discriminative techniques for modeling count data that appears in many domains in machine learning and computer vision applications. In particular, we select accurate kernels generated from mixtures of Multinomial Scaled Dirichlet distribution and its exponential approximation (EMSD) for support vector machines. We demonstrate the effectiveness and the merits of the proposed framework through challenging real-world applications namely; object recognition and visual scenes classification. Large scale datasets have been considered in the empirical study such as Microsoft MOCR, Fruits-360 and MIT places.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Gaussian Mixture Trees for One Class Classification in Automated Visual Inspection

Deriving Probabilistic SVM Kernels from Exponential Family Approximations to Multivariate Distributions for Count Data

A new hybrid discriminative/generative model using the full-covariance multivariate generalized Gaussian mixture models

Article 30 November 2019

Notes

The size of each vector depends on the image representation approach, in our case the vectors are 128-dimensional given that we are representing each image as a bag of SIFT descriptors [46].
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data
http://kdd.ics.uci.edu/databases/reuters21578
https://cs.nyu.edu/∼roweis/data.html

References

Agarwal A, DaumÃ H et al (2011) Generative kernels for exponential families. In: Proceedings of the 14th international conference on artificial intelligence and statistics, pp 85–92
Amayri O, Bouguila N (2015) Beyond hybrid generative discriminative learning: spherical data classification. Pattern Anal Appl 18(1):113–133
MathSciNet MATH Google Scholar
Banerjee A, Dhillon IS, Ghosh J, Sra S (2005) Clustering on the unit hypersphere using von mises-fisher distributions. J Mach Learn Res 6:1345–1382
MathSciNet MATH Google Scholar
Bdiri T, Bouguila N (2013) Bayesian learning of inverted dirichlet mixtures for svm kernels generation. Neural Comput Appl 23(5):1443–1458
Google Scholar
Berk RA (2016) Support vector machines. In: Statistical learning from a regression perspective. Springer, pp 291–310
Bishop C (2006) Pattern recognition and machine learning. Information science and statistics. Springer, New York
MATH Google Scholar
Bishop C, Bishop CM et al (1995) Neural networks for pattern recognition. Oxford University Press, Oxford
MATH Google Scholar
Bosch A, Muñoz X, Martí R (2007) Which is the best way to organize/classify images by content? Image Vis Comput 25(6):778–791
Google Scholar
Bouguila N (2008) Clustering of count data using generalized dirichlet multinomial distributions. IEEE Trans Knowl Data Eng 20(4):462–474
Google Scholar
Bouguila N (2011) Bayesian hybrid generative discriminative learning based on finite liouville mixture models. Pattern Recogn 44(6):1183–1200
MATH Google Scholar
Bouguila N (2011) Count data modeling and classification using finite mixtures of distributions. IEEE Trans Neural Netw 22(2):186–198
Google Scholar
Bouguila N (2012) Hybrid generative/discriminative approaches for proportional data modeling and classification. IEEE Trans Knowl Data Eng 24(12):2184–2202
Google Scholar
Bouguila N (2013) Deriving kernels from generalized dirichlet mixture models and applications. Inf Process Manag 49(1):123–137
Google Scholar
Bouguila N, Amayri O (2009) A discrete mixture-based kernel for svms: application to spam and image categorization. Inf Process Manag 45(6):631–642
Google Scholar
Bouguila N, Ziou D (2007) Unsupervised learning of a finite discrete mixture: Applications to texture modeling and image databases summarization. J Vis Commun Image Represent 18(4):295–309
Google Scholar
Brown LD (1986) Fundamentals of statistical exponential families: with applications in statistical decision theory. Ims
Burges CJ (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Disc 2 (2):121–167
Google Scholar
Campbell WM, Sturim DE, Reynolds DA (2006) Support vector machines using gmm super vectors for speaker verification. IEEE Signal Process Lett 13(5):308–311
Google Scholar
Chan AB, Vasconcelos N, Moreno PJ (2004) A family of probabilistic kernels based on information divergence. Univ. California, San Diego, CA, Tech. Rep SVCL-TR-2004-1
Chang SK, Hsu A (1992) Image information systems: where do we go from here? IEEE Trans Knowl Data Eng 4(5):431–442
Google Scholar
Christianini N, Shawe-Taylor J (2000) Support vector machines, vol 93. Cambridge University Press, Cambridge, pp 935–948
Google Scholar
Church KW, Gale WA (1995) Poisson mixtures. Nat Lang Eng 1(2):163–190
MathSciNet Google Scholar
Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV. Prague, vol 1, pp 1–2
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B Methodol 39(1):1–22
MathSciNet MATH Google Scholar
Deng J, Xu X, Zhang Z, Frühholz S., Grandjean D, Schuller B (2017) Fisher kernels on phase-based features for speech emotion recognition. In: Dialogues with social robots. Springer, pp 195–203
Elisseeff A, Weston J (2002) A kernel method for multi-labelled classification. In: Advances in neural information processing systems, pp 681–687
Elkan C (2006) Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 289–296
Erhan D, Bengio Y, Courville A, Manzagol PA, Vincent P, Bengio S (2010) Why does unsupervised pre-training help deep learning? J Mach Learn Res 11(Feb):625–660
MathSciNet MATH Google Scholar
Fei-Fei L, Fergus R, Perona P (2007) Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. Comput Vis Image Underst 106(1):59–70
Google Scholar
Fei-Fei L, Perona P (2005) A bayesian hierarchical model for learning natural scene categories. In: IEEE computer society conference on computer vision and pattern recognition, 2005. CVPR 2005, vol 2. IEEE, pp 524–531
Ferrari V, Tuytelaars T, Van Gool L (2006) Object detection by contour segment networks. In: European conference on computer vision. Springer, pp 14–28
Grauman K, Darrell T (2005) The pyramid match kernel: Discriminative classification with sets of image features. In: 10th IEEE international conference on computer vision, 2005. ICCV 2005, vol 2. IEEE, pp 1458–1465
Gupta RD, Richards DSP (1987) Multivariate liouville distributions. J Multivar Anal 23(2):233–256
MathSciNet MATH Google Scholar
Han X, Dai Q (2018) Batch-normalized mlpconv-wise supervised pre-training network in network. Appl Intell 48(1):142–155
Google Scholar
Hankin RK et al (2010) A generalization of the dirichlet distribution. J Stat Softw 33(11):1–18
Google Scholar
Jaakkola T, Haussler D (1999) Exploiting generative models in discriminative classifiers. In: Advances in neural information processing systems, pp 487–493
Jebara T (2003) Images as bags of pixels. In: ICCV, pp 265–272
Jebara T, Kondor R, Howard A (2004) Probability product kernels. J Mach Learn Res 5(Jul):819–844
MathSciNet MATH Google Scholar
Jégou H, Douze M, Schmid C (2009) On the burstiness of visual elements. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009. IEEE, pp 1169–1176
Kailath T (1967) The divergence and bhattacharyya distance measures in signal selection. IEEE Trans Commun Technol 15(1):52–60
Google Scholar
Katz SM (1996) Distribution of content words and phrases in text and language modelling. Nat Lang Eng 2 (1):15–59
Google Scholar
Keerthi SS, Lin CJ (2003) Asymptotic behaviors of support vector machines with gaussian kernel. Neural Comput 15(7):1667–1689
MATH Google Scholar
Lin HT, Lin CJ (2003) A study on sigmoid kernels for svm and the training of non-psd kernels by smo-type methods. submitted to Neural Computation 3:1–32
Google Scholar
Lin J (1991) Divergence measures based on the shannon entropy. IEEE Trans Inf Theor 37(1):145–151
MathSciNet MATH Google Scholar
Lochner RH (1975) A generalized dirichlet distribution in bayesian life testing. J R Stat Soc Ser B Methodol 37(1):103–113
MathSciNet MATH Google Scholar
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Google Scholar
Ma Y, Guo G (2014) Support vector machines applications. Springer, New York
Google Scholar
Madsen RE, Kauchak D, Elkan C (2005) Modeling word burstiness using the dirichlet distribution. In: Proceedings of the 22nd international conference on machine learning. ACM, pp 545–552
McCallum A, Nigam K (1998) A comparison of event models for naive bayes text classification. In: Proceedings of the AAAI-98 workshop on learning for text categorization, vol 752. Citeseer, pp 41–48
McCallum AK (1996) Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/∼mccallum/bow
McLachlan G, Krishnan T (2007) The EM algorithm and extensions, vol 382. Wiley, New Jersey
Google Scholar
Migliorati S, Monti GS, Ongaro A (2008) E–m algorithm: an application to a mixture model for compositional data. In: Proceedings of the 44th scientific meeting of the italian statistical society
Moguerza JM, Muñoz A, et al. (2006) Support vector machines with applications. Stat Sci 21(3):322–336
MathSciNet MATH Google Scholar
Monti GS, Mateu-Figueras G, Pawlowsky-Glahn V (2011) Compositional Data Analysis: Theory and Applications, chap. Notes on the scaled Dirichlet distribution. Wiley, Chichester. https://doi.org/10.1002/9781119976462.ch10
Google Scholar
Moreno PJ, Ho PP, Vasconcelos N (2004) A kullback-leibler divergence based kernel for svm classification in multimedia applications. In: Advances in neural information processing systems, pp 1385–1392
Mosimann JE (1962) On the compound multinomial distribution, the multivariate β-distribution, and correlations among proportions. Biometrika 49(1/2):65–82
MathSciNet MATH Google Scholar
Mureşan H, Oltean M (2018) Fruit recognition from images using deep learning. Acta Universitatis Sapientiae, Informatica 10(1):26–42
Google Scholar
Ng AY, Jordan MI (2002) On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. In: Advances in neural information processing systems, pp 841–848
Oboh BS, Bouguila N (2017) Unsupervised learning of finite mixtures using scaled dirichlet distribution and its application to software modules categorization. In: Proceedings of the 2017 IEEE international conference on industrial technology (ICIT). IEEE, pp 1085–1090
Van den Oord A, Schrauwen B (2014) Factoring variations in natural images with deep gaussian mixture models. In: Advances in neural information processing systems, pp 3518–3526
Penny WD (2001) Kullback-liebler divergences of normal, gamma, dirichlet and wishart densities. Wellcome Department of Cognitive Neurology
Pérez-Cruz F (2008) Kullback-leibler divergence estimation of continuous distributions. In: IEEE international symposium on information theory, 2008. ISIT 2008. IEEE, pp 1666–1670
Raina R, Shen Y, Mccallum A, Ng AY (2004) Classification with hybrid generative/discriminative models. In: Advances in neural information processing systems, pp 545–552
Rennie JDM, Shih L, Teevan J, Karger DR (2003) Tackling the poor assumptions of naive bayes text classifiers. In: Proceedings of the 20th international conference on machine learning ICML, vol 3, pp 616–623
Rényi A et al (1961) On measures of entropy and information. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. The Regents of the University of California
Rubinstein YD, Hastie T et al (1997) Discriminative vs informative learning. In: KDD, vol 5, pp 49–53
Scholkopf B, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge
Google Scholar
Shmilovici A (2010) Support vector machines. In: Data mining and knowledge discovery handbook. Springer, pp 231–247
Sivazlian B (1981) On a multivariate extension of the gamma and beta distributions. SIAM J Appl Math 41 (2):205–209
MathSciNet MATH Google Scholar
Song G, Dai Q (2017) A novel double deep elms ensemble system for time series forecasting. Knowl-Based Syst 134:31–49
Google Scholar
Van Der Maaten L (2011) Learning discriminative fisher kernels. In: ICML, vol 11, pp 217–224
Vapnik V (2013) The nature of statistical learning theory. Springer Science & Business Media, New York
MATH Google Scholar
Vapnik VN (1995) The nature of statistical learning theory
MATH Google Scholar
Variani E, McDermott E, Heigold G (2015) A gaussian mixture model layer jointly optimized with discriminative features within a deep neural network architecture. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4270–4274
Vasconcelos N, Ho P, Moreno P (2004) The kullback-leibler kernel as a framework for discriminant and localized representations for visual recognition. In: European conference on computer vision. Springer, pp 430–441
Wang P, Sun L, Yang S, Smeaton AF (2015) Improving the classification of quantified self activities and behaviour using a fisher kernel. In: Adjunct Proceedings of the 2015 ACM international joint conference on pervasive and ubiquitous computing and proceedings of the 2015 ACM international symposium on wearable computers. ACM, pp 979–984
Winn J, Criminisi A, Minka T (2005) Object categorization by learned universal visual dictionary. In: 10th IEEE international conference on computer vision, 2005. ICCV 2005, vol 2. IEEE, pp 1800–1807
Wong TT (2009) Alternative prior assumptions for improving the performance of naïve bayesian classifiers. Data Min Knowl Disc 18(2):183–213
MathSciNet Google Scholar
Zamzami N, Bouguila N (2018) Text modeling using multinomial scaled dirichlet distributions. In: International conference on industrial, engineering and other applications of applied intelligent systems. Springer, pp 69–80
Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene recognition using places database. In: Advances in neural information processing systems , pp 487–495

Download references

Author information

Authors and Affiliations

Concordia Institute for Information Systems Engineering (CIISE), Concordia University, Montreal, QC, Canada
Nuha Zamzami & Nizar Bouguila
Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
Nuha Zamzami

Authors

Nuha Zamzami
View author publications
You can also search for this author in PubMed Google Scholar
Nizar Bouguila
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nuha Zamzami.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Proof of (5)

The composition of the Multinomial and Scaled Dirichlet is obtained by the following integration:

$$\begin{array}{@{}rcl@{}} \mathcal{M}\mathcal{S}\mathcal{D}(\mathbf{X}|\boldsymbol{\alpha},\boldsymbol{\beta} )&=& {\int}_{\rho} \mathcal{M}(\mathbf{X}|\boldsymbol{\rho}) \mathcal{S}\mathcal{D}(\boldsymbol{\rho}|\boldsymbol{\alpha}, \boldsymbol{\beta}) d\rho \\ &=& {\int}_{\rho} \frac{n!}{\prod\limits_{d = 1}^{D} x_{d}!} \prod\limits_{d = 1}^{D} \rho_{d}^{x_{d}} \frac{{\Gamma} (A)}{\prod\limits_{d = 1}^{D} {\Gamma}(\alpha_{d})} \frac{\prod\limits_{d = 1}^{D} \beta_{d}^{\alpha_{d}} p_{d}^{\alpha_{d}-1}}{\left( \sum\limits_{d = 1}^{D} \beta_{d} \rho_{d} \right)^{A}} d\rho \end{array} $$

$$\begin{array}{@{}rcl@{}} &=& \frac{n!}{\prod\limits_{d = 1}^{D} x_{d}!}\frac{{\Gamma} (A)}{\prod\limits_{d = 1}^{D} {\Gamma}(\alpha_{d})} \prod\limits_{d = 1}^{D} \beta_{d}^{\alpha_{d}} {\int}_{\rho} \frac{\prod \limits_{d = 1}^{D} \rho_{d}^{x_{d}+\alpha_{d}-1}}{\left( \sum\limits_{d = 1}^{D} \beta_{d} \rho_{d} \right)^{A}} d\rho\\ \end{array} $$

(29)

Using the fact that the integration of the PDF = 1, we have: ${\int }_{\rho } \mathcal {M}\mathcal {S}\mathcal {D}(\rho |\theta ) d\rho = 1$, straightforward manipulation yield:

$$\begin{array}{@{}rcl@{}} && {\int}_{\rho} \frac{{\Gamma} \left( \sum\limits_{d = 1}^{D} \alpha_{d}\right)}{\prod\limits_{d = 1}^{D} {\Gamma}(\alpha_{d})} \frac{\prod\limits_{d = 1}^{D} \beta_{d}^{\alpha_{d}} \rho_{d}^{\alpha_{d}-1}}{\left( \sum\limits_{d = 1}^{D} \beta_{d} \rho_{d} \right)^{A}} d\rho = 1 \\ && \frac{{\Gamma} \left( \sum\limits_{d = 1}^{D} \alpha_{d}\right) \prod\limits_{d = 1}^{D} \beta_{d}^{\alpha_{d}}}{\prod\limits_{d = 1}^{D} {\Gamma}(\alpha_{d})} {\int}_{\rho} \frac{\prod\limits_{d = 1}^{D} \rho_{d}^{\alpha_{d}-1}}{\left( \sum\limits_{d = 1}^{D} \beta_{d} \rho_{d} \right)^{A}} d\rho = 1 \end{array} $$

(30)

and we solve the integration using the following empirically found approximation: $\left ({\sum }_{d = 1}^{D} \beta _{d} \ \rho _{d}\right )^{{\sum }_{d = 1}^{D} x_{d}} \simeq {\prod }_{d = 1}^{D} \beta _{d}^{x_{d}}$, as:

$$\begin{array}{@{}rcl@{}} {\int}_{\rho} \frac{\prod\limits_{d = 1}^{D} \rho_{d}^{\alpha_{d}-1}}{\left( \sum\limits_{d = 1}^{D} \beta_{d} \rho_{d} \right)^{A}} d\rho = \frac{\prod\limits_{d = 1}^{D} {\Gamma}(\alpha_{d})}{{\Gamma} \left( \sum\limits_{d = 1}^{D} \alpha_{d}\right) \prod\limits_{d = 1}^{D} \beta_{d}^{\alpha_{d}}} \end{array} $$

(31)

Using this to solve the integration in (29), we obtain (5).

Appendix B: Newton Raphson approach

The complete data log likelihood corresponding to a K-component mixture is given by:

$$ \mathcal{L}(\mathcal{X},\mathcal{Z}|{\Theta})=\sum\limits_{k = 1}^{K} \sum\limits_{i = 1}^{N} z_{ik} \left( \log \pi_{k} + \log p(\mathbf{X}_{i}|\theta_{k}) \right) $$

(32)

By computing the second and mixed derivatives of $ \mathcal {L}(\mathcal {X},\mathcal {Z}|{\Theta })$ with respect to $\alpha _{kd},\ d = 1,\dots ,D$, we obtain:

$$\begin{array}{@{}rcl@{}} &&\frac{\partial^{2} \mathcal{L}(\mathcal{X},\mathcal{Z}|{\Theta})}{\partial\alpha_{kd1}\partial\alpha_{kd2}} = \\ &&\left\{\begin{array}{ll} \sum\limits_{i = 1}^{N} z_{ik} \left( {\Psi}^{\prime}(A)-{\Psi}^{\prime}(n_{i}+A)\right. \\ \left. +{\Psi}^{\prime}(x_{id}+\alpha_{kd})-{\Psi}^{\prime}(\alpha_{kd}) \right) &\text{if}\quad d_{1}=d_{2}=d, \\ \sum\limits_{i = 1}^{N} z_{ik} \left( {\Psi}^{\prime}(A)-{\Psi}^{\prime}(n_{i}+A) \right) & \text{otherwise,} \end{array}\right. \end{array} $$

(33)

where ${\Psi }^{\prime }$ is the trigamma function. By computing the second and mixed derivatives of $ \mathcal {L}(\mathcal {X},\mathcal {Z}|{\Theta })$ with respect to $\beta _{kd},\ d = 1,\dots ,D$, we obtain:

$$ \frac{\partial^{2} \mathcal{L}(\mathcal{X},\mathcal{Z}|{\Theta})}{\partial\beta_{kd1}\partial\beta_{kd2}} = \left\{\begin{array}{ll} \sum\limits_{i = 1}^{N} z_{ik} \left( \frac{x_{id}}{\beta_{kd}^{2}} \right) & \text{if}\ d_{1}=d_{2}=d, \\ \\ 0 & \text{otherwise,} \end{array}\right. $$

(34)

The second and mixed derivatives of $\mathcal {L}(\mathcal {X},\mathcal {Z}|{\Theta })$ with respect to α_kd and β_kd, $d = 1,\dots ,D$, is 0.

Appendix C: Proof of (19)

The KL-divergence between two exponential distributions is given by [40]:

$$\begin{array}{@{}rcl@{}} KL(p(X|{\Theta}),p^{\prime}(X|{\Theta}^{\prime}))&=&{\Phi}(\theta)-{\Phi}(\theta^{\prime})\\&&+[G(\theta)-G(\theta^{\prime})]^{tr} E_{\theta}[T(X)]\\ \end{array} $$

(35)

where E_𝜃 is the expectation with respect to p(X|𝜃). Moreover, we have the following [16]:

$$ E_{\theta}[T(X)]=-{\Phi}^{\prime}(\theta) $$

(36)

Thus, according to (14), we have:

$$\begin{array}{@{}rcl@{}} E_{\theta} \left[\sum\limits_{d = 1}^{D} I(x_{d} \geq 1)\right]&=&-\frac{\partial {\Phi}(\theta)}{\partial \lambda_{d}} \\&=&{\Psi}\left( \sum\limits_{d = 1}^{D} \lambda_{d}+n\right) -{\Psi}\left( \sum\limits_{d = 1}^{D} \lambda_{d}\right) \\ E_{\theta} \left[\sum\limits_{d = 1}^{D} I(x_{d} \geq 1) x_{d}\right]&=&-\frac{\partial {\Phi}(\theta)}{\partial \nu_{d}} = 0 \end{array} $$

(37)

where $n={\sum }_{d = 1}^{D} x_{d}$, and Ψ(.) is the digamma function. By substituting the previous two equations into Eq.(35), we obtain:

$$\begin{array}{@{}rcl@{}} KL &&(p(X|{\Theta}),p^{\prime}(X|{\Theta}^{\prime})) \\ &=&\log \left( {\Gamma}\left( \sum\limits_{d = 1}^{D} \lambda_{d}\right)\right)-\log \left( {\Gamma}\left( \sum\limits_{d = 1}^{D} \lambda^{\prime}_{d}\right)\right)\\ &&-\log\left( {\Gamma}\left( \sum\limits_{d = 1}^{D} \lambda_{d} +n\right)\right)+\log \left( {\Gamma}\left( \sum\limits_{d = 1}^{D} \lambda^{\prime}_{d} +n\right)\right) \\ &&+{\sum}_{d = 1}^{D} \left( {\Psi}\left( \sum\limits_{d = 1}^{D} \lambda_{d}+n\right) -{\Psi}\left( \sum\limits_{d = 1}^{D} \lambda_{d}\right) \right) (\lambda_{d}-\lambda^{\prime}_{d}) \\ &=& \log \left[\frac{{\Gamma}\left( {\sum}_{d = 1}^{D} \lambda_{d}\right).{\Gamma}\left( {\sum}_{d = 1}^{D} \lambda^{\prime}_{d} +n\right)}{{\Gamma}\left( {\sum}_{d = 1}^{D} \lambda^{\prime}_{d}\right).{\Gamma}\left( {\sum}_{d = 1}^{D} \lambda_{d} +n\right)}\right] \\ &&+{\sum}_{d = 1}^{D} \left( {\Psi}\left( \sum\limits_{d = 1}^{D} \lambda_{d}+n\right) -{\Psi}\left( \sum\limits_{d = 1}^{D} \lambda_{d}\right) \right) (\lambda_{d}-\lambda^{\prime}_{d})\\ \end{array} $$

(38)

Appendix D: Proof of (23)

In the case of the EMSD distribution, we can show that:

$$\begin{array}{@{}rcl@{}} {\int}_{0}^{+\infty} & p&(\mathbf{X}|{\Theta})^{\sigma} p^{\prime}(\mathbf{X}|{\Theta}^{\prime})^{1-\sigma} dX= \\ && \left[\frac{{\Gamma}\left( \sum\limits_{d = 1}^{D} \lambda_{d}\right)}{{\Gamma}\left( \sum\limits_{d = 1}^{D} \lambda_{d}+n\right)}\right]^{\sigma} \left[\frac{{\Gamma}\left( \sum\limits_{d = 1}^{D} {\lambda^{\prime}_{d}}\right)}{{\Gamma}\left( \sum\limits_{d = 1}^{D} {\lambda^{\prime}_{d}}+\sum\limits_{d = 1}^{d} x_{d}\right)}\right]^{1-\sigma} \\ &\times& {\int}_{0}^{+\infty} \left[\frac{n!}{\prod\limits_{d = 1}^{D} x_{d}} \prod\limits_{d = 1}^{D} \frac{ \lambda_{d}}{\nu_{d}^{x_{d}}}\right]^{\sigma} dX \\ &\times& {\int}_{0}^{+\infty} \left[\frac{n!}{\prod\limits_{d = 1}^{D} x_{d}} \prod\limits_{d = 1}^{D} \frac{\lambda^{\prime}_{d}}{{\nu^{\prime}_{d}}^{x_{d}}}\right]^{1-\sigma} dX \\ &=&\left[\frac{{\Gamma}\left( \sum\limits_{d = 1}^{D} \lambda_{d}\right) }{{\Gamma}\left( \sum\limits_{d = 1}^{D} \lambda_{d}+\sum\limits_{d = 1}^{D} x_{d}\right)}\right]^{\sigma} \left[\frac{{\Gamma}\left( \sum\limits_{d = 1}^{D} {\lambda^{\prime}_{d}}\right)}{{\Gamma}(s^{\prime}+n)}\right]^{1-\sigma} \\ &\times& {\int}_{0}^{+\infty} \frac{n!}{\prod\limits_{d = 1}^{D} x_{d}} \prod\limits_{d = 1}^{D} \lambda_{d} \nu_{d}^{-\sigma x_{d}} dX \\ &\times& {\int}_{0}^{+\infty} \frac{n!}{\prod\limits_{d = 1}^{D} x_{d}} \prod\limits_{d = 1}^{D} \lambda^{\prime}_{d}{\nu^{\prime}_{d}}^{-x_{d}+\sigma x_{d}} dX \end{array} $$

(39)

We have the PDF of an EMSD distribution that integrates to one which gives:

$$ {\int}_{0}^{+\infty} \frac{n!}{{\prod}_{d = 1}^{D} x_{d}} \prod\limits_{d = 1}^{D} \frac{\lambda_{d}}{\nu_{d}^{x_{d}}}dX=\frac{{\Gamma}\left( {\sum}_{d = 1}^{D} \lambda_{d} + {\sum}_{d = 1}^{D} x_{d}\right)}{{\Gamma}\left( {\sum}_{d = 1}^{D} \lambda_{d}\right) } $$

(40)

By substituting (40) into (39), we obtain:

$$\begin{array}{@{}rcl@{}} {\int}_{0}^{+\infty} & p&(\mathbf{X}|{\Theta})^{\sigma} p^{\prime}(\mathbf{X}|{\Theta}^{\prime})^{1-\sigma} dX= \\ && \left[\frac{{\Gamma}\left( {\sum}_{d = 1}^{D} \lambda_{d}\right) }{{\Gamma}\left( {\sum}_{d = 1}^{D} \lambda_{d}+n\right)}\right]^{\sigma} \left[\frac{{\Gamma}\left( {\sum}_{d = 1}^{D} {\lambda^{\prime}_{d}}\right)}{{\Gamma}\left( {\sum}_{d = 1}^{D} {\lambda^{\prime}_{d}}+{\sum}_{d = 1}^{d} x_{d}\right)}\right]^{1-\sigma} \\ &\times& \frac{{\Gamma}\left( {\sum}_{d = 1}^{D} \lambda_{d}+{\sum}_{d = 1}^{D} -\sigma x_{d}\right)}{{\Gamma}\left( {\sum}_{d = 1}^{D} \lambda_{d}\right)}\\ &\times& \frac{{\Gamma}\left( {\sum}_{d = 1}^{D} {\lambda^{\prime}_{d}}+{\sum}_{d = 1}^{D} -x_{d}+\sigma x_{d}\right)}{{\Gamma}\left( {\sum}_{d = 1}^{D} {\lambda^{\prime}_{d}}\right)} \end{array} $$

(41)

Appendix E: Proof of (27)

$$\begin{array}{@{}rcl@{}} H[p(\mathbf{X}|{\Theta})]&=&- {\int}_{0}^{+ \infty} p(\mathbf{X}|{\Theta}) \log p(\mathbf{X}|{\Theta}) dX \\ &=&- {\int}_{0}^{+ \infty} p(\mathbf{X}|{\Theta}) \left[\log {\Gamma}\left( \sum\limits_{d = 1}^{D} \lambda_{d}\right) \right.\\&&\left.-\log {\Gamma}\left( \sum\limits_{d = 1}^{D} \lambda_{d}+n\right)\right. \\ &&+\sum\limits_{d = 1}^{D} \log(\lambda_{d}) E_{\theta}[I(x_{d} \geq 1)] \\ &&\left.-\sum\limits_{d = 1}^{D} \log(\nu_{d}) E_{\theta} [I(x_{d} \geq 1) x_{d}]\right] \end{array} $$

(42)

By substituting (37) into the previous equation, we obtain the following:

$$\begin{array}{@{}rcl@{}} H[p(\mathbf{X}|{\Theta})]&=&-\log {\Gamma}\left( \sum\limits_{d = 1}^{D} \lambda_{d}\right) +\log {\Gamma}\left( \sum\limits_{d = 1}^{D} \lambda_{d}+n\right) \\ &-&\sum\limits_{d = 1}^{D} \log(\lambda_{d}) \left( {\Psi}\left( \sum\limits_{d = 1}^{D} \lambda_{d}+n\right) \right.\\&&\left.-{\Psi}\left( \sum\limits_{d = 1}^{D} \lambda_{d}\right) \right) \end{array} $$

(43)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zamzami, N., Bouguila, N. Hybrid generative discriminative approaches based on Multinomial Scaled Dirichlet mixture models. Appl Intell 49, 3783–3800 (2019). https://doi.org/10.1007/s10489-019-01437-0

Download citation

Published: 07 March 2019
Issue Date: November 2019
DOI: https://doi.org/10.1007/s10489-019-01437-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hybrid generative discriminative approaches based on Multinomial Scaled Dirichlet mixture models

Abstract

Access this article

Similar content being viewed by others

Gaussian Mixture Trees for One Class Classification in Automated Visual Inspection

Deriving Probabilistic SVM Kernels from Exponential Family Approximations to Multivariate Distributions for Count Data

A new hybrid discriminative/generative model using the full-covariance multivariate generalized Gaussian mixture models

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendices

Appendix A: Proof of (5)

Appendix B: Newton Raphson approach

Appendix C: Proof of (19)

Appendix D: Proof of (23)

Appendix E: Proof of (27)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Hybrid generative discriminative approaches based on Multinomial Scaled Dirichlet mixture models

Abstract

Access this article

Similar content being viewed by others

Gaussian Mixture Trees for One Class Classification in Automated Visual Inspection

Deriving Probabilistic SVM Kernels from Exponential Family Approximations to Multivariate Distributions for Count Data

A new hybrid discriminative/generative model using the full-covariance multivariate generalized Gaussian mixture models

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendices

Appendix A: Proof of (5)

Appendix B: Newton Raphson approach

Appendix C: Proof of (19)

Appendix D: Proof of (23)

Appendix E: Proof of (27)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation