Skip to main content
Log in

Outlier-robust parameter estimation for unnormalized statistical models

  • Original Paper
  • Published:
Japanese Journal of Statistics and Data Science Aims and scope Submit manuscript

Abstract

Unnormalized statistical models are ubiquitous in modern statistical data analysis. Recent methods take a classification approach to estimate unnormalized models. However, the classification problem is often solved based on the maximum-likelihood estimation, which can be seriously hampered by the contamination of outliers. In this paper, we propose two outlier-robust methods for estimation of unnormalized statistical models. The proposed methods are developed by combining robust divergences with the classification approach, and their robustness is theoretically investigated based on influence function. Interestingly, our theoretical analysis reveals a counter-intuitive robustness of the proposed methods, and shows the importance of not only employing robust divergences but also taking the classification approach for outlier-robust estimation. Finally, we experimentally demonstrate that the proposed methods are robust against outliers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. More specifically, the objective function in Hung et al. [2018, Eq.(13)] corresponds to a simple monotonic transformation of the right-hand side on (10), which is equal to \(\exp (-\gamma D_{\gamma }(\varvec{\alpha }))=E_{\textrm{CU}}\left[ \left( \frac{f_C(\varvec{u};\varvec{\alpha },\nu )^{\gamma +1}}{f_0(\varvec{u};\varvec{\alpha },\nu )^{\gamma +1}+f_1(\varvec{u};\varvec{\alpha },\nu )^{\gamma +1}} \right) ^{\frac{\gamma }{\gamma +1}}\right] \) with the expectation \(E_{\textrm{CU}}[\cdot ]\) over \(p(C,\varvec{u})\)

  2. More precisely, the influence function from the pseudospherical score (Good, 1971) was shown in Jones et al. (2001), of which the \(\gamma \)-cross entropy can obtained through a simple logarithmic transformation.

  3. A python library called SciPY was used.

References

  • Basak, S., Basu, A., & Jones, M. (2021). On the ‘optimal’ density power divergence tuning parameter. Journal of Applied Statistics, 48(3), 536–556.

    Article  MathSciNet  PubMed  Google Scholar 

  • Basu, A., Harris, I., Hjort, N., & Jones, M. (1998). Robust and efficient estimation by minimising a density power divergence. Biometrika, 85(3), 549–559.

    Article  MathSciNet  Google Scholar 

  • Besag, J. (1975). Statistical analysis of non-lattice data. Journal of the Royal Statistical Society, 24D(3), 179–195.

    Google Scholar 

  • Bregman, L. M. (1967). The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7(3), 200–217.

    Article  MathSciNet  Google Scholar 

  • Fujisawa, H., & Eguchi, S. (2008). Robust parameter estimation with a small bias against heavy contamination. Journal of Multivariate Analysis, 99(9), 2053–2081.

    Article  MathSciNet  Google Scholar 

  • Good, I. (1971). Comment on “Measuring information and uncertainty” by Robert J. Buehler. Foundations of Statistical Inference, 337–339.

  • Gutmann, M., & Hyvärinen, A. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 297–304.

  • Gutmann, M., & Hyvärinen, A. (2012). Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research, 13, 307–361.

    MathSciNet  Google Scholar 

  • Gutmann, M.U., & Hirayama, J. (2011). Bregman divergence as general framework to estimate unnormalized statistical models. In Proceedings of the Twenty-seventh Conference on Uncertainty in Artificial Intelligence (UAI), pp. 283–290.

  • Gutmann, M. U., & Hyvärinen, A. (2013). A three-layer model of natural image statistics. Journal of Physiology-Paris, 107(5), 369–398.

    Article  PubMed  Google Scholar 

  • Gutmann, M. U., Kleinegesse, S., & Rhodes, B. (2022). Statistical applications of contrastive learning. Behaviormetrika, 49(2), 277–301.

    Article  Google Scholar 

  • Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., & Stahel, W. A. (2011). Robust statistics: The approach based on influence functions. Wiley.

    Google Scholar 

  • Hinton, G. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800.

    Article  PubMed  Google Scholar 

  • Huber, P. J., & Ronchetti, E. M. (2009). Robust statistics. Wiley.

    Book  Google Scholar 

  • Hung, H., Jou, Z.-Y., & Huang, S.-Y. (2018). Robust mislabel logistic regression without modeling mislabel probabilities. Biometrics, 74(1), 145–154.

    Article  MathSciNet  PubMed  Google Scholar 

  • Hyvärinen, A. (2005). Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6, 695–709.

    MathSciNet  Google Scholar 

  • Jones, M., Hjort, N. L., Harris, I. R., & Basu, A. (2001). A comparison of related density based minimum divergence estimators. Biometrika, 88(3), 865–873.

    Article  MathSciNet  Google Scholar 

  • Kanamori, T., & Fujisawa, H. (2015). Robust estimation under heavy contamination using unnormalized models. Biometrika, 102(3), 559–572.

    Article  MathSciNet  Google Scholar 

  • Kanamori, T., & Sugiyama, M. (2014). Statistical analysis of distance estimators with density differences and density ratios. Entropy, 16(2), 921–942.

    Article  ADS  MathSciNet  Google Scholar 

  • Maronna, R. A., Martin, R. D., Yohai, V. J., & Salibián-Barrera, M. (2019). Robust statistics: Theory and methods (with R). Wiley.

    Google Scholar 

  • Matsuda, T., Uehara, M., & Hyvarinen, A. (2021). Information criteria for nonnormalized models. Journal of Machine Learning Research, 22(158), 1–33.

    Google Scholar 

  • Minami, M., & Eguchi, S. (2003). Adaptive selection for minimum \(\beta \)-divergence method. In Proceedings of the Fourth International Symposium on Independent Component Analysis and Blind Source Separation.

  • Mnih, A., & Kavukcuoglu, K. (2013). Learning word embeddings efficiently with noise-contrastive estimation. In Advances in neural information processing systems (NeurIPS), vol. 26.

  • Sasaki, H., & Takenouchi, T. (2022). Representation learning for maximization of MI, nonlinear ICA and nonlinear subspaces with robust density ratio estimation. Journal of Machine Learning Research, 23(231), 1–55.

    MathSciNet  Google Scholar 

  • Sugasawa, S., & Yonekura, S. (2021). On selection criteria for the tuning parameter in robust divergence. Entropy, 23(9), 1147.

    Article  ADS  MathSciNet  PubMed  PubMed Central  Google Scholar 

  • Sugiyama, M., Suzuki, T., & Kanamori, T. (2012). Density ratio estimation in machine learning. Cambridge University Press.

    Book  Google Scholar 

  • Takenouchi, T., & Kanamori, T. (2017). Statistical inference with unnormalized discrete models and localized homogeneous divergences. Journal of Machine Learning Research, 18(1), 1804–1829.

    MathSciNet  Google Scholar 

  • Thomas, O., Dutta, R., Corander, J., Kaski, S., & Gutmann, M. U. (2022). Likelihoodfree inference by ratio estimation. Bayesian Analysis, 17(1), 1–31.

    Article  MathSciNet  Google Scholar 

  • Uehara, M., Kanamori, T., Takenouchi, T., & Matsuda, T. (2020). A unified statistically efficient estimation framework for unnormalized models. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 809–819.

  • Van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge University Press.

    Book  Google Scholar 

  • Wasserman, L. (2004). All of statistics. Springer.

    Book  Google Scholar 

Download references

Acknowledgements

The authors would like to thank Prof. Takafumi Kanamori and Dr. Takayuki Kawashima for their helpful comments. Hiroaki Sasaki was partially supported by JSPS KAKENHI Grant No. 23H03460. Takashi Takenouchi was partially supported by JSPS KAKENHI Grant No. 20K03753 and 19H04071.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hiroaki Sasaki.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest to declare.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Limits of \(D_{\gamma }\) and \(D_{\beta }\)

As in Fujisawa and Eguchi (2008), to show the respective limits of \(D_{\gamma }\) and \(D_{\beta }\), we employ the following Taylor expansions: For \(x>0\) and \(\gamma >0\)

$$\begin{aligned} x^{\gamma }&=1+\gamma \log {x}+O(\gamma ^2) \end{aligned}$$
(A1)
$$\begin{aligned} \log (1+x)&=x+O(x^2). \end{aligned}$$
(A2)

First, in the limit of \(\gamma \rightarrow 0\), we show that \(D_{\gamma }(\varvec{\alpha })\) for GNCE equals to \(D_{\textrm{lr}}(\varvec{\alpha })\) for NCE. Applying (A1) to \(h_{\gamma }(\varvec{X};\varvec{\alpha },\nu )^{\frac{\gamma }{\gamma +1}}\) yields

$$\begin{aligned} h_{\gamma }(\varvec{X};\varvec{\alpha },\nu )^{\frac{\gamma }{\gamma +1}}&= 1+\frac{\gamma }{\gamma +1}\log h_{\gamma }(\varvec{X};\varvec{\alpha },\nu ) +O\left( \left( \frac{\gamma }{\gamma +1}\right) ^2\right) . \end{aligned}$$
(A3)

Equation (A3) enables us to express \(D_{\gamma }(\varvec{\alpha })\) as

$$\begin{aligned} D_{\gamma }(\varvec{\alpha })&=-\frac{1}{\gamma }\log \left[ 1+\frac{p_0\gamma }{\gamma +1}E_{\textrm{d}}[\log h_{\gamma }(\varvec{X};\varvec{\alpha },\nu )]\right. \\&\quad \left. +\frac{p_1\gamma }{\gamma +1}E_{\textrm{r}}[\log \left( 1-h_{\gamma }(\varvec{Y};\varvec{\alpha },\nu )\right) ] +O\left( \left( \frac{\gamma }{\gamma +1}\right) ^2\right) \right] ,\\&= -\frac{p_0}{\gamma +1}\left( E_{\textrm{d}}[\log h_{\gamma }(\varvec{X};\varvec{\alpha },\nu )] +\nu E_{\textrm{r}}[\log \left( 1-h_{\gamma }(\varvec{Y};\varvec{\alpha },\nu )\right) ]\right) \\&\quad +O\left( \frac{\gamma }{(\gamma +1)^2}\right) , \end{aligned}$$

where (A2) was applied on the last line. Since \(\lim _{\gamma \rightarrow {0}}h_{\gamma }(\varvec{u};\varvec{\alpha },\nu )=f_0(\varvec{u};\varvec{\alpha },\nu )\), we have

$$\begin{aligned} \lim _{\gamma \rightarrow {0}}D_{\gamma }(\varvec{\alpha })&=D_{\textrm{lr}}(\varvec{\alpha }). \end{aligned}$$

Next, we investigate the relationship between \(D_{\beta }(\varvec{\alpha })\) and \(D_{\textrm{lr}}(\varvec{\alpha })\) in the limit of \(\beta \rightarrow 0\). Equation (A1) gives the following expression of \(D_{\beta }(\varvec{\alpha })\):

$$\begin{aligned} D_{\beta }(\varvec{\alpha })&=-p_0\left( 1+\nu +E_{\textrm{d}}[\log {f}_0(\varvec{X};\varvec{\alpha },\nu )] +\nu E_{\textrm{r}}[\log {f}_1(\varvec{Y};\varvec{\alpha },\nu )]+O(\beta )\right) \nonumber \\&\quad +\frac{p_0}{\beta +1}\left( E_{\textrm{d}}\left[ \sum _{C=0}^1f_C(\varvec{X};\varvec{\alpha },\nu )^{\beta +1}\right] +\nu E_{\textrm{r}}\left[ \sum _{C=0}^1f_C(\varvec{Y};\varvec{\alpha },\nu )^{\beta +1}\right] \right) . \end{aligned}$$

Since \(\lim _{\beta \rightarrow 0}\sum _{C=0}^1f_C(\varvec{u};\varvec{\alpha },\nu )^{\beta +1}=1\) for all \(\varvec{u}\),

$$\begin{aligned} \lim _{\beta \rightarrow 0}D_{\beta }(\varvec{\alpha })&=D_{\textrm{lr}}(\varvec{\alpha }). \end{aligned}$$

Appendix B: Consistency of GNCE and PNCE

This appendix first proves the consistency of GNCE. To this end, we recall

$$\begin{aligned} \varvec{\alpha }_{\star }:=\mathop {\textrm{argmin}}\limits _{\varvec{\alpha }}D_{\gamma }(\varvec{\alpha })\quad \text {and}\quad \widehat{\varvec{\alpha }}:=\mathop {\textrm{argmin}}\limits _{\varvec{\alpha }}\widehat{D}_{\gamma }(\varvec{\alpha }),\qquad , \end{aligned}$$

and decompose them into \(\varvec{\alpha }_{\star }=(\varvec{\theta }_{\star },c_{\star })\) and \(\widehat{\varvec{\alpha }}=(\widehat{\varvec{\theta }},\widehat{c})\), respectively. The notion of identifiability (Van der Vaart, 1998) is defined as follows: A statistical model \(p_{\textrm{m}}(\cdot ;\varvec{\theta })\) is said to be identifiable if \(p_{\textrm{m}}(\cdot ;\varvec{\theta })=p_{\textrm{m}}(\cdot ;\varvec{\theta }')\) implies \(\varvec{\theta }=\varvec{\theta }'\).

Intuitively, when \(p_{\textrm{d}}(\cdot )=p_{\textrm{m}}(\cdot ;\varvec{\theta }_0)\) with some constant vector \(\varvec{\theta }_0\), the definition of \(\varvec{\alpha }_{\star }\) ensures \(p^0_{\textrm{m}}(\cdot ;\varvec{\alpha }_{\star })=p_{\textrm{m}}(\cdot ;\varvec{\theta }_0)\) and the identifiability of \(p_{\textrm{m}}(\cdot ;\varvec{\theta })\) implies \(\varvec{\theta }_{\star }=\varvec{\theta }_0\). Thus, \(\widehat{\varvec{\theta }}\) should converge to \(\varvec{\theta }_0\) in probability as \(n\rightarrow \infty \) and \(n_{\textrm{y}}\rightarrow \infty \). To formally state the consistency of GNCE, we establish the following theorem:

Theorem 8

Suppose that the following assumptions hold:

  1. (A0)

    \(\textrm{supp}(p_{\textrm{m}}(\cdot ;\varvec{\theta }))\) is included in \(\textrm{supp}(p_{\textrm{r}})\) for all \(\varvec{\theta }\).

  2. (A1)

    \(p_{\textrm{d}}(\cdot )\) belongs to the same parametric family as \(p_{\textrm{m}}(\cdot ;\varvec{\theta })\), i.e., there exists a constant vector \(\varvec{\theta }_0\), such that \(p_{\textrm{d}}(\cdot )=p_{\textrm{m}}(\cdot ;\varvec{\theta }_0)\).

  3. (A2)

    \(p_{\textrm{m}}(\cdot ;\varvec{\theta })\) is identifiable.

  4. (A3)

    \(\inf _{\varvec{\alpha }:\Vert \varvec{\alpha }-\varvec{\alpha }_{\star }\Vert \ge \epsilon }D_{\gamma }(\varvec{\alpha })>D_{\gamma }(\varvec{\alpha }_{\star })\) for all \(\epsilon >0\).

  5. (A4)

    \(\sup _{\varvec{\alpha }}|D_{\gamma }(\varvec{\alpha })-\widehat{D}_{\gamma }(\varvec{\alpha })|\) converges to zero in probability.

  6. (A5)

    \(\widehat{\nu }=n_{\textrm{y}}/n\rightarrow {\nu }\) as \(n\rightarrow \infty \).

Then, \(\widehat{\varvec{\theta }}\) converges to \(\varvec{\theta }_0\) in probability.

The proof of Theorem 8 is given in Appendix B.1. Since \(\textrm{supp}(p_{\textrm{m}}(\cdot ;\varvec{\theta }))=\textrm{supp}(p^0_{\textrm{m}}(\cdot ;\varvec{\alpha }))\) with \(\varvec{\alpha }=(\varvec{\theta },c)\) by definition, Assumption (A0) is required to properly estimate \(p^0_{\textrm{m}}(\cdot ,\varvec{\alpha })\) on its support as discussed in Sect. 3.1 or assumed in the previous works related to NCE (Gutmann & Hyvärinen, 2012; Matsuda et al., 2021). Assumption (A3) means that \(\varvec{\alpha }_{\star }\) is a well-separated point of minimum of \(D_{\gamma }(\varvec{\alpha })\) (Van der Vaart, 1998). Assumption (A4) is well known as the uniform law of large numbers. Assumption (A5) implies \(n_{\textrm{y}}\rightarrow \infty \) as \(n\rightarrow \infty \).

For PNCE, consistency can be established by replacing Assumptions (A3–A4) with the following ones, while the other assumptions are still maintained:

  1. (B3)

    \(\inf _{\varvec{\alpha }:\Vert \varvec{\alpha }-\varvec{\alpha }_{\star }^{\beta }\Vert \ge \epsilon }D_{\beta }(\varvec{\alpha })>D_{\beta }(\varvec{\alpha }_{\star }^{\beta })\) for all \(\epsilon >0\).

  2. (B4)

    \(\sup _{\varvec{\alpha }}|D_{\beta }(\varvec{\alpha })-\widehat{D}_{\beta }(\varvec{\alpha })|\) converges to zero in probability.

Since it is fundamentally the same as GNCE, we omit the proof for consistency of PNCE.

1.1 B.1: Proof of Theorem 8

Proof

Our proof essentially follows existing proofs (Van der Vaart, 1998; Wasserman, 2004). We first state the following lemma whose proof is given in Appendix B.2:

Lemma 9

Suppose that Assumptions (A0–A2) hold. Then, \(\varvec{\theta }_{\star }=\varvec{\theta }_0\).

By Lemma 9, it suffices to prove that \(\widehat{\varvec{\alpha }}\) converges to \(\varvec{\alpha }_{\star }\) in probability. Assumption (A3) implies that for all \(\epsilon >0\), there exists some \(\eta >0\), such that \(D_{\gamma }(\varvec{\alpha })>D_{\gamma }(\varvec{\alpha }_{\star })+\eta \) for every \(\varvec{\alpha }\) with \(\Vert \varvec{\alpha }-\varvec{\alpha }_{\star }\Vert \ge \epsilon \). Thus, the event \(\{\Vert \widehat{\varvec{\alpha }}-\varvec{\alpha }_{\star }\Vert \ge \epsilon \}\) is included in the event \(\{D_{\gamma }(\widehat{\varvec{\alpha }})>D_{\gamma }(\varvec{\alpha }_{\star })+\eta \}\), and the probability of these events can be expressed as

$$\begin{aligned} P(\Vert \widehat{\varvec{\alpha }}-\varvec{\alpha }_{\star }\Vert \ge \epsilon )\le {P}(D_{\gamma }(\widehat{\varvec{\alpha }})-D_{\gamma }(\varvec{\alpha }_{\star })>\eta ). \end{aligned}$$

The inequality above indicates that the convergence of \(\widehat{\varvec{\alpha }}\) to \(\varvec{\alpha }_{\star }\) in probability is confirmed by showing that \(P(D_{\gamma }(\widehat{\varvec{\alpha }})-D_{\gamma }(\varvec{\alpha }_{\star })>\eta )\) converges to zero. To this end, we derive the following upper bound:

$$\begin{aligned} D_{\gamma }(\widehat{\varvec{\alpha }})-D_{\gamma }(\varvec{\alpha }_{\star })&=D_{\gamma }(\widehat{\varvec{\alpha }})-\widehat{D}_{\gamma }(\varvec{\alpha }_{\star })+\widehat{D}_{\gamma }(\varvec{\alpha }_{\star })-D_{\gamma }(\varvec{\alpha }_{\star })\\&\le D_{\gamma }(\widehat{\varvec{\alpha }})-\widehat{D}_{\gamma }(\widehat{\varvec{\alpha }})+\widehat{D}_{\gamma }(\varvec{\alpha }_{\star })-D_{\gamma }(\varvec{\alpha }_{\star })\\&\le 2\sup _{\varvec{\alpha }}|D_{\gamma }(\varvec{\alpha })-\widehat{D}_{\gamma }(\varvec{\alpha })|. \end{aligned}$$

Assumption (A4) ensures that \(D_{\gamma }(\widehat{\varvec{\alpha }})-D_{\gamma }(\varvec{\alpha }_{\star })\) converges to zero in probability, indicating that \(P(D_{\gamma }(\widehat{\varvec{\alpha }})-D_{\gamma }(\varvec{\alpha }_{\star })>\eta )\) also converges to zero. The proof is completed. \(\square \)

1.2 B.2: Proof of Lemma 9

Proof

We first define the \(\gamma \)-divergence \(d(\varvec{\alpha })\) (not cross entropy) as

$$\begin{aligned} d(\varvec{\alpha })&:= \frac{1}{\gamma }\log \int \left\{ \sum _{C=0}^1 p(C|\varvec{u})^{\gamma +1}\right\} ^{\frac{1}{\gamma +1}} \hspace{-4mm}p(\varvec{u})\textrm{d}\varvec{u}\\&\quad -\frac{1}{\gamma }\log \int \left\{ \frac{\sum _{C=0}^1 p(C|\varvec{u})f_C(\varvec{u};\varvec{\alpha },\nu )^{\gamma }}{\left( \sum _{C=0}^1f_C(\varvec{u};\varvec{\alpha },\nu )^{\gamma +1}\right) ^{\frac{\gamma }{\gamma +1}}} \right\} p(\varvec{u})\textrm{d}\varvec{u}, \end{aligned}$$

where we note that the second term in the right-hand side above is equal to the \(\gamma \)-cross entropy \(D_{\gamma }(\varvec{\alpha })\) used in this paper. Non-negativity of the \(\gamma \)-divergence (i.e., \(d(\varvec{\alpha })\ge 0\)) is guaranteed by the following Hölder inequality (Fujisawa & Eguchi, 2008): For all \(\varvec{u}\in \textrm{supp}(p_{\textrm{r}})\)

$$\begin{aligned} \left\{ \sum _{C=0}^1 p(C|\varvec{u})^{\gamma +1}\right\} ^{\frac{1}{\gamma +1}} \left\{ \sum _{C=0}^1 f_C(\varvec{u};\varvec{\alpha },\nu )^{\gamma +1}\right\} ^{\frac{\gamma }{\gamma +1}} \ge \sum _{C=0}^1 p(C|\varvec{u})f_C(\varvec{u};\varvec{\alpha },\nu )^{\gamma }. \end{aligned}$$
(B4)

Equality in (B4), i.e., \(d(\varvec{\alpha })=0\) holds if and only if \(p(C|\varvec{u})\) and \(f_C(\varvec{u};\varvec{\alpha },\nu )\) are linearly dependent. In addition, by definition of \(\varvec{\alpha }_{\star }\), \(d(\varvec{\alpha }_{\star })=0\). Thus, with a positive function \(\omega (\varvec{u})>0\), we obtain

$$\begin{aligned} p(C|\varvec{u})=\omega (\varvec{u})f_C(\varvec{u};\varvec{\alpha }_{\star },\nu ). \end{aligned}$$
(B5)

By substituting (5) and (7) into (B5) with \(p_{\textrm{d}}(\cdot )=p_{\textrm{m}}(\cdot ;\varvec{\theta }_0)\), we have

$$\begin{aligned} \frac{p_{\textrm{m}}(\varvec{u};\varvec{\theta }_0)}{p_{\textrm{m}}(\varvec{u};\varvec{\theta }_0)+\nu p_{\textrm{r}}(\varvec{u})}= \omega (\varvec{u})\frac{p^0_{\textrm{m}}(\varvec{u};\varvec{\alpha }_{\star })}{p^0_{\textrm{m}}(\varvec{u};\varvec{\alpha }_{\star })+\nu p_{\textrm{r}}(\varvec{u})}, \end{aligned}$$

which yields

$$\begin{aligned} (1-\omega (\varvec{u}))p_{\textrm{m}}(\varvec{u};\varvec{\theta }_0)p^0_{\textrm{m}}(\varvec{u};\varvec{\alpha }_{\star }) +\nu \{p_{\textrm{m}}(\varvec{u};\varvec{\theta }_0)-\omega (\varvec{u})p^0_{\textrm{m}}(\varvec{u};\varvec{\alpha }_{\star })\}p_{\textrm{r}}(\varvec{u})=0. \end{aligned}$$
(B6)

It follows from (B6) that the supports of \(p_{\textrm{m}}(\varvec{u};\varvec{\theta }_0)\) and \(p^0_{\textrm{m}}(\varvec{u};\varvec{\alpha }_{\star })\) coincide; otherwise, there exists a point \(\tilde{\varvec{u}}\), such that \(p_{\textrm{m}}(\tilde{\varvec{u}};\varvec{\theta }_0)=0\) yet \(p^0_{\textrm{m}}(\tilde{\varvec{u}};\varvec{\alpha }_{\star }),p_{\textrm{r}}(\tilde{\varvec{u}}), \omega (\tilde{\varvec{u}})>0\), and (B6) does not hold at \(\varvec{u}=\tilde{\varvec{u}}\). Since (B6) holds for all \(\varvec{u}\in \textrm{supp}(p_{\textrm{r}})\), we obtain \(\omega (\varvec{u})=1\) and

$$\begin{aligned} p_{\textrm{m}}(\varvec{u};\varvec{\theta }_0)=p^0_{\textrm{m}}(\varvec{u};\varvec{\alpha }_{\star }). \end{aligned}$$
(B7)

Integrating both sides on (B7) indicates that \(p^0_{\textrm{m}}(\varvec{u};\varvec{\alpha }_{\star })\) is normalized, i.e., \(p^0_{\textrm{m}}(\varvec{u};\varvec{\alpha }_{\star })=p_{\textrm{m}}(\varvec{u};\varvec{\theta }_{\star })\). Finally, the identifiability of \(p_{\textrm{m}}\) ensures \(\varvec{\theta }_0=\varvec{\theta }_{\star }\). \(\square \)

Appendix C: Proof of Proposition 1

Proof

Let us express

$$\begin{aligned} L_{\gamma }(\varvec{\alpha })&=\exp \left( -\gamma D_{\gamma }(\varvec{\alpha })+\log {p}_0\right) \\&=E_{\textrm{d}}\Big [h_{\gamma }(\varvec{X};\varvec{\alpha },\nu )^{\frac{\gamma }{\gamma +1}}\Big ] +\nu E_{\textrm{r}}\Big [\left( 1-h_{\gamma }(\varvec{Y};\varvec{\alpha },\nu )\right) ^{\frac{\gamma }{\gamma +1}}\Big ]. \end{aligned}$$

Since \(\nabla _{\varvec{\alpha }}D_{\gamma }(\varvec{\alpha })\bigr |_{\varvec{\alpha }=\varvec{\alpha }_{\star }}=\varvec{0}\) from the definition of \(\varvec{\alpha }_{\star }\), we have

$$\begin{aligned} \nabla _{\varvec{\alpha }}L_{\gamma }(\varvec{\alpha })\Bigr |_{\varvec{\alpha }=\varvec{\alpha }_{\star }}&=\varvec{0}, \end{aligned}$$
(C8)

which implies that

$$\begin{aligned} \left\{ E_{\textrm{d}}\Big [\nabla _{\varvec{\alpha }}h_{\gamma }(\varvec{X};\varvec{\alpha },\nu )^{\frac{\gamma }{\gamma +1}}\Big ] +\nu E_{\textrm{r}}\Big [\nabla _{\varvec{\alpha }}(1-h_{\gamma }(\varvec{Y};\varvec{\alpha },\nu ))^{\frac{\gamma }{\gamma +1}}\Big ] \right\} \Bigr |_{\varvec{\alpha }=\varvec{\alpha }_{\star }}=\varvec{0}. \end{aligned}$$
(C9)

Similarly, we define the contaminated version of \(L_{\gamma }(\varvec{\alpha })\) as

$$\begin{aligned} \bar{L}_{\gamma }(\varvec{\alpha })&:=\exp \left( -\gamma \bar{D}_{\gamma }(\varvec{\alpha })+\log {p_0}\right) \nonumber \\ {}&= \bar{E}_{\textrm{d}}\Big [h_{\gamma }(\varvec{X};\varvec{\alpha },\nu )^{\frac{\gamma }{\gamma +1}}\Big ]+\nu \bar{E}_{\textrm{r}}\Big [\left( 1-h_{\gamma }(\varvec{Y};\varvec{\alpha },\nu )\right) ^{\frac{\gamma }{\gamma +1}}\Big ]. \end{aligned}$$

Then, by the definition of \(\varvec{\alpha }_{\epsilon }\)

$$\begin{aligned} \nabla _{\varvec{\alpha }}\bar{L}_{\gamma }(\varvec{\alpha })\Bigr |_{\varvec{\alpha }=\varvec{\alpha }_{\epsilon }}&=\varvec{0}. \end{aligned}$$
(C10)

Here, we recall that influence function in (20) is the (Gâteaux) derivative of \(\varvec{\alpha }_{\epsilon }\) with respect to \(\epsilon \) at \(\epsilon =0\). Thus, since the gradient of \(\bar{L}_{\gamma }(\varvec{\alpha })\) is computed as

$$\begin{aligned} \nabla _{\varvec{\alpha }}\bar{L}_{\gamma }(\varvec{\alpha })&=\nabla _{\varvec{\alpha }}L_{\gamma }(\varvec{\alpha }) +\epsilon \left\{ \nabla _{\varvec{\alpha }}h_{\gamma }(\varvec{x}_{\textrm{o}};\varvec{\alpha },\nu )^{\frac{\gamma }{\gamma +1}} -E_{\textrm{d}}\Big [\nabla _{\varvec{\alpha }}h_{\gamma }(\varvec{X};\varvec{\alpha },\nu )^{\frac{\gamma }{\gamma +1}}\Big ] \right. \\&\quad \left. +\nu \eta \nabla _{\varvec{\alpha }}(1-h_{\gamma }(\varvec{y}_{\textrm{o}};\varvec{\alpha },\nu ))^{\frac{\gamma }{\gamma +1}} -\nu \eta E_{\textrm{r}}\Big [\nabla _{\varvec{\alpha }}(1-h_{\gamma }(\varvec{Y};\varvec{\alpha },\nu ))^{\frac{\gamma }{\gamma +1}}\Big ] \right\} , \end{aligned}$$

we obtain the following equation by differentiating (C10) respect to \(\epsilon \) at \(\epsilon =0\):

$$\begin{aligned}&\varvec{H}\cdot \textrm{IF}(\varvec{x}_{\textrm{o}},\varvec{y}_{\textrm{o}}) +\left\{ (\eta -1)E_{\textrm{d}}\Big [\nabla _{\varvec{\alpha }}h_{\gamma }(\varvec{X};\varvec{\alpha },\nu )^{\frac{\gamma }{\gamma +1}}\Big ] \right. \nonumber \\&\quad \left. +\nabla _{\varvec{\alpha }}h_{\gamma }(\varvec{x}_{\textrm{o}};\varvec{\alpha },\nu )^{\frac{\gamma }{\gamma +1}} +\nu \eta \nabla _{\varvec{\alpha }}(1-h_{\gamma }(\varvec{y}_{\textrm{o}};\varvec{\alpha },\nu ))^{\frac{\gamma }{\gamma +1}} \right\} \Bigr |_{\varvec{\alpha }=\varvec{\alpha }_{\star }}=\varvec{0}, \end{aligned}$$
(C11)

where \(\varvec{\alpha }_{\epsilon }\bigr |_{\epsilon =0}=\varvec{\alpha }_{\star }\), \(\varvec{H}:=\nabla ^2_{\varvec{\alpha }}L_{\gamma }(\varvec{\alpha })\Bigr |_{\varvec{\alpha }=\varvec{\alpha }_{\star }}\), and we derived the following equation from (C9) and applied it:

$$\begin{aligned}&\left\{ E_{\textrm{d}}\Big [\nabla _{\varvec{\alpha }}h_{\gamma }(\varvec{X};\varvec{\alpha },\nu )^{\frac{\gamma }{\gamma +1}}\Big ] +\nu \eta E_{\textrm{r}}\Big [\nabla _{\varvec{\alpha }}(1-h_{\gamma }(\varvec{Y};\varvec{\alpha },\nu ))^{\frac{\gamma }{\gamma +1}}\Big ] \right\} \Bigr |_{\varvec{\alpha }=\varvec{\alpha }_{\star }}\\&\quad =(1-\eta )E_{\textrm{d}}\Big [\nabla _{\varvec{\alpha }}h_{\gamma }(\varvec{X};\varvec{\alpha },\nu )^{\frac{\gamma }{\gamma +1}}\Big ]\Bigr |_{\varvec{\alpha }=\varvec{\alpha }_{\star }}. \end{aligned}$$

Finally, by denoting \(\nabla _{\varvec{\alpha }}\log {p^0_{\textrm{m}}}(\varvec{x};\varvec{\alpha })\) by \(\varvec{g}_{\textrm{m}}(\varvec{x};\varvec{\alpha })\), it completes the proof to substitute

$$\begin{aligned} \nabla _{\varvec{\alpha }}h_{\gamma }(\varvec{u};\varvec{\alpha },\nu )^{\frac{\gamma }{\gamma +1}}&=\gamma h_{\gamma }(\varvec{u};\varvec{\alpha },\nu )^{\frac{\gamma }{\gamma +1}}(1-h_{\gamma }(\varvec{u};\varvec{\alpha },\nu ))\varvec{g}_{\textrm{m}}(\varvec{u};\varvec{\alpha })\\ \nabla _{\varvec{\alpha }}(1-h_{\gamma }(\varvec{u};\varvec{\alpha },\nu ))^{\frac{\gamma }{\gamma +1}}&=-\gamma h_{\gamma }(\varvec{u};\varvec{\alpha },\nu )(1-h_{\gamma }(\varvec{u};\varvec{\alpha },\nu ))^{\frac{\gamma }{\gamma +1}} \varvec{g}_{\textrm{m}}(\varvec{u};\varvec{\alpha }), \end{aligned}$$

into (C11) and to use the relation

$$\begin{aligned} \varvec{H}&=\left\{ \gamma ^2L_{\gamma }(\varvec{\alpha })(\nabla _{\varvec{\alpha }}D_{\gamma }(\varvec{\alpha }))(\nabla _{\varvec{\alpha }}D_{\gamma }(\varvec{\alpha }))^{\top } -\gamma L_{\gamma }(\varvec{\alpha })\nabla ^2_{\varvec{\alpha }}D_{\gamma }(\varvec{\alpha })\right\} \Bigr |_{\varvec{\alpha }=\varvec{\alpha }_{\star }} \nonumber \\&=-\gamma L_{\gamma }(\varvec{\alpha })\nabla ^2_{\varvec{\alpha }}D_{\gamma }(\varvec{\alpha })\Bigr |_{\varvec{\alpha }=\varvec{\alpha }_{\star }} =\gamma \tilde{\varvec{H}}, \end{aligned}$$
(C12)

where \(\nabla _{\varvec{\alpha }}D_{\gamma }(\varvec{\alpha })\Bigr |_{\varvec{\alpha }=\varvec{\alpha }_{\star }}=\varvec{0}\). \(\square \)

Appendix D: Proof of Proposition 4

Here, we give a brief proof of Proposition 4, because it essentially follows the same process as Proposition 1.

Proof

We first compute the gradient of \(\bar{D}_{\beta }(\varvec{\alpha })/p_0\) as

$$\begin{aligned} \frac{\nabla _{\varvec{\alpha }}\bar{D}_{\beta }(\varvec{\alpha })}{p_0}&=\frac{\nabla _{\varvec{\alpha }}D_{\beta }(\varvec{\alpha })}{p_0} +\epsilon \left\{ \frac{1}{\beta }\left( E_{\textrm{d}}\big [\nabla _{\varvec{\alpha }}f_0(\varvec{X};\varvec{\alpha },\nu )^{\beta }\big ]-\nabla _{\varvec{\alpha }}f_0(\varvec{x}_{\textrm{o}};\varvec{\alpha },\nu )^{\beta } \right. \right. \\&\quad \left. +\nu \eta E_{\textrm{r}}\big [\nabla _{\varvec{\alpha }}f_1(\varvec{Y};\varvec{\alpha },\nu )^{\beta }\big ] -\nu \eta \nabla _{\varvec{\alpha }}f_1(\varvec{y}_{\textrm{o}};\varvec{\alpha },\nu )^{\beta }\right) \\&\quad +\frac{1}{\beta +1}\sum _{C=0}^1\left( \nabla _{\varvec{\alpha }}f_C(\varvec{x}_{\textrm{o}};\varvec{\alpha },\nu )^{\beta +1}-E_{\textrm{d}}\big [\nabla _{\varvec{\alpha }}f_C(\varvec{X};\varvec{\alpha },\nu )^{\beta +1}\big ] \right. \\&\quad \left. \left. +\nu \eta \nabla _{\varvec{\alpha }}f_C(\varvec{y}_{\textrm{o}};\varvec{\alpha },\nu )^{\beta +1} -\nu \eta E_{\textrm{r}}\big [\nabla _{\varvec{\alpha }}f_C(\varvec{Y};\varvec{\alpha },\nu )^{\beta +1}\big ] \right) \right\} . \end{aligned}$$

Then, we differentiate both sides of \(\nabla _{\varvec{\alpha }}\bar{D}_{\beta }(\varvec{\alpha })/p_0|_{\varvec{\alpha }=\varvec{\alpha }_{\epsilon }}=\varvec{0}\) with respect to \(\epsilon \) at \(\epsilon =0\), and have

$$\begin{aligned}&\varvec{H}_{\beta }\cdot \textrm{IF}(\varvec{x}_{\textrm{o}},\varvec{y}_{\textrm{o}}) +\Biggl \{(1-\eta )\left( \frac{1}{\beta }E_{\textrm{d}}[\nabla _{\varvec{\alpha }}f_0(\varvec{X};\varvec{\alpha },\nu )^{\beta }]\right. \nonumber \\&\quad \left. -\frac{1}{\beta +1}E_{\textrm{d}}\left[ \sum _{C=0}^1\nabla _{\varvec{\alpha }}f_C(\varvec{X};\varvec{\alpha },\nu )^{\beta +1} \right] \right) \nonumber \\&\quad -\frac{1}{\beta }\left( \nabla _{\varvec{\alpha }}f_0(\varvec{x}_{\textrm{o}};\varvec{\alpha },\nu )^{\beta } +\nu \eta \nabla _{\varvec{\alpha }}f_1(\varvec{y}_{\textrm{o}};\varvec{\alpha },\nu )^{\beta }\right) \nonumber \\&\quad +\frac{1}{\beta +1}\sum _{C=0}^1\left( \nabla _{\varvec{\alpha }}f_C(\varvec{x}_{\textrm{o}};\varvec{\alpha },\nu )^{\beta +1} +\nu \eta \nabla _{\varvec{\alpha }}f_C(\varvec{y}_{\textrm{o}};\varvec{\alpha },\nu )^{\beta +1}\right) \Biggr \} \Biggr |_{\varvec{\alpha }=\varvec{\alpha }_{\star }^{\beta }}=\varvec{0}, \end{aligned}$$
(D13)

where we used the following relation derived from \(\nabla _{\varvec{\alpha }}D_{\beta }(\varvec{\alpha })|_{\varvec{\alpha }=\varvec{\alpha }_{\star }^{\beta }}=\varvec{0}\):

$$\begin{aligned}&-\frac{1}{\beta }\left( E_{\textrm{d}}[\nabla _{\varvec{\alpha }}f_0(\varvec{X};\varvec{\alpha },\nu )^{\beta }] +\nu E_{\textrm{r}}\big [\nabla _{\varvec{\alpha }}f_1(\varvec{Y};\varvec{\alpha },\nu )^{\beta }\big ]\right) \\&\quad +\frac{1}{\beta +1}\sum _{C=0}^1\left( E_{\textrm{d}}\big [\nabla _{\varvec{\alpha }}f_C(\varvec{X};\varvec{\alpha },\nu )^{\beta +1}\big ] +\nu E_{\textrm{r}}\big [\nabla _{\varvec{\alpha }}f_C(\varvec{Y};\varvec{\alpha },\nu )^{\beta +1}\big ]\right) =\varvec{0}. \end{aligned}$$

We complete the proof by substituting the formulas

$$\begin{aligned} \nabla _{\varvec{\alpha }}f_0(\varvec{X};\varvec{\alpha },\nu )^{\beta }&=\beta {f}_0(\varvec{X};\varvec{\alpha },\nu )^{\beta } f_1(\varvec{X};\varvec{\alpha },\nu )\varvec{g}_{\textrm{m}}(\varvec{X};\varvec{\alpha })\\ \nabla _{\varvec{\alpha }}f_1(\varvec{X};\varvec{\alpha },\nu )^{\beta }&=-\beta {f}_0(\varvec{X};\varvec{\alpha },\nu )f_1(\varvec{X};\varvec{\alpha },\nu )^{\beta }\varvec{g}_{\textrm{m}}(\varvec{X};\varvec{\alpha }), \end{aligned}$$

into (D13). \(\square \)

Appendix E: Proof of Proposition 7

Since the proof is similar as Proposition 1, we give a short proof of Proposition 7.

Proof

First, the gradient of \(\bar{J}_{\gamma }(\varvec{\alpha })\) is computed as

$$\begin{aligned} \nabla _{\varvec{\alpha }}\bar{J}_{\gamma }(\varvec{\alpha })&=\frac{\bar{E}_{\textrm{r}}\left[ \nabla _{\varvec{\alpha }}r(\varvec{Y};\varvec{\alpha })^{\gamma +1}\right] }{(\gamma +1)\bar{E}_{\textrm{r}}\left[ r(\varvec{Y};\varvec{\alpha })^{\gamma +1}\right] } -\frac{\bar{E}_{\textrm{d}}\left[ \nabla _{\varvec{\alpha }}r(\varvec{X};\varvec{\alpha })^{\gamma }\right] }{\gamma \bar{E}_{\textrm{d}}\left[ r(\varvec{X};\varvec{\alpha })^{\gamma }\right] }\\&\simeq \nabla _{\varvec{\alpha }}J_{\gamma }(\varvec{\alpha })\\&\quad +\eta \epsilon \frac{E_{\textrm{r}}\left[ r(\varvec{Y};\varvec{\alpha })^{\gamma +1}\right] \left\{ \nabla _{\varvec{\alpha }}r(\varvec{y}_{\textrm{o}};\varvec{\alpha })^{\gamma +1}\right\} -E_{\textrm{r}}\left[ \nabla _{\varvec{\alpha }}r(\varvec{Y};\varvec{\alpha })^{\gamma +1}\right] r(\varvec{y}_{\textrm{o}};\varvec{\alpha })^{\gamma +1}}{(\gamma +1)E_{\textrm{r}}\left[ r(\varvec{Y};\varvec{\alpha })^{\gamma +1}\right] ^2}\\&\quad -\epsilon \frac{ E_{\textrm{d}}\left[ r(\varvec{X};\varvec{\alpha })^{\gamma }\right] \left\{ \nabla _{\varvec{\alpha }}r(\varvec{x}_{\textrm{o}};\varvec{\alpha })^{\gamma }\right\} -E_{\textrm{d}}\left[ \nabla _{\varvec{\alpha }}r(\varvec{X};\varvec{\alpha })^{\gamma }\right] r(\varvec{x}_{\textrm{o}};\varvec{\alpha })^{\gamma }}{\gamma E_{\textrm{d}}\left[ r(\varvec{X};\varvec{\alpha })^{\gamma }\right] ^2}, \end{aligned}$$

where \(\simeq \) denotes the equality up to \(O(\epsilon ^2)\). As done in previous proofs, differentiating both sides of \(\nabla _{\varvec{\alpha }}\bar{J}_{\gamma }(\varvec{\alpha })|_{\varvec{\alpha }=\varvec{\alpha }_{\epsilon }^{\gamma }}=\varvec{0}\) with respect to \(\epsilon \) at \(\epsilon =0\) yields

$$\begin{aligned}&\varvec{H}_{\gamma }\cdot \textrm{IF}(\varvec{x}_{\textrm{o}},\varvec{y}_{\textrm{o}})\nonumber \\&\quad +\left[ \eta \frac{E_{\textrm{r}}\left[ r(\varvec{Y};\varvec{\alpha })^{\gamma +1}\right] \left\{ \nabla _{\varvec{\alpha }}r(\varvec{y}_{\textrm{o}};\varvec{\alpha })^{\gamma +1}\right\} -E_{\textrm{r}}\left[ \nabla _{\varvec{\alpha }}r(\varvec{Y};\varvec{\alpha })^{\gamma +1}\right] r(\varvec{y}_{\textrm{o}};\varvec{\alpha })^{\gamma +1}}{(\gamma +1) E_{\textrm{r}}\left[ r(\varvec{Y};\varvec{\alpha })^{\gamma +1}\right] ^2} \right. \nonumber \\&\quad \left. -\frac{E_{\textrm{d}}\left[ r(\varvec{X};\varvec{\alpha })^{\gamma }\right] \left\{ \nabla _{\varvec{\alpha }}r(\varvec{x}_{\textrm{o}};\varvec{\alpha })^{\gamma }\right\} -E_{\textrm{d}}\left[ \nabla _{\varvec{\alpha }}r(\varvec{X};\varvec{\alpha })^{\gamma }\right] r(\varvec{x}_{\textrm{o}};\varvec{\alpha })^{\gamma }}{\gamma E_{\textrm{d}}\left[ r(\varvec{X};\varvec{\alpha })^{\gamma }\right] ^2} \right] \Biggr |_{\varvec{\alpha }=\varvec{\alpha }_{\star }^{\gamma }}=\varvec{0}. \end{aligned}$$
(E14)

The proof is completed by applying

$$\begin{aligned} \nabla _{\varvec{\alpha }}r(\varvec{x};\varvec{\alpha })^{\gamma }&=\gamma \cdot {r}(\varvec{x};\varvec{\alpha })^{\gamma }\varvec{g}_{\textrm{m}}(\varvec{x};\varvec{\alpha })\\ \frac{E_{\textrm{r}}\left[ \nabla _{\varvec{\alpha }}r(\varvec{Y};\varvec{\alpha })^{\gamma +1}\right] }{(\gamma +1)E_{\textrm{r}}\left[ r(\varvec{Y};\varvec{\alpha })^{\gamma +1}\right] } \Biggr |_{\varvec{\alpha }=\varvec{\alpha }_{\star }^{\gamma }}&=\frac{E_{\textrm{d}}\left[ \nabla _{\varvec{\alpha }}r(\varvec{X};\varvec{\alpha })^{\gamma }\right] }{\gamma E_{\textrm{d}}\left[ r(\varvec{X};\varvec{\alpha })^{\gamma }\right] } \Biggr |_{\varvec{\alpha }=\varvec{\alpha }_{\star }^{\gamma }}, \end{aligned}$$

in (E14), where the second equation is derived from \(\nabla J_{\gamma }(\varvec{\alpha })|_{\varvec{\alpha }=\varvec{\alpha }_{\star }^{\gamma }}=\varvec{0}\). \(\square \)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sasaki, H., Takenouchi, T. Outlier-robust parameter estimation for unnormalized statistical models. Jpn J Stat Data Sci (2024). https://doi.org/10.1007/s42081-023-00237-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42081-023-00237-8

Keywords

Navigation