Journal of Signal Processing Systems

, Volume 55, Issue 1–3, pp 185–207 | Cite as

Balancing the Role of Priors in Multi-Observer Segmentation Evaluation

  • Yaoyao Zhu
  • Xiaolei Huang
  • Wei Wang
  • Daniel Lopresti
  • Rodney Long
  • Sameer Antani
  • Zhiyun Xue
  • George Thoma


Comparison of a group of multiple observer segmentations is known to be a challenging problem. A good segmentation evaluation method would allow different segmentations not only to be compared, but to be combined to generate a “true” segmentation with higher consensus. Numerous multi-observer segmentation evaluation approaches have been proposed in the literature, and STAPLE in particular probabilistically estimates the true segmentation by optimal combination of observed segmentations and a prior model of the truth. An Expectation–Maximization (EM) algorithm, STAPLE’s convergence to the desired local minima depends on good initializations for the truth prior and the observer-performance prior. However, accurate modeling of the initial truth prior is nontrivial. Moreover, among the two priors, the truth prior always dominates so that in certain scenarios when meaningful observer-performance priors are available, STAPLE can not take advantage of that information. In this paper, we propose a Bayesian decision formulation of the problem that permits the two types of prior knowledge to be integrated in a complementary manner in four cases with differing application purposes: (1) with known truth prior; (2) with observer prior; (3) with neither truth prior nor observer prior; and (4) with both truth prior and observer prior. The third and fourth cases are not discussed (or effectively ignored) by STAPLE, and in our research we propose a new method to combine multiple-observer segmentations based on the maximum a posterior (MAP) principle, which respects the observer prior regardless of the availability of the truth prior. Based on the four scenarios, we have developed a web-based software application that implements the flexible segmentation evaluation framework for digitized uterine cervix images. Experiment results show that our framework has flexibility in effectively integrating different priors for multi-observer segmentation evaluation and it also generates results comparing favorably to those by the STAPLE algorithm and the Majority Vote Rule.


Ground truth Bayesian decision Precision Segmentation Multi-observer Sensitivity Specificity STAPLE Validation 


  1. 1.
    Warfield, S. K., Zou, K. H., & Wells, W. M. (2004). Simultaneous Truth and Performance Level Estimation (STAPLE): An algorithm for the validation of image segmentation. IEEE Transactions on Medical Imaging, July.Google Scholar
  2. 2.
    Lotenberg, S., Greenspan, H., Gordon, S., Long, L. R., Jeronimo, J., & Antani, S. K. (2007). Automatic evaluation of uterine cervix segmentations. Proceedings of SPIE Medical Imaging, 6515, 65151J–1-12.Google Scholar
  3. 3.
    Zhu, Y., Long, L. R., Antani, S. K., Xue, Z., & Thoma, G. R. (2007). Web-based STAPLE for quality estimation of multiple image segmentations. Poster at 20th NIH Research Festival (IMAG-12), National Institutes of Health, September.Google Scholar
  4. 4.
    Zhang, Y. J. (1996). A survey on evaluation methods for image segmentation. Pattern Recognition, 29(8), 1335–1346.CrossRefGoogle Scholar
  5. 5.
    Yasnoff, W. A., Mui, J. K., & Bacus, J. W. (1977). Error measures in scene segmentation. Pattern Recognition, 9(4), 217–231.CrossRefGoogle Scholar
  6. 6.
    Qian Huang Dom, B. (1995). Quantitative methods of evaluating image segmentation. Proceedings IEEE International Conference on Image Processing, 3, 53–56.CrossRefGoogle Scholar
  7. 7.
    Martin, D. (2002). An empirical approach to grouping and segmentation. PhD dissertation, University of California, Berkeley.Google Scholar
  8. 8.
    Cardoso, J. S., & Corte-Real, L. (2005). Toward a generic evaluation of image segmentation. IEEE Transactions on Image Processing, 14(11), 1773–1782.CrossRefGoogle Scholar
  9. 9.
    Monteiro, F. C., Fernando, C., Campilho, A. C., & Aurélio, C. Performance Evaluation of Image Segmentation. ICIAR06 (I: 248–259).Google Scholar
  10. 10.
    Kittler, J., Hatef, M., Duin, R. P. W., & Matas, J. (1998). On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 226–239 Mar.CrossRefGoogle Scholar
  11. 11.
    Windridge, D., & Kittler, J. (2003). A morphologically optimal strategy for classifier combination: Multiple expert fusion as a tomographic process. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25, 343–353 Mar.CrossRefGoogle Scholar
  12. 12.
    Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3, 79–87.CrossRefGoogle Scholar
  13. 13.
    Jordan, M. I., & Jacobs, R. A. Hierarchical Mixtures of Experts and the EM Algorithm. Tech. Rep. AIM-1440, 1993.Google Scholar
  14. 14.
    Restif, C. (2007). Revisiting the evaluation of segmentation results: Introducing confidence maps. Medical Image Computing and Computer-Assisted Intervention, 2, 588–595.Google Scholar
  15. 15.
    Martina, A., Laanaya, H., & Arnold-Bos, A. (2006). Evaluation for uncertain image classification and segmentation. Pattern Recognition, 39(11), 1987–1995 November.CrossRefGoogle Scholar
  16. 16.
    Berger, J. (1985). Statistical decision theory and bayesian analysis. New York: Springer-Verlag.MATHGoogle Scholar
  17. 17.
    Prasad, M., Sowmya, A., & Koch, I. (2004). Feature subset selection using ICA for classifying emphysema in HRCT images. 17th International Conference on Pattern Recognition (ICPR), 4, 515–518.CrossRefGoogle Scholar
  18. 18.
    Prasad, M., Sowmya, A., & Wilson, P. Multi-level classification of emphysema in HRCT lung images. Pattern Analysis & ApplicationsGoogle Scholar
  19. 19.
    Herrero, R., Schiffman, M. H., Bratti, C., et al. (1997). Design and methods of a population-based natural history study of cervical neoplasia in a rural province of Costa-Rica: The Guanacaste Project. Revista Panamericana de Salud Pública, 1(5), 362–375.CrossRefGoogle Scholar
  20. 20.
    Huang, X., Wang, W., Xue, Z., Antani, S., Long, L. R., & Jeronimo, J. (2008). Tissue classification using cluster features for lesion detection in digital cervigrams. San Diego: SPIE Medical Imaging.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Yaoyao Zhu
    • 1
  • Xiaolei Huang
    • 1
  • Wei Wang
    • 1
  • Daniel Lopresti
    • 1
  • Rodney Long
    • 2
  • Sameer Antani
    • 2
  • Zhiyun Xue
    • 2
  • George Thoma
    • 2
  1. 1.Department of Computer Science and EngineeringLehigh UniversityBethlehemUSA
  2. 2.National Library of MedicineNational Institutes of HealthBethesdaUSA

Personalised recommendations