Benchmarking Fast-Data Platforms for the Aadhaar Biometric Database
Aadhaar is the world’s largest biometric database with a billion records, being compiled as an identity platform to deliver social services to residents of India. Aadhaar processes streams of biometric data as residents are enrolled and updated. Besides \(\sim \)1 million enrollments and updates per day, up to 100 million daily biometric authentications are expected during delivery of various public services. These form critical Big Data applications, with large volumes and high velocity of data. Here, we propose a stream processing workload, based on the Aadhaar enrollment and Authentication applications, as a Big Data benchmark for distributed stream processing systems. We describe the application composition, and characterize their task latencies and selectivity, and data rate and size distributions, based on real observations. We also validate this benchmark on Apache Storm using synthetic streams and simulated application logic. This paper offers a unique glimpse into an operational national identity infrastructure, and proposes a benchmark for “fast data” platforms to support such eGovernance applications.
We are grateful for inputs provided by Dr. Vivek Raghavan from UIDAI, and UIDAI’s public reports in preparing this article. The views and opinions of authors expressed herein do not necessarily state or reflect those of the Government of India or any agency thereof, the UIDAI, nor any of their employees.
- 1.Arasu, A., Cherniack, M., Galvez, E., Maier, D., Maskey, A.S., Ryvkina, E., Stonebraker, M., Tibbetts, R.: Linear road: a stream data management benchmark. In: VLDB (2004)Google Scholar
- 2.Baru, C., Marcus, R., Chang, W. (eds.): Use cases from NIST big data requirements working group V1.0. Technical report M0180 v15, NIST (2013). http://bigdatawg.nist.gov
- 3.Dalwai, A. (ed.): Aadhaar technology and architecture: principles, design. best practices and key lessons. Technical report, Unique Identification Authority of India (UIDAI) (2014)Google Scholar
- 4.Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.A.: BigBench: towards an industry standard benchmark for big data analytics. In: ACM SIGMOD (2013)Google Scholar
- 5.Gu, L., Zhou, M., Zhang, Z., Shan, M.C., Zhou, A., Winslett, M.: Chronos: an elastic parallel framework for stream benchmark generation and simulation. In: IEEE ICDE (2015)Google Scholar
- 6.Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench benchmark suite: characterization of the mapreduce-based data analysis. In: Agrawal, D., Candan, K.S., Li, W.-S. (eds.) New Frontiers in Information and Software as Services. LNBIP, vol. 74, pp. 209–228. Springer, Heidelberg (2011). doi: 10.1007/978-3-642-19294-4_9 CrossRefGoogle Scholar
- 7.Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: Sparkbench: a comprehensive benchmarking suite for in memory data analytic platform spark. In: ACM International Conference on Computing Frontiers (2015)Google Scholar
- 8.Lu, R., Wu, G., Xie, B., Hu, J.: Stream bench: towards benchmarking modern distributed stream computing frameworks. In: IEEE/ACM UCC, 2014 (2014)Google Scholar
- 9.Nabi, Z., Bouillet, E., Bainbridge, A., Thomas, C.: Of Streams and Storms. Technical report, IBM (2014). https://github.com/IBMStreams/benchmarks
- 10.Office of the Chief Financial Officer: Office of Biometric Identity Management Expenditure Plan: Fiscal Year 2015 Report to Congress. Technical report, Office of Biometric Identity Management, Homeland Security, United States (2015)Google Scholar
- 11.Poess, M., Smith, B., Kollar, L., Larson, P.: TPC-DS, taking decision support benchmarking to the next level. In: ACM International Conference on Management of Data (SIGMOD), pp. 582–587. ACM (2002)Google Scholar
- 12.Welsh, M., Culler, D., Brewer, E.: SEDA: an architecture for well-conditioned, scalable internet services. In: ACM SOSP (2001)Google Scholar