Abstract
Spark is a large-scale data processing engine that is at least a hundred times faster than the Hadoop big data processing engine. Even though Spark is a complete in-memory framework, although limited with its big data platforms facilities compared to Hadoop, Spark analytics engine with Hadoop distributed file system gives better throughput than Hadoop alone. The main contribution of this paper is the insight into the behaviour of HDFS-based Azura Cloud Spark Cluster with discussion and evaluation of its strengths and limitations using NHS prescription large dataset. Data on NHS prescriptions obtained from 2015 to April 2022 exceeds 500 GB of records. A public dashboard for individual BNF code analysis and studies on NHS cost analysis exist, but no analysis of this data range and volume of NHS prescription and especially using new big data processing engines such as Spark was conducted. This study also contributes descriptive statistics and machine learning models of prescription data trends using Cloud Spark engine and PySpark technology that has not been used in this context before. This study illustrates regions as well as GP practices in terms of reimbursement cost, drug consumption level, the type of the drug, and the disease type; varied demand for dispensed chemical substances over the years; shows what diseases have increased or decreased over the years as well as the total cost and its trends.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Naser AY, Alwafi H, Al-Daghastani T, Hemmo SI, Alrawashdeh HM, Jalal Z, Paudyal V, Alyamani N, Almaghrabi M, Shamieh A (2022) Drugs utilization profile in England and Wales in the past 15 years: a secular trend analysis. BMC primary care 23(1):239. https://doi.org/10.1186/s12875-022-01853-1
OpenPrescribing.net, Bennett Institute for Applied Data Science, University of Oxford, 2023, https://openprescribing.net/
Salloum S, Dautov R, Chen X et al (2016) Big data analytics on Apache Spark. Int J Data Sci Anal 1:145–164. https://doi.org/10.1007/s41060-016-0027-9
Shaikh E, Mohiuddin I, Alufaisan Y, Nahvi I (2019) Apache Spark: a big data processing engine. In: 2019 2nd IEEE middle East and North Africa communications conference (MENACOMM), Manama, Bahrain, pp 1–6. https://doi.org/10.1109/MENACOMM46666.2019.8988541
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (HotCloud'10). USENIX Association, USA, 10
Lekha RN, Sujala DS, Siddhanth DS (2018) Applying spark based machine learning model on streaming big data for health status prediction. Comput Electric Eng 65:393–399, ISSN 0045-7906
Bell J, GBE FF (2017) Life sciences industrial strategy—a report to the government from the life sciences sector. Office for Life Sciences
Kyoungyoung J, Gang-Hoon K (2013) Potentiality of big data in the medical sector: focus on how to reshape the healthcare system. The Korean Society of Medical Informatics, 79–85
Villars RL, Olofson CW, Eastwood M (2011) Big data: what it is and why you should care. IDC Analyze the Future, 4
Dash S, Shakyawar SK, Sharma M, Kaushik S (2019) Big data in healthcare: management, analysis and future prospects. J Big Data 54
Kretz A (2019) The data engineering cookbook: mastering the plumbing of data science v3
Wang G, Xin R, Damji J (2018) Benchmarking Apache Spark on a Single node machine, engineering Blog https://www.databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html
Microsoft (2023) Best practices: cluster configuration, Azure Databricks documentation, https://learn.microsoft.com/en-us/azure/databricks/clusters/cluster-config-best-practices
Learning Journal (2021) Parallel processing in Apache Spark, Apache Spark core context, https://www.learningjournal.guru/article/apache-spark
MacDonald BK, Cockerell OC, Sander JW, Shorvon SD (2000) The incidence and lifetime prevalence of neurological disorders in a prospective community-based study in the UK. Brain: J Neurol 123(Pt 4):665–676. https://doi.org/10.1093/brain/123.4.665
Olvera Lopez E, Ballard BD, Jan A. Cardiovascular Disease. [Updated 2022 Aug 8]. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2023 Jan-. Available from: https://www.ncbi.nlm.nih.gov/books/NBK535419/
NHS UK website (2023) Cardiovascular disease. Available at: https://www.nhs.uk/conditions/cardiovascular-disease
Wilson JD (2001) Prospects for research for disorders of the endocrine system. JAMA. 285(5):624–627. https://doi.org/10.1001/jama.285.5.624 Available from: https://jamanetwork.com/journals/jama/fullarticle/193529
Madhugiri D (2022) Apache Spark vs. hadoop mapreduce—top 7 differences, analytics Vidhya Blog, https://www.analyticsvidhya.com/blog/2022/06/apache-spark-vs-hadoop-mapreduce-top-7-differences
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Fernando, S., Mydlarz, V.S., Katanani, A., Virdee, B. (2024). Cloud Spark Cluster to Analyse English Prescription Big Data for NHS Intelligence. In: Swaroop, A., Polkowski, Z., Correia, S.D., Virdee, B. (eds) Proceedings of Data Analytics and Management. ICDAM 2023. Lecture Notes in Networks and Systems, vol 785. Springer, Singapore. https://doi.org/10.1007/978-981-99-6544-1_27
Download citation
DOI: https://doi.org/10.1007/978-981-99-6544-1_27
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-6543-4
Online ISBN: 978-981-99-6544-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)