In chapters so far, you have relied on HDFS as your storage medium. It has two major advantages for the type of processing we desired to do. It excels at storing large files and enabling distributed processing of these files with help of MapReduce. HDFS is most efficient for tasks that require a pass through all data in a file (or a set of files). In case you only need to access a certain element in a dataset (operation sometimes called point query) or a continuous range of elements (sometimes called range query), HDFS does not provide you an efficient toolkit for the task. You are forced to simply scan over all elements to pick out the ones you are interested in.
- Abadi, D (2010) DBMS musings: problems with CAP, and Yahoo’s little known NoSQL system. http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html (visited on 09/26/2018)
- Gilbert S, Lynch N (2002) Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. en. In: ACM SIGACT News 33.2 (June 2002), p 51. ISSN: 01635700. https://doi.org/10.1145/564585.564601. http://portal.acm.org/citation.cfm?doid=564585.564601 (visited on 09/26/2018)CrossRefGoogle Scholar