A big data management framework means the organization of the information according to the principles and practices that would yield high schema flexibility, scalability and processing of the huge volumes of data, but for which traditional RDBMSs are not well suited and becomes impractical. Therefore, there is a need to devise new data models and technologies that can handle such big data. Recent research efforts have shown that big data management frameworks can be classified into three layers that consist of file systems, database technology, and programming models [19]. However in this article we shall focus upon database technologies only in context of the healthcare domain with real-time and temporal perspective.
NoSQL also be interpreted as the abbreviation of “NOT ONLY SQL” or “no SQL at all” [45], whereas this was first used by Carlo Strozzi in 1998 for his RDBMS that did not offer an SQL interface [47]. NoSQL databases are often used for storing of the big data in non-relational and distributed manner, and its concurrency model is weaker than the ACID transactions in relational SQL-like database systems [14]. This is because NoSQL systems are ACID non-compliant by design and the complexity involves in enforcing ACID properties does not exist for most of them [10]. For example some of the ACID compliant NOSQL databases are: Redis [48], Aerospike [49] and Voldemort [50] as key-value stores [51]; where as Neo4jDB [52] and Sparksee are as graph-based data stores [8, 51]. In contrast to this MongoDB is not ACID compliant document-oriented database [8].
The V’s of big data
The V’s of big data is paramountly, even in healthcare, refer to as mainly for Volume, Variety, Velocity and Veracity [14]. The first three V have been introduced in [53], and the V for Veracity has been introduced by Dwaine Snow in his blog Thoughts on Databases and Data Management [54].
Volume Prolific data at scale creates issues ranging from storage to its processing.
Velocity Real-time data, data streams—analysis of streaming data, data snapshots in the memory for quick responses, availability for access and delivery.
Variety Many formats of data—structured, unstructured, semi-structured, media.
Veracity Deals with uncertain or imprecise data, its cleaning before the processing. Variety and Velocity goes against it as both do not let to clean the data.
NoSQL database categories
Based on the differences in the respective data models, NoSQL databases can be organized into following basic categories as: key-value stores, document databases, column-oriented databases and graph databases [10, 14, 19, 20].
Key-value stores
These are systems that store values against the index keys, as key-value pairs. The keys are unique to identify and request the data values from the collections. Such databases has emerged recently and are influenced heavily by Amazon’s Dynamo key-value store database, where data is distributed and replicated across multiple servers [62]. The values in such databases are schema flexible and can be simple text strings or more complex structures like arrays. The simplicity of its data model makes the retrieval of information very fast, therefore supports the big data real-time processing along the scalability, reliability and highly available characteristics. Some of the key-value databases store data ordered on the keys, such as Memcached [55] or Berkeley DB [66]; while others do not, such as Voldemort [50] etc. Whereas some keep entire data in memory, such as Aerospike [49], Redis [48]; others use it after writing it to the disk permanently (like Aerospike, MemcacheDB [56] etc.) with the trade-off replying to the queries in real-time. The scalability, durability and flexibility depends upon different mechanisms like partitioning, replication, object versioning, schema evolution [19]. Sharding, also known as partitioning, is the splitting of the data based upon the keys; whereas the replication, also known as mirroring, is the copying of the data to the different nodes.
Amazon’s Dynamo and Voldemort [50], which are used by Linkedin, apply this data model successfully. Other databases that use this model of data category are such as: Redis [48], Tokyo Cabinet [67] and Tokyo Tyrant [68], Memcached [55] and MemcacheDB [56], Basho Riak [60], Berkeley DB [66] and Scalaris [69]. Whereas Cassandra is a hybrid of key-value and column-oriented database models [57]. Table 1 summarizes the characteristics of some of the Key-value stores.
Column-oriented databases
Relational databases have their focus on rows in which they store the instances or the records and return rows or instances of an entity against a data retrieval query. Such rows posses unique keys against each instance for locating the information. Whereas column-oriented databases store their data as columns instead of the rows and use index based unique keys over the columns for the data retrieval. This supports attribute level access rather than the tuple-level access pattern.
Only query relevant necessary columns are required to be loaded, so this reduce the I/O cost significantly [70]. These are good for read-intensive applications, as they only allow relevant data reads because each column contains contiguous similar values; so calculating aggregate values will also be very fast. More columns are easily addable and a column may be further restructured called super-column, where it contains nested (sub)columns (e.g., in Cassandra) [14]. Super columns are key-value pairs, where the values are columns. Columns and super-columns are both tuples with a name and value. The key difference is that a standard column’s value is a string, whereas a super-column’s value is a map of columns. Super-columns are sorted associative array of columns [71].
Table 2 Column-oriented databases
Google’s Bigtable, which played the inspirational role for the column databases [74], is a compressed, high performance, scalable, sparse, distributed multi-dimensional database built over a number of technologies, such as Google File System (GFS) [75], a cluster management system, SSTable file format and Chubby [76]. This provides indexes over rows, columns, as well as a third timestamp dimension. Bigtable is designed to scale across thousands of system nodes and allows to add more nodes easily through automatic configuration.
This was the first most popular column oriented database of its type however latter many companies introduced some other variants of it. For example Facebook’s Cassandra [77] integrates the distributed system technologies of Dynamo and the data model from Bigtable. It distributes multi-dimensional structures across different nodes based upon four dimensions: rows, column families, columns, and super columns. Cassandra was open sourced in 2008, and then HBase [72] and Hypertable [78], based upon a proprietary Bigtable technology, have emerged to implement similar open source data models. Table 2 provides the description about some column-oriented databases in a categorical format.
Graph databases
Graph databases, as a category of NoSQL technologies, represent data as a network of nodes connected with edges and are having properties of key-value pairs. Working on relationships, detecting patterns and finding paths are the best applications to be solved by representing them as graphs. Neo4j [52], Allegro Graph [79], ArangoDB [80] and OrientDB [81] are few examples of such systems, and are described along their characteristics in a categorical format in Table 3. Neo4j is the most popular open source, embedded, fully transactional with ACID characteristics graph-based database. This is schema flexible to store data as a network of nodes, edges and their attributes. This also supports custom data types with its Java persistence engine. Neo4j does not support graph sharding on different nodes, rather it supports in memory cache sharding [52, 82]. The reason having that, the mathematical problem of optimally partitioning a graph across a set of servers is near-impossible (NP complete) to do for large graphs [82]. Whereas AllegroGraph is a Resource Description Framework (RDF) [83] triple store for linked data and widely used by different organizations, such as Stanford University, IBM,Ford, AT&T, Siemens, NASA and United States Census department.
Table 4 Document-oriented databases
Document databases
These are the most general models, which use use JSON (JavaScript Object Notation) or BSON (Binary JSON) format to represent and store the data structures as documents for the data management. Document stores provide schema flexibility by allowing arbitrarily complex documents, i.e. sub-documents within document or sub-documents; and documents as lists. A database comprises one or more collections, where each collection is a named group of documents. A document can be a simple or complex value, a set of attribute-value pairs, which can comprise simple values, lists, and even nested sub documents. Documents are schema-flexible, as one can alter the schema at the run time hence providing flexibility to the programmers to save an object instances in different formats, thus supporting polymorphism at the database level [92]. Figure 1 illustrates two collections of documents for both students and course within an academic management System. It is evident that a collection can have different formats of documents in JSON format and they have hierarchies among themselves, such as courses has an attribute books which contains a list of sub-documents of different formats.
These databases store and manage volumes of collections of textual documents (e.g. emails, web pages, text file books), semi-structure, as well as no structure and de-normalized data; that would require extensive usage of null values as in RDBMS [93]. Unlike key-value stores, the document databases support secondary indexes on sub-documents to allow fast searching. They allow horizontal scaling of the data over multiple servers called shards. MongoDB [38], CouchDB [88], Couchbase [89], ReThinkDB [90], and Cloudant [91] are some of the most popular document-oriented databases, as shown in the Table 4 in a categorical format along their different characteristics. Among these MongoDB is the most popular one due to its efficiency, in memory processing and complex data type features [94]. The other databases such as Couchbase, ReThinkDB, Cloudant and CouchDB do not offer in-memory processing features; although the former three offer a list of data types. MongoDB query languages more like the SQL of RDBMS, so is easy to use for the programmers. MongoDB is good for the dynamic queries, which the other document-oriented databases lack, such as CouchDB or Couchbase [95]. Besides this there are different object relational mapping middlewares available [96], to define out of the box multiple schemas depending upon the application requirements. The object nature of MongoDB documents makes this mapping even more fluid and fast, such as while using Mongoose [96] or Morphia [97].
Document databases: MongoDB
MongoDB, created by 10gen in 2007, is a document oriented database for today’s applications which are not possible to develop using the traditional relational databases [98]. It is an IoT database which instead of tables (as in RDBMS) provides one or more collection(s) as main storage components consisted upon similar or different JSON or BSON based documents or sub documents. Documents that tend to share some of the similar structure are organized as collections, which can be created at any time, without predefinitions. A document can simply be considered as a row or instance of an entity in RDBMS, but the difference is that, in MongoDB we can have instances within instances or documents with in documents, even lists or arrays of documents. The types for the attributes of a document can be of any basic data type, such as numbers, strings, dates, arrays or even a sub-document.
MongoDB provides unique multiple storage engines within a single deployment and automatically manages the movement of data between storage engine technologies using native replication. MongoDB 3.2 consists of four efficient storage engines as shown in Fig. 2, all of which can coexist within a single MongoDB replica set [99]. The default WiredTiger storage engine provides concurrency control and native compression with best storage and performance efficiency. MongoDB allows both the combinations of in-memory engine for ultra low-latency operations with a disk-based engine for persistence altogether.
It allows to build large-scale, highly available, robust systems and enables different sensors and applications to store their data in a schema flexible manner. There is no database blockage, such as we encounter during alter table commands in RDBMS during schema migrations. However in rare cases, such as during the write-intensive scenarios in master-slave nature of MongoDB there may be blockage at the document level or bottleneck to the system if sharding is not used, but these cases are avoidable. MongoDB enables horizontal scalability because table joins are not as important as they are in the traditional RDBMS. MongoDB provides auto-sharding in which more replica server nodes can easily be added to a system. It is a very fast database and provides indexes not only on the primary attributes rather also on the secondary attributes within the sub-documents even. For the cross comparison analysis between different collections we have different technologies, such as aggregation framework [100], MapReduce [101, 102], Hadoop [14, 19] etc. These processing techniques will be targeted in future and are not currently focused in this paper. Next we discuss the data modeling methodologies, which is followed by a data model description we used for the storage of the real-time temporal data of ANT+ sensors in the healthcare domain.