Data Retrieval for Heterogeneous Data Models

Nadkarni, Prakash M.

doi:10.1007/978-0-85729-510-1_15

Prakash M. Nadkarni M.D.²

Part of the book series: Health Informatics ((HI))

962 Accesses

Abstract

Retrieving data from a heterogeneously modeled database is understandably complex because the same conceptual operation requires different types of queries depending on how the class of data that you wish to retrieve is physically organized. Thus: 1. For conventional (columnar)-structured data, the conceptual operation and the query mechanism are identical. A a query that fetches only attributes of interest (i.e., without aggregates), even across multiple tables, can usually be effected with a single SQL statement that has a simple structure – that is, it has no subqueries. 2. For EAV-structured data, query of data based on combinations of values of attributes takes many more steps than for columnar (traditionally modeled) data: the AND, OR and NOT operations must be substituted respectively by the set operations of intersection, union and difference. 3. For hybrid classes, a combination of approaches would be necessary, based on whether the attributes you are fetching are columnar or EAV-modeled. 4. Querying a mixture of conventional, EAV, and hybrid classes is the most difficult. It requires decomposing the query task into individual operations on each class, and then combining the results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.00; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Chatterjee J. Working with parameters with crystal reports and ASP.NET 2.0. Available from: http://www.aspfree.com/c/a/ASP.NET/Working-with-Parameters-with-Crystal-Reports-and-ASP-NET-2-0/1/. Cited 11/1/10.
IBM Corporation. Full outer joins in DB2. 2010.
Google Scholar
Snodgrass RT, Ahn I, Ariav G, et al. TSQL2 language specification. ACM SIGMOD Record. 1994;23(1):65-86.
Article Google Scholar
Ben-gan I. Calculating the median gets easier in SQL Server 2005. SQL Server Magazine. 2010.
Google Scholar
Johnson S, Chatziantoniou D. Extended SQL for manipulating clinical warehouse data. Proceedings of the AMIA Annual Fall Symposium; 1999:819–823.
Google Scholar
Deshpande A, Brandt C, Nadkarni P. Metadata-driven ad hoc query of patient data: meeting the needs of clinical studies. J Am Med Inform Assoc. 2002;9(4):369-382.
Article PubMed Google Scholar
Nadkarni P, Brandt C. Data extraction and ad hoc query of an Entity-Attribute-Value database. J Am Med Inform Assoc. 1998;5(6):511-527.
Article PubMed CAS Google Scholar
Pillai P. Infix to postfix conversion. Available from: http://scriptasylum.com/tutorials/infix_postfix/algorithms/infix-postfix/index.htm. Cited 10/1/10.
Wikipedia. Reverse polish notation. Available from: http://en.wikipedia.org/wiki/Reverse_Polish_notation. Cited 9/1/10.
Marenco L, Tosches N, Crasto C, Shepherd G, Miller P, Nadkarni P. Achieving evolvable Web-database bioscience applications using the EAV/CR framework: recent advances. J Am Med Inform Assoc. 2003;10(5):444-453.
Article PubMed Google Scholar
Fisk J, Mutalik P, Levin F, Erdos J, Taylor C, Nadkarni P. Integrating query of relational and textual data in relational databases: a case study. J Am Med Inform Assoc. 2003;10(1):21-38.
Article PubMed Google Scholar
Agrawal R, Srikant R. Fast algorithms for mining association rules. Proceedings of the 20th VLDB Conference. Santiago, Chile; 1994.
Google Scholar
Dinu V, Nadkarni P, Brandt C. Pivoting approaches for bulk extraction of Entity-Attribute-Value data. Comput Methods Programs Biomed. 2006;82(1):38-43.
Article PubMed Google Scholar
AT&T Research. Graphviz – a graph visualization tool. Available from: http://www.graphviz.org/. Cited 8/2/10.
Smith ML, Glass GV. Meta-analysis of psychotherapy outcome studies. Am Psychol. 1977;32:752-760.
Article PubMed CAS Google Scholar
Leslie H. International developments in openEHR archetypes and templates. HIM J. 2008;37(1):38-39.
PubMed Google Scholar
Kalra D, Beale T, Heard S. The openEHR Foundation. Stud Health Technol Inform. 2005;115:153-173.
PubMed Google Scholar
Sedgewick R. Algorithms in C ⁺⁺. 3rd ed. Reading: Addison-Wesley; 2002.
Google Scholar
Bard JF. Cycles in an undirected graph. Available from: http://www.me.utexas.edu/~bard/IP/Handouts/cycles.pdf. Cited 10/1/2010.

Download references

Author information

Authors and Affiliations

School of Medicine, Yale University, Cedar St. 333, 06510, New Haven, Connecticut, USA
Dr. Prakash M. Nadkarni M.D.

Authors

Dr. Prakash M. Nadkarni M.D.
View author publications
You can also search for this author in PubMed Google Scholar

Appendices

Appendix: Composing SQL that Uses Relationship Metadata

Understanding this section requires some computer-science exposure – specifically, graph theory. If you need to bone up, a data-structures/algorithms book that is reasonable and not overwhelming is Sedgewick.18 This section is relevant to the ad hoc query of non-sparse, heterogeneous data as modeled using an EAV/CR framework, though it also applies to a conventionally modeled schema.

15.1.1 The Universal Database Concept

A traditionally implemented (column-modeled) database schema can be regarded as an undirected graph data structure, where a table can be regarded as a vertex or node, while a primary-key/foreign key relationship between two tables can be treated as an edge. Undirected means that one is permitted to navigate in either direction along any edge, even though the ends of the edge have a specific meaning in terms of primary key vs. foreign key.

In theory, if one wants to retrieve columns from several tables, through criteria on columns that are dispersed across several tables, a program could help in composing part of the SQL – specifically the join expression:

For a pair of physical tables that are directly related, the join expressions is: primary-key-table. primary-key-field = foreign-key-table. foreign-key-field.If there are multiple columns in the join, the above is repeated, and the individual column-pairs are combined with an AND operator.
If the paths between the tables whose columns the user has specified are known, the software could automatically include intermediate tables in the path, even without the user having specified these tables.

An idea originating in the late 1980s with David Maier and Jeffrey Ullman – the “Universal database” (UDB) – takes this insight to an extreme level. The UDB concept attempts to create the illusion, for the end-user, that a database consisting possibly of hundreds of tables is one gigantic table with a large number of columns. In other words, the user concerns oneself only with individual columns, and need not be bothered with the existence of individual tables: foreign keys in particular, can be hidden from the user. The software SAP Business Objects (BO) is based on this premise: in BO terminology, the virtual giant table is the “Universe” and the “Objects” are the individual columns.

15.1.2 Weaknesses in the UDB Premise

In terms of graph theory, the basic UDB approach is guaranteed to work only for schemas that happen to be trees, where there is only one path that connects any two tables. (If there is more than one path, the schema is said to have cycles.) In real schemas, such an assumption never holds. There are two common conditions where the assumption is violated: see Fig. 15.8 below.

One instance of cycles is where one table is directly joined to another more once. (In Fig. 15.8 , there are two links between A and D.) An example is an employee-hierarchy table, consisting of two columns, Manager ID and Employee ID, both these columns are linked to a Persons table that contains details of the individual. Another example is Ship-From and Ship-to address IDs in a sales order, both of which point to an Addresses table.
Another cycle is where a table is directly linked to other tables, which also happen to be linked to each other by different paths. (In Fig. 15.8 , C is linked directly to A and B, but A is also linked to B, so to get from A to C, one can use the direct route A-C, or the indirect one A-B-C.) This happens in reference databases of biomedical information where multiple tables store different kinds of facts, but each fact needs to be backed up with a bibliographic citation. If all citations are stored centrally in a single table, then multiple tables may contain a foreign-key column, Citation ID, which points to the citations table.
Another concern here is that it is not enough to know that two tables are connected, but how they should be connected in a specific circumstance. Most of the time, one needs a standard (inner) join, but at other time, one needs a left or right outer join, and very rarely a full outer join (the union of left and right joins).

In other words, the basic UDB paradigm will take you only so far: at some point, the user must leave Never Land and become aware of the real world, where separate tables exist.

15.1.3 Cycle Detection and Elimination

In Business Objects, setting up a “Universe” involves detecting and eliminating cycles. The standard algorithm for cycle-detection in undirected graphs is based on depth-first search: see Bard19 for an accessible explanation. Essentially, one maintains lists of edges and vertices: all vertices are set to “unvisited”. Starting with any arbitrary vertex, you move along each of the edges connected to it that have not already been used. Any unvisited vertex encountered is set to “visited”. If a previously visited vertex is encountered, the edge that got you to this vertex is a cycle, or part of one. For example, in Fig. 15.8 , starting from A, you would reach D, B and C. Moving on to B, you would reach C, which has already been visited: therefore A–C is a cycle. Moving from D to A would similarly detect a cycle.

To eliminate cycles, you need to duplicate the multiply-connected vertexes: in terms of database schemas, you introduce alternative aliases for a table. You keep repeating this process until no cycles are detected. In Fig. 15.9 , which is a transformation of Fig. 15.8 , adding the two aliases A2 and D2 for A and D would seem to have eliminated cycles.

However, the problem gets more complicated. Remember that in the double-link between A and D of Fig. 15.8 , two separate columns in one table (the foreign keys) are linked to the primary key of the second table. Let’s assume that D is the foreign-key table with two columns Col1 and Col2. (These could be Ship-From and Ship-to address IDs.) Let’s assume that Col1 is linked to A, and Col2 is linked to A2. Then, in Fig. 15.9 , the only way to go from Col2 to B is via A2 and C (which may not be what the user wants). Similarly, any route from Col1 must involve B, which again may not be what is needed. So one must duplicate each D node once again, creating two new nodes D3 and D4, with D3 linked to A, and D4 linked to A2.

You must deal with the outer-join problem in a similar way. Let’s say that you mostly link A and B with an inner join, but occasionally need to use A right-outer-join B. You would duplicate A to create A3 (you could alternatively duplicate B – the choice is immaterial). You could then designate the link A–B as an inner join and A3–B as a right outer join. In general, if there are N cycles, you have to create N*2 extra aliases in the worst case.

15.1.4 Can Database Schemas be “Hidden” Successfully Using a UDB Approach?

As the number of multiple links between tables grows, the setup of a “Universe” becomes increasingly laborious. For the dubious “benefit” of hiding the physical schema and inter-table relationships from a supposedly naïve and E-R-diagram-phobic user, the virtual UDB schema can become much more complex than the actual physical schema. There are several practical issues here.

The UDB concept originated at a time when GUIs were non-existent, and E-R diagrams existed only on paper as a modeling tool SQL, including join expressions, had to be entered mostly by typing. In most modern environments, E-R diagrams are nothing to be scared of. Modern query GUIs such as MS-Access’s VQBE generate the join expression between two tables for you if you click and drag between the two columns that you want to connect. (You can specify the join type by changing the connection’s properties.) VQBE will actually utilize existing relationships in your schema, if you select two tables that happen to be directly related, pre-connecting them. (If you select two tables that are not directly connected, VQBE is currently not smart enough to include intermediate tables. You need to refer to the E-R diagram and add intermediate tables manually – which is not too big a deal.)
If you still insist on going the BO route, a fairly challenging issue involves giving the duplicated column aliases meaningful names that would somehow indicate the path desired by the user when a particular alias (or aliases, such as D4 and A2 in our running example) were picked implicitly. This is a problem that I believe contains an intractable contradiction: the user who is supposedly too naïve to know about separate tables must now be forced to not only deal with the tables but also the paths between them.
In my experience, serious analysts, who know SQL and are quite comfortable with E-R diagrams, find the BO approach particularly condescending and confining: they are likely to spend more time fighting against the software than getting work done. Such individuals would be more productive with direct read-only SQL access to the schema. (An amusing episode involved watching a friend and colleague of mine, the Universe designer/maintainer for an institutional Business Objects installation, bypass BO completely and type SQL to answer queries posed to him by a researcher. This contradicts the well-known dictum about “eating your own dog food”: don’t force your users to use something that you wouldn’t use yourself.)
Finally, in the case of EAV designs, where the same table can be joined to itself multiple times for set intersection operations, and many join operations are not readily predictable, setup of a “Universe” becomes so complicated that your typical developers – unless they are billing by the hour and primarily focused on job security – are likely to throw up their hands in frustration and turn in their resignations.

This is not to say that automatic join composition cannot be a source of developer productivity. The danger is of creating software that so insists on helping a user that it becomes a straitjacket or gatekeeper that persistently underestimates the user’s intelligence: it then becomes a database analog of the infamous Microsoft Office Assistant. Good software should know when to get out of the way, and should allow the power user ways of bypassing it if desired: VQBE, for example, lets developers type SQL.

Join Metadata in EAV/CR Designs

If you are using the EAV/CR approach of maintaining information on your inter-class/table joins in your metadata, this information can be used to compose join expressions. Remember that in an EAV/CR Design, the Meta_Relationships table’s contents the represent the edges in the schema’s graph, while Meta_Classes and Meta_Attributes hold vertex information. The difference, compared to traditional schema is that, based on which classes are modeled using EAV, some of the relationships may be simulated rather than physical. (I’ll explain shortly how join expressions can be composed for simulated relationships.)

At runtime, the graph is best loaded into memory to speed up the operations of cycle detection and join composition. The linked-list representation of the graph (see Sedgewick18) is most appropriate, though it can be optimized for performance. The edges are loaded into an array, and contain the primary-key class, the foreign-key class, and the foreign-key attribute: the primary-key attributes are known from the definition of the primary-key class. IDs are used rather than column names: the latter can be readily determined by lookup of the Class or Attribute IDs.

Given a set of classes that the user has specified, the problem is to find a unique path that connects these. This can be done by picking any class (vertex) in the list, and traversing its connecting edges until all the classes specified by the user have been reached. By the time this has been done, one or more intermediate classes may have been encountered. One then works with this subset of nodes to detect cycles. Conceptually, cycle detection is a second step, but in practice, by keeping track of which vertexes have already been visited, we can detect cycles simultaneously with the traversal process. There are also certain clues that would immediately identify cycles, such as a double relationship between a pair of tables.

1.
If you do not detect cycles, exit; a join expression can be composed.
2.
If you detect a cycle, a join expression cannot be composed without user assistance: the user must be shown the subset of the graph that contains the classes of interest, where the edges are labeled with the foreign-key attributes, and asked to disambiguate by selecting the desired edges (or deselecting the undesired edges). Note that, unlike the BO approach, we are assuming here that the user is aware of the database structure and knows about relationships. (The graph can be drawn using one of several graph-drawing subroutine libraries.)

This will work in most circumstances: sometimes, however, when there is a double relationship between a pair of classes, the user may specify both edges. (In the manager-employee situation, for example, the user may want details of both manager and employee.) In such a case, the software must create an extra alias for the primary-key table – preferably after having the user confirm that s/he really intends to use that table more than once. One may have to create aliases for other tables that are also intended to be used more than once, if they are part of the chain.
3.
After the user has selected edges, go to step 1, working with the reduced edge-list this time. If any tables that are used more than once, the diagram must be modified to include the extra aliases.

Note that the level of software engineering to implement the above steps is not trivial. You should certainly not attempt graph-drawing from first principles: commercial programs also include smart layout algorithms that place the vertexes in 2-D space so as to prevent edges from crossing each other as far as possible: such packages are well worth the modest investment. However, for relationships that are simulated using classes modeled as EAV, the effort of composing SQL by working directly with the schema and using a commercial graphical query tool that is not EAV-aware can be considerable. If you have your developer team try to compose queries this way a couple of times, they are likely to find the experience so disheartening that you should be persuaded that it is better to bite the bullet and solve the problem once and for all with a framework.

15.2.1 Dealing with Simulated Relationships

The join-expression composition approach for a pair of tables is fairly straightforward. However, for simulated relationships in an EAV/CR design, the join is only modestly more complicated. Recall that every object has a unique Object_ID, which points back to the Objects table, and that foreign-key relationships are implemented through the EAV_Objects table. See Fig. 15.10 below. (this diagram has been composed using MS-Access VQBE.)

Focus on the central table E (an alias for EAV_Objects) in Fig. 15.10 . The Object ID column and the Value column both point to the Objects table; here, we have used two aliases for Objects, Fkey_Obj and Pkey_Obj (to indicate the foreign-key and primary-key classes respectively). The relationship information connecting the two classes has three columns: the Primary-Key class ID, the Foreign-key class ID, and the Foreign-Key class’s Foreign-key Attribute ID. In the figure, we have denoted these in the Criteria row by abbreviated forms within square brackets: these would be substituted with actual IDs.

Fig. 15.10 also shows the SQL (automatically generated by VQBE) that implements the relationship: the criteria values have been italicized, to indicate that they are placeholders that would be substituted.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Nadkarni, P.M. (2011). Data Retrieval for Heterogeneous Data Models. In: Metadata-driven Software Systems in Biomedicine. Health Informatics. Springer, London. https://doi.org/10.1007/978-0-85729-510-1_15

Download citation

DOI: https://doi.org/10.1007/978-0-85729-510-1_15
Published: 29 April 2011
Publisher Name: Springer, London
Print ISBN: 978-0-85729-509-5
Online ISBN: 978-0-85729-510-1
eBook Packages: MedicineMedicine (R0)

Publish with us

Policies and ethics