Skip to main content
Log in

AutoG: a visual query autocompletion framework for graph databases

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Composing queries is evidently a tedious task. This is particularly true of graph queries as they are typically complex and prone to errors, compounded by the fact that graph schemas can be missing or too loose to be helpful for query formulation. Despite the great success of query formulation aids, in particular, automatic query completion, graph query autocompletion has received much less research attention. In this paper, we propose a novel framework for subgraph query autocompletion (called AutoG). Given an initial query q and a user’s preference as input, AutoG returns ranked query suggestions \(Q'\) as output. Users may choose a query from \(Q'\) and iteratively apply AutoG to compose their queries. The novelties of AutoG are as follows: First, we formalize query composition. Second, we propose to increment a query with the logical units called c-prime features that are (i) frequent subgraphs and (ii) constructed from smaller c-prime features in no more than c ways. Third, we propose algorithms to rank candidate suggestions. Fourth, we propose a novel index called feature Dag (FDag) to optimize the ranking. We study the query suggestion quality with simulations and real users and conduct an extensive performance evaluation. The results show that the query suggestions are useful (saved roughly 40% of users’ mouse clicks), and AutoG returns suggestions shortly under a large variety of parameter settings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Canada)

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Notes

  1. https://pubchem.ncbi.nlm.nih.gov/.

  2. https://www.emolecules.com/.

  3. Ideally, the user interface may automatically show useful suggestions to users. The current gui (Fig. 1) provides an “Autocomplete” button for users to fetch the top-k suggestions to allow an explicit comparison of the experiences with and without suggestions for user tests.

  4. Query composition refers to an intuitive step when users are composing their queries, which is obviously different from the compositions used in the functional programming literature.

  5. We assume the non-feature parts of the queries are inputted by users. That is, they are not composed from AutoG.

  6. In contrast, for keyword search (of strings), key phrases are simply composed by a union/concatenation of keywords.

  7. https://pubchem.ncbi.nlm.nih.gov/edit2/index.html?cnt=0.

  8. Almost all query templates of PubChem on their user interface are also frequent subgraphs, returned by gSpan using its default parameters. This shows that the domain experts indeed set frequent subgraphs as query templates. Yet, we omit the advanced templates of PubChem after user’s click on the UI.

  9. The automorphism of \(f_i\) is not used for pruning because \(f_i\) is embedded in q, and the automorphism of q is unknown offline.

  10. The automorphism relation \(A_{\mathsf {cs}}\) of \(\mathsf {cs}\) can be retrieved from FDag.

  11. Readers may find the full list of queries for investigating the suggestion quality from the project site https://goo.gl/Xr9MRY. Further, a short video shows how users may interact with the AutoG prototype.

  12. The questionnaire used in the tests can be found at http://goo.gl/dFRdwj.

References

  1. Abiteboul, S., Amsterdamer, Y., Milo, T., Senellart, P.: Auto-completion learning for xml. In: SIGMOD (2012)

  2. Bast, H., Weber, I.: Type less, find more: fast autocompletion search with a succinct index. In: SIGIR (2006)

  3. Bhowmick, S.S., Choi, B., Zhou, S.: VOGUE: towards a visual interaction-aware graph query processing framework. In: CIDR (2013)

  4. Bhowmick, S.S., Chua, H.-E., Thian, B., Choi, B.: ViSual: An hci-inspired simulator for blending visual subgraph query construction and processing. In: ICDE (2015)

  5. Bhowmick, S.S., Dyreson, C.E., Choi, B., Ang, M.-H.: Interruption-sensitive empty result feedback: Rethinking the visual query feedback paradigm for semistructured data. In: CIKM (2015)

  6. Borodin, A., Lee, H.C., Ye, Y.: Max-sum diversification, monotone submodular functions and dynamic updates. In: PODS (2012)

  7. Braga, D., Campi, A., Ceri, S.: XQBE (XQuery By Example), A visual interface to the standard xml query language. In: TODS (2005)

  8. Broder, A.Z.: On the resemblance and containment of documents. In: Compression and complexity of sequences (1997)

  9. Bunke, H., Shearer, K.: A graph distance metric based on the maximal common subgraph. Pattern Recognit. Lett. 19(3), 255–259 (1998)

    Article  MATH  Google Scholar 

  10. Cheng, J., Ke, Y., Ng, W., Lu, A.: Fg-index: towards verification-free query processing on graph databases. In: SIGMOD (2007)

  11. Comai, S., Damiani, E., Fraternali, P.: Computing graphical queries over xml data. In: TOIS (2001)

  12. Cordella, L.P., Foggia, P., Sansone, C., Vento, M.: A (sub)graph isomorphism algorithm for matching large graphs. In: PAMI (2004)

  13. Fan, Z., Peng, Y., Choi, B., Xu, J., Bhowmick, S.S.: Towards efficient authenticated subgraph query service in outsourced graph databases. In: TSC (2014)

  14. Feng, J., Li, G.: Efficient fuzzy type-ahead search in xml data. In: TKDE, pp. 882–895 (2012)

  15. Gollapudi, S., Sharma, A.: An axiomatic approach for result diversification. In: WWW (2009)

  16. Han, W.-S., Lee, J., Pham, M.-D., Yu, J.X.: iGraph: a framework for comparisons of disk-based graph indexing techniques. In: PVLDB, pp. 449–459 (2010)

  17. Herschel, M., Tzitzikas, Y., Candan, K.S., Marian, A.: Exploratory search: New name for an old hat? http://wp.sigmod.org/?p=1183 (2014)

  18. Hung, H.H., Bhowmick, S.S., Truong, B.Q., Choi, B., Zhou, S.: QUBLE: blending visual subgraph query formulation with query processing on large networks. In: SIGMOD, pp. 1097–1100 (2013)

  19. Jayaram, N., Goyal, S., Li, C.: VIIQ: auto-suggestion enabled visual interface for interactive graph query formulation. In: PVLDB, pp. 1940–1951 (2015)

  20. Jayaram, N., Gupta, M., Khan, A., Li, C., Yan, X., Elmasri, R.: GQBE: querying knowledge graphs by example entity tuples. In: ICDE (2014)

  21. Jin, C., Bhowmick, S.S., Xiao, X., Cheng, J., Choi, B.: GBLENDER: towards blending visual query formulation and query processing in graph databases. In: SIGMOD (2010)

  22. Kriege, N., Mutzel, P., Schäfer, T.: Practical sahn clustering for very large data sets and expensive distance metrics. J. Graph Algorithms Appl. 18, 577–602 (2014)

  23. Li, Y., Yu, C., Jagadish, H.V.: Enabling schema-free xquery with meaningful query focus. VLDB J. 17, 355–377 (2008)

  24. Lin, C., Lu, J., Ling, T.W., Cautis, B.: LotusX: a position-aware xml graphical search system with auto-completion. In: ICDE (2012)

  25. Luks, E.M.: Isomorphism of graphs of bounded valence can be tested in polynomial time. J. Comput. Syst. Sci. 25, 42–65 (1982)

  26. Marchionini, G.: Exploratory search: from finding to understanding. Commun. ACM 49, 41–46 (2006)

  27. McGregor, J.J.: Backtrack search algorithms and the maximal common subgraph problem. Softw. Pract. Exp. 12, 23–34 (1982)

  28. Mottin, D., Bonchi, F., Gullo, F.: Graph query reformulation with diversity. In: KDD, pp. 825–834 (2015)

  29. Nandi, A., Jagadish, H.V.: Effective phrase prediction. In: VLDB, pp. 219–230 (2007)

  30. NCI. AIDS. https://dtp.cancer.gov/databases_tools/bulk_data.htm

  31. NLM. PubChem. ftp://ftp.ncbi.nlm.nih.gov/pubchem/

  32. Pandey, S., Punera, K.: Unsupervised extraction of template structure in web search queries. In: WWW, pp. 409–418 (2012)

  33. Papakonstantinou, Y., Petropoulos, M., Vassalos, V.: QURSED: querying and reporting semistructured data. In: SIGMOD (2002)

  34. Qin, L., Yu, J.X., Chang, L.: Diversifying top-k results. CoRR, arXiv:1208.0076 (2012)

  35. Shasha, D., Wang, J.T.-L., Giugno, R.: Algorithmics and applications of tree and graph searching. In: PODS (2002)

  36. Venero, M.L.F., Valiente, G.: A graph distance metric combining maximum common subgraph and minimum common supergraph. Pattern Recognit. Lett. 22, 753–758 (2001)

    Article  MATH  Google Scholar 

  37. Vieira, M.R., Razente, H.L., Barioni, M.C.N., Hadjieleftheriou, M., Srivastava, D., Traina, C., Tsotras, V.J.: On query result diversification. In: ICDE (2011)

  38. Wallis, W.D., Shoubridge, P., Kraetzl, M., Ray, D.: Graph distances using graph union. Pattern Recognit. Lett. 22, 701–704 (2001)

    Article  MATH  Google Scholar 

  39. Xiao, C., Qin, J., Wang, W., Ishikawa, Y., Tsuda, K., Sadakane, K.: Efficient error-tolerant query autocompletion. In: PVLDB (2013)

  40. Xie, X., Fan, Z., Choi, B., Yi, P., Bhowmick, S.S., Zhou, S.: PIGEON: Progress indicator for subgraph queries. In: ICDE (2015)

  41. Yan, X., Han, J.: gSpan: Graph-based substructure pattern mining. In: ICDM, pp. 721–724, (2002)

  42. Yan, X., Yu, P.S., Han, J.: Graph indexing: a frequent structure-based approach. In: SIGMOD (2004)

  43. Yi, P., Choi, B., Bhowmick, S.S., Xu, J.: AutoG: A visual query autocompletion framework for graph databases. https://goo.gl/Xr9MRY (2016)

  44. Yi, P., Choi, B., Bhowmick, S.S., Xu, J.: AutoG: a visual query autocompletion framework for graph databases [demo]. PVLDB 9, 1505–1508 (2016)

    Google Scholar 

  45. Yuan, D., Mitra, P.: Lindex: a lattice-based index for graph databases. VLDB J. 22, 229–252 (2013)

    Article  Google Scholar 

Download references

Acknowledgements

Peipei Yi and Byron Choi are partially supported by the HK-RGC GRF 12201315 and 12232716. Sourav S Bhowmick is supported by the Singapore MOE AcRF Tier-1 Grant RG24/12 and MOE AcRF Tier-2 Grant 2015-T2-1-040. Jianliang Xu is partially supported by the HK-RGC GRF 12244916 and 12200114.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peipei Yi.

Appendices

Appendix 1: Properties of \(c\)-prime features

This appendix presents the definition of frequent features and the anti-monotonicity and downward-closure properties of \(c\)-prime features.

Definition 12

(Frequent feature) Given a graph database D, a frequent feature f is a subgraph of the graphs in D and \(|D_f| \ge minSup\), where minSup is the minimum support of features. F is the frequent feature set of D with respect to minSup. \(\square \)

Proposition 3

(Anti-monotonicity property) If f is a \(c\)-prime feature and \(f'\) is a subgraph of f (i.e., \(f' \subseteq _{\lambda }f\)), then \(f'\) is a \(c\)-prime feature. \(\square \)

Proof sketch It is obvious that \(f'\) is a frequent subgraph. For any \(f'_i\) and \(f'_j\) such that \(f' = \mathsf {compose}(f'_i, f'_j, \mathsf {cs}(f'_i, f'_j), \lambda '_i, \lambda '_j)\). It is possible because by the a priori property of frequent subgraph, \(f'_i\), \(f'_j\) and \(\mathsf {cs}_k(f'_i, f'_j)\) are frequent. Thus, we can use these subgraphs to compose queries. Since f is connected, we can always find connected supergraphs of \(f'_i\) and \(f'_j\), called them \(f_i\) and \(f_j\) such that \(f = \mathsf {compose}(f_i,\, f_j,\, \mathsf {cs}(f_i, f_j),\, \lambda _i,\, \lambda _j)\). Again, by the a priori property of frequent subgraph, \(f_i\) and \(f_j\) are frequent. Therefore, for each way of constructing \(f'\), we can always determine a corresponding way to construct f. The composability of f is larger than or equal to that of \(f'\). Since f is \(c\)-prime, \(f'\) is \(c\)-prime. \(\square \)

In the context of \(c\)-prime features, the downward-closure property is also just a synonym of anti-monotonic property.

Proposition 4

(Downward-closure property) If f is a non-\(c\)-prime feature, and \(f''\) is a frequent feature and a supergraph of f (i.e., \(f \subseteq _{\lambda }f''\)), then \(f''\) is a non-\(c\)-prime feature. \(\square \)

Proof sketch We can apply the same argument in the proof sketch of Proposition 3. Since \(f''\) is a supergraph of f, the composability of \(f''\) is larger than or equal to that of f. Since f is not \(c\)-prime  \(f''\) is not \(c\)-prime. \(\square \)

Appendix 2: Analysis of the ranked subgraph query suggestion (Rsq) problem

Theorem 2

The Rsq problem is NP-hard. \(\square \)

Fig. 18
figure 18

An illustration of the query suggestions generated from an Mis instance

Proof sketch [Mis] Given a graph G=(V, E), where V and E are the sets of vertices and edges, respectively, an independent set (Is) is a set of vertices \(V'\), such that \(V' \subseteq V\), there does not exist \(v_i, v_j \in V'\) and (\(v_i\), \(v_j\)) \(\in \) E. The Mis problem is to determine an Is \(V'\) such that there does not exist an Is whose size is larger than \(V'\).

Reduction We start with an instance of Mis problem G=(V, E). We construct a query suggestion set \(Q_V\) such that for each vertex v in V, we have a query suggestion \(q_v\) corresponding to each v in V. We construct \(Q_V\) such that each query \(q_v\) in \(Q_V\) has exactly \(|Q_V|\) edges. The structure of each query is a star, with a common first edge (\(v_a, v_b\)) and the other edges encode the following (also illustrated with Fig. 18):

  1. 1.

    If (\(v_i\), \(v_j\)) \(\in \) E, then, (\(v_b\), v ij) is an edge of \(q_{v_i}\) and \(q_{v_j}\). That is, \(\mathsf {mces}\) between \(q_{v_i}\) and \(q_{v_j}\) is {(\(v_a, v_b\)), (\(v_b\), v ij)}.

  2. 2.

    Otherwise, an edge (\(v_b\), \(v^i{i,j}\)) is introduced to \(q_{v_i}\) and (\(v_b\), \(v^j{i,j}\)) is introduced to \(q_{v_j}\). Then, the \(\mathsf {mces}\) between \(q_{v_i}\) and \(q_{v_j}\) is {(\(v_a, v_b\))} only.

The maximum independent set is at most of the size |V|/2. Therefore, we invoke Rsq on \(Q_V\), where k is ranged from 1 to |V|/2 and \(\alpha \) is set to 0. That is, only the diversity component of the ranking function is considered.

Case (1) Suppose \(Q_v'\) is a solution of Rsq. If for some \(q_i\), \(q_j\) \(\in \) \(Q_v'\) and \(\mathsf {mces}\)(\(q_i\), \(q_j\)) is not just (\(v_a\), \(v_b\)), then it guarantees there does not exists \(Q_v''\) such that \(|Q_v'|\) = \(|Q_v''|\) and for all \(q_i'\), \(q_j'\) \(\in \) \(Q_v''\), such that \(\mathsf {mces}\)(\(q_i'\), \(q_j'\)) is just (\(v_a\), \(v_b\)). This is because \(Q_v''\) would have been ranked higher, according to \(\mathsf {util}\) (i.e., \(Q_v''\) is more diversified than \(Q_v'\)). And by the reduction above, there is an edge (\(v_i\), \(v_j\)) in E. The corresponding \(V'\) is not an Is.

Case (2) Suppose \(Q_v'\) is a solution of Rsq and for all \(q_i\), \(q_j\) \(\in \) \(Q_v'\), \(\mathsf {mces}\)(\(q_i\), \(q_j\)) is (\(v_a\), \(v_b\)). By the reduction above, there is no edge between \(v_i\) and \(v_j\), for all \(v_i\), \(v_j\). Thus, the corresponding \(V'\) is an Is.

Putting these together, let \(Q_v'\) is the largest set returned by invoking Rsq’s for k ranging from 1 to |V|/2, whose corresponding \(V'\) is an Is. Suppose \(Q_V''\) is returned by Rsq and larger than \(Q_v'\). Then, \(Q_V''\) belongs to Case (1). According to Case (1), there is no Is of the size \(|Q_V''|\) can be obtained. Therefore, \(V'\) is the maximum independent set. \(\square \)

Appendix 3: Additional experiments

1.1 Suggestion qualities with different underlying definitions in AutoG

Quality metrics under other \(\mathsf {mces}\) distance metrics For illustration purposes, the paper adopts the maximum common edge subgraph (\(\mathsf {mces}\)) for \(\mathsf {dist}\) (see BS in Definition 8). It is possible to plug in other edge based distance metrics into the AutoG framework (without modifications) to represent the “intra-dis- similarity” between a pair of suggestions. In this experiment, we report the quality metrics of the suggestions when AutoG uses two other \(\mathsf {mces}\) distance metrics.

The two distance metrics are presented in [22, 36, 38], denoted as WSKR and FV. The first distance metric (WSKR) uses the size of the union instead of the size of the larger graphs to distinguish variations in the graph sizes [38]:

$$\begin{aligned} \mathsf {dist}_{_{WSKR}}(g_1, g_2) = 1 - \frac{|\mathsf {mces}(g_1, g_2)|}{|g_1| + |g_2| - |\mathsf {mces}(g_1, g_2)|} \end{aligned}$$
(1)

The second distance metric (FV) is based on the maximum common subgraph and minimum common supergraph of the two graphs. Therefore, it takes into consideration of both superfluous and missing structure information of the two graphs [36]. We normalize FV by dividing it by \(|g_1| + |g_2|\).

$$\begin{aligned} \mathsf {dist}_{_{FV}}(g_1, g_2) = \frac{|g_1| + |g_2| - 2|\mathsf {mces}(g_1, g_2)|}{|g_1| + |g_2|} \end{aligned}$$
(2)
Table 12 #AutoG versus \({\small \delta }\) (PubChem)
Table 13 #AutoG versus |q| (PubChem when \({\small \delta }\)=3)

AutoG achieved similar stable qualities under these different distance metrics and parameter settings. For presentation brevity, we reported the results of suggestion qualities in terms of #AutoG and TPM only. Tables 12, 13 and 14 show that regardless the distance adopted, the suggestions were used in multiple iterations of query formulations. BS performed slightly better than FV and WSKR. Tables 15, 16 and 17 show similar trends from BS, FV and WSKR, while BS almost always performed the best. Therefore, users may pick the distance metric that is intuitive to their applications.

Table 14 #AutoG versus k (PubChem when \({\small \delta }\) = 3)
Table 15 TPM versus \({\small \delta }\) (PubChem)
Table 16 TPM versus |q| (PubChem when \({\small \delta }\)=3)
Table 17 TPM versus k (PubChem when \({\small \delta }\) = 3)

Suggestion qualities of the \({\varvec{c}}\) -prime features on top of gIndex In the paper, \(c\)-prime features are defined (Definition 7) with frequent features (Definition 12). As discussed, \(c\)-prime features are orthogonal to other features. As a proof of concept for integrating other features to AutoG, we implemented \(c\)-prime features on top of discriminative features proposed in gIndex–the seminal work of using features for subgraph query performance.

In a nutshell, we counted the composability of the discriminative features and index those that are \(c\)-prime. We used the implementation from iGraph [16]. We adopted the default parameter values for gIndex. In particular, the support threshold was set to 10%, the maximum feature size maxL was 10, and the discriminative ratio \(\gamma _{min}\) was 2. We implemented the same size-increasing function as in [42]. Under this setting, we obtained 2370 discriminative features from the PubChem dataset. We then constructed the FDag index as before.

We compared the suggestion qualities of AutoG using the proposed \(c\)-prime features and \(c\)-prime features on top of discriminative features, simply denoted as gIndex. We report the suggestion qualities from simulations in Tables 18, 19 and 20. The results showed that such suggestions were still somewhat useful. As expected, when they are compared to the results from Tables 4, 5 and 6, the proposed \(c\)-prime features gave clearly higher quality suggestions than those by using discriminative features. More specifically, when we varied \(\delta \), the average values of 1 Hit (%), #AutoG, AutoG |E| and TPM (%) of our proposed \(c\)-prime features were 99%, 2.8, 3.6, and 48%. whereas those of gIndex were 46%, 0.5, 1.2, and 19%. When we varied |q|, those quality metrics of the proposed features were 99%, 3.2, 5.0, and 44%, whereas those of gIndex were 76%, 1.1, 3.0, and 27%. Similarly, when we varied k, those quality metrics of the proposed features were 93%, 1.9, 2.7, and 36%, whereas those of gIndex were 49%, 0.5, 1.4, and 21%. The reason is simple. The design goal of gIndex is to efficiently prune non-answer graphs of a query, not query autocompletion.

Table 18 Quality metrics versus \({\small \delta }\) (PubChem, using gIndex)
Table 19 Quality metrics versus |q| (PubChem when \({\small \delta }\)=3, using gIndex)
Table 20 Quality metrics versus k (PubChem when \({\small \delta }\) = 3, using gIndex)

1.2 Online performance breakdowns

Next, we present a detailed performance study of each major step of the online processing.

Query decomposition We report that the query decomposition phase always took less than a few milliseconds, for all queries and datasets under all the aforementioned parameter settings.

Local ranking The runtimes of local ranking phase under various parameter settings are presented in Figs. 19, 20, 21, 22, 23 and 24. When comparing Fig. 19 to Fig. 10, we observed that the local ranking was the bottleneck of online processing. The reason is that local ranking compared many pairs of candidate suggestions, even they were indexed by FDag. Consistent results can be observed from experiments with the parameters \(m\), \(\gamma \), \(\alpha \) and top-k were varied (e.g., Figs. 20, 11), except with the following situation. Under the default setting, the runtimes of the local part increased slightly with k as shown in Fig. 23. We also noted from Fig. 24 that the art of the online processing for local ranking increased sub-linearly with |q|.

Fig. 19
figure 19

Local—default

Fig. 20
figure 20

Local—time versus m

Fig. 21
figure 21

Local—time versus \(\gamma \)

Fig. 22
figure 22

Local—time versus \(\alpha \)

Fig. 23
figure 23

Local—time versus k

Fig. 24
figure 24

Local—time versus |q|

Global ranking The runtimes of the global ranking phase under a large variety of parameter settings are reported in Figs. 25, 26, 27, 28, 29 and 30. When varying m, \(\gamma \) and \(\alpha \), the times were quite stable. We noted from the experimental results that the global diversification took \({<}\)60 ms under the default setting. Fig. 29 shows that its runtimes were roughly linear to the number of suggestions k. Fig. 30 however shows that the runtimes increased sharply as the query size |q| increased. The two main reasons are: (1) When |q| increased, the queries may be decomposed into large features. Computing \(\mathsf {mces}\) of large features online is known to be costly, and (2) Large queries may be decomposed into more features, which in turn resulted in more \(\mathsf {mces}\) calls.

Fig. 25
figure 25

Global—default

Fig. 26
figure 26

Global—time versus m

Fig. 27
figure 27

Global—time versus \(\gamma \)

Fig. 28
figure 28

Global—time versus \(\alpha \)

Fig. 29
figure 29

Global—time versus k

Fig. 30
figure 30

Global—time versus |q|

From the last experiment, we found that the art of online processing was determined by either local or global rankings, which are in turn dependent to |q|. We illustrate further the relations between the two by reporting performance breakdown on all datasets datasets (see Figs. 31, 32, 33, 34). For Aids and PubChem, we observed that the runtimes of global ranking increased much faster than those of the local one, due to the costly online \(\mathsf {mces}\) computations. For Syn-1 and Syn-2, we observed the arts of the two phases exhibited linear trends.

Fig. 31
figure 31

Performance breakdown (Aids)—vary |q|

Fig. 32
figure 32

Performance breakdown (Syn-1)—vary |q|

Fig. 33
figure 33

Performance breakdown (Syn-2)—vary |q|

Fig. 34
figure 34

Performance breakdown (PubChem)—vary |q|

Selectivity estimation We evaluated various facets of selectivity estimation in AutoG. The results are summarized in Figs. 35, 36. We noted that the estimation results clearly formed three distinct categories: “mis-estimated” denotes queries that were estimated to be non-empty, but are in fact empty; “large error” denotes estimated queries whose errors were larger than 100%; and “small error” denotes the remaining queries. Fig. 35 shows the percentages for each of the three categories under different values of sampling step \(m\). About 80% of the queries were estimated correctly. Fig. 36 reports that the mean estimation errors of “small error” were all below 1.2%. Thus, the selectivity estimation was accurate.

Fig. 35
figure 35

Percentages of estimation presented in error categories (PubChem)—vary m

Fig. 36
figure 36

The errors of the estimated in the “small errors” category (PubChem)—vary m

Effectiveness of structural trimming for \(\mathsf {mces}\) Fig. 37 reports the average speedups due to the trimming technique introduced in Sect. 4.3.2 on all datasets. There were at least three orders of magnitudes of speedup. We remark that there were some large queries that could not be finished without trimming; these queries are excluded. Recall that \(\mathsf {mces}\) was a performance bottleneck. With trimming techniques, we report in Fig. 38 that the costs of \(\mathsf {mces}\) were around 30% (respectively, 60%) of global ranking costs for the synthetic datasets (respectively, the real datasets).

Fig. 37
figure 37

Effects of the trimming technique on \(\mathsf {mces}\)

Fig. 38
figure 38

\(\mathsf {mces}\) cost in global ranking (with trimming)

Appendix 4: The FDAG construction

figure c

In this appendix, we present the construction algorithm of FDag (shown in Algorithm 3). The details of Algorithm 3 can be presented as follows. First, we sort F in ascending order of their edge numbers (Line 2). Then, we process the features one by one (Line 3). We create a node \(v_f\) in FDag, for each \(f \in F\), and compute the automorphism relation of f (\(A_f\)). For index nodes \(v_f\) and \(v_{f'}\), we perform a subgraph isomorphism test between f and \(f'\) (Line 9). If subgraph isomorphism relations exist, we add the edge (\(v_f\), \(v_{f'}\)) to FDag, and associate the subgraph embeddings to the edge (Lines 10–11). Finally, we generate \(\mathsf {anc}\) and \(\mathsf {des}\) from the FDag structure (Line 12). Next, Line 13 enumerates the possible compositions of a feature pair and determines intermediate results for computing structural difference between compositions (as presented in Algorithm 4).

Feature pair composition enumeration Algorithm 4 requires further elaboration. It takes FDag and the maximum increment size as input, and outputs a set of feature pair compositions and some auxiliary data (i.e., \(f_{ij}\) in Line 5 and \(F_l\) in Line 12). We highlight some important steps before the details. (i) In Line 5, it shows that a query composition \((f_i, f_j, \mathsf {cs}, \lambda _i, \lambda _j)\) forms \(f_{ij}\), and \(f_{ij}\) itself can be a feature. We record it in \(\zeta \); we count the occurrences of \(f_{ij}\) in all \(\zeta \)s and obtain \(f_{ij}\)’s composability. (ii) While \(f_{ij}\) may not be a feature, it may contain some features other than \(f_i\) and \(f_j\). We record such features in \(F_l\). It is known that in feature-based query processing, the more features (\(F_q\)) the query has, the more accurate the candidate query answer set (\(D_q\)) is, because \(D_q\) = \(\cap _{f \in F_q}\) \(D_{f}\). (iii) The parameter \(\delta \) is the threshold on the size increment in a query suggestion. The number of candidate suggestions increases exponentially with \(\delta \). Therefore, to ensure an interactive response, we may set a modest \(\delta \) (e.g., its default is 5 in our experiments).

figure d

The details of Algorithm 4 can be described as follows. It starts the enumeration from a feature (\(\mathsf {cs}\)), and composes large query graphs from two descendants (\(f_i\) and \(f_j\)) of \(\mathsf {cs}\) (Lines 1–2). Line 3 checks if the increment is smaller than \(\delta \). In Line 4, we iterate through each embedding of \(\mathsf {cs}\) in \(f_i\) and \(f_j\), respectively. In Line 5, we compose a larger graph (denoted as \(f_{ij}\)) from \(f_i\) and \(f_j\) via \(\mathsf {cs}\). If \(f_{ij}\) is a feature, then Algorithm 4 just detected a possible way to compose \(f_{ij}\). In Lines 6–7, we employ the techniques in Sect. 4.2 to prune the empty queries.

Lines 8–15 compute further information to optimize the ranking procedure (Sect. 4.3). Lines 8–9 check if \(f_{ij}\) is a feature. If yes, \(f_{ij}\) is recorded, for determining the composability of \(f_{ij}\). Otherwise, Lines 10–12 determine if there are any features (other than \(f_i\) and \(f_j\)) embedded in \(f_{ij}\), specifically, \(F_l = \{ f | \mathsf {cs}\subseteq _{\lambda }f \wedge f \subseteq _{\lambda }f_{ij}\}\), where \(f \in F\). \(F_l\) are the set of features that contains in \(f_{ij}\). As presented in Sect. 4.3.2, the selectivity of a query is estimated by the intersection of the candidate answers of the features of the queries. When \(F_l\) and \(f_{ij}\) are used in such estimation, it is more accurate than the estimation with only \(f_{i}\) and \(f_{j}\). Lines 13–15 compute the auxiliary structures that optimize the online \(\mathsf {mces}\) distance computation (Sect. 4.3.2).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yi, P., Choi, B., Bhowmick, S.S. et al. AutoG: a visual query autocompletion framework for graph databases. The VLDB Journal 26, 347–372 (2017). https://doi.org/10.1007/s00778-017-0454-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-017-0454-9

Keywords

Navigation