Skip to main content
Log in

Detecting subgraph isomorphism with MapReduce

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

In recent years, the MapReduce framework has become one of the most popular parallel computing platforms for processing big data. MapReduce is used by companies such as Facebook, IBM, and Google to process or analyze massive data sets. Since the approach is frequently used for industrial solutions, the algorithms based on the MapReduce framework gained significant attention within the scientific community. The subgraph isomorphism is a fundamental graph theory problem. Finding small patterns in large graphs is a core challenge in the analysis of applications with big data sets. This paper introduces two novel algorithms, which are capable of finding matching patterns in arbitrary large graphs. The algorithms are designed for utilizing the easy parallelization technique offered by the MapReduce framework. The approaches are evaluated regarding their space and memory requirements. The paper also provides the applied data structure and presents formal analysis of the algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

Notes

  1. The software package is published at https://www.aut.bme.hu/Upload/Pages/Research/VMTS/Papers/MRSI_Implementation.zip.

References

  1. Apache Hadoop: Apache Hadoop Project (2011) http://hadoop.apache.org/

  2. Windows Azure (2013) http://www.windowsazure.com/en-us/

  3. Bader DA, Madduri K (2006) Designing multithreaded algorithms for breadth-first search and st-connectivity on the cray mta-2. In: Parallel Processing, 2006. ICPP 2006. International Conference on, pp 523–530. IEEE

  4. Berry JW (2011) Practical heuristics for inexact subgraph isomorphism. Technical Report SAND2011-6558W, Sandia National Laboratories, Albuquerque

  5. Berry JW, Hendrickson B, Kahan S, Konecny P (2007) Software and algorithms for graph queries on multithreaded architectures. In: International Parallel and Distributed Processing Symposium, IEEE, pp 1–14

  6. Bröcheler M, Pugliese A, Subrahmanian V (2010) Cosi: cloud oriented subgraph identification in massive social networks. In: 2010 International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp 248–255

  7. Bröcheler M, Pugliese A, Subrahmanian VS (2009) Dogma: a disk-oriented graph matching algorithm for rdf databases. In: Proceedings of the 8th International Semantic Web Conference, ISWC ’09. Springer, Berlin, pp 97–113

  8. Hadoop wiki—Powered by http://wiki.apache.org/hadoop/PoweredBy (2013)

  9. Chakrabarti D, Zhan Y, Faloutsos C (2004) R-mat: a recursive model for graph mining. In: SDM, vol. 4, pp 442–446. SIAM

  10. Coffman T, Greenblatt S, Marcus S (2004) Graph-based technologies for intelligence analysis. Commun ACM 47(3):45–47

    Article  Google Scholar 

  11. Graph 500 Steering Committee: graph 500 benchmark (2014) http://www.graph500.org/

  12. Cordella L, Foggia P, Sansone C, Vento M (2004) A (sub)graph isomorphism algorithm for matching large graphs. Pattern Anal Mach Intell IEEE Trans 26(10):1367–1372

    Article  Google Scholar 

  13. Cordella LP, Foggia P, Sansone C, Tortorella F, Vento M (1998) Graph matching: a fast algorithm and its evaluation. In: Proceedings of the 14th International Conference on Pattern Recognition, pp 1582–1584

  14. Cordella LP, Foggia P, Sansone C, Vento M (2001) An improved algorithm for matching large graphs. In: 3rd IAPR-TC15 workshop on graph based representation (GbR2001)

  15. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  16. Fehér P (2013) Cloud enabled model processing approaches. In: Proceedings of the Automation and Applied Computer Science Workshop 2013 (AACS’13)

  17. Fehér P, Lengyel L (2013) Investigating the candidate pair generation of the vf2 algorithm. In: The 12th IASTED International Conference on Software Engineering (SE2013), pp 814–820

  18. Fehér P, Vajk T, Charaf H, Lengyel L (2013) Mapreduce algorithm for finding st-connectivity. In: 4th IEEE International Conference on Cognitive Infococommunications—CogInfoCom 2013

  19. Foggia P, Sansone C, Vento M (2001) A performance comparison of five algorithms for graph isomorphism. In: 3rd IAPR-TC15 workshop on graph based representation (GbR2001)

  20. Kang U, Tsourakakis C, Appel AP, Faloutsos C, Leskovec J (2008) HADI: fast diameter estimation and mining in massive graphs with Hadoop. Carnegie Mellon University, School of Computer Science, Machine Learning Department

  21. Karloff H, Suri S, Vassilvitskii S (2010) A model of computation for mapreduce. In: Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, pp 938–948

  22. Kim SH, Lee KH, Choi H, Lee YJ (2013) Parallel processing of multiple graph queries using mapreduce. In: DBKDA 2013, The Fifth International Conference on Advances in Databases, Knowledge, and Data Applications, pp 33–38

  23. Leach AR, Gillet VJ (2007) An introduction to chemoinformatics. Springer, Berlin

    Book  Google Scholar 

  24. Lee KH, Lee YJ, Choi H, Chung YD, Moon B (2012) Parallel data processing with mapreduce: a survey. AcM sIGMoD Rec. 40(4):11–20

    Article  Google Scholar 

  25. Leskovec J, Chakrabarti D, Kleinberg J, Faloutsos C (2005) Realistic, mathematically tractable graph generation and evolution, using kronecker multiplication. In: Jorge AM, Torgo L, Brazdil P, Camacho R, Gama J (eds) Knowledge discovery in databases: PKDD 2005. Springer, Berlin, pp 133–145

  26. Liu Y, Jiang X, Chen H, Ma J, Zhang X (2009) Mapreduce-based pattern finding algorithm applied in motif detection for prescription compatibility network. In: Dou Y, Gruber R, Joller JM (eds) Advanced parallel processing technologies. Springer, Berlin, pp 341–355

  27. McKay BD (1981) Practical graph isomorphism. Congr Numer 30:45–87

    MathSciNet  MATH  Google Scholar 

  28. Messmer BT, Bunke H (1995) Subgraph isomorphism in polynominal time. Technical Report IAM 95-003, Institute of Computer Science and Applied Mathematics, University of Bern, Bern

  29. Nilsson N (1982) Principles of artificial intelligence. Symbolic computation: artificial intelligence. Springer, Berlin

  30. Ohlrich M, Ebeling C, Ginting E, Sather L (1993) Subgemini: identifying subcircuits using a fast subgraph isomorphism algorithm. In: Proceedings of the 30th International Design Automation Conference, ACM, pp 31–37

  31. Park HM, Chung CW (2013) An efficient mapreduce algorithm for counting triangles in a very large graph. In: Proceedings of the 22Nd ACM International Conference on Information & Knowledge Management, CIKM ’13, ACM, pp 539–548

  32. Plantenga T (2013) Inexact subgraph isomorphism in mapreduce. J Parallel Distrib Comput 73(2):164–175

  33. Plump D (1998) Termination of graph rewriting is undecidable. Fundam Inf 33(2):201–209

    MathSciNet  MATH  Google Scholar 

  34. Rahman SA, Bashton M, Holliday GL, Schrader R, Thornton JM (2009) Small molecule subgraph detector (smsd) toolkit. J Cheminformatics 1(1):1–13

    Article  Google Scholar 

  35. Amazon Web Services (2013) http://aws.amazon.com

  36. Snijders TA, Pattison PE, Robins GL, Handcock MS (2006) New specifications for exponential random graph models. Sociol Methodol 36(1):99–153

    Article  Google Scholar 

  37. Tong H, Faloutsos C, Gallagher B, Eliassi-Rad T (2007) Fast best-effort pattern matching in large attributed graphs. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp 737–746

  38. Tsourakakis CE, Kang U, Miller GL, Faloutsos C (2009) Doulion: counting triangles in massive graphs with a coin. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, pp 837–846

  39. Ullmann JR (1976) An algorithm for subgraph isomorphism. J Assoc Comput Mach 23:31–42

    Article  MathSciNet  Google Scholar 

  40. Zhao Z, Wang G, Butt AR, Khan M, Kumar V, Marathe MV (2012) Sahad: Subgraph analysis in massive networks using hadoop. In: Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, IEEE, pp 390–401

Download references

Acknowledgments

This work was partially supported by the European Union and the European Social Fund through project FuturICT.hu (Grant No.: TAMOP-4.2.2.C-11/1/KONV-2012-0013) organized by VIKING Zrt. Balatonfüred. This work was partially supported by the Hungarian Government, managed by the National Development Agency, and financed by the Research and Technology Innovation Fund (Grant No.: KMR_12-1-2012-0441).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Péter Fehér.

Appendix A: Detailed Proofs

Appendix A: Detailed Proofs

1.1 A.1 Proof of Proposition 1

Proof

Let \(R(n_{S}, n_{T}, m):G_{S}\times G_{T} \rightarrow {0,1}\) be a predicate function implementing the feasibility rules of the VF2 algorithm. R takes nodes \(n_{S}\in V_{S}\), \(n_{T}\in V_{T}\), and partial mapping m as arguments, and returns true if \(n_{S}\) and \(n_{T}\) form a match that can lead to a complete match with regard to the partial mapping m. Let E(cp) be a function that emits the cp candidate node pair taken as arguments as a \({<}cp_{S}, cp_{T}, m{>}\) object, where \(cp_{S} \in V_{S}\), \(cp_{T} \in V_{T}\), and m is the partial mapping. Suppose that the first target graph node to find a pair for is \(x|x \in V_{T}\). The first mapper function generates \(cp_{1}\) of the form \({<}n, x, \emptyset {>},\forall \; n \in V_{S}\). In this case, the behavior of the first reducer is: \(E(cp_{1})\Leftrightarrow R(v_{S},v_{T},m) = 1 | v_{S}\in cp_{1,S} \wedge v_{T}\in cp_{1,T}\). In this manner, exactly those candidate node pairs are emitted, which satisfy the feasibility rules.

In the second iteration, where the target graph node to find a pair for is \(y|y \in V_{T} \wedge x \ne y\), the mapper functions generate all \(cp_{2}: {<}s,y,m{>}, \forall \;s \in T^{\mathrm{In}} \cup T^{\mathrm{Out}}\). Then, the reducer emits \(E(cp_{2})\Leftrightarrow R(v_{S},v_{T},m) = 1 | v_{S}\in cp_{2,S} \wedge v_{T}\in cp_{2,T}\). It means that all emitted candidate node pairs are valid subgraphs with regard to the x and y nodes. Moreover, since all permutations of the nodes were checked (or more specifically, those, which met the feasibility rules), all possible partial mappings are emitted.

After the kth iteration, the output of the reducer contains exactly those candidate node pairs where \(R(v_{S},v_{T},m) = 1 | v_{S}\in cp_{k,S} \wedge v_{T}\in cp_{k,T}\) and the m partial mapping contains exactly k node pairs. If \(k=|V_{T}| \rightarrow \forall \;n_{T} \in cp_{k}|n_{T} \in V_{T}\), that is, all candidate node pairs contain each and every node of the target graph. Since the reducers emit exactly those candidate node pairs that satisfy the conditions of subgraph isomorphism, the \(\mathrm{MRSI}_{H {}}\)algorithm finds all subgraphs in \(G_{S}\) that is isomorphic to \(G_{T}\). \(\square \)

1.2 A.2 Proof of Proposition 2

Proof

Let \(R(n_{S}, n_{T}, m):G_{S}\times G_{T} \rightarrow {0,1}\) be a predicate function implementing the feasibility rules of the VF2 algorithm. R takes nodes \(n_{S}\in V_{S}\), \(n_{T}\in V_{T}\), and partial mapping m as arguments, and returns true if \(n_{S}\) and \(n_{T}\) form a match that can lead to a complete match with regard to the partial mapping m. Let E(cp) be a function that takes the cp candidate node pair as arguments as a \(<cp_{S}, cp_{T}, m\)> object, where \(cp_{S} \in V_{S}\), \(cp_{T} \in V_{T}\), and m is the partial mapping. E(cp) emits the m partial mapping extended with the \(<cp_{S}, cp_{T}\)> node pair.

Suppose that the first node to find a pair for in the target graph is \(x|x \in V_{T}\). In the first iteration, the \(M_{S {}}^{'}\) instances generate \(E(cp_{1})\Leftrightarrow R(v_{S},v_{T},m) = 1 | v_{S}\in cp_{1,S} \wedge v_{T}\in cp_{1,T}\), where \(cp_{1} = <n, x, \emptyset >,\forall \;n \in V_{S}\). In this manner, exactly those partial mappings are emitted, which satisfy the feasibility rules. Next, the \(R_{S {}}^{'}\), \(M_{S {}}^{''}\), and \(R_{S {}}^{''}\) instances reorder the partial mappings. After the \(R_{S {}}^{''}\) phase, the partial mappings appear in a row, if the node described in the row has connecting edges to at least one of the nodes contained by the partial mapping. Partial mappings are only omitted in case of duplication, thus the number of distinct partial mappings does not change.

Suppose that the next target graph node to find a pair for is \(y|y \in V_{T}\). In the second iteration, the \(M_{S {}}^{'}\) instances generate \(E(cp_{2})\Leftrightarrow R(v_{S},v_{T},m) = 1 | v_{S}\in cp_{2,S} \wedge v_{T}\in cp_{2,T}\), where \(cp_{2} = <s, y, m>,\forall \;s \in T^{\mathrm{In}}_{m} \cup T^{\mathrm{Out}}_{m}\). It means that all emitted partial mappings are valid subgraphs with regard to the x and y nodes. Moreover, since all permutations of the nodes were checked, where the x node is successfully matched, all possible partial mappings are emitted.

After the kth iteration, the output of the \(M_{S {}}^{'}\) contains exactly those partial mappings where \(R(v_{S},v_{T},m) = 1 | v_{S}\in cp_{k,S} \wedge v_{T}\in cp_{k,T}\) and the m partial mapping contains exactly \(k - 1\) node pairs. If \(k=|V_{T}| \rightarrow \forall \;n_{T} \in cp_{k} \cup m_{T}|n_{T} \in V_{T}\), where \(m_{T}\) contains the target nodes already in the partial mapping. This means that the partial mappings emitted by the kth \(M_{S {}}^{'}\) contain each and every node of the target graph. Since there is no other target node to find a pair for, after the kth \(M_{S {}}^{'}\)the \(\mathrm{MRSI}_{S {}}\)algorithm finds all subgraphs in \(G_{S}\) that is isomorphic to \(G_{T}\). \(\square \)

1.3 A.3 Proof of Proposition 3

Proof

As it was described in Sect. 4.3, the mapper functions are responsible for generating all possible candidate node pairs that can be possibly added to the given partial mappings. The first \(M_{H {}}\) phase is a special state, because this time the set containing the candidate nodes is empty. At the first iteration, \(M_{H {}}\) functions suggest \(\forall \;m \in V_{S}\) as the candidate node for the first target graph node. This results in m new lines in the output file, where each line contains the identifier of the suggested node (let us ignore the three semicolons in this case). Thus, after the first iteration:

$$\begin{aligned} N_{1}^{M_{H {}}{}} = m \quad L_{1}^{M_{H {}}{}} = f \end{aligned}$$

The reducer phase of the \(\mathrm{MRSI}_{H {}}\) algorithm checks whether the candidate node pairs can be added to the partial match considering the different feasibility rules. The output file contains only the rows related to the candidate nodes, where this examination was successful. In the worst-case scenario, this examination always returns true. Therefore, the number of rows in the output file does not differ from the number of rows generated by the \(M_{H {}}\) functions. In case a candidate node pair can be added to the partial mapping, the \(R_{H {}}\) function also refreshes the set containing the related candidate nodes with the neighbors of the newly added node, as 9b denotes. Since the source graph is a complete graph, each node appears in either the set of the candidate nodes or in the partial mapping. Plus, the newly added node also appears as the key of the line. This leads to the following:

$$\begin{aligned} N_{1}^{R_{H {}}{}} = m \quad L_{1}^{R_{H {}}{}} = (m + 1) \cdot f \end{aligned}$$

After the first iteration, the \(M_{H {}}\) functions generate the candidate nodes for all partial mappings. This means that the mappers create \(m - 1\) new rows for each row, because the cardinality of the candidate node set equals to the number of nodes not contained by the partial mapping. Since the algorithm orders the target graph nodes based on a predefined algorithm, there is no need to contain them in the output file. The only difference between a line generated by an \(M_{H {}}\) function and one created by an \(R_{H {}}\) is the key value. This means that the length of the row remains the same:

$$\begin{aligned} N_{2}^{M_{H {}}{}} = m \cdot (m - 1) \quad L_{2}^{M_{H {}}{}} = (m + 1) \cdot f \end{aligned}$$

The second \(R_{H {}}\) checks the new candidate nodes and refreshes the candidate node sets. Since in the worst-case scenario all candidates satisfy the feasibility rules, the number of the emitted lines does not change. The length of a row does not change either, because the only difference between the input and the output is that a candidate node is moved from the candidate set to the partial mapping. This leads to the following:

$$\begin{aligned} N_{2}^{R_{H {}}{}} = m \cdot (m - 1) \quad L_{2}^{R_{H {}}{}} = (m + 1) \cdot f \end{aligned}$$

In general, the output of the kth mapper phase generates all the ordered set of permutations for \(\forall \;m \in V_{G_{S}}\) where the partial mapping consists of \(k - 1\) elements and does not contain the given m. Each of these rows contains all nodes from the source graph, part of either the partial mapping or the candidate node set.

$$\begin{aligned} N_{k}^{M_{H {}}{}} = \frac{m!}{(m - k)!} \quad L_{k}^{M_{H {}}{}} = {\left\{ \begin{array}{ll} f &{} \quad \text {if } \; k = 1\\ (m + 1) \cdot f &{} \quad \text {otherwise} \end{array}\right. } \end{aligned}$$
(1)

Since in the worst-case scenario, all candidate node pairs satisfy the feasibility rule, \(R_{H {}}\) emits all input rows, thus the number of output rows remains the same. The length of the rows does not change either, since the candidate node set contains all neighbor nodes not contained by the partial mapping, which equals to \(m - k\) in case of a complete graph, where the partial mapping consists of k elements.

$$\begin{aligned} N_{k}^{R_{H {}}{}} = \frac{m!}{(m - k)!} \quad L_{k}^{R_{H {}}{}} = (m + 1) \cdot f \end{aligned}$$
(2)

This means that if \(k > 1\), then \(N_{k}^{M_{H {}}{}} = N_{k}^{R_{H {}}{}}\) and \(L_{k}^{M_{H {}}{}} = L_{k}^{R_{H {}}{}}\). Based on (1) and (2), the size of the output file after the kth iteration is the following:

$$\begin{aligned} S_{k}^{M_{H {}}{}} = S_{k}^{R_{H {}}{}} = s + \frac{m!}{(m - k)!} \cdot (m + 1) \cdot f \end{aligned}$$
(3)

\(\square \)

1.4 A.4 Proof of Proposition 4

Proof

Each \(R_{H {}}\) task obtains a key and all values related to this key. The worst-case scenario is, when the \(M_{H {}}\) generates a line with this key to all possible k-element permutations. Since the given node cannot be part of the partial mapping, the number of these rows equals to \(\frac{(m-1)!}{(m-k)!}\). Based on (2), the size of each row is less or equal to \((m+1) \cdot f\). There is also a row, which is responsible for representing the connections of the given node. The size of this row is \(m \cdot f\). Therefore, the memory used by a single \(R_{H {}}\) is no more than:

$$\begin{aligned} \Delta _{k}^{R_{H {}}{}}{} = m \cdot f + \frac{(m-1)!}{(m-k)!} \cdot (m+1) \cdot f \end{aligned}$$
(4)

\(\square \)

1.5 A.5 Proof of Proposition 5

Proof

As it was already mentioned in Sect. 4.4, \(M_{S {}}^{'}\) always checks whether the given candidate node can be mapped to the next target node. The first step is always special, since there are no candidates prepared to be examined. Therefore, in this case, all nodes of the source graph are examined. The output of this mapper contains the source graph as separate rows and a new row for each successful match. The rows representing the successful match consist of the matched node as the key values and the current state of the partial mapping.

$$\begin{aligned} N_{1}^{M_{S {}}^{'}{}} = m \quad L_{1}^{M_{S {}}^{'}{}} = (k+1) \cdot f \end{aligned}$$

If \(k > 1\), then the lines to be processed by the \(M_{S {}}^{'}\) contain not only the candidate node to examine, but all partial mappings the new candidate node might be added to. The mapper emits a certain number of new line for each partial mapping. In this manner, the worst-case scenario is when the number of partial mapping contained by the given line is the maximum. The partial mappings do not contain the candidate node, and consist of \(k-1\) nodes. Therefore, the number of possible ordered sets is equal to \(\frac{(m-1)!}{(m-1-(k-1))!} = \frac{(m-1)!}{(m-k)!}\).

The emitted lines are obtained by the reducer functions sorted by their keys, and \(M_{S {}}^{''}\) generates the candidate nodes. This generation is based on the neighboring nodes. Therefore, the new partial mappings must be emitted as a value for each row contained by the partial mapping. This means that the same value is emitted k times (the partial mappings now contain the candidate node), since after the kth iteration the cardinality of the partial mapping is k. In the worst-case scenario, the maximal output size of the \(M_{S {}}^{'}\) after the kth iteration becomes the following:

$$\begin{aligned} N_{k}^{M_{S {}}^{'}{}} = k \cdot m \cdot \frac{(m-1)!}{(m-k)!} = k \cdot \frac{m!}{(m-k)!} \quad L_{k}^{M_{S {}}^{'}{}} = (k+1) \cdot f \end{aligned}$$
(5)

Given the number of emitted lines and the length of each line, in the worst-case scenario, the size of the output after the kth can be calculated as follows:

$$\begin{aligned} S_{k}^{M_{S {}}^{'}{}} = s + k \cdot \frac{m!}{(m-k)!} \cdot (k+1) \cdot f \end{aligned}$$
(6)

\(\square \)

1.6 A.6 Proof of Proposition 6

Proof

\(R_{S {}}^{'}\) is responsible for generating the appropriate input for \(M_{S {}}^{''}\), which creates the possible candidate nodes for the partial mappings. Therefore, the reducer collects all partial mappings with the same key, and extends the row representing the node in the source graph with them. In this manner, each output value of \(M_{S {}}^{'}\) is moved to the row representing the exact node that was its key value. The maximum number of ordered sets with k-elements where the key node is contained is \(k \cdot \frac{(m-1)!}{(m-k)!}\). Each of these partial mappings contains k nodes.

The row also describes the given node. Therefore, the identity and connections of the node are also represented in the row. In the worst-case scenario, the node has an edge to all other source graph nodes. Therefore, the size from the row dedicated to the representation of the node is \(m \cdot f\). This means that the properties of the output are the following:

$$\begin{aligned} N_{k}^{R_{S {}}^{'}{}} = m \quad L_{k}^{R_{S {}}^{'}{}} = m \cdot f + k^{2} \cdot \frac{(m-1)!}{(m-k)!} \cdot f \end{aligned}$$

It has been shown that in case of the worst-case scenario, that is, when the most number of rows is emitted by the \(R_{S {}}^{'}\), the size of the output can be calculated as follows:

$$\begin{aligned} S_{k}^{R_{S {}}^{'}{}} = m \cdot m \cdot f + m \cdot k^{2} \cdot \frac{(m-1)!}{(m-k)!} \cdot f = s + k^{2} \cdot \frac{m!}{(m-k)!} \cdot f \end{aligned}$$
(7)

\(\square \)

1.7 A.7 Proof of Proposition 7

Proof

\(M_{S {}}^{''}\) obtains the output of the \(R_{S {}}^{'}\) and attempts to create new candidate nodes for each partial mapping found in the given row. The new candidate nodes are generated based on the neighbors of the row. Each neighbor node that is not contained by the partial mapping is emitted with the whole partial mapping as a new row. In this manner, each partial mapping of a row is emitted with a neighbor that is not yet contained. The number of the not contained neighbors is \(m-k\). The emitted lines contain the candidate node and the partial mapping with k nodes. Therefore, the properties of the output are the following:

$$\begin{aligned} N_{k}^{M_{S {}}^{''}{}} = m \cdot (m-k) \cdot k \cdot \frac{(m-1)!}{(m-k)!} = k \cdot \frac{m!}{(m-k-1)!} \quad L_{k}^{M_{S {}}^{''}{}} = (k+1) \cdot f \end{aligned}$$
(8)

Based on these properties, in the worst-case scenario, where the number of the partial mappings and the number of the neighbor nodes are the most, the size of the output can be calculated as follows:

$$\begin{aligned} S_{k}^{M_{S {}}^{''}{}} = s + k \cdot \frac{m!}{(m-k-1)!} \cdot (k+1) \cdot f \end{aligned}$$
(9)

\(\square \)

1.8 A.8 Proof of Proposition 8

Proof

\(R_{S {}}^{''}\) is responsible for generating an output where each candidate node (with its partial mapping) is moved to the node representing the node in the source graph. Although \(M_{S {}}^{''}\) might emit a candidate node with the same partial mapping multiple times, \(R_{S {}}^{''}\) emits each partial mapping related to the candidate node only once. In this manner, the output file consists of m rows where each row contains the connections of the node and the possible partial mappings consisting of exactly k elements. The number of the partial mappings equals to the number of ordered subset of k elements from a set of \(m-1\) elements, and each partial mapping contains k elements, therefore:

$$\begin{aligned} N_{k}^{R_{S {}}^{''}{}} = m \quad L_{k}^{R_{S {}}^{''}{}} = m \cdot f + \frac{(m-1)!}{(m-k-1)!} \cdot k \cdot f \end{aligned}$$

This means that in the worst-case scenario, the size of the output can be calculated as follows:

$$\begin{aligned} S_{k}^{R_{S {}}^{''}{}} = m \cdot m \cdot f + m \cdot \frac{(m-1)!}{(m-k-1)!} \cdot k \cdot f = s + \frac{m!}{(m-k-1)!} \cdot k \cdot f \end{aligned}$$
(10)

\(\square \)

1.9 A.9 Proof of Proposition 9

Proof

Each reducer task obtains a key and all values related to this key. The worst-case scenario is the following:

On one hand, the reducer obtains the values, where the partial mappings are extended with the key node. The partial mappings had \(k-1\) nodes before the addition of the key node; therefore, the maximal number of these partial mappings equals to the number of ordered subset of \(k-1\) elements from a set of \(m-1\) elements, i.e., \(\frac{(m-1)!}{(m-k)!}\).

On the other hand, the reducer also obtains the values, where the key node n was already part of the partial mappings. This is required, because the algorithm must be able to extend these partial mappings with the neighbors of the n node as well. These values are created as follows: At the previous \(M_{S {}}^{'}\) phase, the maximal number of partial mappings with \(k-1\) elements, which already contain the key node n but not contain the examined node, is \((k-1) \cdot \frac{(m-2)!}{(m-k)!}\). (The formula calculating the number of ordered sets of k elements from a set of m elements that contain a marked element is \(k \cdot \frac{(m-1)!}{(m-k)!}\).) Since there are \(m-1\) nodes that might extend the partial mappings already containing the n node, the maximal number of such values equals to \((k-1) \cdot \frac{(m-1)!}{(m-k)!}\).

To sum it up, a node can be a key of a row at the kth \(R_{S {}}^{'}\) phase at most \(k \cdot \frac{(m-1)!}{(m-k)!}\) times. This equals to the number of different ordered sets with k elements from a set of m elements, where the ordered sets contain the n node. As it is described in (5), each row has the length of \((k+1) \cdot f\). Moreover, \(R_{S {}}^{'}\) also obtains the row responsible for describing its connections, which has the length of \(m \cdot f\). This means that the input of a single \(R_{S {}}^{'}\) has the maximum size of:

$$\begin{aligned} \Delta _{k}^{R_{S {}}^{'}{}}{} = m \cdot f + (k \cdot \frac{(m-1)!}{(m-k)!}) \cdot (k+1) \cdot f \end{aligned}$$
(11)

\(\square \)

1.10 A.10 Proof of Proposition 10

Proof

Each reducer obtains all partial mappings related to a candidate node. \(M_{S {}}^{''}\) attaches each candidate node to the partial mapping. The node n appears as a candidate node for each partial mapping with k elements, where the mapping does not contain the node n. The number of ordered subset of k elements from a set of \(m-1\) elements equals to \(\frac{(m-1)!}{(m-k-1)!}\).

After \(M_{S {}}^{'}\), a partial mapping appears k times, since the algorithm must examine all possible neighbors as a candidate node. In this manner, the node n can be a key at most \(k \cdot \frac{(m-1)!}{(m-k-1)!}\) times. As it has been described in (8), each row has the length of \((k+1) \cdot f\).

A reducer also obtains the row responsible for describing its connections, which has the length of \(m \cdot f\). This means that the input of a \(R_{S {}}^{''}\) has the maximum size of:

$$\begin{aligned} \Delta _{k}^{R_{S {}}^{''}{}}{} = m \cdot f + (k \cdot \frac{(m-1)!}{(m-k-1)!}) \cdot (k+1) \cdot f \end{aligned}$$
(12)

\(\square \)

1.11 A.11 Proof of Theorem 1

Proof

Although s depends on m, since \(s \le m \cdot m \cdot f\), based on the output of the mapper/reducer function, s can be omitted from the calculation, if \(k \ge 2\), because \(m^{2}\) is only a small term in respect to the other addends. Moreover, f is practically a constant multiplier to represent the appropriate separator character between the nodes.

The worst-case scenario is when no candidate node fails to satisfy the feasibility rules. In this case, the order of magnitudes can be calculated as follows.

According to (3), the size of the output of the \(\mathrm{MRSI}_{H {}}\) algorithm is the following:

$$\begin{aligned} S_{k}^{M_{H {}}{}} = S_{k}^{R_{H {}}{}} = {\mathcal {O}}(s + \frac{m!}{(m-k)!} \cdot (m + 1) \cdot f) = {\mathcal {O}}(m^{k+1}) \end{aligned}$$
(13)

According to (6), the output of \(M_{S {}}^{'}\) of the algorithm \(\mathrm{MRSI}_{S {}}\) equals to:

$$\begin{aligned} S_{k}^{M_{S {}}^{'}{}} = {\mathcal {O}}(s + k \cdot \frac{m!}{(m-k)!} \cdot (k+1) \cdot f) = {\mathcal {O}}(m^{k} \cdot k^{2}) \end{aligned}$$
(14)

According to (7), the size of the output of \(R_{S {}}^{'}\) used in the algorithm \(\mathrm{MRSI}_{S {}}\) is:

$$\begin{aligned} S_{k}^{R_{S {}}^{'}{}} = {\mathcal {O}}(s + k^{2} \cdot \frac{m!}{(m-k)!} \cdot f) = {\mathcal {O}}(m^{k} \cdot k^{2}) \end{aligned}$$
(15)

The size of the output produced by the \(M_{S {}}^{''}\) of the \(\mathrm{MRSI}_{S {}}\) algorithm, based on (9), is the following:

$$\begin{aligned} S_{k}^{M_{S {}}^{''}{}} = {\mathcal {O}}(s + k \cdot \frac{m!}{(m-k-1)!} \cdot (k+1) \cdot f = {\mathcal {O}}(m^{k+1} \cdot k^{2}) \end{aligned}$$
(16)

Finally, based on (10), the size of \(R_{S {}}^{''}\) of the algorithm \(\mathrm{MRSI}_{S {}}\) is:

$$\begin{aligned} S_{k}^{R_{S {}}^{''}{}} = {\mathcal {O}}(s + \frac{m!}{(m-k-1)!} \cdot k \cdot f) = {\mathcal {O}}(m^{k+1} \cdot k) \end{aligned}$$
(17)

Since the target graphs are small graphs—this makes it possible to store them in the memory—and source graphs are greater by multiple order of magnitudes, it can be stated that \(k \ll m\). Therefore, the order of magnitude of the output produced by the \(M_{S {}}^{'}\) and \(R_{S {}}^{'}\) of the algorithm \(\mathrm{MRSI}_{S {}}\) is smaller than the order of magnitude of the output produced by the algorithm \(\mathrm{MRSI}_{H {}}\). The size of the output of \(M_{S {}}^{''}\) and \(R_{S {}}^{''}\) of the algorithm \(\mathrm{MRSI}_{S {}}\) has the same order of magnitude than the outputs of the \(\mathrm{MRSI}_{H {}}\) algorithm.

As for the general case, after k steps, both algorithms provide the same amount of partial mappings. However, the algorithm \(\mathrm{MRSI}_{S {}}\) checks the feasibility rules in \(M_{S {}}^{'}\) in contrast to the \(\mathrm{MRSI}_{H {}}\) algorithm that examines the partial mapping in \(R_{H {}}\). Let b denote the number of candidate node, partial mapping pairs that do not satisfy the feasibility rules in the kth step. The number of rows is decreased by \(k \cdot b\), since each partial mapping is emitted with all of its nodes as a key.

This results in less data in the output of \(R_{S {}}^{'}\) as well, because all partial mapping appears in one of the m rows.

The output generated by \(M_{S {}}^{''}\) is decreased by \(k \cdot b \cdot c\), where c denotes the number of neighbors of the partial mappings rejected in the \(M_{S {}}^{'}\).

The output of \(R_{S {}}^{''}\) of the \(\mathrm{MRSI}_{S {}}\) algorithm consists of all partial mappings and the data responsible for describing the connections of the source graphs. This information can be found in the output of the \(R_{H {}}\) of the algorithm \(\mathrm{MRSI}_{H {}}\) as well and is extended with the candidate nodes. Therefore, the size of the output of the \(R_{S {}}^{''}\) is always smaller than the output produced by the \(R_{H {}}\) in the \(\mathrm{MRSI}_{H {}}\) algorithm. \(\square \)

1.12 A.12 Proof of Theorem 2

Proof

Similar to Theorem 1, the worst-case scenario is when no candidate node fails to satisfy the feasibility rules. In this case, the used memory is calculated as follows.

According to (4), the memory used by a single reducer in \(\mathrm{MRSI}_{H {}}\) at the kth iteration is the following:

$$\begin{aligned} \Delta _{k}^{R_{H {}}{}}{} = {\mathcal {O}}(m \cdot f + \frac{(m-1)!}{(m-k)!} \cdot (m+1) \cdot f) = {\mathcal {O}}(m^{k}) \end{aligned}$$
(18)

The maximum memory used by a single \(R_{S {}}^{'}\) and \(R_{S {}}^{''}\) in \(\mathrm{MRSI}_{S {}}\) is calculated in (11) and (12), respectively. Therefore, the orders of magnitude are the following:

$$\begin{aligned} \Delta _{k}^{R_{S {}}^{'}{}}{}= & {} m \cdot f + (k \cdot \frac{(m-1)!}{(m-k)!}) \cdot (k+1) \cdot f = {\mathcal {O}}(m^{(k-1)} \cdot k^{2}) \end{aligned}$$
(19)
$$\begin{aligned} \Delta _{k}^{R_{S {}}^{''}{}}{}= & {} m \cdot f + (k \cdot \frac{(m-1)!}{(m-k-1)!}) \cdot (k+1) \cdot f = {\mathcal {O}}(m^{k} \cdot k^{2}) \end{aligned}$$
(20)

Since the target graphs are small graphs, it can be stated that \(k^{2} \ll m\). In this manner, the memory used by a single reducer in \(\mathrm{MRSI}_{S {}}\) is smaller or is in the same order of magnitude than the memory required for a reducer instance in \(\mathrm{MRSI}_{H {}}\).

As for the general case, it has been already proven in Theorem 1 that the outputs of the \(M_{S {}}^{'}\) and \(M_{S {}}^{''}\) are smaller than the output of the \(M_{H {}}\). Moreover, the candidates that fail to the feasibility rules in the kth step are not emitted by \(M_{S {}}^{'}\), in contrast to the \(M_{H {}}\). This means that the memory requirement of the \(R_{S {}}^{'}\) and \(R_{S {}}^{''}\) is decreased even further. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fehér, P., Asztalos, M., Vajk, T. et al. Detecting subgraph isomorphism with MapReduce. J Supercomput 73, 1810–1851 (2017). https://doi.org/10.1007/s11227-016-1885-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-016-1885-6

Keywords

Navigation