Skip to main content

node2bits: Compact Time- and Attribute-Aware Node Representations for User Stitching

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11906))

Abstract

Identity stitching, the task of identifying and matching various online references (e.g., sessions over different devices and timespans) to the same user in real-world web services, is crucial for personalization and recommendations. However, traditional user stitching approaches, such as grouping or blocking, require pairwise comparisons between a massive number of user activities, thus posing both computational and storage challenges. Recent works, which are often application-specific, heuristically seek to reduce the amount of comparisons, but they suffer from low precision and recall. To solve the problem in an application-independent way, we take a heterogeneous network-based approach in which users (nodes) interact with content (e.g., sessions, websites), and may have attributes (e.g., location). We propose node2bits, an efficient framework that represents multi-dimensional features of node contexts with binary hashcodes. node2bits leverages feature-based temporal walks to encapsulate short- and long-term interactions between nodes in heterogeneous web networks, and adopts SimHash to obtain compact, binary representations and avoid the quadratic complexity for similarity search. Extensive experiments on large-scale real networks show that node2bits outperforms traditional techniques and existing works that generate real-valued embeddings by up to \(5.16\%\) in F1 score on user stitching, while taking only up to \(1.56\%\) as much storage.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    We assume that the length of each sketch at distance \({\varDelta t}\) is given as \(K^{{\varDelta t}}=\frac{K}{\text {MAX}}\).

References

  1. Ahmed, N.K., et al.: Learning role-based graph embeddings. In: StarAI workshop at IJCAI (2018)

    Google Scholar 

  2. Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. TKDD 1(1), 1–36 (2007)

    Article  Google Scholar 

  3. Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Compu. Netw. ISDN Syst. 29(8), 1157–1166 (1997)

    Article  Google Scholar 

  4. Cao, S., Lu, W., Xu, Q.: Deep neural networks for learning graph representations. In: AAAI, pp. 1145–1152 (2016)

    Google Scholar 

  5. Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: STOC, pp. 380–388 (2002)

    Google Scholar 

  6. Christen, P.: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012)

    Google Scholar 

  7. Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: KDD, pp. 475–480 (2002)

    Google Scholar 

  8. Dasgupta, A., Gurevich, M., Zhang, L., Tseng, B., Thomas, A.O.: Overcoming browser cookie churn with clustering. In: WSDM, pp. 83–92 (2012)

    Google Scholar 

  9. Dong, X.L., Naumann, F.: Data fusion: resolving data conflicts for integration. VLDB 2(2), 1654–1655 (2009)

    Google Scholar 

  10. Dong, Y., Chawla, N.V., Swami, A.: metapath2vec: scalable representation learning for heterogeneous networks. In: KDD, pp. 135–144 (2017)

    Google Scholar 

  11. Donnat, C., Zitnik, M., Hallac, D., Leskovec, J.: Learning structural node embeddings via diffusion wavelets. In: KDD, pp. 1320–1329 (2018)

    Google Scholar 

  12. Eckersley, P.: How unique is your web browser? In: Atallah, M.J., Hopper, N.J. (eds.) PETS 2010. LNCS, vol. 6205, pp. 1–18. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14527-8_1

    Chapter  Google Scholar 

  13. Getoor, L., Machanavajjhala, A.: Entity resolution for big data. In: KDD, pp. 1527–1527 (2013)

    Google Scholar 

  14. Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: KDD, pp. 855–864 (2016)

    Google Scholar 

  15. Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. In: NeurIPS, pp. 1024–1034 (2017)

    Google Scholar 

  16. Heimann, M., Shen, H., Safavi, T., Koutra, D.: Regal: representation learning-based graph alignment. In: CIKM, pp. 117–126 (2018)

    Google Scholar 

  17. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC, pp. 604–613 (1998)

    Google Scholar 

  18. Jin, D., et al.: Smart roles: inferring professional roles in email networks. In: KDD (2019)

    Google Scholar 

  19. Jin, D., Rossi, R.A., Koh, E., Kim, S., Rao, A., Koutra, D.: Latent network summarization: bridging network embedding and summarization. In: KDD (2019)

    Google Scholar 

  20. Kim, S., Kini, N., Pujara, J., Koh, E., Getoor, L.: Probabilistic visitor stitching on cross-device web logs. In: WWW, pp. 1581–1589 (2017)

    Google Scholar 

  21. Kolb, L., Thor, A., Rahm, E.: Dedoop: efficient deduplication with Hadoop. VLDB 5(12), 1878–1881 (2012)

    Google Scholar 

  22. Liu, Y., Zhu, L., Szekely, P., Galstyan, A., Koutra, D.: Coupled clustering of time-series and networks. In: SDM (2019)

    Google Scholar 

  23. Nguyen, G.H., Lee, J.B., Rossi, R.A., Ahmed, N.K., Koh, E., Kim, S.: Continuous-time dynamic network embeddings. In: WWW BigNet (2018)

    Google Scholar 

  24. Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. VLDB 9(9), 684–695 (2016)

    Google Scholar 

  25. Perozzi, B., Al-Rfou, R., Skiena, S.: DeepWalk: online learning of social representations. In: KDD (2014)

    Google Scholar 

  26. Rajaraman, A., Leskovec, J., Ullman, J.D.: Mining Massive Datasets (2014)

    Google Scholar 

  27. Ribeiro, L.F., Saverese, P.H., Figueiredo, D.R.: struc2vec: learning node representations from structural identity. In: KDD, pp. 385–394 (2017)

    Google Scholar 

  28. Rossi, R.A., Ahmed, N.K.: The network data repository with interactive graph analytics and visualization. In: AAAI (2015). http://networkrepository.com

  29. Rossi, R.A., Zhou, R., Ahmed, N.K.: Deep inductive network representation learning. In: WWW, pp. 953–960 (2018)

    Google Scholar 

  30. Saha Roy, R., Sinha, R., Chhaya, N., Saini, S.: Probabilistic deduplication of anonymous web traffic. In: WWW, pp. 103–104 (2015)

    Google Scholar 

  31. Shi, Y., Gui, H., Zhu, Q., Kaplan, L., Han, J.: AspEm: embedding learning by aspects in heterogeneous information networks. In: SDM, pp. 144–152. SIAM (2018)

    Google Scholar 

  32. Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., Mei, Q.: LINE: large-scale information network embedding. In: WWW (2015)

    Google Scholar 

  33. Von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)

    Article  MathSciNet  Google Scholar 

  34. Wang, Q., Wang, S., Gong, M., Wu, Y.: Feature hashing for network representation learning. In: IJCAI, pp. 2812–2818 (2018)

    Google Scholar 

Additional References

  1. Aghazadeh, A., Lan, A., Shrivastava, A., Baraniuk, R.: RHash: robust hashing via l-norm distortion. In: IJCAI, pp. 1386–1394 (2017)

    Google Scholar 

  2. Heimann, M., Lee, W., Pan, S., Chen, K.-Y., Koutra, D.: HashAlign: hash-based alignment of multiple graphs. In: Phung, D., Tseng, V.S., Webb, G.I., Ho, B., Ganji, M., Rashidi, L. (eds.) PAKDD 2018. LNCS (LNAI), vol. 10939, pp. 726–739. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93040-4_57

    Chapter  Google Scholar 

  3. Hisano, R.: Semi-supervised graph embedding approach to dynamic link prediction. In: Cornelius, S., Coronges, K., Gonçalves, B., Sinatra, R., Vespignani, A. (eds.) CompleNet 2018. SPC, pp. 109–121. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73198-8_10

    Chapter  Google Scholar 

  4. Ji, J., Li, J., Yan, S., Tian, Q., Zhang, B.: Min-max hash for Jaccard similarity. In: ICDM, pp. 301–309 (2013)

    Google Scholar 

  5. Kang, B., Jung, K.: Robust and efficient locality sensitive hashing for nearest neighbor search in large data sets. In: NeurIPS BigLearn Workshop, pp. 1–8 (2012)

    Google Scholar 

  6. Li, P., Hastie, T.J., Church, K.W.: Very sparse random projections. In: KDD, pp. 287–296 (2006)

    Google Scholar 

  7. Li, P., König, A.C.: Theory and applications of b-bit minwise hashing. Comm. ACM 54(8), 101–109 (2011)

    Article  Google Scholar 

  8. Li, P., Owen, A., Zhang, C.H.: One permutation hashing. In: NeurIPS, pp. 3113–3121 (2012)

    Google Scholar 

  9. Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe LSH: efficient indexing for high-dimensional similarity search. In: VLDB, pp. 950–961 (2007)

    Google Scholar 

  10. Manzoor, E., Milajerdi, S.M., Akoglu, L.: Fast memory-efficient anomaly detection in streaming heterogeneous graphs. In: KDD, pp. 1035–1044 (2016)

    Google Scholar 

  11. Qiu, J., Dong, Y., Ma, H., Li, J., Wang, K., Tang, J.: Network embedding as matrix factorization: Unifying DeepWalk, LINE, PTE, and node2vec. In: WSDM, pp. 459–467 (2018)

    Google Scholar 

  12. Rossi, R.A., Ahmed, N.K.: Role discovery in networks. TKDE 27(4), 1112–1131 (2015)

    Google Scholar 

  13. Safavi, T., Sripada, C., Koutra, D.: Scalable hashing-based network discovery. In: ICDM, pp. 405–414, November 2017

    Google Scholar 

  14. Shrivastava, A., Li, P.: Densifying one permutation hashing via rotation for fast near neighbor search. In: ICML, pp. 557–565 (2014)

    Google Scholar 

  15. Wang, D., Cui, P., Zhu, W.: Structural deep network embedding. In: KDD, pp. 1225–1234 (2016)

    Google Scholar 

  16. Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: NeurIPS, pp. 1753–1760 (2009)

    Google Scholar 

  17. Zhu, L., Guo, D., Yin, J., Ver Steeg, G., Galstyan, A.: Scalable temporal latent space inference for link prediction in dynamic social networks. IEEE TKDE 28(10), 2765–2777 (2016)

    Google Scholar 

Download references

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant No. IIS 1845491, Army Young Investigator Award No. W911NF1810397, an Adobe Digital Experience and an Amazon research faculty award. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF or other funding parties.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Di Jin .

Editor information

Editors and Affiliations

Appendices

Appendix

A Detailed Algorithm

In Sect. 3 we gave the overview of our proposed method, node2bits. For reproducibility, here we also provide its more detailed pseudocode.

figure b

B Complexity Analysis

Time Complexity. The runtime complexity of node2bits includes deriving (1) the set of R temporal random walks of length up to L, which is \(\mathcal {O}(MRL)\) in the worst case; (2) the feature values of nodes in the walks from step (1); and (3) hashing the feature values of nodes in the context through random projection, which is \(\mathcal {O}(NK)\). Thus, the total runtime complexity is \(\mathcal {O}(MRL + NK)\), which is linear to the number of edges when \(M\gg N\) as K is relatively small (R3).

Runtime Space Complexity. The space required in the runtime consists three parts: (1) the set of temporal random walks (represented as vectors) per edge with complexity \(\mathcal {O}(MRL)\), (2) the histograms of feature contexts \( N|\mathcal {F}||\mathcal {T}_V|\), and (3) the set of randomly-generated hyperplanes NK. Therefore, the total runtime space complexity is \(\mathcal {O}(MRL + N(|\mathcal {F}||\mathcal {T}_V|+K))\).

Output Space Complexity. The output space complexity of node2bits is \(\mathcal {O}(NK)\)-bit. The space required to store binary vectors is guaranteed to be \(32{\times }\) less than vectors represented with real-value floats (4 bytes) with the same dimension. In practice, node2bits requires even less storage if the binary vectors are represented in the sparse format (see Sect. 4.4 for empirical results).

C Data Description

Below we provide a more detailed description of the network datasets that we use in our experiments (Table 2).

  • citeseer: CiteSeerX is an undirected, heterogeneous network that contains the bipartite relations between authors and papers they contributed.

  • yahoo: Yahoo! Messenger Logs is a heterogeneous network capturing message exchanges between users at different locations (node attribute).

  • bitcoin: soc-bitcoinA is a who-trusts-whom network on the Bitcoin Alpha platform. The directed edges indicate user ratings.

  • digg: This heterogeneous network consists of users voting stories that they like and forming friendships with other users.

  • wiki: wiki-talk is a temporal homogeneous network capturing Wikipedia users editing each other’s Talk page over time.

  • comp-X: A temporal heterogeneous network is derived from a company’s web logs and consists of web sessions of users and their IPs. In the stitching task, we predict the web session IDs that correspond to the same user.

D Configuration of Baselines

As we mentioned in Sect. 4.1, we configured all the baselines to achieve the best performance according to the respective papers. For all the baselines that are based on random walks (i.e., node2vec, struc2vec, DeepWalk, metapath2vec, CTDNE), we set the number of walks to 20 and the maximum walk length to \(L=20\). For node2vec, we perform grid search over \(p,q \in \{0.25, 0.50, 1, 2, 4\}\) as mentioned in [14] and report the best performance. For metapath2vec, we adopt the recommended meta-path “Type 1-Type 2-Type 1” (e.g., type 1 = author; type 2 = publication). In DNGR, we set the random surfing probability \(\alpha =0.98\) and use a 3-layer neural network model where the hidden layer has 1024 nodes. We use 2nd-LINE to incorporate 2nd-order proximity in the graph. For all the embedding methods, we set the embedding dimension to \(K=128\). Unlike those, CN outputs clusters, each of which corresponds to one entity.

E Additional Empirical Analysis

1.1 E.1 Justification of Hashing

In this experiment we hash the outputs given by baseline embedding methods using SimHash [5] and then perform stitching on two temporal graphs to study their performance under the constraint of storage comparable to node2bits. Based on Table 6, we observe fluctuation in the stitching performance of baseline methods, for example, almost all baselines got degenerated scores in all metrics on the bitcoin dataset, especially for struc2vec. On the other hand, however, node2vec, LINE and CTDNE got slight increased scores on yahoo dataset, which is likely due to the fact that the small real-values are amplified when hashed into binary for logistic regression binary classification. It is also possibly due to the graph structure. We leave further discussion in the future work, but nevertheless, node2bits outperforms these baselines in all cases. This empirical experiment demonstrates that node2bits effectively preserves context information in the binary hashcodes.

Table 6. Justification of hashing
Fig. 4.
figure 4

Sensitivity Analysis on bitcoin dataset. node2bits achieves highest scores in AUC, ACC and F-1 score when \(\text {MAX}=3\). Increasing the numbers of walks or increasing their lengths do not significantly affect the performance of node2bits.

1.2 E.2 Sensitivity Analysis

We also perform sensitivity analysis of the hyperparameters used in this work on the bitcoin dataset. Particularly, we perform grid analysis by varying (1) max temporal distances, (2) the number of temporal walks per edge and (3) the length of walks. The results are given in Fig. 4. Figure 4a indicates that when \(\text {MAX}=3\), node2bits achieves the best performance. This implies that although it is potentially beneficial to incorporate nodes in temporally distant contexts, it will also incorporate information that is less relevant. Therefore, we set \(\text {MAX}=3\) by default for the experiments in this work. Figure 4b and c imply that the performance of node2bits is not significantly affected by the number of walks performed or the length of these temporal walks. This is reasonable because node2bits leverages these temporal walks to collect node features into the context and normalizes their occurrences in the histograms. Thus, adding more nodes in the ordered temporal contexts does not provide extra useful information. We empirically set the number of walks per edge to be 10 and the lengths to be 20 in the experiments of this work.

1.3 E.3 Output Storage and Runtime in Detail

We report detailed output storage in Table 7 and the time elapsed when running all methods in Table 8. It can be seen that the node-wise sparse binary vectors generated by node2bits take trivial amount of storage compared to the other methods, while its runtime is comparable to node2vec. node2bits finished running on all datasets while most baselines fail to finish within the time limit on the large datasets, digg and wiki.

Table 7. Space required to store the output in MB. node2bits requires 63\(\times \)–339\(\times \) less space than other embedding methods. ‘–’ indicates that the method does not apply to that dataset, or encounters errors such as out-of-memory or out-of-time.
Table 8. Comparison between node2bits and baselines in terms of runtime (in seconds). Note the runtime of dynamic node2bits (short-term) for the temporal networks is shown in parentheses.

F Additional Related Work

In this section we provide additional related work, complementing our discussion in Sect. 5.

Node Embeddings. Here we give some more details about proximity-based methods, which we employ in our experiments. DeepWalk [25] and node2vec [14] leverage vanilla and 2-order random walk, respectively, to explore the identities of the neighborhood; LINE [32] can be seen as a special case of DeepWalk by setting the context to be 1 [45]; metapath2vec [10] relies on predefined meta-schema to perform random walk in heterogeneous networks. In the field of temporal network embedding, most approaches [37, 51] approximate the dynamic network as discrete static snapshots overtime, which does not apply to user stitching tasks as sessions corresponding to the same user could occur in multiple timespans. CTDNE [23] first explores temporal proximity by learning temporally valid embeddings based on a corpus of temporal random walks. Another related field is hashing-based embedding, for example, node2hash [34] proposes to hash the pairwise node proximity derived from random walk into low-dimensional hashcode as the embeddings. Due to the quadratic complexity in computing the pairwise proximity between nodes, node2hash does not apply to large-scale networks. One limitation of these methods is that training a skip-gram architecture on the entire corpus sampled by random walks can be memory-intensive. A further limitation of these approaches, as well as existing deep architectures [15, 49] is that for nodes to have similar embeddings, they must be in close proximity (e.g. neighbors or nodes with several common neighbors) in the network. This is not necessarily the case for user stitching, where corresponding entities may exhibit similar behavior (resulting in similar local topologies) but not connect to the same entities.

Compared with proximity-based methods, embedding works exploring structural equivalency or similarity [1, 11, 16, 18, 19, 22, 27, 46] are more suitable to handle user stitching. Representative examples include the following: struc2vec [27], xNetMf [16], and EMBER [18] define similarity in terms of degree sequences in node-centric subgraphs; DeepGL [29] learns deep relational functions applied to degree, triangle counts and other graph invariants in an inductive scheme. Role2vec [1] proposes a framework that inductively learns structural similarity by introducing attributed random walk atop relational operators, while MultiLENS [19] summarizes node embeddings obtained by recursive application of relational operators. CCTN [22] embeds and clusters nodes in a network that are not only well-connected but also share similar behavioral patterns (e.g., similar patterns in the degree or other structural properties over time).

Locality Sensitivity Hashing (LSH). More recently, LSH functions that are robust to distortion [35]; require less storage of the hash codes [41, 50]; generate codewords with balanced amounts of items [39] or compute hash functions efficiently [38, 40, 42, 48] have attracted much attention. LSH has been used in a variety of data mining applications, including network alignment [36], network inference [47], anomaly detection [44], and more. In addition, there are also works devoted to learning to hash [35] where the main idea is to learn hash codes through an optimization objective function, or intelligently probe multiple adjacent code words that are likely to contain query results in a hash table for similarity search [43]. But these methods do not apply to large-scale graphs directly.

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jin, D., Heimann, M., Rossi, R.A., Koutra, D. (2020). node2bits: Compact Time- and Attribute-Aware Node Representations for User Stitching. In: Brefeld, U., Fromont, E., Hotho, A., Knobbe, A., Maathuis, M., Robardet, C. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019. Lecture Notes in Computer Science(), vol 11906. Springer, Cham. https://doi.org/10.1007/978-3-030-46150-8_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-46150-8_29

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-46149-2

  • Online ISBN: 978-3-030-46150-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics