node2bits: Compact Time- and Attribute-Aware Node Representations for User Stitching

Jin, Di; Heimann, Mark; Rossi, Ryan A.; Koutra, Danai

doi:10.1007/978-3-030-46150-8_29

Di Jin¹⁴,
Mark Heimann¹⁴,
Ryan A. Rossi¹⁵ &
…
Danai Koutra¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11906))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

2181 Accesses
13 Citations

Abstract

Identity stitching, the task of identifying and matching various online references (e.g., sessions over different devices and timespans) to the same user in real-world web services, is crucial for personalization and recommendations. However, traditional user stitching approaches, such as grouping or blocking, require pairwise comparisons between a massive number of user activities, thus posing both computational and storage challenges. Recent works, which are often application-specific, heuristically seek to reduce the amount of comparisons, but they suffer from low precision and recall. To solve the problem in an application-independent way, we take a heterogeneous network-based approach in which users (nodes) interact with content (e.g., sessions, websites), and may have attributes (e.g., location). We propose node2bits, an efficient framework that represents multi-dimensional features of node contexts with binary hashcodes. node2bits leverages feature-based temporal walks to encapsulate short- and long-term interactions between nodes in heterogeneous web networks, and adopts SimHash to obtain compact, binary representations and avoid the quadratic complexity for similarity search. Extensive experiments on large-scale real networks show that node2bits outperforms traditional techniques and existing works that generate real-valued embeddings by up to \(5.16\%\) in F1 score on user stitching, while taking only up to \(1.56\%\) as much storage.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

HFUL: a hybrid framework for user account linkage across location-aware social networks

Article 05 February 2022

Multi Association Semantics-Based User Matching Algorithm Without Prior Knowledge

GSimRank: A General Similarity Measure on Heterogeneous Information Network

Notes

1.
We assume that the length of each sketch at distance \({\varDelta t}\) is given as \(K^{{\varDelta t}}=\frac{K}{\text {MAX}}\).

References

Ahmed, N.K., et al.: Learning role-based graph embeddings. In: StarAI workshop at IJCAI (2018)
Google Scholar
Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. TKDD 1(1), 1–36 (2007)
Article Google Scholar
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Compu. Netw. ISDN Syst. 29(8), 1157–1166 (1997)
Article Google Scholar
Cao, S., Lu, W., Xu, Q.: Deep neural networks for learning graph representations. In: AAAI, pp. 1145–1152 (2016)
Google Scholar
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: STOC, pp. 380–388 (2002)
Google Scholar
Christen, P.: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Heidelberg (2012)
Google Scholar
Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: KDD, pp. 475–480 (2002)
Google Scholar
Dasgupta, A., Gurevich, M., Zhang, L., Tseng, B., Thomas, A.O.: Overcoming browser cookie churn with clustering. In: WSDM, pp. 83–92 (2012)
Google Scholar
Dong, X.L., Naumann, F.: Data fusion: resolving data conflicts for integration. VLDB 2(2), 1654–1655 (2009)
Google Scholar
Dong, Y., Chawla, N.V., Swami, A.: metapath2vec: scalable representation learning for heterogeneous networks. In: KDD, pp. 135–144 (2017)
Google Scholar
Donnat, C., Zitnik, M., Hallac, D., Leskovec, J.: Learning structural node embeddings via diffusion wavelets. In: KDD, pp. 1320–1329 (2018)
Google Scholar
Eckersley, P.: How unique is your web browser? In: Atallah, M.J., Hopper, N.J. (eds.) PETS 2010. LNCS, vol. 6205, pp. 1–18. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14527-8_1
Chapter Google Scholar
Getoor, L., Machanavajjhala, A.: Entity resolution for big data. In: KDD, pp. 1527–1527 (2013)
Google Scholar
Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: KDD, pp. 855–864 (2016)
Google Scholar
Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. In: NeurIPS, pp. 1024–1034 (2017)
Google Scholar
Heimann, M., Shen, H., Safavi, T., Koutra, D.: Regal: representation learning-based graph alignment. In: CIKM, pp. 117–126 (2018)
Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC, pp. 604–613 (1998)
Google Scholar
Jin, D., et al.: Smart roles: inferring professional roles in email networks. In: KDD (2019)
Google Scholar
Jin, D., Rossi, R.A., Koh, E., Kim, S., Rao, A., Koutra, D.: Latent network summarization: bridging network embedding and summarization. In: KDD (2019)
Google Scholar
Kim, S., Kini, N., Pujara, J., Koh, E., Getoor, L.: Probabilistic visitor stitching on cross-device web logs. In: WWW, pp. 1581–1589 (2017)
Google Scholar
Kolb, L., Thor, A., Rahm, E.: Dedoop: efficient deduplication with Hadoop. VLDB 5(12), 1878–1881 (2012)
Google Scholar
Liu, Y., Zhu, L., Szekely, P., Galstyan, A., Koutra, D.: Coupled clustering of time-series and networks. In: SDM (2019)
Google Scholar
Nguyen, G.H., Lee, J.B., Rossi, R.A., Ahmed, N.K., Koh, E., Kim, S.: Continuous-time dynamic network embeddings. In: WWW BigNet (2018)
Google Scholar
Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. VLDB 9(9), 684–695 (2016)
Google Scholar
Perozzi, B., Al-Rfou, R., Skiena, S.: DeepWalk: online learning of social representations. In: KDD (2014)
Google Scholar
Rajaraman, A., Leskovec, J., Ullman, J.D.: Mining Massive Datasets (2014)
Google Scholar
Ribeiro, L.F., Saverese, P.H., Figueiredo, D.R.: struc2vec: learning node representations from structural identity. In: KDD, pp. 385–394 (2017)
Google Scholar
Rossi, R.A., Ahmed, N.K.: The network data repository with interactive graph analytics and visualization. In: AAAI (2015). http://networkrepository.com
Rossi, R.A., Zhou, R., Ahmed, N.K.: Deep inductive network representation learning. In: WWW, pp. 953–960 (2018)
Google Scholar
Saha Roy, R., Sinha, R., Chhaya, N., Saini, S.: Probabilistic deduplication of anonymous web traffic. In: WWW, pp. 103–104 (2015)
Google Scholar
Shi, Y., Gui, H., Zhu, Q., Kaplan, L., Han, J.: AspEm: embedding learning by aspects in heterogeneous information networks. In: SDM, pp. 144–152. SIAM (2018)
Google Scholar
Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., Mei, Q.: LINE: large-scale information network embedding. In: WWW (2015)
Google Scholar
Von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)
Article MathSciNet Google Scholar
Wang, Q., Wang, S., Gong, M., Wu, Y.: Feature hashing for network representation learning. In: IJCAI, pp. 2812–2818 (2018)
Google Scholar

Additional References

Aghazadeh, A., Lan, A., Shrivastava, A., Baraniuk, R.: RHash: robust hashing via l-norm distortion. In: IJCAI, pp. 1386–1394 (2017)
Google Scholar
Heimann, M., Lee, W., Pan, S., Chen, K.-Y., Koutra, D.: HashAlign: hash-based alignment of multiple graphs. In: Phung, D., Tseng, V.S., Webb, G.I., Ho, B., Ganji, M., Rashidi, L. (eds.) PAKDD 2018. LNCS (LNAI), vol. 10939, pp. 726–739. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93040-4_57
Chapter Google Scholar
Hisano, R.: Semi-supervised graph embedding approach to dynamic link prediction. In: Cornelius, S., Coronges, K., Gonçalves, B., Sinatra, R., Vespignani, A. (eds.) CompleNet 2018. SPC, pp. 109–121. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73198-8_10
Chapter Google Scholar
Ji, J., Li, J., Yan, S., Tian, Q., Zhang, B.: Min-max hash for Jaccard similarity. In: ICDM, pp. 301–309 (2013)
Google Scholar
Kang, B., Jung, K.: Robust and efficient locality sensitive hashing for nearest neighbor search in large data sets. In: NeurIPS BigLearn Workshop, pp. 1–8 (2012)
Google Scholar
Li, P., Hastie, T.J., Church, K.W.: Very sparse random projections. In: KDD, pp. 287–296 (2006)
Google Scholar
Li, P., König, A.C.: Theory and applications of b-bit minwise hashing. Comm. ACM 54(8), 101–109 (2011)
Article Google Scholar
Li, P., Owen, A., Zhang, C.H.: One permutation hashing. In: NeurIPS, pp. 3113–3121 (2012)
Google Scholar
Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: Multi-probe LSH: efficient indexing for high-dimensional similarity search. In: VLDB, pp. 950–961 (2007)
Google Scholar
Manzoor, E., Milajerdi, S.M., Akoglu, L.: Fast memory-efficient anomaly detection in streaming heterogeneous graphs. In: KDD, pp. 1035–1044 (2016)
Google Scholar
Qiu, J., Dong, Y., Ma, H., Li, J., Wang, K., Tang, J.: Network embedding as matrix factorization: Unifying DeepWalk, LINE, PTE, and node2vec. In: WSDM, pp. 459–467 (2018)
Google Scholar
Rossi, R.A., Ahmed, N.K.: Role discovery in networks. TKDE 27(4), 1112–1131 (2015)
Google Scholar
Safavi, T., Sripada, C., Koutra, D.: Scalable hashing-based network discovery. In: ICDM, pp. 405–414, November 2017
Google Scholar
Shrivastava, A., Li, P.: Densifying one permutation hashing via rotation for fast near neighbor search. In: ICML, pp. 557–565 (2014)
Google Scholar
Wang, D., Cui, P., Zhu, W.: Structural deep network embedding. In: KDD, pp. 1225–1234 (2016)
Google Scholar
Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: NeurIPS, pp. 1753–1760 (2009)
Google Scholar
Zhu, L., Guo, D., Yin, J., Ver Steeg, G., Galstyan, A.: Scalable temporal latent space inference for link prediction in dynamic social networks. IEEE TKDE 28(10), 2765–2777 (2016)
Google Scholar

Download references

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant No. IIS 1845491, Army Young Investigator Award No. W911NF1810397, an Adobe Digital Experience and an Amazon research faculty award. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF or other funding parties.

Author information

Authors and Affiliations

University of Michigan, Ann Arbor, USA
Di Jin, Mark Heimann & Danai Koutra
Adobe Research, San Jose, USA
Ryan A. Rossi

Authors

Di Jin
View author publications
You can also search for this author in PubMed Google Scholar
Mark Heimann
View author publications
You can also search for this author in PubMed Google Scholar
Ryan A. Rossi
View author publications
You can also search for this author in PubMed Google Scholar
Danai Koutra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Di Jin .

Editor information

Editors and Affiliations

Leuphana University, Lüneburg, Germany
Ulf Brefeld
IRISA/Inria, Rennes, France
Elisa Fromont
University of Würzburg, Würzburg, Germany
Andreas Hotho
Leiden University, Leiden, The Netherlands
Arno Knobbe
ETH Zurich, Zurich, Switzerland
Marloes Maathuis
Institut National des Sciences Appliquées, Villeurbanne, France
Céline Robardet

Appendices

Appendix

A Detailed Algorithm

In Sect. 3 we gave the overview of our proposed method, node2bits. For reproducibility, here we also provide its more detailed pseudocode.

B Complexity Analysis

Time Complexity. The runtime complexity of node2bits includes deriving (1) the set of R temporal random walks of length up to L, which is \(\mathcal {O}(MRL)\) in the worst case; (2) the feature values of nodes in the walks from step (1); and (3) hashing the feature values of nodes in the context through random projection, which is \(\mathcal {O}(NK)\). Thus, the total runtime complexity is \(\mathcal {O}(MRL + NK)\), which is linear to the number of edges when \(M\gg N\) as K is relatively small (R3).

Runtime Space Complexity. The space required in the runtime consists three parts: (1) the set of temporal random walks (represented as vectors) per edge with complexity \(\mathcal {O}(MRL)\), (2) the histograms of feature contexts \( N|\mathcal {F}||\mathcal {T}_V|\), and (3) the set of randomly-generated hyperplanes NK. Therefore, the total runtime space complexity is \(\mathcal {O}(MRL + N(|\mathcal {F}||\mathcal {T}_V|+K))\).

Output Space Complexity. The output space complexity of node2bits is \(\mathcal {O}(NK)\)-bit. The space required to store binary vectors is guaranteed to be \(32{\times }\) less than vectors represented with real-value floats (4 bytes) with the same dimension. In practice, node2bits requires even less storage if the binary vectors are represented in the sparse format (see Sect. 4.4 for empirical results).

C Data Description

Below we provide a more detailed description of the network datasets that we use in our experiments (Table 2).

citeseer: CiteSeerX is an undirected, heterogeneous network that contains the bipartite relations between authors and papers they contributed.
yahoo: Yahoo! Messenger Logs is a heterogeneous network capturing message exchanges between users at different locations (node attribute).
bitcoin: soc-bitcoinA is a who-trusts-whom network on the Bitcoin Alpha platform. The directed edges indicate user ratings.
digg: This heterogeneous network consists of users voting stories that they like and forming friendships with other users.
wiki: wiki-talk is a temporal homogeneous network capturing Wikipedia users editing each other’s Talk page over time.
comp-X: A temporal heterogeneous network is derived from a company’s web logs and consists of web sessions of users and their IPs. In the stitching task, we predict the web session IDs that correspond to the same user.

D Configuration of Baselines

As we mentioned in Sect. 4.1, we configured all the baselines to achieve the best performance according to the respective papers. For all the baselines that are based on random walks (i.e., node2vec, struc2vec, DeepWalk, metapath2vec, CTDNE), we set the number of walks to 20 and the maximum walk length to \(L=20\). For node2vec, we perform grid search over \(p,q \in \{0.25, 0.50, 1, 2, 4\}\) as mentioned in [14] and report the best performance. For metapath2vec, we adopt the recommended meta-path “Type 1-Type 2-Type 1” (e.g., type 1 = author; type 2 = publication). In DNGR, we set the random surfing probability \(\alpha =0.98\) and use a 3-layer neural network model where the hidden layer has 1024 nodes. We use 2nd-LINE to incorporate 2nd-order proximity in the graph. For all the embedding methods, we set the embedding dimension to \(K=128\). Unlike those, CN outputs clusters, each of which corresponds to one entity.

E Additional Empirical Analysis

1.1 E.1 Justification of Hashing

In this experiment we hash the outputs given by baseline embedding methods using SimHash [5] and then perform stitching on two temporal graphs to study their performance under the constraint of storage comparable to node2bits. Based on Table 6, we observe fluctuation in the stitching performance of baseline methods, for example, almost all baselines got degenerated scores in all metrics on the bitcoin dataset, especially for struc2vec. On the other hand, however, node2vec, LINE and CTDNE got slight increased scores on yahoo dataset, which is likely due to the fact that the small real-values are amplified when hashed into binary for logistic regression binary classification. It is also possibly due to the graph structure. We leave further discussion in the future work, but nevertheless, node2bits outperforms these baselines in all cases. This empirical experiment demonstrates that node2bits effectively preserves context information in the binary hashcodes.

Table 6. Justification of hashing

Full size table

1.2 E.2 Sensitivity Analysis

We also perform sensitivity analysis of the hyperparameters used in this work on the bitcoin dataset. Particularly, we perform grid analysis by varying (1) max temporal distances, (2) the number of temporal walks per edge and (3) the length of walks. The results are given in Fig. 4. Figure 4a indicates that when \(\text {MAX}=3\), node2bits achieves the best performance. This implies that although it is potentially beneficial to incorporate nodes in temporally distant contexts, it will also incorporate information that is less relevant. Therefore, we set \(\text {MAX}=3\) by default for the experiments in this work. Figure 4b and c imply that the performance of node2bits is not significantly affected by the number of walks performed or the length of these temporal walks. This is reasonable because node2bits leverages these temporal walks to collect node features into the context and normalizes their occurrences in the histograms. Thus, adding more nodes in the ordered temporal contexts does not provide extra useful information. We empirically set the number of walks per edge to be 10 and the lengths to be 20 in the experiments of this work.

1.3 E.3 Output Storage and Runtime in Detail

We report detailed output storage in Table 7 and the time elapsed when running all methods in Table 8. It can be seen that the node-wise sparse binary vectors generated by node2bits take trivial amount of storage compared to the other methods, while its runtime is comparable to node2vec. node2bits finished running on all datasets while most baselines fail to finish within the time limit on the large datasets, digg and wiki.

Table 7. Space required to store the output in MB. node2bits requires 63\(\times \)–339\(\times \) less space than other embedding methods. ‘–’ indicates that the method does not apply to that dataset, or encounters errors such as out-of-memory or out-of-time.

Full size table

Table 8. Comparison between node2bits and baselines in terms of runtime (in seconds). Note the runtime of dynamic node2bits (short-term) for the temporal networks is shown in parentheses.

Full size table

F Additional Related Work

In this section we provide additional related work, complementing our discussion in Sect. 5.

Node Embeddings. Here we give some more details about proximity-based methods, which we employ in our experiments. DeepWalk [25] and node2vec [14] leverage vanilla and 2-order random walk, respectively, to explore the identities of the neighborhood; LINE [32] can be seen as a special case of DeepWalk by setting the context to be 1 [45]; metapath2vec [10] relies on predefined meta-schema to perform random walk in heterogeneous networks. In the field of temporal network embedding, most approaches [37, 51] approximate the dynamic network as discrete static snapshots overtime, which does not apply to user stitching tasks as sessions corresponding to the same user could occur in multiple timespans. CTDNE [23] first explores temporal proximity by learning temporally valid embeddings based on a corpus of temporal random walks. Another related field is hashing-based embedding, for example, node2hash [34] proposes to hash the pairwise node proximity derived from random walk into low-dimensional hashcode as the embeddings. Due to the quadratic complexity in computing the pairwise proximity between nodes, node2hash does not apply to large-scale networks. One limitation of these methods is that training a skip-gram architecture on the entire corpus sampled by random walks can be memory-intensive. A further limitation of these approaches, as well as existing deep architectures [15, 49] is that for nodes to have similar embeddings, they must be in close proximity (e.g. neighbors or nodes with several common neighbors) in the network. This is not necessarily the case for user stitching, where corresponding entities may exhibit similar behavior (resulting in similar local topologies) but not connect to the same entities.

Compared with proximity-based methods, embedding works exploring structural equivalency or similarity [1, 11, 16, 18, 19, 22, 27, 46] are more suitable to handle user stitching. Representative examples include the following: struc2vec [27], xNetMf [16], and EMBER [18] define similarity in terms of degree sequences in node-centric subgraphs; DeepGL [29] learns deep relational functions applied to degree, triangle counts and other graph invariants in an inductive scheme. Role2vec [1] proposes a framework that inductively learns structural similarity by introducing attributed random walk atop relational operators, while MultiLENS [19] summarizes node embeddings obtained by recursive application of relational operators. CCTN [22] embeds and clusters nodes in a network that are not only well-connected but also share similar behavioral patterns (e.g., similar patterns in the degree or other structural properties over time).

Locality Sensitivity Hashing (LSH). More recently, LSH functions that are robust to distortion [35]; require less storage of the hash codes [41, 50]; generate codewords with balanced amounts of items [39] or compute hash functions efficiently [38, 40, 42, 48] have attracted much attention. LSH has been used in a variety of data mining applications, including network alignment [36], network inference [47], anomaly detection [44], and more. In addition, there are also works devoted to learning to hash [35] where the main idea is to learn hash codes through an optimization objective function, or intelligently probe multiple adjacent code words that are likely to contain query results in a hash table for similarity search [43]. But these methods do not apply to large-scale graphs directly.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jin, D., Heimann, M., Rossi, R.A., Koutra, D. (2020). node2bits: Compact Time- and Attribute-Aware Node Representations for User Stitching. In: Brefeld, U., Fromont, E., Hotho, A., Knobbe, A., Maathuis, M., Robardet, C. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019. Lecture Notes in Computer Science(), vol 11906. Springer, Cham. https://doi.org/10.1007/978-3-030-46150-8_29

Download citation

DOI: https://doi.org/10.1007/978-3-030-46150-8_29
Published: 30 April 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46149-2
Online ISBN: 978-3-030-46150-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

node2bits: Compact Time- and Attribute-Aware Node Representations for User Stitching

Abstract

Access this chapter

Similar content being viewed by others

HFUL: a hybrid framework for user account linkage across location-aware social networks

Multi Association Semantics-Based User Matching Algorithm Without Prior Knowledge

GSimRank: A General Similarity Measure on Heterogeneous Information Network

Notes