Skip to main content

A Two-Stage Authorship Attribution Method Using Text and Structured Data for De-Anonymizing User-Generated Content

Abstract

User-generated content (UGC) is an important source of information on products and services for consumers and firms. Although incentivizing high-quality UGC is an important business objective for any content platform, we show that it is also possible to identify anonymous posters by exploiting the characteristics of posted content. We present a novel two-stage authorship attribution methodology that combines structured and text data by identifying an author first by the amount and granularity of structured data (e.g., location, first name) posted with the UGC and second by the author’s writing style. As a case study, we show that 75% of the 1.3 million users in data publicly released by Yelp are uniquely identified by three structured variable combinations. For the remaining 25%, when the number of potential authors with (nearly) identically structured data ranges from 100 to 5 and sufficient training data exists for text analysis, the average probabilities of identification range from 40 to 81%. Our findings suggest that UGC platforms concerned with the potential negative effects of privacy-related incidents should limit or generalize their posters’ structured data when it is adjoined with textual content or mentioned in the text itself. We also show that although protection policies that focus on structured data remove the most predictive elements of authorship, they also have a small negative effect on the usefulness of content.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3

Notes

  1. 1.

    As reviewed in Shu et al. [57], this problem is known by a number of names, including User Identity Linkage, Social Identity Linkage, User Identity Resolution, Social Network Reconciliation, User Account Linkage Inference, Profile Linkage, Anchor Link Prediction, and Detecting me edges.

  2. 2.

    We explored several different kernels for SVM, including polynomial (2nd and 3rd order) and other nonlinear specifications. A linear kernel achieved the best results and is therefore presented throughout the paper.

References

  1. 1.

    Abbasi A, Chen H (2008) Writeprints: a stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transact Inform Syst (TOIS) 26(2):1–29

    Article  Google Scholar 

  2. 2.

    Abbasi A, Chen H, Nunamaker JF (2008) Stylometric identification in electronic markets: scalability and robustness. J Manag Inf Syst 25(1):49–78

    Article  Google Scholar 

  3. 3.

    Aggarwal CC, Philip SY (2008) A general survey of privacy-preserving data mining models and algorithms. In: In Privacy-preserving data mining. Springer, Boston, pp 11–52

    Chapter  Google Scholar 

  4. 4.

    Ahn D-Y, Duan JA, Mela CF (2015) Managing user-generated content: a dynamic rational expectations equilibrium approach. Mark Sci 35(2):284–303

    Article  Google Scholar 

  5. 5.

    Almishari M, Tsudik G (2012) Exploring linkability of user reviews. In: In European Symposium on Research in Computer Security. Springer, Berlin, pp 307–324

    Google Scholar 

  6. 6.

    AMZ Tracker, 2018. How to deal with negative reviews. URL: https://www.amztracker.com/blog/deal-negative-reviews/. Accessed: July 24, 2020.

  7. 7.

    André Q, Carmon Z, Wertenbroch K, Crum A, Frank D, Goldstein W, Huber J, Van Boven L, Weber B, Yang H (2018) Consumer choice and autonomy in the age of artificial intelligence and big data. Cust Needs Solut 5(1):28–37

    Article  Google Scholar 

  8. 8.

    Bordes A, Ertekin S, Weston J, Bottou L (2005) Fast kernel classifiers with online and active learning. J Mach Learn Res 6(Sep):1579–1619

    Google Scholar 

  9. 9.

    Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  10. 10.

    Brennan M, Afroz S, Greenstadt R (2012) Adversarial stylometry: circumventing authorship recognition to preserve privacy and anonymity. ACM Transac Inform Syst Secur (TISSEC) 15(3):1–22

    Article  Google Scholar 

  11. 11.

    Brizan DG, Tansel AU (2006) A. survey of entity resolution and record linkage methodologies. Commun IIMA 6(3):5

    Google Scholar 

  12. 12.

    Büschken J, Allenby GM (2016) Sentence-based text analysis for customer reviews. Mark Sci 35(6):953–975

    Article  Google Scholar 

  13. 13.

    Campbell J, Goldfarb A, Tucker C (2015) Privacy regulation and market structure. J Econ Manag Strateg 24(1):47–73

    Article  Google Scholar 

  14. 14.

    Caselaw, (2017). ZL TECHNOLOGIES INC v. GLASSDOOR INC. Court of Appeal, First District, Division 4, California. URL: https://caselaw.findlaw.com/ca-court-of-appeal/1868279.html. Accessed July 24, 2020.

  15. 15.

    De Jong MG, Pieters R, Fox JP (2010) Reducing social desirability bias through item randomized response: an application to measure underreported desires. J Mark Res 47(1):14–27

    Article  Google Scholar 

  16. 16.

    De Montjoye YA, Hidalgo CA, Verleysen M, Blondel VD (2013) Unique in the crowd: the privacy bounds of human mobility. Sci Rep 3(1):1376

    Article  Google Scholar 

  17. 17.

    De Montjoye YA, Radaelli L, Singh VK (2015) Unique in the shopping mall: on the reidentifiability of credit card metadata. Science 347(6221):536–539

    Article  Google Scholar 

  18. 18.

    Douglas DM (2016) Doxing: a conceptual analysis. Ethics Inf Technol 18(3):199–210

    Article  Google Scholar 

  19. 19.

    Du Bay WH, (2004). The principles of readability. Accessed April 7, 2020. http://en.copian.ca/library/research/readab/readab.pdf.

  20. 20.

    Elmagarmid AK, Ipeirotis PG, Verykios VS (2006) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16

    Article  Google Scholar 

  21. 21.

    Farr C, (2018). Facebook sent a doctor on a secret mission to ask hospitals to share data. CNBC. URL: https://www.cnbc.com/2018/04/05/facebook-building-8-explored-data-sharing-agreement-with-hospitals.html. Accessed: July 24, 2020.

  22. 22.

    Getoor L, Machanavajjhala A (2012) Entity resolution: theory, practice & open challenges. Proc VLDB Endowment 5(12):2018–2019

    Article  Google Scholar 

  23. 23.

    Ghose A, Ipeirotis PG (2010) Estimating the helpfulness and economic impact of product reviews: mining text and reviewer characteristics. IEEE Trans Knowl Data Eng 23(10):1498–1512

    Article  Google Scholar 

  24. 24.

    Goldfarb A, Tucker C (2013) Why managing consumer privacy can be an opportunity. MIT Sloan Manag Rev 54(3):10

    Google Scholar 

  25. 25.

    Gravano L, Ipeirotis PG, Koudas N and Srivastava D, (2003). Text joins for data cleansing and integration in an rdbms. In Proceedings 19th International Conference on Data Engineering (Cat. No. 03CH37405) (pp. 729-731). IEEE.

  26. 26.

    Hewett K, Rand W, Rust RT, van Heerde HJ (2016) Brand buzz in the echoverse. J Mark 80(3):1–24

    Article  Google Scholar 

  27. 27.

    Hill S, Provost F (2003) The myth of the double-blind review? Author identification using only citations. Acm Sigkdd Explor Newslett 5(2):179–184

    Article  Google Scholar 

  28. 28.

    Hsu CW, Lin CJ (2002) A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw 13(2):415–425

    Article  Google Scholar 

  29. 29.

    Hu M and Liu B, 2004. Mining opinion features in customer reviews. In AAAI (Vol. 4, No. 4, pp. 755-760).

  30. 30.

    Jones R, (2017). Court rules Yelp must identify anonymous user in defamation case. Gizmodo. URL: https://gizmodo.com/court-rules-yelp-must-identify-anonymous-user-in-defama-1820433103. Accessed: July 24, 2020.

  31. 31.

    Juola P (2012) Large-scale experiments in authorship attribution. Engl Stud 93(3):275–283

    Article  Google Scholar 

  32. 32.

    Juola P and Vescovi D, (2010). Empirical evaluation of authorship obfuscation using JGAAP. In Proceedings of the 3rd ACM workshop on Artificial Intelligence and Security (pp. 14-18).

  33. 33.

    Kincaid JP, Fishburne Jr RP, Rogers RL and Chissom BS, (1975). Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Naval Technical Training Command Millington TN Research Branch.

  34. 34.

    Klemko R (2021) A small group of sleuths has been identifying right-wing extremists long before the attack on the Capitol. URL: https://www.washingtonpost.com/national-security/antifa-far-right-doxing-identities/2021/01/10/41721de0-4dd7-11eb-bda4-615aaefd0555_story.html. Accessed January 2, 2021.

  35. 35.

    Koppel M, Schler J, Argamon S (2009) Computational methods in authorship attribution. J Am Soc Inf Sci Technol 60(1):9–26

    Article  Google Scholar 

  36. 36.

    Kourtis I, Stamatatos E (2011) Author identification using semi-supervised learning. In: In CLEF 2011: Proceedings of the 2011 Conference on Multilingual and Multimodal Information Access Evaluation (Lab and Workshop Notebook Papers). The Netherlands, Amsterdam

    Google Scholar 

  37. 37.

    Krishnamoorthy S (2015) Linguistic features for review helpfulness prediction. Expert Syst Appl 42(7):3751–3759

    Article  Google Scholar 

  38. 38.

    Kroft S, (2014). The data brokers: selling your personal information. 60 Minutes. URL: https://www.cbsnews.com/news/the-data-brokers-selling-your-personal-information/. Accessed: July 24, 2020.

  39. 39.

    Kumar V, Reinartz W (2018) Customer privacy concerns and privacy protective responses. In: In Customer relationship management. Springer, Berlin, pp 285–309

    Chapter  Google Scholar 

  40. 40.

    Li XB, Qin J (2017) Anonymizing and sharing medical text records. Inf Syst Res 28(2):332–352

    Article  Google Scholar 

  41. 41.

    Li XB, Sarkar S (2006) Privacy protection in data mining: a perturbation approach for categorical data. Inf Syst Res 17(3):254–270

    Article  Google Scholar 

  42. 42.

    Mankad S, Han HS, Goh J, Gavirneni S (2016) Understanding online hotel reviews through automated text analysis. Serv Sci 8(2):124–138

    Article  Google Scholar 

  43. 43.

    Martin KD, Murphy PE (2017) The role of data privacy in marketing. J Acad Mark Sci 45(2):135–155

    Article  Google Scholar 

  44. 44.

    Menon S, Sarkar S (2016) Privacy and big data: scalable approaches to sanitize large transactional databases for sharing. MIS Q 40(4):963–981

    Article  Google Scholar 

  45. 45.

    Moe WW, Schweidel DA (2012) Online product opinions: incidence, evaluation, and evolution. Mark Sci 31(3):372–386

    Article  Google Scholar 

  46. 46.

    Narayanan A, Paskov H, Gong NZ, Bethencourt J, Stefanov E, Shin ECR and Song D, (2012). On the feasibility of internet-scale author identification. In 2012 IEEE Symposium on Security and Privacy (pp. 300-314). IEEE.

  47. 47.

    Narayanan A and Shmatikov V, 2008, May. Robust de-anonymization of large datasets. In Proceedings of the 2008 IEEE Symposium on Security and Privacy.

  48. 48.

    The Associated Press, (2017). Yelp says lawsuit might eliminate all negative reviews. New York Daily News. URL: https://www.nydailynews.com/news/national/yelp-lawsuit-eliminate-negative-reviews-article-1.2796087. Accessed July 24, 2020.

  49. 49.

    Payer M, Huang L, Gong NZ, Borgolte K, Frank M (2014) What you submit is who you are: a multimodal approach for deanonymizing scientific publications. IEEE Transact Inform Forensics Secur 10(1):200–212

    Article  Google Scholar 

  50. 50.

    Peer E, Vosgerau J, Acquisti A (2014) Reputation as a sufficient condition for data quality on Amazon Mechanical Turk. Behav Res Methods 46(4):1023–1031

    Article  Google Scholar 

  51. 51.

    Porter J, (2019). Fraudulent Yelp posting protected under the law, ridiculous. Tahoe Daily Tribune, May 20, 2019. URL: https://www.tahoedailytribune.com/news/jim-porter-fraudulent-yelp-posting-protected-under-the-law-ridiculous/. Accessed July 24, 2020.

  52. 52.

    Proserpio D, Zervas G (2017) Online reputation management: estimating the impact of management responses on consumer reviews. Mark Sci 36(5):645–665

    Article  Google Scholar 

  53. 53.

    Qian T, Liu B, Chen L and Peng, Z., (2014). Tri-training for authorship attribution with limited training data. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 345-351).

  54. 54.

    Rochet JC, Tirole J (2003) Platform competition in two-sided markets. J Eur Econ Assoc 1(4):990–1029

    Article  Google Scholar 

  55. 55.

    Schneider MJ, Jagpal S, Gupta S, Li S, Yu Y (2017) Protecting customer privacy when marketing with second-party data. Int J Res Mark 34(3):593–603

    Article  Google Scholar 

  56. 56.

    Schneider MJ, Jagpal S, Gupta S, Li S, Yu Y (2018) A flexible method for protecting marketing data: an application to point-of-sale data. Mark Sci. ePub ahead of print Jan 8 37:153–171. https://doi.org/10.1287/mksc.2017.1064

    Article  Google Scholar 

  57. 57.

    Shu K, Wang S, Tang J, Zafarani R, Liu H (2017) User identity linkage across online social networks: a review. Acm Sigkdd Explor Newslett 18(2):5–17

    Article  Google Scholar 

  58. 58.

    Singh JP, Irani S, Rana NP, Dwivedi YK, Saumya S, Roy PK (2017) Predicting the “helpfulness” of online consumer reviews. J Bus Res 70:346–355

    Article  Google Scholar 

  59. 59.

    Snyder P, Doerfler P, Kanich C and McCoy D, (2017). Fifteen minutes of unwanted fame: detecting and characterizing doxing. In Proceedings of the 2017 Internet Measurement Conference (pp. 432-444).

  60. 60.

    Stamatatos E (2009) A survey of modern authorship attribution methods. J Am Soc Inf Sci Technol 60(3):538–556

    Article  Google Scholar 

  61. 61.

    Steyvers M, Griffiths T (2007) Probabilistic topic models. Handb Latent Semantic Anal 427(7):424–440

    Google Scholar 

  62. 62.

    Stone EF, Spool MD, Rabinowitz S (1977) Effects of anonymity and retaliatory potential on student evaluations of faculty performance. Res High Educ 6(4):313–325

    Article  Google Scholar 

  63. 63.

    Sweeney L (2000) Simple demographics often identify people uniquely. Health (San Francisco) 671(2000):1–34

    Google Scholar 

  64. 64.

    Sweeney L (2002a) k-anonymity: a model for protecting privacy. Int J Uncertaint Fuzziness Knowl-Based Syst 10(05):557–570

    Article  Google Scholar 

  65. 65.

    Sweeney L (2002b) Achieving k-anonymity privacy protection using generalization and suppression. Int J Uncertaint Fuzziness Knowl-Based Syst 10(05):571–588

    Article  Google Scholar 

  66. 66.

    Tirunillai S, Tellis GJ (2014) Mining marketing meaning from online chatter: strategic brand analysis of big data using latent dirichlet allocation. J Mark Res 51(4):463–479

    Article  Google Scholar 

  67. 67.

    Turjeman D and Feinberg FM, (2019). When the data are out: measuring behavioral changes following a data breach. Available at SSRN 3427254.

  68. 68.

    Tweedie FJ, Baayen RH (1998) How variable may a constant be? Measures of lexical richness in perspective. Comput Hum 32(5):323–352

    Article  Google Scholar 

  69. 69.

    US Census Bureau, (2016). Decennial Census Surname Files (2010, 2000). URL: https://www.census.gov/data/developers/data-sets/surnames.html. Accessed July 24, 2020.

  70. 70.

    US Social Security Administration, (2019). Baby names from social security card applications - national data. Data.gov. URL: https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data. Accessed: July 24, 2020.

  71. 71.

    Wedel M, Kannan PK (2016) Marketing analytics for data-rich environments. J Mark 80(6):97–121

    Article  Google Scholar 

  72. 72.

    Winkler WE, (1999). The state of record linkage and current research problems. In Statistical Research Division, US Census Bureau.

  73. 73.

    Xia D, Mankad S, Michailidis G (2016) Measuring influence of users in Twitter ecosystems using a counting process modeling framework. Technometrics 58(3):360–370

    Article  Google Scholar 

  74. 74.

    Xu J, Ding M (2019) Using the double transparency of autonomous vehicles to increase fairness and social welfare. Cust Needs Solut 6(1):26–35

    Article  Google Scholar 

  75. 75.

    Yelp, 2020. https://terms.yelp.com/privacy/en_us/20200101_en_us/#Controlling-Your-Personal-Data. .

  76. 76.

    Yule, G.U., 1944. The statistical study of literary vocabulary. In Mathematical Proceedings of the Cambridge Philosophical Society (Vol. 42, pp. b1-9).

  77. 77.

    Zhang Y, Moe WW, Schweidel DA (2017) Modeling the role of message content and influencers in social media rebroadcasting. Int J Res Mark 34(1):100–119

    Article  Google Scholar 

  78. 78.

    Zhao Y, Yang S, Narayan V, Zhao Y (2013) Modeling consumer learning from online product reviews. Mark Sci 32(1):153–169

    Article  Google Scholar 

Download references

Acknowledgements

We are thankful to Elea Feit, Sachin Gupta, Cameron Bale, and Sharan Jagpal for their helpful comments on earlier versions of this paper.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Matthew J. Schneider.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Appendix. Expanded results for the yelp data

Appendix. Expanded results for the yelp data

Figure 4 provides out-of-sample accuracy results for the Yelp data. Accuracy consistently improves as the sophistication of the data intruder and the size of the training data increase.

Fig. 4
figure4

Full second stage accuracy results for the Yelp dataset

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Schneider, M.J., Mankad, S. A Two-Stage Authorship Attribution Method Using Text and Structured Data for De-Anonymizing User-Generated Content. Cust. Need. and Solut. 8, 66–83 (2021). https://doi.org/10.1007/s40547-021-00116-x

Download citation

Keywords

  • Data privacy
  • De-anonymization
  • Stylometry
  • User identity linkage
  • Data linkage