The workforce analyzer: group discovery among LinkedIn public profiles

  • Kais Dai
  • Ana Fernández Vilas
  • Rebeca P. Díaz Redondo
Original Research


In this paper, we describe two users’ group discovery methods among LinkedIn public profiles. We start by clustering profiles according to their professional background. In this sense, we combine the so-called K-means technique with the gap statistics method and use tag clouds to scrutinize the obtained groups. The second phase of this work consists in classifying the same profiles by relying on a knowledge base. In this context, we design a support-vector-machines multi-label classifier that takes advantage of the LinkedIn job Ads taxonomy. We finally contrast results of both methods and provide insights about the trending professional orientations of the workforce from an online perspective.


Group discovery LinkedIn Text mining Clustering Multi-label classification N-Gram 



This work is funded by Spanish Ministry of Economy and Competitiveness under the National Science Program (TEC2014-54335-C4-3-R); the European Regional Development Fund (ERDF) and the Galician Regional Government under agreement for funding the Atlantic Research Center for Information and Communication Technologies (AtlantTIC). This work is also partially funded by the European Commission under the Erasmus Mundus GreenIT Project (3772227-1-2012-ES-ERA MUNDUS-EMA21).


  1. Agichtein E, Castillo C, Donato D, Gionis A, Mishne G (2008) Finding high-quality content in social media. In: Proceedings of the 2008 international conference on web search and data mining. ACM, pp 183–194Google Scholar
  2. Ahmed EB, Nabli A, Gargouri F (2014) Group extraction from professional social network using a new semi-supervised hierarchical clustering. Knowl Inf Syst 40(1):29–47CrossRefGoogle Scholar
  3. Asur S, Huberman BA (2010) Predicting the future with social media. In: 2010 IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology (WI-IAT), vol 1. IEEE, pp 492–499Google Scholar
  4. Baatarjav E-A, Phithakkitnukoon S, Dantu R (2008) Group recommendation system for facebook. In: On the move to meaningful internet systems: OTM 2008 workshops. Springer, pp 211–219Google Scholar
  5. Box GE, Jenkins GM, Reinsel GC, Ljung GM (2015) Time series analysis: forecasting and control. Wiley, New YorkzbMATHGoogle Scholar
  6. Carley KM (1996) A comparison of artificial and human organizations. J Econ Behav Organ 31(2):175–191CrossRefGoogle Scholar
  7. Case T, Gardiner A, Rutner P, Dyer J (2013) A linkedin analysis of career paths of information systems alumni. J South Assoc Inf Syst 1(1)Google Scholar
  8. Dai K, Nespereira CG, Vilas AF, Redondo RPD (2015) Scraping and clustering techniques for the characterization of LinkedIn profiles. In: Proceedings of the fourth international conference on information technology convergence and services, pp 1–15Google Scholar
  9. Dai K, Vilas AF, Redondo RPD (2017) A new MOOCs’ recommendation framework based on LinkedIn data. In: Innovations in smart learning. Springer, Singapore, pp 19–22Google Scholar
  10. Hyun KD, Kim J (2015) Differential and interactive influences on political participation by different types of news activities and political conversation through social media. Comput Hum Behav 45:328–334CrossRefGoogle Scholar
  11. Java A, Song X, Finin T, Tseng B (2007). Why we twitter: understanding microblogging usage and communities. In: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis. ACM, pp 56–65Google Scholar
  12. Joachims T (2002) Learning to classify text using support vector machines: methods, theory and algorithms, vol 668. Springer, BerlinCrossRefGoogle Scholar
  13. Jolliffe I (2002) Principal component analysis. Wiley Online LibraryGoogle Scholar
  14. Lee D, Jeong O-R, Lee S-G (2008) Opinion mining of customer feedback data on the web. In: Proceedings of the 2nd international conference on ubiquitous information management and communication. ACM, pp 230–235Google Scholar
  15. Lingras P, Huang X (2005) Statistical, evolutionary, and neurocomputing clustering techniques: cluster-based vs object-based approaches. Artif Intell Rev 23(1):3–29CrossRefGoogle Scholar
  16. Liu B (2007) Web data mining: exploring hyperlinks, contents, and usage data. Springer Science & Business Media, New YorkzbMATHGoogle Scholar
  17. Michelson M, Macskassy SA (2010) Discovering users’ topics of interest on twitter: a first look. In: Proceedings of the fourth workshop on analytics for noisy unstructured text data. ACM, pp 73–80Google Scholar
  18. Paul JA, Baker HM, Cochran JD (2012) Effect of online social networking on student academic performance. Comput Hum Behav 28(6):2117–2127CrossRefGoogle Scholar
  19. Pison G, Struyf A, Rousseeuw PJ (1999) Displaying a clustering with clusplot. Comput Stat Data Anal 30(4):381–392CrossRefGoogle Scholar
  20. Raghunathan B (2013) The complete book of data anonymization: from planning to implementation. CRC Press, Boca RatonCrossRefGoogle Scholar
  21. Read J, Pfahringer B, Holmes G, Frank E (2011) Classifier chains for multi-label classification. Mach Learn 85(3):333–359MathSciNetCrossRefGoogle Scholar
  22. Rousseeuw PJ, Leroy AM (2005) Robust regression and outlier detection, vol 589. Wiley, New YorkzbMATHGoogle Scholar
  23. Sheng ML, Hsu C-L, Wu C-C (2011) The asymmetric effect of online social networking attribute-level performance. Ind Manag Data Syst 111(7):1065–1086CrossRefGoogle Scholar
  24. Sorower MS (2010) A literature survey on algorithms for multi-label learning. Oregon State University, CorvallisGoogle Scholar
  25. Sparrow MK (1991) The application of network analysis to criminal intelligence: an assessment of the prospects. Soc Netw 13(3):251–274CrossRefGoogle Scholar
  26. Steinley D (2006) K-means clustering: a half-century synthesis. Br J Math Stat Psychol 59(1):1–34MathSciNetCrossRefGoogle Scholar
  27. Tahir MA, Kittler J, Bouridane A (2016) Multi-label classification using stacked spectral kernel discriminant analysis. Neurocomputing 171:127–137CrossRefGoogle Scholar
  28. Tang L, Liu H (2010) Community detection and mining in social media. Synth Lect Data Min Knowl Discov 2(1):1–137MathSciNetCrossRefGoogle Scholar
  29. Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B (Stat Methodol) 63(2):411–423MathSciNetCrossRefGoogle Scholar
  30. Valenzuela S (2013) Unpacking the use of social media for protest behavior the roles of information, opinion expression, and activism. Am Behav Sci 57(7):920–942CrossRefGoogle Scholar
  31. Van Dijck J (2013) you have one identity: performing the self on facebook and linkedin. Media Cult Soc 35(2):199–215CrossRefGoogle Scholar
  32. Wang J, Guo Y (2012) Scrapy-based crawling and user-behavior characteristics analysis on taobao. In: 2012 international conference on cyber-enabled distributed computing and knowledge discovery (CyberC). IEEE, pp 44–52Google Scholar
  33. Wang M, Liu M, Feng S, Wang D, Zhang Y (2014) A novel calibrated label ranking based method for multiple emotions detection in Chinese microblogs. In: Natural language processing and Chinese computing. Springer, Berlin, pp 238–250Google Scholar
  34. Wu Q, Zhou D-X (2006) Analysis of support vector machine classification. J Comput Anal Appl 8(2)Google Scholar
  35. Xu Y, Li Z, Gupta A, Bugdayci A, Bhasin A (2014) Modeling professional similarity by mining professional career trajectories. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 1945–1954Google Scholar
  36. Yamaguchi Y, Amagasa T, Kitagawa H (2011) Tag-based user topic discovery using twitter lists. In: 2011 International Conference on advances in social networks analysis and mining (ASONAM). IEEE, pp 13–20Google Scholar
  37. Zaytsev V (2012) Bnf was here: what have we done about the unnecessary diversity of notation for syntactic definitions. In: Proceedings of the 27th annual ACM symposium on applied computing. ACM, pp 1910–1915Google Scholar
  38. Zhang T, Oles FJ (2001) Text categorization based on regularized linear classification methods. Inf Retr 4(1):5–31CrossRefGoogle Scholar
  39. Zhang Y, Wu Y, Yang Q (2012) Community discovery in twitter based on user interests. J Comput Inf Syst 8(3):991–1000Google Scholar
  40. Zhang Z, Li Q (2011) Questionholic: hot topic discovery and trend analysis in community question answering systems. Expert Syst Appl 38(6):6848–6855CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2017

Authors and Affiliations

  • Kais Dai
    • 1
  • Ana Fernández Vilas
    • 1
  • Rebeca P. Díaz Redondo
    • 1
  1. 1.Information and Computing Laboratory, AtlantTIC Research CenterUniversity of VigoVigoSpain

Personalised recommendations