Identification of the Words Most Frequently Used by Different Generations of Twitter Users

Majkowska, Agata; Migdał-Najman, Kamila; Najman, Krzysztof; Raca, Katarzyna

doi:10.1007/978-3-030-75190-6_3

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

Included in the following conference series:

Conference of the Section on Classification and Data Analysis of the Polish Statistical Association

1070 Accesses
1 Citations

Abstract

Text data constitutes a significant part of all data generated on the Internet, including the social network users’ comments and posts. Each website offers its users different functionalities. LinkedIn mainly focuses on the labor market as well as professional and business contacts, and Facebook offers the possibility of creating groups as well as photo and message sharing with friends, while Twitter allows short text message posting and tracking. One type of information researchers would like to obtain about the users of these portals is their age. Such information is crucial from the perspective of marketing, social and economic research. Each of the social networks, however, has different rules regarding the privacy policy and the publishing of information about the date of birth. This poses a problem for the researchers who would like to obtain such information. The aim of the research presented is to attempt characterization of the words typically used in the messages published by Twitter users. This social networking site was chosen due to the possibility of downloading data without additional user consent. Text mining methods and techniques were used to carry out the research, which was mainly focused on the analysis of individual words and collocations occurring in the users’ tweets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The term “Silent Generation” first appeared on Novemebr 5, 1951 in the “Time” magazine in the article The younger generation.
2.
X is the number of years that has been declared by the users.
3.
The difference in the number of users results from the interval between the tweet download and the metadata. During that time, the usernames could be changed, user accounts could be deleted or blocked, which resulted in the smaller number of users in the database.

References

Aggarwal CC, Zhai C (2012) Mining text data. In Springer Science+Business Media, LLC 2012. https://doi.org/10.1007/978-1-4614-3223-4
Baker FB, Hubert LJ (1975) Measuring the power of hierarchical cluster analysis. J Am Statist Assoc 70(349):31–38
Article Google Scholar
Balicki A (2009) Statystyczna analiza wielowymiarowa i jej zastosowania społeczno-ekonomiczne. Wydawnictwo Uniwersytetu Gdańskiego, Gdańsk
Google Scholar
Brosdahl DJ, Carpenter JM (2011) Shopping orientations of US males: a generational cohort comparison. J Retail Consum Serv 18(6):548–554. https://doi.org/10.1016/j.jretconser.2011.07.005
Article Google Scholar
Chamberlain BP, Humby C, Deisenroth MP (2017) Probabilistic inference of twitter users age based on what they follow. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), 10536 LNAI, pp 191–203. https://doi.org/10.1007/978-3-319-71273-4_16
Costanza DP, Badger JM, Fraser RL, Severt JB, Gade PA (2012) Generational differences in work-related attitudes: a meta-analysis. J Bus Psychol 27(4):375–394. https://doi.org/10.1007/s10869-012-9259-4
Article Google Scholar
Diestel R (2017) The basics. In: Graph theory-graduate texts in mathematics, vol 173. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-53622-3_1
Book Google Scholar
Dilthy W (1924) Gesammelte Schriften 5: 37. Polish edition: Dilthy W (1924) Rozwój problemu pokolenia (trans: Wyka K). Warszawa
Google Scholar
Dunn JC (1974) Well-separated clusters and optimal fuzzy partitions. J Cybern 4(1):95–104. https://doi.org/10.1080/01969727408546059
Article MathSciNet MATH Google Scholar
Fisher TF, Crabtree JL (2009) Generational cohort theory: have we overlooked an important aspect of the entry-level occupational therapy doctorate debate? Am J Occup Ther 63(5):656–660. https://doi.org/10.5014/ajot.63.5.656
Article Google Scholar
Florek K, Łukaszewicz J, Perkal J, Steinhaus H, Zubrzycki S (1951) Taksonomia wrocławska. Przegląd Antropologiczny 17:193–211
Google Scholar
Goodman LA, Kruskal WH (1954) Measures of association for cross classifications. J Am Statist Assoc 49(268):732–764
MATH Google Scholar
Gower JC (1967) A comparison of some methods of cluster analysis. Biometrics 23(4):623–638
Article Google Scholar
Hellberg S (1972) Computerized iemmatization without the use of a dictionary: a case study from swedish lexicology. Computers and the Humanities, 6(4):209–212. https://doi.org/10.1007/BF02404268
Article Google Scholar
Hubert LJ (1974) Approximate evaluation techniques for the single-link and complete-link hierarchical clustering procedures. J Am Statist Assoc 69(347):698–704
Article MathSciNet Google Scholar
Hull DL (1970) Contemporary systematic philosophies. Annu Rev pf Ecol Systemat 1:19–54. https://doi.org/10.1146/annurev.es.01.110170.000315
Article Google Scholar
Jambu M (1978) Classification automatiqe pour lˋanalyse des donnees, vol 1. Dunod, Paris
Google Scholar
Kruskal JB (1964) Nonmetric multidimensional scaling: a numerical method. Psychometrika 29(2):115–129
Article MathSciNet Google Scholar
Lance GN, Williams WT (1966) A generalized sorting strategy for computer classifications. Nature 212, 218, Letters to Nature
Google Scholar
Lovins JB (1968) Development of a stemming algorithm*. Mechanical translation and computational linguistics
Google Scholar
Luhn HP (1957) A statistical approach to mechanized encoding and searching of literary information. IBM J Res Dev 1(4):309–317. https://doi.org/10.1147/rd.14.0309
Article MathSciNet Google Scholar
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley symposium on mathematical statistics and probability 1. University of California Press, Berkeley, pp 281–297
Google Scholar
Macky K, Gardner D, Forsyth S (2008) Generational differences at work: introduction and overview. J Manag Psychol 23(8):857–861. https://doi.org/10.1108/02683940810904358
Article Google Scholar
McQuitty LL (1960) Hierarchical linkage analysis for the isolation of types. Educ Psychol Measur 20(1):55–67
Article Google Scholar
McQuitty LL (1966) Similarity analysis by reciprocal pairs for discrete and continuous data. Educ Psychol Measur 26(4):825–831
Article Google Scholar
McQuitty LL (1967) Expansion of similarity analysis by reciprocal pairs for discrete and continuous data. Educ Psychol Measur 27(2):253–255
Article Google Scholar
Migdał-Najman K, Najman K (2013) Samouczące się sztuczne sieci neuronowe w grupowaniu i klasyfikacji danych. Teoria i zastosowania w ekonomii, Wydawnictwo Uniwersytetu Gdańskiego, Gdańsk
Google Scholar
Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2):159–179. https://doi.org/10.1007/BF02294245
Article Google Scholar
Mills AJ, Plangger K (2015) Social media strategy for online service brands. Serv Ind J 35(10):521–536. https://doi.org/10.1080/02642069.2015.1043277
Article Google Scholar
Mirkin BG (1996) Mathematical classification and clustering. Kluwer Academic Publishers, Dordrecht, The Netherlands
Book Google Scholar
Mojena R (1977) Hierarchical grouping methods and stopping rules: an evaluation. Comput J 20:359–363. https://doi.org/10.1093/comjnl/20.4.359
Article MATH Google Scholar
Pociecha J, Podolec B, Sokołowski A, Zając K (1988) Metody taksonomiczne w badaniach społeczno-ekonomicznych. Wydawnictwo Naukowe PWN, Warszawa
Google Scholar
Pratama BY, Sarno R (2016) Personality classification based on Twitter text using Naive Bayes, KNN and SVM. In: Proceedings of 2015 international conference on data and software engineering, ICODSE 2015, pp 170–174. https://doi.org/10.1109/icodse.2015.7436992
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(1):53–65. https://doi.org/10.1016/0377-0427(87)90125-7
Article MATH Google Scholar
Ruth N, Bolton A, Parasuraman A (2013) Understanding generation Y and their use of social media: a review and research agenda. J Serv Manag 24(3):245–267
Article Google Scholar
Ryder NB (1965) The cohort as a concept in the study of social change. Am Sociol Rev 30(6):843–861. https://doi.org/10.2307/2090964
Article Google Scholar
Salton G, Yang CS (1973) On the specification of term values in automatic indexing. Cornell University
Google Scholar
Shannon CE (1951) Prediction and entropy of printed english. Bell System Technical Journal, 30(1):50–64.https://doi.org/10.1002/j.1538-7305.1951.tb01366.x
Article Google Scholar
Sneath PHA (1957) The application of computers to taxonomy. J Gen Microbiol 17(1):201–226
Article Google Scholar
Sneath PH, Sokal RR (1963) Priciples of numerical taxonomy. Freeman, San Fancisco, London
Google Scholar
Sokal RR, Michener CD (1958) A statistical method for evaluating systematic relationships. University of Kansas, Scientific Bulletin 38:1409–1438
Google Scholar
Sokal RR, Rohlf FJ (1962) The comparison of dendrograms by objective methods. TAXON Wiley 11(2):33–40. https://doi.org/10.2307/1217208
Article Google Scholar
Spärck Jones K (1972) A statistical interpretation of term specificity and its application in retrieval. J Documentation 28(1):11–21. https://doi.org/10.1108/00220410410560573
Article Google Scholar
Strauss W, Howe N (1991) Generations. The history of America’s future, 1584 to 2069. William Morrow and Company, Inc., New York
Google Scholar
Tuteja SK, Bogiri N (2017) Email Spam filtering using BPNN classification algorithm. In: International conference on automatic control and dynamic optimization techniques, ICACDOT 2016. Institute of Electrical and Electronics Engineers Inc., pp 915–919. https://doi.org/10.1109/icacdot.2016.7877720
Wallis M (1959) Koncepcje biologiczne w humanistyce. In: Kotarbiński T (ed) Fragmenty filozoficzne vol. 2. Warszawa
Google Scholar
Ward JH (1963) Hierarchical grouping to optimize an objective function. J Am Statist Assoc 58(301):236–244
Article MathSciNet Google Scholar
Watanabe NM, Kim J, Park J (2021) Social network analysis and domestic and international retailers: an investigation of social media networks of cosmetic brands. J Retail Consum Serv 58:102301. https://doi.org/10.1016/j.jretconser.2020.102301
Article Google Scholar
Wątroba W (2017) Transgresje międzypokoleniowe późnego kapitalizmu. Wydawnictwo Uniwersytetu Ekonomicznego we Wrocławiu, Wrocław
Google Scholar
Wątroba W (2019) Transgresywność systemów wartości pokoleń we współczesnym kapitalizmie. Folia Oeconomica, Acta Universitatis Lodziensis 5(344):139–157. https://doi.org/10.18778/0208-6018.344.09
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Gdańsk, Gdańsk, Poland
Agata Majkowska, Kamila Migdał-Najman, Krzysztof Najman & Katarzyna Raca

Authors

Agata Majkowska
View author publications
You can also search for this author in PubMed Google Scholar
Kamila Migdał-Najman
View author publications
You can also search for this author in PubMed Google Scholar
Krzysztof Najman
View author publications
You can also search for this author in PubMed Google Scholar
Katarzyna Raca
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Krzysztof Najman .

Editor information

Editors and Affiliations

Department of Financial Investments and Risk Management, Wroclaw University of Economics and Business, Wroclaw, Poland
Krzysztof Jajuga
Department of Statistics, University of Gdańsk, Sopot, Poland
Krzysztof Najman
Department of Econometrics and Computer Science, Wroclaw University of Economics and Business, Jelenia Góra, Poland
Marek Walesiak

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Majkowska, A., Migdał-Najman, K., Najman, K., Raca, K. (2021). Identification of the Words Most Frequently Used by Different Generations of Twitter Users. In: Jajuga, K., Najman, K., Walesiak, M. (eds) Data Analysis and Classification. SKAD 2020. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-030-75190-6_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-75190-6_3
Published: 28 June 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-75189-0
Online ISBN: 978-3-030-75190-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics