Skip to main content
Log in

Resampling imbalanced data to detect fake reviews using machine learning classifiers and textual-based features

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Fraudulent online sellers often collude with reviewers to garner fake reviews for their products. This act undermines the trust of buyers in product reviews, and potentially reduces the effectiveness of online markets. Being able to accurately detect fake reviews is, therefore, critical. In this study, we investigate several preprocessing and textual-based featuring methods along with machine learning classifiers, including single and ensemble models, to build a fake review detection system. Given the nature of product review data, where the number of fake reviews is far less than that of genuine reviews, we look into the results of each class in detail in addition to the overall results. We recognise from our preliminary analysis that, owing to imbalanced data, there is a high imbalance between the accuracies for different classes (e.g., 1.3% for the fake review class and 99.7% for the genuine review class), despite the overall accuracy looking promising (around 89.7%). We propose two dynamic random sampling techniques that are possible for textual-based featuring methods to solve this class imbalance problem. Our results indicate that both sampling techniques can improve the accuracy of the fake review class—for balanced datasets, the accuracies can be improved to a maximum of 84.5% and 75.6% for random under and over-sampling, respectively. However, the accuracies for genuine reviews decrease to 75% and 58.8% for random under and over-sampling, respectively. We also discover that, for smaller datasets, the Adaptive Boosting ensemble model outperforms other single classifiers; whereas, for larger datasets, the performance improvement from ensemble models is insignificant compared to the best results obtained by single classifiers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Akram AU, Khan HU, Iqbal S, Iqbal T, Munir EU, Shafi M (2018) Finding rotten eggs: a review spam detection model using diverse feature sets. KSII Trans Internet Inform Syst 12(10):5120–5142. https://doi.org/10.3837/tiis.2018.10.026

  2. Bajaj S, Garg N, Singh SK (2017) A novel user-based spam review detection. Procedia Comput Sci 122:1009–1015

    Article  Google Scholar 

  3. Barbado R, Araque O, Iglesias CA (2019) A framework for fake review detection in online consumer electronics retailers. Inf Process Manag 56(4):1234–1244. https://doi.org/10.1016/j.ipm.2019.03.002

    Article  Google Scholar 

  4. Birchall G (2018) TripAdvisor denies claims one in three reviews ‘faked’. https://www.news.com.au/technology/online/social/tripadvisor-denies-claims-one-in-three-reviews-faked/news-story/55243de188cc7f1fb2abb52fee3bac45. Accessed October 03 2019

  5. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140. https://doi.org/10.1007/bf00058655

    Article  MATH  Google Scholar 

  6. Breiman L (2001) Random forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/a:1010933404324

    Article  MATH  Google Scholar 

  7. Budhi GS, Adipranata R (2014) Java characters recognition using evolutionary neural network and combination of Chi2 and backpropagation neural network. Int J Appl Eng Res 9(22):18025–18036

    Google Scholar 

  8. Budhi GS, Chiong R, Pranata I, Hu Z (2017) Predicting rating polarity through automatic classification of review texts. In: Proceedings of the 2017 IEEE Conference on Big Data and Analytics, Kuching, Malaysia, pp 19–24. https://doi.org/10.1109/ICBDAA.2017.8284101

  9. Budhi GS, Chiong R, Hu Z, Pranata I, Dhakal S (2018) Multi-PSO based classifier selection and parameter optimisation for sentiment polarity prediction. Proceedings of the 2018 IEEE Conference on Big Data and Analytics, Langkawi Island, Malaysia, pp 68–73. https://doi.org/10.1109/ICBDAA.2018.8629593

  10. Budhi GS, Chiong R, Pranata I, Hu Z (2020) Using machine learning to predict the sentiment of online reviews: a new framework for comparative analysis. Arch Computation Methods Eng. https://doi.org/10.1007/s11831-020-09464-8

  11. Campbell C, Ying Y (2011) Learning with support vector machines. Morgan & Claypool

  12. Cardoso EF, Silva RM, Almeida TA (2018) Towards automatic filtering of fake reviews. Neurocomputing 309:106–116. https://doi.org/10.1016/j.neucom.2018.04.074

    Article  Google Scholar 

  13. Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):1–27. https://doi.org/10.1145/1961189.1961199

    Article  Google Scholar 

  14. Darzi MRK, Niaki STA, Khedmati M (2019) Binary classification of imbalanced datasets: the case of CoIL challenge 2000. Expert Syst Appl 128:169–186. https://doi.org/10.1016/j.eswa.2019.03.024

    Article  Google Scholar 

  15. Deng X, Li Y, Weng J, Zhang J (2019) Feature selection for text classification: a review. Multimed Tools Appl 78(3):3797–3816. https://doi.org/10.1007/s11042-018-6083-5

    Article  Google Scholar 

  16. Dobson AJ, Barnett AG (2008) An introduction to generalized linear models, 3rd edn. CRC Press, Boca Raton

    Book  Google Scholar 

  17. D'Onfro J (2013) A whopping 20% of Yelp reviews are fake. https://www.businessinsider.com.au/20-percent-of-yelp-reviews-fake-2013-9). Accessed Oktober 02 2019

  18. Dunteman GH, Ho M-HR (2011) Generalized Linear Models. In: An introduction to generalized linear models. SAGE Publications, Inc., pp 2–6

  19. Ellson A (2018) A third of TripAdvisor reviews are fake as cheats buy five stars. The Times. https://www.thetimes.co.uk/article/hotel-and-caf-cheats-are-caught-trying-to-buy-tripadvisor-stars-027fbcwc8. Accessed Oktober 02 2019

  20. Etaiwi W, Naymat G (2017) The impact of applying different preprocessing steps on review spam detection. Procedia Comput Sci 113:273–279. https://doi.org/10.1016/j.procs.2017.08.368

  21. Felbermayr A, Nanopoulos A (2016) The role of emotions for the perceived usefulness in online customer reviews. J Interact Mark 36:60–76. https://doi.org/10.1016/j.intmar.2016.05.004

  22. Fernandez A, Garcıa S, Chawla FHNV (2018) SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905

  23. Freeman LL (2016) How to spot fake online reviews. Money 45(6):30–30

    Google Scholar 

  24. Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, pp 249–256

  25. Hastie T, Tibshirani R (1990) Generalized additive models. Chapman and Hall/CRC,

  26. Hazim M, Anuar NB, Ab Razak MF, Abdullah NA (2018) Detecting opinion spams through supervised boosting approach. PLoS One 13(6):e0198884. https://doi.org/10.1371/journal.pone.0198884

    Article  Google Scholar 

  27. Hernández Fusilier D, Montes-y-Gómez M, Rosso P, Guzmán Cabrera R (2015) Detecting positive and negative deceptive opinions using PU-learning. Inf Process Manag 51(4):433–443. https://doi.org/10.1016/j.ipm.2014.11.001

    Article  Google Scholar 

  28. Heydari A, Ma T, Salim N, Heydari Z (2015) Detection of review spam: a survey. Expert Syst Appl 42(7):3634–3642. https://doi.org/10.1016/j.eswa.2014.12.029

    Article  Google Scholar 

  29. Hu Z, Chiong R, Pranata I, Susilo W, Bao Y (2016) Identifying malicious web domains using machine learning techniques with online credibility and performance data. In: Proceedings of the IEEE Congress on Evolutionary Computation, Vancouver, Canada, pp 5186–5194. https://doi.org/10.1109/CEC.2016.7748347

  30. Hu Z, Chiong R, Pranata I, Bao Y, Lin Y (2019) Malicious web domain identification using online credibility and performance data by considering the class imbalance issue. Ind Manag Data Syst 119(3):676–696. https://doi.org/10.1108/IMDS-02-2018-0072

    Article  Google Scholar 

  31. Imran M, Latif S, Mehmood D, Shah MS (2019) Student academic performance prediction using supervised learning techniques. Int J Emerg Technol Learn 14(14):92–104. https://doi.org/10.3991/ijet.v14i14.10310

  32. Ivanova O, Scholz M (2017) How can online marketplaces reduce rating manipulation? A new approach on dynamic aggregation of online ratings. Decis Support Syst 104:64–78. https://doi.org/10.1016/j.dss.2017.10.003

    Article  Google Scholar 

  33. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Proceedings of the International Conference on Learning Representations. San Diego, USA, pp 1–15

  34. Ko T, Lee JH, Cho H, Cho S, Lee W, Lee M (2017) Machine learning-based anomaly detection via integration of manufacturing, inspection and after-sales service data. Ind Manag Data Syst 117(5):927–945. https://doi.org/10.1108/imds-06-2016-0195

  35. Kumar N, Venugopal D, Qiu L, Kumar S (2018) Detecting review manipulation on online platforms with hierarchical supervised learning. J Manag Inf Syst 35(1):350–380. https://doi.org/10.1080/07421222.2018.1440758

    Article  Google Scholar 

  36. Li L, Qin B, Ren W, Liu T (2017) Document representation and feature combination for deceptive spam review detection. Neurocomputing 254:33–41. https://doi.org/10.1016/j.neucom.2016.10.080

    Article  Google Scholar 

  37. Li H, Fei G, Wang S, Liu B, Shao W, Mukherjee A, Shao J (2017) Bimodal distribution and co-bursting in review spam detection. In: Proceedings of the 26th International Conference on World Wide Web. Perth, Australia, pp 1063–1072. https://doi.org/10.1145/3038912.3052582

  38. Luca M, Zervas G (2016) Fake it till you make it: reputation, competition, and yelp review fraud. Manag Sci 62(12):3412–3427. https://doi.org/10.1287/mnsc.2015.2304

    Article  Google Scholar 

  39. Malbon J (2013) Taking fake online consumer reviews seriously. J Consum Policy 36(2):139–157. https://doi.org/10.1007/s10603-012-9216-7

    Article  Google Scholar 

  40. Menard S (2010) Logistic regression: from introductory to advanced concepts and applications. SAGE, Los Angeles

    Book  Google Scholar 

  41. Munzel A (2016) Assisting consumers in detecting fake reviews: the role of identity information disclosure and consensus. J Retail Consum Serv 32:96–108. https://doi.org/10.1016/j.jretconser.2016.06.002

    Article  Google Scholar 

  42. Nelder JA, Wedderburn RWM (1972) Generalized linear models. J R Stat Soc Ser A 135(3):370–384. https://doi.org/10.2307/2344614

  43. NLTK (2019) Nltk Package. http://www.nltk.org/api/nltk.html. Accessed 25 Jan 2019

  44. Norvig P (2016) How to write a spelling corrector. https://norvig.com/spell-correct.html. Accessed June 01 2018

  45. O'Neill S (2018) A peddler of fake reviews on TripAdvisor gets jail time. https://skift.com/2018/09/12/fake-reviews-tripadvisor-jail-italy/. Accessed October 03 2019

  46. Picchi A (2019) Buyer beware: scourge of fake reviews hitting Amazon, Walmart and other major retailers. CBS News. https://www.cbsnews.com/news/buyer-beware-a-scourge-of-fake-online-reviews-is-hitting-amazon-walmart-and-other-major-retailers/. Accessed 2 Oct 2019

  47. Rahman M, Carbunar B, Ballesteros J, Chau DH (2015) To catch a fake: curbing deceptive yelp ratings and venues. Statistic Anal Data Min 8(3):147–161. https://doi.org/10.1002/sam.11264

    Article  MathSciNet  MATH  Google Scholar 

  48. Rathore S, Loia V, Park JH (2018) SpamSpotter: an efficient spammer detection framework based on intelligent decision support system on Facebook. Appl Soft Comput 67:920–932. https://doi.org/10.1016/j.asoc.2017.09.032

    Article  Google Scholar 

  49. Rayana S, Akoglu L (2015) Collective opinion spam detection: Bridging review networks and metadata. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, pp 985–994. https://doi.org/10.1145/2783258.2783370

  50. Ren Y, Ji D (2017) Neural networks for deceptive opinion spam detection: an empirical study. Inf Sci 385-386:213–224. https://doi.org/10.1016/j.ins.2017.01.015

  51. Rodola G (2020) psutil 5.7.2. https://pypi.org/project/psutil/. Accessed August 5 2020

  52. Rout JK, Singh S, Jena SK, Bakshi S (2016) Deceptive review detection using labeled and unlabeled data. Multimed Tools Appl 76(3):3187–3211. https://doi.org/10.1007/s11042-016-3819-y

    Article  Google Scholar 

  53. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation. In: Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol 1. MIT Press, pp 318–362

  54. Salehan M, Kim DJ (2016) Predicting the performance of online consumer reviews: a sentiment mining approach to big data analytics. Decis Support Syst 81:30–40. https://doi.org/10.1016/j.dss.2015.10.006

  55. Savage D, Zhang X, Yu X, Chou P, Wang Q (2015) Detection of opinion spam based on anomalous rating deviation. Expert Syst Appl 42(22):8650–8657. https://doi.org/10.1016/j.eswa.2015.07.019

    Article  Google Scholar 

  56. Scikit-learn (2019) API Reference. https://scikit-learn.org/stable/modules/classes.html. Accessed 19 Mar 2019

  57. Shu C (2019) FTC brings its first case against fake paid reviews on Amazon. https://techcrunch.com/2019/02/26/ftc-brings-its-first-case-against-fake-paid-reviews-on-amazon/. Accessed October 03 2019

  58. Smithers R (2019) Facebook still flooded with fake reviews, says which? The Guardian. https://www.theguardian.com/business/2019/aug/06/facebook-fake-reviews-which. Accessed October 03 2019

  59. Sun C, Du Q, Tian G (2016) Exploiting product related review features for fake review detection. Math Probl Eng 2016:1–7. https://doi.org/10.1155/2016/4935792

    Article  Google Scholar 

  60. Wahyuni ED, Djunaidy A (2016) Fake review detection from a product review using modified method of iterative computation framework. Proceed MATEC Web Confer 58:03003. https://doi.org/10.1051/matec

    Article  Google Scholar 

  61. Wang X, Liu K, Zhao J (2017) Handling cold-start problem in review spam detection by jointly embedding texts and behaviors. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, pp 366–376. https://doi.org/10.18653/v1/P17-1034

  62. Wu Q, Ye Y, Zhang H, Ng MK, Ho SS (2014) ForesTexter: an efficient random forest algorithm for imbalanced text categorization. Knowl-Based Syst 67:105–116. https://doi.org/10.1016/j.knosys.2014.06.004

  63. Wu Y, Ngai EWT, Wu P, Wu C (2020) Fake online reviews: literature review, synthesis, and directions for future research. Decis Support Syst 132:113280. https://doi.org/10.1016/j.dss.2020.113280

  64. Zhang D, Zhou L, Kehoe JL, Kilic IY (2016) What online reviewer behaviors really matter? Effects of verbal and nonverbal behaviors on detection of fake online reviews. J Manag Inf Syst 33(2):456–481. https://doi.org/10.1080/07421222.2016.1205907

    Article  Google Scholar 

  65. Zhang W, Du Y, Yoshida T, Wang Q (2018) DRI-RCNN: an approach to deceptive review identification using recurrent convolutional neural network. Inf Process Manag 54(4):576–592. https://doi.org/10.1016/j.ipm.2018.03.007

  66. Zhu J, Zou H, Rosset S, Hastie T (2009) Multi-class AdaBoost. Stat Interface 2(3):349–360. https://doi.org/10.4310/SII.2009.v2.n3.a8

Download references

Acknowledgements

The first author would like to acknowledge financial support from the Indonesian Endowment Fund for Education (LPDP), Ministry of Finance, and the Directorate General of Higher Education (DIKTI), Ministry of Education and Culture, Republic of Indonesia.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Raymond Chiong or Zuli Wang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Budhi, G.S., Chiong, R. & Wang, Z. Resampling imbalanced data to detect fake reviews using machine learning classifiers and textual-based features. Multimed Tools Appl 80, 13079–13097 (2021). https://doi.org/10.1007/s11042-020-10299-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-10299-5

Keywords

Navigation