Skip to main content
Log in

Spam analysis of big reviews dataset using Fuzzy Ranking Evaluation Algorithm and Hadoop

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Online reviews are the most easily available free information sources used by both organizations and customers to make decisions. Establishments are utilizing significance of opinions to earn undue profit by hiring professionals known as spammers, giving positive comments on their products and negative opinions on their competitor’s product. This activity is known as opinion spamming and should be identified to give genuine results containing sentiments towards a product. So far, opinion spam detection has been considered as a discrete classification problem, generally as spam and non-spam. However, it involves uncertainty as suspicious behavior of a user might be due to coincidence. As, fuzzy logic handles real world uncertainty very well, we propose a novel fuzzy modeling based solution to the problem. We have proposed four fuzzy input linguistic variable and considered suspicious level of a spammer group to be one of—Ultra, Mega, Immense, Highly, Moderate, Slightly and Feebly. We have defined novel FSL Deduction Algorithm generating 81 fuzzy rules and Fuzzy Ranking Evaluation Algorithm (FREA) to determine the extent to which a group is suspicious. As reviews dataset satisfy the three V’s of big data (Volume, Velocity and Variety), we have considered this problem as a big data problem and used Hadoop for storage and analyzation. We have further demonstrated our proposed algorithm using a sample reviews dataset and Amazon reviews dataset achieving an accuracy of 80.77% which unlike other approaches remains steady for large number of groups and deals well with uncertainty involved in opinion spam detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. http://www.liwc.net/tryonline.php.

  2. http://sentiwordnet.isti.cnr.it/.

  3. https://sites.google.com/site/alenaneviarouskaya/research-.1/sentiful.

References

  1. Abouelenien M, Perez-Rosas V, Zhao B, Mihalcea R, Burzo M (2017) Gender-based multimodal deception detection. In: Symposium On Applied Computing (SAC) 2017. ACM, Morocco. https://doi.org/10.1145/3019612.3019644

    Chapter  Google Scholar 

  2. Adike MR, Reddy V (2016) Detection of fake review and brand spam using data mining. Int J Recent Trends Eng Res 2(7):251–256

    Google Scholar 

  3. Agarwal A, Sharma V, Sikka G, Dhir R (2016) Opinion mining of news headlines using SentiWordNet. Symposium on Colossal Data Analysis and Networking (CDAN). IEEE, pp 1–5. https://doi.org/10.1109/CDAN.2016.7570949

  4. Ahuja Y, Yadav SK (2012) Multiclass classification and support vector machine. Global J Comput Sci Technol Interdiscip 12(11):14–20

    Google Scholar 

  5. Akoglu L, Chandy R, Faloutsos C (2013) Opinion fraud detection in online reviews by network effects. In: Seventh international AAAI conference on weblogs and social media vol 13. AAAI Publications, pp 2–11

  6. Al-Anzi FS, Yadav SK, Soni J (2014) Cloud computing: security model comprising governance, risk management and compliance. In: International conference on data mining and intelligent computing (ICDMIC). IEEE, pp. 1–6

  7. Andrea E, Sebastiani F (2006) Sentiwordnet: A publicly available lexical resource for opinion mining. In: Proceedings of the 5th conference on language resources and evaluation (LREC 2006), vol. 6, pp. 417–422

  8. Ashfaq RAR, Wang XZ, Huang JZ, Abbas H, He YL (2017) Fuzziness based semi-supervised learning approach for intrusion detection system. Inf Sci 378:484–497. https://doi.org/10.1016/j.ins.2016.04.019

    Article  Google Scholar 

  9. Baccianella S, Esuli A, Sebastiani F (2010) SentiWordNet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In: LREC vol 10. European Language Resources Association, pp 2200–2204

  10. Balazs JA, Velasquez JD (2016) Opinion mining and information fusion: a survey. Inf Fusion 27:95–110. https://doi.org/10.1016/j.inffus.2015.06.002

    Article  Google Scholar 

  11. Benevenuto F, Araujo M, Ribeiro F (2015) Sentiment analysis methods for social media. In: Proceedings of the 21st Brazilian symposium on multimedia and the web. ACM, pp. 11–11. https://doi.org/10.1145/2820426.2820642

  12. Bhushan M, Banerjea S, Yadav SK (2014) Bloom filter based optimization on HBase with MapReduce. In: 2014 International conference on data mining and intelligent computing (ICDMIC). IEEE, pp. 1–5

  13. Bhuta S, Doshi U (2014) A review of techniques for sentiment analysis of twitter data. In: 2014 International conference on issues and challenges in intelligent computing techniques (ICICT). IEEE, pp 583–591. https://doi.org/10.1109/ICICICT.2014.6781346

  14. Cambria E, Schuller B, Xia Y, Havasi C (2013) New avenues in opinion mining and sentiment analysis. IEEE Intell Syst 28(2):15–21. https://doi.org/10.1109/MIS.2013.30

    Article  Google Scholar 

  15. Chavan A, Darekar O, Kulkarni O, Jain Y (2017) Spam reviews detection using Hadoop. Int J Eng Comput Sci 6(2):20320–20323. https://doi.org/10.18535/ijecs/v6i2.30

    Article  Google Scholar 

  16. Choo E, Yu T, Chi M (2015) Detecting opinion spammer groups through community discovery and sentiment analysis. In: Samarati P (ed) Data and applications security and privacy XXIX. DBSec 2015. Lecture Notes Computer Science vol 9149. Springer, Cham, pp 170–187. https://doi.org/10.1007/978-3-319-20810-7_11

  17. Cormack GV (2008) Email spam filtering: a systematic review. Found Trends® Inf Retr 1(4):335–455. https://doi.org/10.1561/1500000006

    Article  Google Scholar 

  18. Crawford M et al (2015) Survey of review spam detection using machine learning techniques. J Big Data 2(1):23. https://doi.org/10.1186/s40537-015-0029-9

    Article  Google Scholar 

  19. DeRoos D, Zikopoulos P, Brown B, Coss R, Melnyk RB (2014) Hadoop for dummies. Wiley, Hoboken

    Google Scholar 

  20. Dixit S, Agrawal AJ (2013) Survey on review spam detection. Int J Comput Commun Technol 4(2):68–72

    Google Scholar 

  21. Emmanuel I, Stanier C (2016) Defining big data. In: Proceedings of the international conference on big data and advanced wireless technologies. ACM, p. 5. https://doi.org/10.1145/3010089.3010090

  22. Feldman R (2013) Techniques and applications for sentiment analysis. Commun ACM 56 (4): pp 82–89. https://doi.org/10.1145/2436256.2436274

    Article  Google Scholar 

  23. Fusilier DH, Montes-y-Gomez M, Rosso P, Cabrera RG (2015) Detection of opinion spam with character n-grams. In: International conference on intelligent text processing and computational linguistics. Springer, pp. 285–294. https://doi.org/10.1007/978-3-319-18117-2_21

  24. Gimenes G, Cordeiro RL, Rodrigues-Jr JF (2017) ORFEL: efficient detection of defamation or illegitimate promotion in online recommendation. Inf Sci 379:274–287. https://doi.org/10.1016/j.ins.2016.09.006

    Article  Google Scholar 

  25. Gu B, Sheng VS (2016) A robust regularization path algorithm for ν -support vector classification. IEEE Transac Neural Netw Learn Syst 28(5):1241–1248. https://doi.org/10.1109/TNNLS.2016.2527796

    Article  Google Scholar 

  26. Gu B, Sun X, Sheng VS (2016) Structural minimax probability machine. IEEE Transac Neural Netw Learn Syst 28(7):1646–1656. https://doi.org/10.1109/TNNLS.2016.2544779

    Article  MathSciNet  Google Scholar 

  27. Heydari A, Tavakoli M, Salim N (2016) Detection of fake opinions using time series. Expert Syst Appl 58(C):83–92. https://doi.org/10.1016/j.eswa.2016.03.020

    Article  Google Scholar 

  28. Hu X, Tang J, Zhang Y, Liu H (2013) Social spammer detection in microblogging. In: Proceedings of the twenty-third international joint conference on artificial intelligence (IJCAI), vol. 13, pp. 2633–2639

  29. Hyun Y, Kim N (2016) Detecting blog spam hashtags using topic modeling. In: Proceedings of the 18th annual international conference on electronic commerce: e-commerce in smart connected world. ACM, p. 43. https://doi.org/10.1145/2971603.2971646

  30. Jindal N, Liu B (2008) Opinion spam and analysis. In: Proceedings of the 2008 international conference on web search and data mining. ACM, pp. 219–230. https://doi.org/10.1145/1341531.1341560

  31. Kaur A, Gupta V (2013) A survey on sentiment analysis and opinion mining techniques. J Emerging Technol Web Intell 5(4):367–371. https://doi.org/10.4304/jetwi.5.4.367-371

    Article  Google Scholar 

  32. Kim S, Chang H, Lee S, Yu M, Kang J (2015) Deep semantic frame-based deceptive opinion spam analysis. In: Proceedings of the 24th ACM international on conference on information and knowledge management. ACM, pp. 1131–1140. https://doi.org/10.1145/2806416.2806551

  33. Kumar S, Gao X, Welch I (2016) Novel features for web spam detection. In: 28th international conference on tools with artificial intelligence (ICTAI). IEEE, pp. 593–597. https://doi.org/10.1109/ICTAI.2016.0096

  34. Li H, Chen Z, Mukherjee A, Liu B, Shao J (2015) Analyzing and detecting opinion spam on a large-scale dataset via temporal and spatial patterns. In: International AAAI conference on web and social media. AAAI Press, California pp 634–637

    Google Scholar 

  35. Li F, Huang M, Yang Y, Zhu X (2011) Learning to identify review spam. In: IJCAI Proceedings-International Joint Conference on Artificial Intelligence, vol. 22, no. 3, pp. 2488–2493. https://doi.org/10.5591/978-1-57735-516-8/IJCAI11-414

  36. Li J, Ott M, Cardie C, Hovy EH (2014) Towards a general rule for identifying deceptive opinion spam. In: Proceedings of the 52nd annual meeting of the association for computational linguistics. ACL, Baltimore, pp. 1566–1576

  37. Li L, Ren W, Qin B, Liu T (2015) Learning document representation for deceptive opinion spam detection. In: Sun M, Liu Z, Zhang M, Liu Y (eds) Chinese computational linguistics and natural language processing based on naturally annotated big data. Lecture NotesComputer Science vol 9427. Springer, Cham, pp 393–403. https://doi.org/10.1007/978-3-319-25816-4_32

    Chapter  Google Scholar 

  38. Li X, Xie H, Chen L, Wang J, Deng X (2014) News impact on stock price return via sentiment analysis. Knowl-Based Syst 69:14–23. https://doi.org/10.1016/j.knosys.2014.04.022

    Article  Google Scholar 

  39. Lim EP, Nguyen VA, Jindal N, Liu B, Lauw HW (2010) Detecting product review spammers using rating behaviors. In: Proceedings of the 19th ACM international conference on Information and knowledge management. ACM, pp. 939–948. https://doi.org/10.1145/1871437.1871557

  40. Lin JCW, Gan W, Fournier-Viger P, Hong TP, Tseng VS (2016) Efficient algorithms for mining high-utility itemsets in uncertain databases. Knowl-Based Syst 96:171–187. https://doi.org/10.1016/j.knosys.2015.12.019

    Article  Google Scholar 

  41. Liu B (2012) Sentiment analysis and opinion mining. Synthesis lectures on human language technologies 5(1):1–167. https://doi.org/10.2200/S00416ED1V01Y201204HLT016

  42. McAuley J, Pandey R, Leskovec J (2015) Inferring networks of substitutable and complementary products. In: Proceedings of the 21th ACM SIGKDD International conference on knowledge discovery and data mining. ACM, pp. 785–794

  43. McAuley J, Targett C, Shi Q, Hengel AVD (2015) Image-based recommendations on styles and substitutes. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. ACM, pp. 43–52

  44. Mukherjee A, Liu B, Glance N (2012) Spotting fake reviewer groups in consumer reviews. In: Proceedings of the 21st international conference on World Wide Web. ACM, pp. 191–200. https://doi.org/10.1145/2187836.2187863

  45. Nadaf SB, Gujar AD (2016) A survey paper on spam mail detection using RFD. Int J Adv Res Comput Sci Manag Stud 4(1):46–48

    Google Scholar 

  46. Nandimath JN, Katkar BS, Ghadge VU, Garad AN (2017) Efficiently detecting and analyzing spam reviews using live data feed. Int Res J Eng Technol (IRJET) 4(2):1421–1424

    Google Scholar 

  47. Nasukawa T, Yi J (2003) Sentiment analysis: capturing favorability using natural language processing. In: Proceedings of the 2nd international conference on knowledge capture, pp 70–77. https://doi.org/10.1145/945645.945658

  48. Neviarouskaya A, Prendinger H, Ishizuka M (2011) SentiFul: A lexicon for sentiment analysis. IEEE Transac Affect Comput 2(1):22–36. https://doi.org/10.1109/T-AFFC.2011.1

    Article  Google Scholar 

  49. Ntoulas A, Najork M, Manasse M, Fetterly D (2006) Detecting spam web pages through content analysis. In: Proceedings of the 15th international conference on World Wide Web. ACM, pp. 83–92. https://doi.org/10.1145/1135777.1135794

  50. Ohana B, Tierney B (2009) Sentiment classification of reviews using SentiWordNet. In: 9th it and t conference, Dublin Institute of Technology, Dublin, p 13. https://doi.org/10.21427/D77S56

  51. Ott M, Cardie C, Hancock JT (2013) Negative deceptive opinion spam. In: Proceedings of North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) 2013. Association for Computational Linguistics, pp. 497–501

  52. Ott M, Choi Y, Cardie C, Hancock JT (2011) Finding deceptive opinion spam by any stretch of the imagination. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol.1. Association for Computational Linguistics, pp. 309–319

  53. Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends® Inf Retr 2(1–2):1–135. https://doi.org/10.1561/150000001

    Article  Google Scholar 

  54. Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? sentiment classification using machine learning techniques. In: Proceedings of the ACL-02 conference on Empirical methods in natural language processing, vol. 10. Association for Computational Linguistics, pp. 79–86. https://doi.org/10.3115/1118693.1118704

  55. Peng J, Choo KK, Ashman H (2016) Bit-level n-gram based forensic authorship analysis on social media: Identifying individuals from linguistic profiles. J Netw Comput Appl 70:171–182. https://doi.org/10.1016/j.jnca.2016.04.001

    Article  Google Scholar 

  56. Poria S, Cambria E, Gelbukh A (2016) Aspect extraction for opinion mining with a deep convolutional neural network. Knowl-Based Syst 108(C):42–49. https://doi.org/10.1016/j.knosys.2016.06.009

    Article  Google Scholar 

  57. Qian T, Liu B (2013) Identifying multiple userids of the same author. In: Proceedings of conference on empirical methods in natural language processing (EMNLP-2013), pp. 1124–1135

  58. Qiu J, Wu Q, Ding G, Xu Y, Feng S (2016) A survey of machine learning for big data processing. EURASIP J Adv Signal Process 2016(1):1–16. https://doi.org/10.1186/s13634-016-0355-x

    Article  Google Scholar 

  59. Rao Y, Xie H, Li J, Jin F, Wang FL, Li Q (2016) Social emotion classification of short text via topic-level maximum entropy model. Inf Manag 53(8):978–986

    Article  Google Scholar 

  60. Ren Y, Ji D (2017) Neural networks for deceptive opinion spam detection: an empirical study. Inf Sci 385(C):213–224. https://doi.org/10.1016/j.ins.2017.01.015

    Article  Google Scholar 

  61. Roul RK, Asthana SR, Kumar G (2016) Spam web page detection using combined content and link features. Int J Data Mining Model Manag 8(3):209–222

    Google Scholar 

  62. Rout J, Dalmia A, Choo KK, Bakshi S, Jena S (2017) Revisiting semi-supervised learning for online deceptive review detection. IEEE Access 5:1319–1327. https://doi.org/10.1109/ACCESS.2017.2655032

    Article  Google Scholar 

  63. Rubin VL (2017) Deception detection and rumor debunking for social media. In: Sloan L, Quan-Haase(eds) A handbook of social media research methods. Sage, London, pp 1–25

    Google Scholar 

  64. Schuckert M, Liu X, Law R (2016) Insights into suspicious online ratings: direct evidence from TripAdvisor. Asia Pacific J Tourism Res 21(3):259–272. https://doi.org/10.1080/10941605.2015.1029954

    Article  Google Scholar 

  65. Sheela LJ (2016) A review of sentiment analysis in twitter data using Hadoop. Int J Database Theory Appl 9(1):77–86. https://doi.org/10.14257/ijdta.2016.9.1.07

    Article  Google Scholar 

  66. Taddy M (2013) Measuring political sentiment on twitter: factor optimal design for multinomial inverse regression. Technometrics 55(4):415–425. https://doi.org/10.1080/00401706.2013.778791

    Article  MathSciNet  Google Scholar 

  67. Tavakoli M, Heydari A, Ismail Z, Salim N (2015) A framework for review spam detection research. World Academy of Science, Engineering and Technology. Int J Comput Electrical Autom Control Inf Eng 10(1):67–71

    Google Scholar 

  68. Tayal DK, Yadav SK (2015) Word level sentiment analysis using fuzzy sets. Int J Adv Sci Technol. 54: 73–78

    Google Scholar 

  69. Tayal DK, Yadav SK (2016) Fast retrieval approach of sentimental analysis with implementation of bloom filter on Hadoop. In: 2016 International conference on computational techniques in information and communication technologies (ICCTICT). IEEE, pp. 14–18. https://doi.org/10.1109/ICCTICT.2016.7514544

  70. Tayal DK, Yadav SK (2016) Sentiment analysis on social campaign “Swachh Bharat Abhiyan” using unigram method. AI & SOCIETY, pp 1–13. https://doi.org/10.1007/s00146-016-0672-5

  71. Tayal DK, Yadav S, Gupta K, Rajput B, Kumari K (2014) Polarity detection of sarcastic political tweets. In: 2014 International conference on computing for sustainable global development (INDIACom). IEEE, pp. 625–628. https://doi.org/10.1109/IndiaCom.2014.6828037

  72. Tsang ECC, Chen D, Yeung DS, Wang XZ, Lee JWT (2008) Attributes reduction using fuzzy rough sets. IEEE Transac Fuzzy Syst 16(5):1130–1141. https://doi.org/10.1109/TFUZZ.2006.889960

    Article  Google Scholar 

  73. Tumasjan A, Sprenger TO, Sandner PG, Welpe IM (2010) Predicting elections with twitter: What 140 characters reveal about political sentiment. In: Proceedings of the fourth international aaai conference on weblogs and social media(ICWSM), vol. 10, no. 1, pp. 178–185

  74. Tuteja SK (2016) A Survey on classification algorithms for email spam filtering. Int J Eng Sci 6(5):5937–5940. https://doi.org/10.4010/2016.1440

    Article  Google Scholar 

  75. Vashisht P, Gupta V (2015) Big data analytics techniques: a survey. In: Green Computing and Internet of Things (ICGCIoT), 2015 International Conference. IEEE, pp. 264–269. https://doi.org/10.1109/ICGCIoT.2015.7380470

  76. Viviani M, Pasi G (2017) Quantifier guided aggregation for the veracity assessment of online reviews. Int J Intell Syst 32(5):481–501. https://doi.org/10.1002/int.21844

    Article  Google Scholar 

  77. Wang XZ (2015) Learning from big data with uncertainty–editorial. J Intell Fuzzy Syst 28(5):2329–2330

    Article  MathSciNet  Google Scholar 

  78. Wang XZ, Xing HJ, Li Y, Hua Q, Dong CR, Pedrycz W (2015) A study on relationship between generalization abilities and fuzziness of base classifiers in ensemble learning. IEEE Trans Fuzzy Syst 23(5):1638–1654. https://doi.org/10.1109/TFUZZ.2014.2371479

    Article  Google Scholar 

  79. Yadav SK, Bhushan M, Gupta S (2015) Multimodal sentiment analysis: Sentiment analysis using audiovisual format. In: 2015 2nd international conference on computing for Sustainable Global Development (INDIACom). IEEE, pp. 1415–1419

  80. Yadav S, Dhingra K, Kaushik D (2016) Opinion mining using SentiFul. In: 3rd International Conference on Computing for Sustainable Global Development (INDIACom). IEEE, pp. 2406–2411

  81. Ye J, Kumar S, Akoglu L (2016) Temporal opinion spam detection by multivariate indicative signals. In: Proceedings of the tenth international AAAI conference on web and social media. Association for the Advancement of Artificial Intelligence, pp. 743–746

  82. Yen J, Langari R (1998) Fuzzy logic: intelligence, control, and information. Prentice-Hall, Inc., Upper Saddle River

    Google Scholar 

  83. Zheng X, Lin Z, Wang X, Lin KJ, Song M (2014) Incorporating appraisal expression patterns into topic modeling for aspect and sentiment word identification. Knowl-Based Syst 61(1):29–47. https://doi.org/10.1016/j.knosys.2014.02.003

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Komal Dhingra.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dhingra, K., Yadav, S.K. Spam analysis of big reviews dataset using Fuzzy Ranking Evaluation Algorithm and Hadoop. Int. J. Mach. Learn. & Cyber. 10, 2143–2162 (2019). https://doi.org/10.1007/s13042-017-0768-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-017-0768-3

Keywords

Navigation