Abstract
This paper proposes two advanced algorithms embedded into an integrated system; one is a Dynamic Path Selection Clustering (DPSC) algorithm for the document clustering and the other is the Rearward Binary Window Match (RBWM) algorithm for the user’s search engine. The DPSC algorithm is derived from the concept of Google’s crawler technique implemented in offline processing and the RBWM algorithm for search engine is derived by utilizing the techniques of other search algorithms. The proposed system is being accomplished for giving an appropriate data structure to the input dataset content. The dataset used as input is the Enron dataset, which is large in volume and unstructured. The system is designed with the help of integrating all the individual and independent units into a system by bringing them under one frame and the units are data preprocessing, document clustering, mapping of clusters and search engine. This system, with fine refining integrated frame, would likely evidence in a better way, since simple definition of the system for data retrieval affects the consistency of irrelevant information retrieval for evidencing to be increased. Though there are plenty of existing systems in forensic department with only simple definition of search engines, without any other processes the irrelevancy in retrieval is seen to a larger extent. Consequently, a design of this integrated system, which is automated in process by using the above well defined configured units, is proposed. This systematic approach is for adequate use of digital textual evidences, which assists in quicker crime identification rate. The outcomes of the proposed system are analyzed by obtaining the precision and recall values and comparing them with the results of Metasearch engines like Dogpile and Metacrawler, to test the efficacy in retrieval rate.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Anthony McGregor, Mark Hall, Perry Lorier and James Brunskill 2004. Flow clustering using machine learning techniques. Proc. of 5th Int. Workshop on Passive and Active Network Measurement
Sebastian Zander, Thuy Nguyen and Grenville Armitage 2005. Self-learning IP traffic classification based on statistical flow characteristics. Proc. of 6th Int. Workshop on Passive and Active Measurement
Tasi ć D. S. and Stojanović M. S. 2006. Modified Fuzzy Clustering Method for Energy Loss Calculations in Low Voltage Distribution Networks. ELEKTRONIKA IR ELEKTROTECHNIKA. 2(66): 50–55.
Nicole Lang Beebe and Jan, Guynes Clark 2007. Digital forensic text string searching: Improving information retrieval effectiveness by thematically clustering search results. The International Journal of Digital Forensics & Incident Response. 4: 49–54.
Rudi L. Cilibrasi and Paul M. B. Vitányi 2007. The Google Similarity Distance. IEEE Transactions on knowledge and Data Engineering, 19(3)
Sampath Kumar B. T. and Pavithra S. M. 2010. Evaluating the searching capabilities of search engines and metasearch engines: a comparative study. Annals of Library and Information Studies. 570:87–97.
Ya-li Cao, Tie-jun Huang and Yong-hong Tian 2010. A ranking SVM based fusion model for crossmedia meta-search engine. Journal of Zhejiang University SCIENCE C. 11(11): 903–910.
Subhashini, R. and Senthil Kumar, V. J. 2011. A framework for efficient information retrieval using NLP techniques. Communications in Computer and Information Science. 142: 391–393.
Suiang-Shyan Lee and Ja-Chen Lin 2012. An accelerated K-means clustering algorithm using selection and erasure rules. Journal of Zhejiang University SCIENCE C. 13(10): 761–768.
Nam-Su Jho and Dowon, Hong 2013. Symmetric Searchable Encryption with Efficient Conjunctive Keyword Search. KSII Transactions on Internet and Information Systems (TIIS). 7(5): 1328–1342.
Sendilkumar S., Mathur B. L. and Mohammed Imran 2013. Discrimination of Power Transformation inrush and internal Fault Current using Time to Time Transformation and Fault Classification using Fuzzy Clustering. Journal of Engg. Research. 1(3): 87–108.
Álvaro Cuesta, David F. Barrero and María D. R-Moreno 2014. A Framework for Massive Twitter Data Extraction and Analysis. Malaysian Journal of Computer Science. 27(1): 50–67.
Gowri S, Anandha Mala G.S and Divya.G 2014a. Text Preprocessing for the improvement of Information Retrieval in Digital Textual Analysis. International Conference on Mathematical Science(ICMS 2014) Sathyabama University- Elsevier, pp. 174–179.
Gowri S, Anandha Mala G.S and Divya.G 2014b. Enhancing the Digital Data Retrieval System Using Novel Techniques. Journal of Theoretical and Applied Information Technology. 66(2)
Hong Wang and Rongfang, Song 2014. Clustering Based Adaptive Power Control for Interference Mitigation in Two-Tier Femtocell Networks. KSII Transactions on Internet and Information Systems (TIIS). 8(4): 1424–1441.
Rathna R. and Sivasubramanian A. 2014. Energy Conservation in Radiation Monitoring. Journal of Engg. Research. 2(2): 123–138
Seung Ryul Jeong and Imran, Ghani 2014. Semantic Computing for Big Data: Approaches, Tools, and Emerging Directions (2011–2014). KSII Transactions on Internet and Information Systems (TIIS).8(6): 2022–2042.
Wei Kuang Lai, Chung-Shuo Fan and Chin-Shiuh Shieh 2014. Efficient Cluster Radius and Transmission Ranges in Corona-based Wireless Sensor Networks. KSII Transactions on Internet and Information Systems (TIIS). 8(4): 1237–1255.
Acknowledgments
We thank Sathyabama University for providing us with various resources and unconditional support for carrying out this work.
Author information
Authors and Affiliations
Corresponding author
Additional information
Enron dataset- http://www.cs.cmu.edu/~enron/
Crawling & Indexing-http://www.google.com/intl/en/insidesearch/howsearchworks/crawling–indexing.html
Google Crawlers- https://support.google.com/webmasters/answer/1061943?hl=en
Web crawler- http://en.wikipedia.org/wiki/Web_crawler
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Shanmugam, G., Sankar, A. Strategic enhancement of the collaborative framework for novelty in retrieval from digital textual data corpus by deploying DPSC and RBWM algorithms for forensic analysis. J Engin Res 3, 33 (2015). https://doi.org/10.7603/s40632-015-0033-4
Revised:
Accepted:
Published:
DOI: https://doi.org/10.7603/s40632-015-0033-4