1 Introduction

Now a day’s social media became very important for everyone in their daily routine. Moreover, it became a great terminology of real time contents. Accurate timing of information in transportation system is very important. We use social media to analysis and detect the traffic related incidents such as congestion, incidents, natural disasters or other kinds of events related to transport [1].

Among social networks platforms, commonly used micro blogging site known as Twitter. The use of the micro blogging site Twitter for purposes of communication with customers, and it provides cost-effective and reliable method of sharing information [4]. Twitter has more than 200 million active users in a month [2] and so on also currently make “340,000,000” number of tweets per day. Mostly traffic related information sharing by people using the SUM [5]. It consist information about current situation of traffic while they are driving, so Intelligent Transportation Systems (ITSs) used to detect the traffic related event [14].

In this paper, we will review two main techniques that use twitter for handling the traffic incidents. The first technique is ITS [20] that related to Machine Learning Algorithms and text-mining technique for real-time detection of incidents and events [3] from Twitter stream [10]. The second technique is a methodology tweets processing, classifying and retrieving public tweets by using a popular Natural Language Processing (NLP) techniques with combination of a Support Vector Machine algorithm (SVM) for the classification of particular text. The paper structured as follows Sect. 2 describe problem related to system, Sect. 3 illustrated detail of both techniques with implementation. Next, in Sect. 4, comparison of the results from the experiments is being presenting. Lastly, Sect. 5 contains the discussion on conclusions.

2 Problem Statement

Many people mentioning different problems related to traffic such as traffic-jam, no parking on specific area, heavy vehicle, U-turn etc. [1]. With the change in location, these causes remain same or might be vary. Further there are tweets contain multiple problems related to particular scenario. The detection system handles these problems to predict the situation of problem [18].

3 Methods

There are two main techniques used twitter for handling the traffic incident are as following:

  • Detection Using Data of twitter

  • Real-time Detection using twitter

3.1 Detection Using Data of Twitter

This technique will classify and retrieve tweets related to traffic, APIs selected for real time streaming of data and by using roads names and keywords tweets filtered and the removals of special characters by using NLP [7]. Lastly, the classification into traffic/non-traffic tweets SVM algorithm used. The following contain the detail information of every step.

Twitter Data Gain.

By using two different kinds of tools, twitter provides free access to user for sharing post. Users easily accesses the system and can query by keywords or location and achieve popular tweets by using tool REST API queries are limited to 350 every 15 min. We chose streaming API for real time streaming of data [13]. Roads filtered name and keywords related to traffic like (M6, accidents) Scrabble tweets. This is completing through by using the regular expressions [6].

Pre-processing.

In social media informal language used to write any type of text in post that can be very informal language used by people in tweets mostly, include emotions, special characters, and hash-tags and so on. It is important step to clear the text by using some text-mining technique before send it to classifier. Following are some steps apply on datasets to clear the text:

  1. (a)

    Tokenization: The all text is breakdown into tokens [7]. It is process in which non-alphanumeric characters like emotions, hash tags and punctuations were removed, so the as a result became a set of words [19]. This task completed through by using python.

  2. (b)

    Stop word removal: This term eliminate words, which are not helpful in characterization of a text like conjunctions, prepositions, and articles. Natural language Toolkit in a famous language that is python used to get the full list of ENGLISH stops words.

Classification.

The last step of detection is classifying the pre-processed related tweets into traffic or non-traffic, large numbers of machines learning algorithms like SVM, naïve Bayes, neural language network etc. play an important role in working strategy of classification and it is implemented using SVM with Scikit-learn library [9, 13].

Implementation.

We discuss various datasets for experimental use. For dataset, 3956871 tweets were collected using twitter streaming API form “March 1st 2017 to May 31st 2017” these tweets labeled and further divided into training datasets and testing datasets.

  1. (a)

    Tanning Data Set: The tweets filtered commonly used to train algorithm, these tweets labeled into traffic (Good)/non-traffic (Bad). The datasets tweets related to traffic are 870 0r 871 and related to non-traffic are 870 0r 871. After that, a validation known as “10-flod cross” validation is performing on set [6].

  2. (b)

    Test Data Set: The datasets remaining tweets were considered for testing datasets, datasets tweets related to traffic are 289 0r 290 and related to non-traffic are 289 0r 290 which used to accommodate model of tanning data [6].

3.2 Real-Time Detection Using Twitter Data

We propose the system that based on machine learning algorithm and text mining for real time findings of traffic events via twitter stream analysis. The system is event driven and based on SOA. The system has multi class classification identifies non-traffic and traffic [10, 14].

Pre-Processing and SUMs Fetching.

This proposed system performs Pre-processing and SUMs fetching. It removes raw posts from twitter based multiple search criteria such as geographic coordinates, keywords appearance in the text of the tweet [5]. When the SUMs are fetching related to the specific search criteria and SUMs are pre-processed. The Regular Expression filter is applied on text of each raw tweet and removes additional information related with the text [8, 10, 12].

Elaboration of SUMs.

This module of proposed system named as Elaboration of SUMs, aimed some Text Mining Techniques applied in classification to the SUMs [12]. Some text mining steps are implemented in this module are following and described in detail [8, 10, 14].

  1. (a)

    Tokenization This is text-mining process that transforms a stream of characters into stream processing units that called tokens. This process removes all the punctuation marks and divides every SUM into tokens are similar to words denoted as the sequence of words [8].

  2. (b)

    Stop-word filtering eliminates words that interrupt information to analyze text, articles, associations, prepositions and pronouns. Other stops spoken for particular languages often appear in phrases and expressions in domain considered as text and noise analyzers [8].

  3. (c)

    Stopping is the process of minimize the token (each word) to its stem or root form, by removing its suffix. The aim of this step is the collection of words that contain same theme having closely related semantics [10, 14].

  4. (d)

    Stem filtering is a process that reduces the number of stems per SUM. SUM in all filtrations is carried out to eliminate the stem groups and not go to set stem groups [10, 14].

Classification of SUMs.

The third module proposes system that classifies types of SUM and assigns class label associated with events of type SUM of circulation [5]. The partners finally achieved definitive results with real-time traffic control systems [14] and continues in certain area and reports the presence of traffic event according to set of rules that defined by system administrator [8, 12] (Fig. 1).

Fig. 1.
figure 1

2D dataset Class

Implementation.

Three types of classes used for SUM classification updated by user related to traffic, non-traffic and Traffic due to classification of external events done through the Naive Bayes classifier. The first two traffic related class and non-trafficked are also called 2Dataset [5] and entire classes, related to traffic, non-traffic and Traffic due to external event is also called as 3Dataset. Here we classify the SUM by Application NB Classifier, SVM and Text mining Technique [10] (Fig. 2).

Fig. 2.
figure 2

3D dataset Class

4 Expected Results

The Detection Using Data technique working with geographical filter and it is not real time detection working. It involved two steps in its processing is tokenization and stop word removal [6]. The real-time detection technique senses traffic events in real-time. It involved Tokenization, Stop word removal, stopping and stem filtering in its processing [14] (Table 1).

Table 1. Detection using data result

We inked SVM as classification model, and by handling binary classification problem traffic vs. non-traffic tweets, we attained 95.75% an accuracy value [5]. The important problem is multiclass classification by solving it. We describe difference between external event traffic or not, we attained 88.89%an accuracy value [14]. By using other technology, which is detection popular NLP technique with combination of SVM for text classification. This approach detects tweets and we attained 88.28% accuracy [6] (Tables 2) and 3.

Table 2. Real-time detection result (2-class dataset)
Table 3. Real-time detection result (3-class dataset)

5 Conclusion

This paper presents the detail of two detections methodologies for processing the tweets, classifying and retrieving public tweets by using popular NLP techniques with combination of algorithm SVM for the classification of text. This paper we review a framework in which using data from Twitter to manage incident detection in transport networks. As The detection system, handle problems to predict the situation of problems and further use an appropriate algorithm to identify the problems and causes of the problem related to the tweet.