Keywords

1 Introduction

Popular microblogging services such as Twitter can generate as many as 600 million short texts in a dayFootnote 1. Similar to real world human conversations, these short texts, henceforth called tweets, cover all types of topics, including politics, sports, weather, product promotion, and interesting personal discoveries. Exploiting such mixed-domain data for the information needs of a narrow domain can prove extremely useful for identifying crucial information. Existing work in this field usually requires selecting small portions of data from a large, mixed-domain data body. For example, as a service such as Twitter allows public access to all its dataFootnote 2, certain portions of this data have been collected for applications in narrow domains, including earthquake monitoring [15], influenza surveillance [2], election result prediction [18, 19], ideal point estimation [1], and rumor detection [5, 9].

In domain applications such as the above, the required data represents an extremely small portion of the collected datastream. For example, in a work attempting to capture disaster and crime events from tweets, the authors find that only 0.05% data in all collected data is related to the application [7]. Thus the first step in existing approaches is to filter the required data from mixed-domain, unclassified data. In [15], earthquake-related tweets are classified. In [19] tweets related to two candidates are collected. In [9], only tweets related to the bombing incident are selected. The techniques for such filtering range from machine learning based approaches [13, 15], to keyword generation [11] and clustering [20]. However, most of the filtering solutions are designed specifically for the corresponding application, and are not suitable for other domains or applications. In this paper, we focus on providing an information extraction solution that can be tailored to different specific applications based on existing domain knowledge.

Our approach relies on the insight that a narrow domain information consumer has some initial but not complete knowledge of the data, including knowledge about the key elements or topics within the domain, which is quite often the case when a domain expert in an organization wants to build an information system based on text data. This knowledge often can be translated into a taxonomy. For example, in a previous work [21], short text messages containing the keyword “shooting” are collected for detecting shooting crimes, where distinctions must be made about the meaning of “shooting”, such as in “shooting photo”, “shooting gun”, or “shooting basketball”. Users may note that “photo” is a “imaging product” and “gun” is a “weapon”. They may also note the domain background, that “gun” is used in a “crime”, while “ball” is used in a “game”. We can construct a Multiple Domain Taxonomy (MDT) that contains these two kinds of relationships, namely, \(is\_a\), and \(in\_a\), to represent user knowledge. We show that using the concepts and relationships defined in a partially constructed MDT, we can effectively provide functions such as message domain classification and key concept recognition.

Given the MDT, we use a pure frequency approach on unlabeled data to identify the domain and concepts in the short text, as we describe in detail in the Sect. 3. There are several advantages using this approach. First, a pure frequency approach that does not involve grammar-based NLP (Natural Language Processing) techniques is language independent, and suitable for processing informal microblog messages. Unlike formal texts, microblog messages are filled with common misspellings and word shortening that cannot be found in a dictionary, but can be captured by frequency-based analysis with large data. Second, it is an unsupervised approach that does not require annotating data. As Twitter allows free access to one percent of its data traffic, one can easily collect millions of tweets in a day, very few of which, however, can be manually annotated. Our approach takes advantage of the large number of unlabeled data and effectively improves the identification accuracy. Finally, our approach does not require an external knowledge source. Since existing knowledge sources such as WikipediaFootnote 3 only provide information for more common concepts, the use of external knowledge sources generally limits the applicability of the method. Instead, our approach considers the unlabeled data as the context of the key terms and provides similar accuracy improvement effect. To summarize, we make the following contributions:

  • We formally define the problem of domain classification and concept tagging given an existing taxonomy called MDT. We propose MDT as a new type of taxonomy based on the reality of narrow domain information consumption from mixed-domain data.

  • We propose an unsupervised, pure-frequency approach for solving identification of domain and concepts in short texts. Our approach does not require annotation of training data and captures common misspellings and word shortenings, thus is suitable for processing informal social media messages. Our approach is also a general solution that is applicable in any narrow domain, and except for the partial MDT that requires some initial knowledge of data to construct, our approach does not need any external input.

  • We test our approach extensively using real Twitter data. Our results show that the proposed domain classification method achieves much higher accuracy than existing classification methods, with up to 52% precision increase; our concept tagging method similarly achieved relatively high accuracy.

2 Related Work

Given the emerging popularity of social media, short text classification has been widely studied. Sriram et al. [16] propose a classification method to identify pre-defined message categories, such as news, opinions, and deals. Targeting such categories, their method is a supervised learning approach based on text features such as opinion words, time-event phrases, and the use of dollar sign. Li et al. [7] propose a classification method to find the Crime and Disaster Events (CDE). They use a supervised classifier that incorporates features that include hashtag, URL, and CDE-specific features such as time mention. They found that including CDE features provides about 80% accuracy versus 60% accuracy without them. These works are proposing classification solutions with presumed target domains are often considering specific domain characteristics. However, as we will show in our experiments, such solutions are usually not applicable with a different target domain. Olteanu et al. [11] propose a method for filtering relevant information based on keywords, and claim that the method can be applied to any domain. Their method generates discriminative keywords based on labeled data, and the discriminative strength is measured using PMI and frequency. However, their experiments show poor performance, with the proposed method providing almost no accuracy improvement over simple keyword filtering.

Some research exploits the message categories inherently associated with the messages. Ritter et al. [14] propose a method to automatically generate message types in addition to message classification. Based on the event messages and related phrases, the event type of each message is determined based on the distribution of name entity and time. The event messages and related phrases, however, are initially classified under a broad “event” label, which is first extracted using a supervised method based on signal words such as “announcement” and “new”, and thus may not be applicable depending on the application domain. Lucia and Ferrari [8] propose an unsupervised message classification method based on expanding lexical meanings using external knowledge sources. They automatically generate message categories based on existing type definitions provided by knowledge sources such as YAGOFootnote 4. Most works that automatically generate message categories, however, tend to result only in general categories such as sports, politics, and religion, and are insufficient for more specific classification needs in a particular domain.

Name and entity recognition (NER) has been widely studied in computational linguistics, and well known solutions have been developed, such as StanfordNERFootnote 5 and OpenNLPFootnote 6. Traditional NER solutions, however, focus only on pre-defined term categories, such as person, organization, and location [3, 12]. Recently, some solutions are proposed to tag names and entities without pre-defined categories. Tuan et al. [17] propose a method to find the taxonomy relations between unlabeled terms in data. In addition to string inclusion method and lexical-based rules, their method calculates subsumptions of contexts between terms, which rely on existing tools to extract (Object, Verb, Subject) triples. Their method achieves high precision recognizing taxonomy relations in formal texts such as journal papers and government reports. However, it is unlikely their method can be applied to informal texts, since there is no existing tool to effectively extract structures from such texts. Topics extracted from topic models can also be regarded as concepts for a short text. Li et al. [6] propose a topic model, GPU-DMM, for extracting topics from short texts. The method enriches the topic model with learned word embeddings. Semantically related words under the same topic are promoted during the sampling process by using a GPU model. However, this method highly depends on the word embeddings, which requires long time to learn and may not provide the specified categories. The work by Han et al. [4] has a similar aim to our work. They propose a frequency-based approach to link name mentions in texts to a concept in a knowledge graph, based on local compatibility and evidence propagation over the graph. Their method, however, relies on a pre-defined knowledge graph that has articles associated with each entity and thus is difficult to tailor to a specific classification task in a user-defined domain. Our proposed method, on the other hand, can work on user-defined domains and only requires a handful of domain concepts.

3 Domain Classification and Concept Tagging

We define the Multiple Domain Taxonomy (MDT) as a taxonomy with two types of relationshipsFootnote 7, namely, domain association and taxonomy association, denoted as \(in\_a\) and \(is\_a\). Domain associations define the domain to which a concept belongs. Taxonomy associations define taxonomical hierarchies between concepts. One such MDT is shown in Fig. 1. In this example, the domains are crime and imaging activity, which could both present in a text dataset regarding a shooting. The concepts of suspect and victim are defined as “in a” crime, and camera “is a” tool “in a” imaging activity.

Fig. 1.
figure 1

An example multiple domain taxonomy

We define a multiple domain taxonomy as \(MDT=\{D,V,I,S\}\), where D is the set of domains, V is the taxonomy vocabulary, and each \(c \in V\) is a concept. \(I=V \mapsto D\) is the mapping of \(in\_a\) relationship between concepts and domains, and \(S=V \mapsto V\) is the mapping of \(is\_a\) relationship that describes the hierarchy of concepts. Here we consider if \(\{c_1 \mapsto d\} \in I\), and \(\{c_2 \mapsto c_1\} \in S\), then \(\{c_2 \mapsto d\} \in I\), in other words, if a parent concept belongs to a domain, all its children concepts also belong to the same domain. In this way we do not need to explicitly define \(in\_a\) relationship for lower level concepts.

3.1 Problem Statement

We show that using a partial MDT constructed with some initial knowledge of that data, we can solve the problem of message domain classification and concept tagging. The problem of domain classification looks at determining the domain for a message given a number of known domains. The problem of concept tagging looks at tagging unknown terms in a message with a concept label. An example of a tagged message would look like: “I took a photo[IMAGING:PRODUCT] of my girlfriend[IMAGING:TARGET] with my new camera[IMAGING:TOOL]”. We note that the text transformation is straightforward once we identify the compatible taxonomy concept for the term. We formally define the two problems as the following:

Problem 1

(Domain Classification). Given a number of possible domains \(D=\{d_1,...,d_l\}\), and the message m consisting of terms \(\{t_1,...,t_k\}\), find the domain association of m, such that \(\{m \mapsto d\}\) for some \(d \in D\).

Problem 2

(Concept Tagging). Given a number of concepts V, and a number of terms in a message m, \(T_m=\{t_1,...,t_k\}\), find a taxonomy association for each t such that \(\{t \mapsto c\}\), for some \(c \in V\).

For solving the problems, we assume a \(MDT=\{D,V,I,S\}\) has been constructed, such that D contains all known domains, and V contains an incomplete list of concepts that are mapped to D with I.

3.2 Message Domain Classification

To classify the domain of a message, we compare the semantic relatedness between message terms and the concepts in each domain. After aggregating the relatedness for all terms in each domain, we can determine which domain is more semantically related to the message.

To calculate the semantic relatedness between a term and a concept, we use a method proposed by Milne and Witten [10], which utilizes the presence of the term and the concept in the unlabeled data, and calculates the semantic relatedness score (SRS) as following:

$$\begin{aligned} SRS(t,c)=1-\frac{log(max(|T|,|C|))-log(|T \bigcap C|)}{log(|W|)-log(min(|T|,|C|))} \end{aligned}$$
(1)

where t and c are the term and the concept, T and C are the sets of all messages that contain t and c, respectively, and W is the entire dataset.

We use the highest SRS obtained when matching the term with different domain concepts as the domain score for the term. After retrieving the SRS for each term in a domain, we calculate a message score for the domain (DS):

$$\begin{aligned} DS(m, d)=\sum _{i=1}^k max(SRS(t_k, c_j), \forall \{c_j \mapsto d\} \in I) \end{aligned}$$
(2)

where the message m consists of terms \(\{t_1,...,t_k\}\).

We calculate a domain score for each domain. Then the predicted domain for m is the domain that provides the highest domain score, \(\mathrm {arg\,max}_i DS(m,d_i)\).

3.3 Concept Tagging

We approach the concept tagging problem by finding the compatible concept in the taxonomy for a term. If a term is compatible with a concept, then it can inherit its taxonomy associations. For example, if we find “film” is compatible with “movie”, and “movie” is defined as a product in the taxonomy, then we can consider “film” is also a product. To calculate the concept compatibility, we take into account the message contexts, which is formed from the words surrounding the term and the concept. We argue that if a term is in the same domain as the concept, and the context they appear in are similar, then it is very likely they are compatible.

The context of a term is usually represented as a number of words neighboring the keyword. Traditionally, the position of context words is ignored, and the context words are considered interchangeable. However, we found that the position of context words contains crucial information and should not be overlooked. For example, suppose we have two message, “He took a new photo of the house”, and “the house of cards took a new view on US politics”. In this example, if we ignore the position, the two terms “photo” and “cards” have the same context, but they are certainly semantically incompatible. Based on this insight, in our solution we take into account the position of context words.

To calculate the context similarity between a term and a concept, we first set a context width parameter q, which defines how many neighboring words will be considered as the context. From a number of unlabeled messages that contains the term, we extract a set of words at each position between \(p-q\) and \(p+q\), where p is the position of the term in the message. A total of 2q sets will be extracted, denoted as \(Q_t^1,...,Q_t^{2q}\). Similarly we extract the context word sets for the compared concept, \(Q_c^1,...,Q_c^{2q}\). The context similarity is thus calculated based on the similarity of context words in the same position:

$$\begin{aligned} contextSimilarity(t, c)=\frac{ \sum _{i=1}^{2q} sim(Q_t^i, Q_c^i)}{2q} \end{aligned}$$
(3)

where \(sim(Q_1,Q_2)\) is a similarity function that compares two cluster of words.

From existing work, we choose a similarity function proposed by Unankard et al. [20], which is based on term frequency and cosine similarity:

$$\begin{aligned} sim(Q_1, Q_2)=\frac{\sum _i tf(Q_1, t_i) \times tf(Q_2, t_i)}{\sqrt{\sum _i tf(Q_1,t_i)^2} \times \sqrt{\sum _i tf(Q_2,t_i)^2}} \end{aligned}$$
(4)

where \(t_i \in T\) is all the terms in \(Q_1 \bigcup Q_2\), and tf(Qt) is the term frequency of term t in set Q.

To tag a term t in a message m, first we determine the domain for m using the method described above. Then we obtain all concepts that belong to the domain, \(c_d \in V\) that satisfies \(\{c_d \mapsto d\} \in I\). We then calculate the context similarity between the term and each concept, and find the concept that produces the highest context similarity, \(c_{max}\). Finally we consider t and \(c_{max}\) compatible, and assign \(\{t \mapsto c_p\}\) for any \(\{c_{max} \mapsto c_p\}\), in other words, allowing t inherit the taxonomy association that \(c_{max}\) has.

We need to note that the identified concepts can be added into the MDT, based on the identified domain and compatible concepts, and thus the MDT can be iteratively improved. As more data being processed, and more concepts added to the ontology, we expect a better recognition performance of our system with the improved MDT. In this work, however, we focus on the first iteration of this process. We will explore iterative MDT improvement with identified concepts in future works.

3.4 Improving Computational Efficiency

It is computationally expensive to collect context and calculate similarity for every concept-term pair in large unlabeled datasets. For example, for 100,000 unlabeled messages and \(q=3\), a total of 600,000 words will be compared for each pair. To improve efficiency, we compute some term frequency information at the start of the system and store it in a memory heap to quickly estimate the significance of context similarity between a term and a concept, thus eliminating most contextual comparisons between insignificant pairs.

We call our runtime reduction technique reverse contextualizing (RC). First we compute the significance between a concept c and a context word w in position i. We collect the context of c in position i as \(Q_c^i\), the significance of a context word w is calculated as:

$$\begin{aligned} sig(c, w, i)=\frac{tf(Q_c^i, w)}{|Q_c^i|} \end{aligned}$$
(5)

This score shows the percentage of a context word in all words appearing in the concept’s context at the given position. Then for each context w, we also collect its contexts, with the reversed position of \(2q-i\), as \(Q_w^{2q-i}\). For each term t of this reversed context set of w, the significance is calculated as:

$$\begin{aligned} sig(t, w, i)=\frac{tf(Q_w^{2q-i}, t)}{|Q_w^{2q-i}|} \end{aligned}$$
(6)

Finally we compute a significance score between concept c and term t as:

$$\begin{aligned} sig(c, t)= 100 \times \sum _{i=1}^{2q} \sum _{w \in |Q_c^i|} sig(c, w, i) + sig(t, w, i) \end{aligned}$$
(7)

As an example, suppose \(q=3\) and \(i=q+1\). Then for concept police, we calculate significance of contextual word shooting in the position next to the concept as sig(“police”, “shooting”, \(q+1\)), based on the frequency of the phrase “police shooting”. Then for word shooting, we calculate the significance of its context word kids in the position previous to the word as sig(“kids”, “shooting”, \(q+1\)), based on the frequency of phrase “kids shooting”. Finally based on two calculation results we obtain significance score between police and kids.

We compute this score for each pair of concept and term appearing in the same position in the data with respect to context words, and store it in memory. We also set a significance threshold \(\tau \). When tagging a term in a message, we first retrieve the significance score between the term and the concept, sig(ct), and only when \(sig(c,t) > \tau \) we proceed to calculate the actual context similarity.

4 Experimental Analysis

We have presented our domain and concept identifying method as an effective unsupervised method. We expect our method to achieve better accuracy than current supervised and unsupervised methods, while keep low computational cost. We conduct experiments on real Twitter data to validate our approach. First we test the accuracy of our domain classification method. Then we test the accuracy of our concept tagging method. Finally we study the runtime of our approach, and provide insights into the impact of different training data size and pre-computation on computational costs.

4.1 Datasets

Our experiments are conducted on two sets of real Twitter data. The first dataset, called the shooting dataset, is collected using the Twitter Filter APIFootnote 8 during September and October, 2014. The dataset has about 2 million tweets containing the keyword shooting. After removing retweets, we obtain a set of 284,343 tweets. We examine the data and discover that the tweets are mainly related to four domains, namely, crime, imaging, game, and metaphor. After deciding the domains, we label a number of tweets according to their domains. The labeled data contains 1,083 tweets.

The second dataset is called the crisis dataset and is a publicly available datasetFootnote 9 introduced by Olteanu et al. [11]. It contains sets of tweets related to 26 natural disasters and other crisis events and labeled and unlabeled tweets. There are two types of labels, based on whether the tweet is related and informative, and based on the source of the tweet, respectively. We use only related tweets. Combining tweets for all 26 events, we obtain 201,078 unlabeled tweets, and 3,646 labeled tweets. The labeled tweets contain five categories, namely, eyewitness, business, government, media, and ngo.

For each dataset we manually construct an MDT shown in Table 1. Both MDTs have a flat structure, with the first level as domains, and the second and third levels as concepts. Between domains and concepts, \(in\_a\) relationships are defined. Between second and third levels of concepts, \(is\_a\) relationships are defined. We have not spent more than two hours per MDT. For the crisis dataset, the five domains are taken from the five categories of labeled data.

Table 1. MDT used in the experiments

4.2 Results for Domain Classification

In the first set of experiments, we test the domain classification accuracy for our approach. We first focus on the first domain for the two datasets, namely, crime in the shooting dataset, and eyewitness in the crisis dataset. We focus on these two domains because crime and eyewitness are more desirable information, and have been the topic in several studies [7, 22].

We compare our approach with three baselines. The first is accept all which considers all messages as positive. The accept all method would always achieve the highest recall of 1.0. The second baseline, proposed by Sriram et al. [16], is a supervised method based on eight features and the Naive Bayes model. The eight features include author name, use of slang, time phrase, opinionated words, and word emphasis, presences of currency signs, percentage signs, mention sign at the beginning and the middle of the message. The evaluation is based on the five-fold cross validation. The Sriram classifier is shown to be effective in classifying tweets into categories such as news, opinions, deals and events, but has not been tested in other applications. The third baseline (PA) is from our previous work [22]. It is an unsupervised approach that incorporates lexical analysis and user profiling. This method is shown to be effective for filtering personal observations from tweet messages.

The classification accuracy of the first domain in two datasets achieved by three baselines and our MDT-based approach is shown in Table 2. As can be seen from the results, our approach achieves extremely high precision comparing to the baselines. For classifying crime domain, it achieves 0.92 precision, which is a 52% increase from the baseline methods, as well as 0.78 f-value, a 27% increase from the baseline methods. For classifying eyewitness, it also achieves a high precision of 0.73, a 9% increase from the baseline method, and 6% increase in f-value. The PA method is designed to distinguish observation messages according to their source, and thus it achieves a low accuracy classifying crime as it includes messages from various sources; for eyewitness, it achieves the highest accuracy among baseline methods. Our MDT-based method, nevertheless, surpasses the PA method both in precision and recall for classifying eyewitness.

Table 2. Classification accuracy of the first domain

We also look at other domains. Table 3 shows the classification accuracy across four domains for the shooting dataset. As can be seen from the result, classification on other domain also achieves high accuracy as the crime domain, indicated by similar f-values. However, the crime domain do provide the highest precision, mainly due to that it is a narrower domain that can be better identified with a simple taxonomy.

Table 3. Classification accuracy for the shooting dataset

4.3 Results for Concept Tagging

In the second set of experiments, we test the accuracy of our concept tagging approach. We conduct two experiments. In the first experiment, which we call take-out-one experiment, the leaf level concepts in the MDT are taken out one-by-one and put back to the MDT using our approach. For example, for the shooting MDT, we first take out the police concept, and then use the proposed tagging method to match it with the MDT, now without the police concept. This process is run for every concept in the MDT. For the shooting MDT, 71 concepts are tested. For the crisis MDT, 80 concepts are tested. The proportion of correctly tagged concept with respect to different training data sizes is shown in Table 4. From the results we can see that the take-out-one experiment reaches a very high precision. With only 15,000 training data, we have over 0.95 precision for the shooting MDT, and over 0.92 precision for the crisis MDT. According to this result, we can confidently tag a concept with a MDT even with a small number of training data, if it is known that the concept must be compatible with the MDT.

Table 4. Precision in take-out-one experiment

In the next experiment, we run concept tagging on the raw data. We take 10,000 tweets from the shooting dataset and 3,000 tweets from the crisis dataset, and employ our concept tagging method. We use 30k training data, which should provide optimal effectiveness based on to the previous experiment. The detected taxonomy and the context similarity score are recorded for each tagged term, and thus a large number of tagged terms are generated. Table 5 shows the terms with the highest context similarity score for the second-level concepts.

Table 5. Top terms for second-level concept

We can identify some errors in the above tagging, such as identifying baby instead of baby milk product commercial as the shooting target, and Queen in Queen Elizabeth High School as a person. Such errors are caused by the limitation of not considering multi-word terms, which we will explore in the future. It is worth noting that word shortenings such as ppl are captured correctly.

To evaluate the overall accuracy, we manually check all the tagged terms with a context similarity score above 0.3. There are 296 terms and 554 terms that satisfy this requirement in the shooting and crisis test data, respectively. The tagging accuracy of these terms with respect to different context similarity score range is shown in Table 6. As a comparison, we also show the expected accuracy if we randomly choose a second-level concept for tagging.

Table 6. Tagging accuracy in different context similarity score range

We can obtain around 50% tagging accuracy for terms that generate a context similarity score >0.3. The low accuracy is possibly due to many terms that do not have a taxonomy relationship with the MDT but still have similar context with the concepts, such as time and location words. This problem can be overcome by, for example, adding the time-related concepts to the MDT. Nevertheless, comparing to randomly assigning tags, our approach achieves much higher tagging accuracy.

4.4 Runtime Analysis

We test the effectiveness of our runtime reduction technique (RC). We measure the runtime of concept tagging for the 1,083 shooting tweets, with different training data sizes and two \(\tau \) values. The results are shown in Fig. 2. All experiments are run on a desktop computer with a 3.7 GHz eight-core Intel Xeon CPU, 15.6 GB memory, and Ubuntu 16.04.

Fig. 2.
figure 2

Runtime with different RC options

As we can see from the figure, our RC technique effectively removes most of the computation in the otherwise computation-heavy concept tagging task. Using 5k training data, the runtime without RC is 3,045 s, while with RC the runtime is 72 s for \(\tau =0.005\) and 23 s for \(\tau =0.05\). Using 10k training data, the runtime without RC is 6,065 s, while with RC the runtime is 202 s for \(\tau =0.005\) and 49 s for \(\tau =0.05\). In both cases, the runtime is reduced to a few hundredth of the original runtime. Looking at the absolute values, using 10k training data, with which we have seen satisfactory tagging accuracy, the average tagging time for a single tweet is 5.98 s without RC, but only 0.04 s with RC (\(\tau =0.05\)). With the improved tagging speed, our concept tagging method becomes suitable even for realtime tweet processing.

5 Discussion

One of the hurdles of deploying our approach is the construction of the MDT. It is nearly impossible to extract narrow domain information from a large, mixed-domain data without any manual input. Comparing to training data annotation in supervised approaches, though, we consider that constructing a MDT requires much less effort, and translates knowledge in an efficient manner. We can also see that the extraction accuracy varies depending on the quality of the MDT. In our experiments, the extraction accuracy for the shooting data is higher than the crisis dataset, most likely because we have more experience with the first dataset than with the latter, and thus constructed a more representative MDT for the first dataset. Adding identified concepts to the MDT can improve the system performance, but manually checking is required given the errors in concept recognition we discussed in the previous section. Based on our experiences, adding wrong or ambiguous concept will not improve identification accuracy, but rather decrease it.

Currently our method only considers single-word concepts, but in reality many concepts are expressed in multiple words, and we will run into error if we cannot recognize them, for example, “video camera”. This can be done by generating all possible bi-grams and multi-grams from data, as existing works have suggested [8].

6 Conclusion

Social media produces significantly large volume of data covering a wide range of topics, and there is an increasing need of extracting information for narrow domain applications from large, mixed-domain datasets. However, currently most applications develop classification and extraction solutions tailored to a narrow domain, and are usually unsuitable for use in other applications and domains. Developing individual solutions is expensive including efforts to develop algorithms and annotate training data for supervised solutions. We therefore focus on a general solution that can be easily tailored to narrow domain needs and does not require training data annotation and other manual involvement.

In this paper, we propose Multiple Domain Taxonomy (MDT), a representation of mixed-domain data. We show that using a partially constructed MDT, we can effectively classify and extract key concepts from short text messages. The MDT can be constructed with some initial knowledge of the data, and can be quickly tailored to narrow domain needs. Our approach is frequency-based and unsupervised. It is robust to common misspellings and word shortenings, and does not require training data annotation. The effectiveness of our approach is verified extensively using real datasets, and comparing to baseline methods such as the Sriram classifier and the PA method, our approach increased the accuracy by up to 52%. In the future, we plan to further improve the concept tagging accuracy, as well as investigating the case of multi-word concepts.