A conditional random field framework for language process in product review mining

The Part-Of-Speech tagging is widely used in the natural language process. There are many statistical approaches in this area. The most popular one is Hidden Markov Model. In this paper, an alternative approach, linear-chain Conditional Random Fields, is introduced. The Conditional Random Fields is a factor graph approach that can naturally incorporate arbitrary, non-independent features of the input without conditional independence among the features or distributional assumptions of inputs. This paper applied the Conditional Random Fields for the car review word Part-Of-Speech tagging and then the feature extraction, which can be used as an input to an opinion mining system. To reduce the computational time, we also proposed applying the Limited-memory BFGS algorithm to train the Conditional Random Fields. Furthermore, this paper evaluated the Conditional Random Fields and the classical graph approach using the car review dataset to demonstrate that the Conditional Random Fields have a more robust result with a smaller training dataset.


Introduction
With the rapid growth of e-commerce, people are more likely to share their opinions and hands-on experiences on products or services they have purchased. This information is important for both business organizations and potential customers. Companies can make decisions on their strategies for marketing and products improvement, which Customers can make a better decision when purchasing the products or services. Unfortunately, the number of reviews has reached to more than hundreds of thousands in recent days, especially for popular products, which hence poses a challenge for a potential customer to go over all of them. Therefore, it is essential to provide coherent and concise summaries for the reviews.
Researchers have explored different angles on opinion mining to tackle this problem, aiming to extract the essential information from reviews and present it to the users. Previous works have mainly adopted rule-based techniques [3] and statistic methods [10]. Later, a machine learning approach based on the Hidden Markov model (HMMs) was proposed and proved more effective than previous works. However, these HMMs-based methods are limited because it is difficult to model arbitrary, dependent features of the input word sequence.
A Conditional Random Field (CRFs) was introduced to fix this limitation [Lafferty 2001]. Later on, the CRF framework was summarized [Sutton 2012]. The CRFs is a discriminant, factor graph model with the potential to model overlapping and dependent features. Prior works on natural language processing (NLP) have demonstrated that CRFs outperform the classical HMMs [Peng 2006].
Motivated by the findings, we propose a linear-chain CRF-based framework to mine and extract opinions from product reviews on the web. The performance of the CRFs is impressive as the training data for the CRFs is minimal, and the CRFs still perform a relatively similar result comparing with the classical POS tagging method.
The rest of this paper is organized as follows: we will describe the proposed framework and the CRFs for the framework in Section 2. Section 3 demonstrates the experiment result. Section 4 demonstrates a further application using CRFs: feature extraction, i.e., extracting keywords from a sentence. Section 5 summarizes our work, and Section 6 present its future directions.

Methodology
Before applying the CRFs to the POS tagging, some problems need to be solved. First is the pre-data process. Then, the feature design for the CRFs. At last, parameter estimation for the CRFs.

Proposed framework
The architectural overview of the framework can be divided into the following steps: First, pre-processing that includes crawling raw review data and cleaning.
Step 2, POS tagging on review data. In this step, we manually labeled the data using the Penn Treebank POS tagging.
Step 3, training the linear-chain CRFs model using the pre-defined POS tagging.
Step 4, applying the model to the test set and extract opinions. For comparison, the Python Natural Language Toolkit (NLTK 3.3) is applied [2].
Step 5, we used the POS tagging result generated by the CRFs model to further extract opinions by extracting only Nouns and Adjectives words from the review sentences.

Conditional random fields
Conditional random fields (CRFs) are conditional probability distributions on an undirected graph model [Lafferty 2001]. To reduce the complexity, we employed linear-chain CRFs as an approximation to restrict the relationship among tags. A 1 st order CRF (X, Y ) is specified by a vector F of local features and a corresponding weight vector λ. Each local feature is either a transition feature A y t−1 ,y t or an emission feature O y t ,x t , where y is the label sequence, x is the input sequence, and t is the position of a token in the sequence. We define the 1 st order features: -The assigment of current tag y t is supposed to depend on the current word x t only. The feature function is represented as an emission feature O y t ,x t in the form .
-The assignment of current y t is supposed to depend on the previous tag y t −1 only. The feature function is represented as a transition feature A y t−1 ,y t in the form .
With the definition of the conditional probability can be written as: where is called the partition function (or a normalization factor), which is the summation over all possible combinations of sequences (transitions and emissions). Hence, the most probable label sequence for input sequence x: can be found with Viterbi algorithm. Therefore, the task of review mining can be transformed to an automatic labeling task, and the problem can then be formalized as: given a sequence of words x = x 1 x 2 , ..., x T and it's corresponding POS y = y 1 y 2 , ..., y T , the objective is to find an appropriate sequence of tags which can maximize the conditional likelihood according to (3).

Parameter estimation
To estimate the parameters of a linear-chain CRF θ = {λ k }, given identically independent T i } is a sequence of the desired predictions (i.e. labels), the conditional log likelihood can be obtained as: where K k=1 λ 2 k 2σ 2 is the L2 regularization term added to the likelihood function in order to reduce overfitting. σ is assigned a Gaussian prior and the value of σ 2 is often taken up to 10 (we take σ 2 = 10 in our experiment). Since in general the function (θ) cannot be maximized in closed form, so dynamic programming and L-BFGS algorithm can be used to optimize objective function. The partial derivative, or the gradient of the objective function is computed as: where the first term is the empirical count of feature k in the training data, the second term is the expected count of this feature under the current trained model. Hence, the derivative measures the difference between the empirical count and the expected count of a feature under the current model. In order to obtain the gradient (5), we need to calculate the conditional probability P (y t−1 , y t |x (i) ) that requires the sum over the whole label sequence y, which is intractable in a naive fashion. Hence we need to employ some dynamic programming techniques for the calculation.

Dynamic programming for CRF probability as matrix computations
For a linear-chain CRF where each label sequence is augmented by start and end states for y 0 and y t+1 respectively, the conditional probability of label sequence y given an observation sequence x can be efficiently computed using matrices.
Let Y be the collection of all possible labels, define a set of n + 1 matrices Hence, the conditional probability can be written as the product of the appropriate elements of the n + 1 matrices for that pair of y and x sequences as The partition function Z(x) is given by the (start, end) entry of the product of all n + 1 M t (x) matrices: Therefore, the conditional probability can be calculated by a dynamic programming method that is similar to the forward-backward algorithm for HMMs. Define the forward and backward vectors α t and β t starting with the base cases: and for recurrence: Finally, the conditional probability can be written as: which can thus be plugged into (5) to calculate the gradient.

Training with limited-memory quasi-newton method
The traditional Newton methods for nonlinear optimization require calculating the inverse of the Hessian matrix (curvature information) of the log-likelihood to find the search direction, which in our case, is impractical. Limited-memory BFGS (L-BFGS) estimates the curvature information based on previous m gradients and weight updates. There is no theoretical guidance on how much information from previous steps should be kept to obtain sufficiently accurate curvature estimates. In our experiment, we used the previous m = 10 gradient and weight pairs, which worked well.
Assume all vectors are column vectors, given λ k as the updates at the k th iteration, and the gradient g k ≡ ∇f (λ k ) where f is the objective function being minimized (negative log likelihood). The last m updates of the form s k = λ k+1 − λ k and y k = g k+1 − g k are stored.
as the initial approximate of the inverse Hessian Define a i = ρ i s T i q i+1 , and hence the first recursion calculates Hence, the value z k is the approximation for the search direction. (Note: when performing minimization, the search direction is the negative of z.) After obtaining the search direction at each step, a backtracking line search method is implemented to find and tune the learning rate (step size) such that it satisfies the sufficient decrease condition given by: where γ k is the step size, σ ∈ (0, 1) is a control parameter and η is the scaling parameter that fits (12) iteratively until the condition is met. In our experiment, the initial step size is γ 0 = 0.5, σ = 0.4 and η = {1, 2, · · · , 20}. This step determines the optimal η value, and then the γ η k becomes the new step size (learning rate) for the next iteration.

Path prediction with viterbi algorithm
After training the model, the aim is to find the most probable label sequence for a given sequence with observed words and corresponding POS tags. The Viterbi algorithm was employed to score all candidate tags with the trained model and then search for the best path with the maximal score.
Given an observed sequence X = {x 1 , x 2 , · · · , x T } (T being the number of tokens in this sequence) with the trained feature (transition and emission) weights being obtained, the most likely state sequence Y = {y 1 , y 2 , · · · , y T }, where each y t ∈ L = {l 1 , l 2 , · · · , l V } (L being the label space obtained through training) can be calculated by the recurrence relations (forward step): where V t is the score of the most probable state sequence responsible for the first t observations. The Viterbi path can then be retrieved by saving back pointers that remember which state y was used in (14). Let P tr(y t , t) be the function that returns the value of y t used to compute V t , then we have:

Numeric experiment
In order to demonstrate the performance of the CRFs in the POS tagging, the CRF model was applied to the Car review datasets.

Data description
We crawled the car review dataset on Toyota and Honda cars from Cars.com using Python Scrapy. A total of 1,126 reviews were collected. After the initial cleaning and duplicates removal, 1,094 reviews were kept. Inspired by [4], additional transformations using regular expressions (also known as rational expression or regex) were used on the training and testing dataset. As a result, a total number of 18,440 words are used. We tokenized the review sentence into word-level (18,440 words), and then POS tagged each word manually with Penn Treebank POS Tags, and 45 POS tags are used (see Appendix Table 10). Notice that a Verb Past Participles (VBN) can be used as adjectives (JJ) to describe nouns.

Train the conditional random field part-of-speech tagger
The performance of the CRF model is measured using 10-fold Cross-validation using the transformed dataset. That means, for each validation, the transformed dataset was then divided into training with 998 reviews and testing with 96 reviews. For such a small dataset, 10% as test samples can provide an intuition about the model. After the pre-processing that included tokenizing the corpus, there are 549 transition features and 2,475 emission features, which means there were a total of 3,024 parameters to be estimated. We ran the algorithm for 100 iterations, and the negative Log-Likelihood converged quite well. Figure 1 shows the distribution for trained weights, as most of the feature weights have values around 0. There are a few features having values that are towards the tails, meaning that certain words are likely/unlikely to emit certain POS tags, or certain transitions, e.g. [Adjective (JJ) → Noun (NN)] vs [Adjective (JJ) → Verb (VB)], are likely/unlikely to happen.

Performance evaluation
The performance is evaluated based on precision, recall, and F-score. Precision, also referred to as positive predictive value, talks about how precise/accurate the model is out of those P redictedP ositive, how many of them are ActualP ositive; Recall is defined as the true positive rate or sensitivity, calculates how many of the ActualP ositives the model captures through labeling it as P ositive (True Positive): and F 1 score is the harmonic mean of the precision and recall, which helps seek a balance between precision and recall:

Fig. 1 Distribution of Predicted Feature Weights
We computed both the macro and micro values for precision and recall. A macro-average will compute the metric independently for each class and then take the average (hence treating all classes equally). In contrast, a micro-average will aggregate the contributions of all classes to compute the average metric. In a multi-class classification setup, micro-average is preferable if one suspects there is a class imbalance.

Validation
To validate our CRF model, we incorporated 10-fold cross-validation where the training set was randomly partitioned into 898 for training and the rest 100 for validation. Further, after each cycle, we would reshuffle the training set and go through the 10-fold CV process again. The process was repeated 20 times to ensure the generality of our proposed CRF model. Hence, we obtained 200 validation results and calculated the three metrics accordingly, with corresponding means and standard deviations listed in Table 1. In summary, the overall performance is good, as the lower bounds of the 95% confidence intervals rest above our threshold of 90% for both precision (0.9393) and recall (0.9195), indicating no further model tuning is required at the moment.

Testing
For the testing set, Fig. 2 displays the confusion matrix, where the overall accuracy is 0.9252 (however, overall accuracy is not a metric to use when evaluating a model). Table 2 shows the average precision, recall and F 1 metrics.
We also computed these metrics for each label (overall 31 labels in our experiment), displayed in Table 3. Our tagger managed to capture each POS feature fairly well, given such a small data set.
The error matrix displayed in Fig. 3 shows the details of mispredicted classes, and we see that most misclassified tokens were between VBZ and NNS.
Based on the above metrics, CRF performed well in sequential labeling for Toyota and Honda cars reviews. Taking the first sentence in our testing data as an example, comparing the true path and predicted path is shown in Table 4 where the only misclassification was on the word [inside].

Comparison
We compared the performance of the CRF tagger to the baseline tagger in Python NLTK 3.3, which is based on HMM. The side-by-side comparison is displayed in Table 5. The performances of the two competing taggers were very close, which is impressive as CRF was performed on a small training data set. However, as we observed from the tagging results, the performance of the baseline tagger has been inconsistent. The baseline tagger tends to classify any word with the first letter capitalized to NNP, e.g. would be classified as NNP instead of the ground truth of NN and JJ. Hence, in our data, the CRF tagger performance is more robust.

Feature extraction
After successful training of the CRF tagger, we then extracted features based on the tagging result. As the first step, we extracted only Nouns and Adjectives from the review sentences as these words contain the most information one would need to generalize the ideas. Furthermore, Since these keywords contain useful information in the review mining, we can use these keywords as an input for an opinion mining system (e.g., using a new CRF model to classify the opinions in 5 levels). An example shown in Table 6 gives the idea about how it works. When one is interested in finding out how people think about a specific feature (e.g., transmission), our framework takes in the keywords [transmission, transmissions] and output any summarized reviews that contain these keywords. From the generalized report on feature transmission as shown in Table 7 people will get abundant information on how transmission performs.

Conclusion
We proposed and built a CRF based framework and integrated it with L-BFGS. The advantage of CRF is that it makes fewer assumptions than the generative models and hence allows a better level of flexibility on feature engineering. Compared with the existing method, which has been trained over a large training set, the CRF model has a very similar accuracy and shows a more robust result even though it is trained over a minimal training set. Hence, the CRF model can be used as part of an exploratory data analysis. Furthermore, similar to deep learning, the CRF-based framework can be used to classify car reviews in the future study by defining more precise feature functions.

Future research
The current CRF model can be further expanded in the future. For example, since we only extracted information that is carried by the Nouns and Adjectives at the current stage, some information that is carried by verbs or verb phrases such as "recommend," "outperform," "disappoint," etc. are not inherited. Hence, we can improve the CRF model by introducing a set of self-defined entities and corresponding feature functions listed in Table 8. For word that is not an entity, it will be represented as background word by (B). Furthermore, an entity can be a single word or a phrase. For phrase entity, a position feature is assigned to each word in the phrase, and there are three possible positions denoted at beginning of the phrase (Entity-B), middle of the phrase (Entity-M) and end of the phrase (Entity-E). As for opinion entity, polarity can be represented by positive (P) and negative (N), and use (Exp) and (Imp) to respectively indicate explicit opinion (opinion expressed explicitly) and implicit opinion (opinion needs to be induced from the review). , so it is tagged as the hybrid tag (Opinion-B-P-Exp). Therefore, after obtaining all the hybrid tags, we can identify the Table 8 Different Types of Entities [7] Components Pysical objects of a product, e.g. engine, transmission, brake, seat ...

Functions
Capabilities provided by a product, e.g. horsepower, acceleration, adjustable seat ...

Opinions
Thoughts expressed by users on components, functions or features opinion orientation if a word is an opinion entity. Thus, second-order feature functions can be expanded on top of the first-order feature function defined in Section 2.

Conflict of Interests
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.