Mining text data (user reviews, feedback forms, medical notes) is a hot topic these days. That means there is a lot of hype and misinformation too. It's hard to know what's realistic to do with a data analytics program and what's just wishful thinking. (at least for now anyway)
I am going to attempt to explain the topic through a modest text mining project. My goal is to try to determine whether I can teach a machine to read an online review for a restaurant and determine whether the review is good or bad. I will be learning right along with the machine in this case as text mining is pretty new to me.
Let's get started!
Here are the specifics of my project along with goals. I have selected a popular restaurant in the downtown area of a major U.S. city. My criteria was simple: >1000 reviews with a reasonable number of bad reviews. I got my reviews from OpenTable.com, partially because it's easy to extract reviews and partly because of the way users can rate restaurants. With OpenTable, users can rate food, ambience and service individually.The star ratings are important because that's how we'll know if our algorithm works. Lastly, I want my algorithm to flag any review that is negative so my theoretical restaurant owner can look for themes and possible ways to improve the dining experience.
For the analysis to be more accurate, I want approximately the same number of "Good" reviews as "Bad". Out of 1000 reviews, only around 110 met our "Bad" criteria so I used all of those and randomly selected an equal number of "Good" reviews. I ended up with about 210 reviews to train my system. The remaining 800 or so will be used for evaluation. (Yes, we can only test for one type of failure this way. If this project were for an actual client I would be much more rigorous.)
The first thing we need to do is get organized. How do we define a "bad" review? Users can rate food, ambience and service on a scale from 1 to 5. For our purposes, if any of the ratings are a "1", we'll call that a bad review. (We could have chosen anything as a criteria. For example, originally I chose to take the three ratings and average them together and then apply a threshold to that. My results weren't as good however.)
Next we need to do a little accounting on the reviews. We are using a technique known as "Bag of Words". (Data scientists are clever.) We are going to take all the words in the reviews and reduce them to something called tokens using a process called stemming. This sounds more complicated than it is. Basically we are turning words like "walk", "walked" and "walking" into just "walk". This reduces the number of words we have to deal with and all of those words basically mean the same thing.
Next we get a big list of every token that appears in all 1000 reviews. (We even use the 800 reviews we set aside for testing because it makes the testing process much more simple.) After that we have a few options. We can just look at every review and flag whether or not the token appears in the review. For example, for our token "walk" we would have a "1" for review number 1 if the review included the phrase "We had to walk 10 minutes to get to our table." If the word "walk" appeared more than once we would still just have a one. Another technique would be to count the number of times "walk" appeared and divide that by the total number of words in review for a percentage know as Term Frequency. We won't be doing that in this case. (In fact, I tried that first and the results weren't as good.) There are a great many different ways to do this part but I think the flags are the most simple for demonstration.
Now it's time to try to find a classification system that works. I tried three: a tree classifier, logistic regression analysis, and naive Bayes. Logistic regression produced the best results. (I am using Weka for the curious. The logistic regression used cross validation with 10 folds.)
Here is the confusion matrix:
=== Confusion Matrix ===
a b <-- classified as
72 37 | a = Bad
24 77 | b = Good
That's just shy of 71% success rate. That's not terrible. When we try it out on the 800 reviews we didn't use for training we get 134 "Bad" reviews out of 790 remaining "Good" reviews for a success rate of 83%. You might think that's a lot of false positives. I thought so too, so I did a little more investigating.
A large number of the "false positives" look more like a mismatch between what the reviewer wrote and the number of stars they used for the rating. Here are a few examples:
- seafood alfredo was a disappointment
- calmari was a bit over-cooked
- was very disappointing
- cold clam chowder
- service was a bit slow
Do those sound like good reviews to you? Me either. I think what is happening is that people are either leaving bad feedback but the stars don't really reflect just how bad that feedback is. That makes me think the algorithm is doing a pretty good job.
A few more of the false positives were due to just very short reviews. Fewer words give you less to go on. A future improvement might be to handle those differently.
So, what words indicate a bad review? Here are a few that strongly correlated with a bad review:
- raw (note that this restaurant serves sushi but this was related to something else)
- worse (worst)
Okay, those are pretty obvious but here are a few that aren't as obvious:
One interesting thing I noticed: positive reviews often seemed to describe the meal as part of a bigger event like going to the theater, a birthday, holiday or professional sporting event. Negative reviewers often stated or implied the meal was the event. There is no way to know for sure since negative reviewers may just have focused on the bad part of the experience.
- It may take several iterations and some thought before you get good results. You can see from the text that I made several attempts before getting results I was happy with.
- There is always room to improve. I think this analysis would be much better if we included tokens with two words instead of one. "Not happy" together means a lot more than "not" and "happy" mean alone.
- Misspelled words are a major challenge. I noticed a few tokens that only appeared one time and were obviously misspelled. Those tokens basically didn't get properly treated in this analysis.
- The algorithm flagged reviews that had pretty good star ratings, but look a lot like bad reviews. I consider this to be a good thing.
- After we discarded a few tokens (not exactly stop-words like 'a' or 'the' but tokens that appeared in every document) we were left with around 1600 tokens. That's surprisingly few for 1000 reviews.
You can see how this type of information could be valuable for a restaurant owner. I was able to quickly zero in on a few problem spots in the restaurant: noise, steaks and bad/slow service. This algorithm could easily be automated, problems highlighted, and statistics kept on trends. The more quickly you learn to utilize new tools, the better off your business can be.