Wednesday, June 24, 2015

Mining millions of reviews




Abstract
  1. Mine Reviews
  2. Get Score – present a product ranking model that applies weights to product review factors to calculate a products ranking score. Rank reviews and products based on score.
  3. Sort reviews and products based on score.
Introduction
  1. Too many reviews 
  2. Reviews may contain –ve feedback about the seller and not the product.  Those need to be filtered.
  3. A Review's credibility can be based on -
    1. Date or Age of the review
    2. Number of Helpful votes/ number of votes
  4. Ratings provided can have a personal bias.
Methodology

  1. Summary: Sentiment analysis for each relevant sentence
  2. While calculating the product ranking scores, reviews should not be equally weighted, they should be mined and given proportional weights.
  3. 3 stages are proposed in evaluating a review's weight
    1. filter out irrelevant sentences.
    2. use helpfulness votes and age to derive the review's weight.
    3. calculate the product's ranking score based on the review weights.
Filtering Irrelevant sentences

  1. This is treated as a binary classification problem
  2. Use Support Vector Machines (SVM) to train a hypothesis function h. - sentence gives h(sentence)
  3. The sentence is translated to a vector X
  4. The SVM uses linear regression – h(X) = Beta-Transpose * X + b (Just like you used gradient descent to implement linear regression, use SVM here)
  5. 1000 sentences are collected manually and used as the training set
  6. 10 fold cross validation is used.
Calculating the Review's weight

  1. Helpfulness Vote - H
    1. bare minimum, it is X out of Y people found it useful.
    2. to beat the bias in ratings do the following
      1. ignore reviews with less than 10 votes
      2. use  simple X out of Y ratio for items with 10 – 200 reviews
      3. if number of items is greater than 200, multiply by a gaining factor, > 1
  2. Age of review and durability - T
    1. Younger reviews have a greater weight.
      1. younger reviews will naturally have less number of votes
      2. newer versions of the product will match with newer reviews.
    2. Calculating the age based metric = e^ Decay rate * ( time of review – time of product release) + initializing factor.
  3. Sentence splitter and part of speech tagging:-
    1. Split reviews into sentences, using MXTERMINATOR
    2. Assign positive or negative sentiments to sentences.
      1. Use part of speech tagger to do assess sentiment.
      2. Sentences are saved with PART-OF-SPEECH tags.

Product Ranking Score Function:-

1.  Calculate the sentiment or the polarity of the review. Use polarity, H and T to calculate the product ranking score.
2.  Calculating review sentiment
1. Manually pick a set of common adjectives/ adverbs as a seed list.
2. Augment it with synonyms and antonyms
3. If a sentence has an adjective or adverb from a positive set then it is positive.
4. Negative sentences are handled similarly
5. If a sentence has more than 1 sentiment, then you Polarity = Sum of positives + sum of negatives.
3. Final Score of a product is – For all of its reviews, Sum of all ( Polarity * H * T) / Sum of all H * Sum of all T


Evaluation and Analysis

  1. Compare review rank with sales rank (Amazon specific) - how well the product is sold within its category
  2. Mean Average Precision or MAP
    1. Spearman's coefficient for a set of products between both human ranking and the above algorithm.
  3. Results
    1. Filtering out irrelevant sentences improves performance
    2. giving weights to reviews is an additional improvement
    3. weights + age of reviews is even better

Effects of Individual Features

  1. To get the features that contribute the most to the ranking, the correlation between the ranking by a feature and overall ranking can be calculated.

Future work, Consider these additional attributes

  1. Reviewer Credibility
  2. Prioritizing features
  3. Look for Sarcasm :)
  4. Filter out Spam.
  5. Data from other sources.