Wednesday, June 24, 2015

Mining millions of reviews




Abstract
  1. Mine Reviews
  2. Get Score – present a product ranking model that applies weights to product review factors to calculate a products ranking score. Rank reviews and products based on score.
  3. Sort reviews and products based on score.
Introduction
  1. Too many reviews 
  2. Reviews may contain –ve feedback about the seller and not the product.  Those need to be filtered.
  3. A Review's credibility can be based on -
    1. Date or Age of the review
    2. Number of Helpful votes/ number of votes
  4. Ratings provided can have a personal bias.
Methodology

  1. Summary: Sentiment analysis for each relevant sentence
  2. While calculating the product ranking scores, reviews should not be equally weighted, they should be mined and given proportional weights.
  3. 3 stages are proposed in evaluating a review's weight
    1. filter out irrelevant sentences.
    2. use helpfulness votes and age to derive the review's weight.
    3. calculate the product's ranking score based on the review weights.
Filtering Irrelevant sentences

  1. This is treated as a binary classification problem
  2. Use Support Vector Machines (SVM) to train a hypothesis function h. - sentence gives h(sentence)
  3. The sentence is translated to a vector X
  4. The SVM uses linear regression – h(X) = Beta-Transpose * X + b (Just like you used gradient descent to implement linear regression, use SVM here)
  5. 1000 sentences are collected manually and used as the training set
  6. 10 fold cross validation is used.
Calculating the Review's weight

  1. Helpfulness Vote - H
    1. bare minimum, it is X out of Y people found it useful.
    2. to beat the bias in ratings do the following
      1. ignore reviews with less than 10 votes
      2. use  simple X out of Y ratio for items with 10 – 200 reviews
      3. if number of items is greater than 200, multiply by a gaining factor, > 1
  2. Age of review and durability - T
    1. Younger reviews have a greater weight.
      1. younger reviews will naturally have less number of votes
      2. newer versions of the product will match with newer reviews.
    2. Calculating the age based metric = e^ Decay rate * ( time of review – time of product release) + initializing factor.
  3. Sentence splitter and part of speech tagging:-
    1. Split reviews into sentences, using MXTERMINATOR
    2. Assign positive or negative sentiments to sentences.
      1. Use part of speech tagger to do assess sentiment.
      2. Sentences are saved with PART-OF-SPEECH tags.

Product Ranking Score Function:-

1.  Calculate the sentiment or the polarity of the review. Use polarity, H and T to calculate the product ranking score.
2.  Calculating review sentiment
1. Manually pick a set of common adjectives/ adverbs as a seed list.
2. Augment it with synonyms and antonyms
3. If a sentence has an adjective or adverb from a positive set then it is positive.
4. Negative sentences are handled similarly
5. If a sentence has more than 1 sentiment, then you Polarity = Sum of positives + sum of negatives.
3. Final Score of a product is – For all of its reviews, Sum of all ( Polarity * H * T) / Sum of all H * Sum of all T


Evaluation and Analysis

  1. Compare review rank with sales rank (Amazon specific) - how well the product is sold within its category
  2. Mean Average Precision or MAP
    1. Spearman's coefficient for a set of products between both human ranking and the above algorithm.
  3. Results
    1. Filtering out irrelevant sentences improves performance
    2. giving weights to reviews is an additional improvement
    3. weights + age of reviews is even better

Effects of Individual Features

  1. To get the features that contribute the most to the ranking, the correlation between the ranking by a feature and overall ranking can be calculated.

Future work, Consider these additional attributes

  1. Reviewer Credibility
  2. Prioritizing features
  3. Look for Sarcasm :)
  4. Filter out Spam.
  5. Data from other sources.


Friday, March 27, 2015

Performance Testing

This is a rough outline of my post, will be improved..

Goal: Performance test a Java application - Rest API, load balanced on linux nodes.


  1. 1. Generate requests i.e. the load for performance test through JMeter or Silk (simulating 1000s of client requests)
  2. 2. Take one or 2 server nodes as sample and extrapolate.
  3. 3. Monitoring tools can be  - Jconsole (comes with JDK) and vmstat - available in Linux.
  4. 4. Metrics to be measured are - 
    1. CPU usage
    2. Number of threads
    3. Memory saw tooth
  5. Example of unnecessary high CPU usage in code 
while(true)
{

 if(some-externally-flag){

}

}

This will cause an extraordinary amount of churn.


Sunday, February 1, 2015

Architect - enroute


I'm trying to transition from a senior level engineer to a Software Architect. I believe the key to that transition is designing applications that can scale to take enterprise level performance loads. That in turn boils down to the following questions:-


  1. Most production systems are distributed, i.e. several nodes sit behind a domain address and share the work load as determined by a load balancer.
    1. How to do Capacity Planning for such a system:-
      1. number of nodes
      2. amount of memory and processors on each node
      3. How do you performance test such a system.
    2. If such a load balanced cluster exists at a middleware level, What is the best way of running a batch job over that cluster?
    3. How does such a system maintain state or sessions?
  2. How to determine the best technology stack for a given application?
  3. When would you use an asynchronous channel of communication? JMS or Kafka?
  4. How would you determine if a regular RDBMS such as Oracle or a NoSQL system such as Cassandra needs to be used?
  5. When would you use object caching systems like MemcacheD or CouchBase? 
  6. How to trace requests that flow across several clusters e.g. an incoming HTTP request for a page view could flow across a front server cluster and then proceed to a middleware cluster before hitting the DB and then back..

I hope to find the answers to all these questions and come up with blog posts for each of these.. Architect en route.. :)