My experiments with code: 2015

Wednesday, June 24, 2015

Mining millions of reviews

Study Notes from http://users.eecs.northwestern.edu/~choudhar/Publications/MiningMillionsofReviewsATechniqueToRankProductsBasedOnImportanceofReviews.pdf

Abstract

Mine Reviews
Get Score – present a product ranking model that applies weights to product review factors to calculate a products ranking score. Rank reviews and products based on score.
Sort reviews and products based on score.

Introduction

Too many reviews
Reviews may contain –ve feedback about the seller and not the product. Those need to be filtered.
A Review's credibility can be based on -

Date or Age of the review
Number of Helpful votes/ number of votes

Ratings provided can have a personal bias.

Methodology

Summary: Sentiment analysis for each relevant sentence
While calculating the product ranking scores, reviews should not be equally weighted, they should be mined and given proportional weights.
3 stages are proposed in evaluating a review's weight

filter out irrelevant sentences.
use helpfulness votes and age to derive the review's weight.
calculate the product's ranking score based on the review weights.

Filtering Irrelevant sentences

This is treated as a binary classification problem
Use Support Vector Machines (SVM) to train a hypothesis function h. - sentence gives h(sentence)
The sentence is translated to a vector X
The SVM uses linear regression – h(X) = Beta-Transpose * X + b (Just like you used gradient descent to implement linear regression, use SVM here)
1000 sentences are collected manually and used as the training set
10 fold cross validation is used.

Calculating the Review's weight

Helpfulness Vote - H

bare minimum, it is X out of Y people found it useful.
to beat the bias in ratings do the following

ignore reviews with less than 10 votes
use simple X out of Y ratio for items with 10 – 200 reviews
if number of items is greater than 200, multiply by a gaining factor, > 1

Age of review and durability - T

Younger reviews have a greater weight.

younger reviews will naturally have less number of votes
newer versions of the product will match with newer reviews.

Calculating the age based metric = e^ Decay rate * ( time of review – time of product release) + initializing factor.

Sentence splitter and part of speech tagging:-

Split reviews into sentences, using MXTERMINATOR
Assign positive or negative sentiments to sentences.

Use part of speech tagger to do assess sentiment.
Sentences are saved with PART-OF-SPEECH tags.

Product Ranking Score Function:-

1. Calculate the sentiment or the polarity of the review. Use polarity, H and T to calculate the product ranking score.

2. Calculating review sentiment

1. Manually pick a set of common adjectives/ adverbs as a seed list.

2. Augment it with synonyms and antonyms

3. If a sentence has an adjective or adverb from a positive set then it is positive.

4. Negative sentences are handled similarly

5. If a sentence has more than 1 sentiment, then you Polarity = Sum of positives + sum of negatives.

3. Final Score of a product is – For all of its reviews, Sum of all ( Polarity * H * T) / Sum of all H * Sum of all T

Evaluation and Analysis

Compare review rank with sales rank (Amazon specific) - how well the product is sold within its category
Mean Average Precision or MAP

Spearman's coefficient for a set of products between both human ranking and the above algorithm.

Results

Filtering out irrelevant sentences improves performance
giving weights to reviews is an additional improvement
weights + age of reviews is even better

Effects of Individual Features

To get the features that contribute the most to the ranking, the correlation between the ranking by a feature and overall ranking can be calculated.

Future work, Consider these additional attributes

Reviewer Credibility
Prioritizing features
Look for Sarcasm :)
Filter out Spam.
Data from other sources.

Friday, March 27, 2015

Performance Testing

This is a rough outline of my post, will be improved..

Goal: Performance test a Java application - Rest API, load balanced on linux nodes.

1. Generate requests i.e. the load for performance test through JMeter or Silk (simulating 1000s of client requests)
2. Take one or 2 server nodes as sample and extrapolate.
3. Monitoring tools can be - Jconsole (comes with JDK) and vmstat - available in Linux.
4. Metrics to be measured are -

CPU usage
Number of threads
Memory saw tooth

Example of unnecessary high CPU usage in code

while(true)

{

if(some-externally-flag){

}

This will cause an extraordinary amount of churn.

Sunday, February 1, 2015

I'm trying to transition from a senior level engineer to a Software Architect. I believe the key to that transition is designing applications that can scale to take enterprise level performance loads. That in turn boils down to the following questions:-

Most production systems are distributed, i.e. several nodes sit behind a domain address and share the work load as determined by a load balancer.

How to do Capacity Planning for such a system:-

number of nodes
amount of memory and processors on each node
How do you performance test such a system.

If such a load balanced cluster exists at a middleware level, What is the best way of running a batch job over that cluster?
How does such a system maintain state or sessions?

How to determine the best technology stack for a given application?
When would you use an asynchronous channel of communication? JMS or Kafka?
How would you determine if a regular RDBMS such as Oracle or a NoSQL system such as Cassandra needs to be used?
When would you use object caching systems like MemcacheD or CouchBase?
How to trace requests that flow across several clusters e.g. an incoming HTTP request for a page view could flow across a front server cluster and then proceed to a middleware cluster before hitting the DB and then back..

I hope to find the answers to all these questions and come up with blog posts for each of these.. Architect en route.. :)

My experiments with code

Wednesday, June 24, 2015

Mining millions of reviews

Friday, March 27, 2015

Performance Testing

Sunday, February 1, 2015

Architect - enroute

About Me

Blog Archive