My experiments with code

Tuesday, October 19, 2021

Adventure Scientists - Timber tracking

Goal: Identify Black Walnut trees and extract core, leaf, and twig samples

Some notes:

Make sure you don't sign up to sample from private land. Use the GaiaGPS app's shading feature to determine if a piece of land is private or public.
Identifying Black Walnut trees is pretty hard; use the iNaturalist app to determine if there are any previous sightings of Black Walnut trees - you need to cross-verify if those sightings are unsampled.
Extracting the leaves and twigs is very hard; extracting the core is probably the easiest thing to do. Make sure they send you a borer tool.

Thursday, November 26, 2020

Polar Codes

Just read a really nice article in the WIRED magazine about the inventor of the "Polar Codes", Erdal Arikan and Huawei.. they spend an equal amount of time praising Erdal Arikan and dissing Huawei...

First they talk about how Erdal invented the "Polar Codes". His obsession with the Shannon limit in Information Theory. This is the technical upper limit on how much information you can pack in a channel, factoring in the noise and redundancy. It was a very hard problem and even his mentor in MIT came close to a solution and gives up. Erdal goes back to Turkey and helps set up the engineering college in a University - Bilkent, which he heads. This helps him to work on this problem over 20 years, while other people in US have to work on small problems for the sake of tenure.

Simultaneously, they talk about Huawei's rise through the support of the Chinese Government and by stealing intellectual property. Apparently, Huawei screws Nortel, and when Nortel has to file for Bankruptcy, Huawei takes over their research team. It is the Chinese born head of the Nortel team who identifies Erdal's polar codes and starts working on them.

Now, even if one company develops the technology, the standards have to be agreed by a lot of companies, the governing body is called 3GPP. Huawei has a lot of leverage there, because it holds most of patents and it has been able to push the Polar Codes based technology. US doesn't even have an equivalent company - Europe has Ericsson, Japan has a few. And all the Chinese companies are working together in pushing the Polar Codes based standard including ZTE and Lenovo. They've succeeded :)

Now to claim legitimacy, Huawei is honoring the inventor Erdal - They start the article by saying the ceremony and settings were corny and cold-war style :)

https://youtu.be/BE5HuqEg0oY

Tuesday, November 3, 2020

Deep Learning - Coursera

This course was introduction to Neural Networks, I'll try to summarize this in as simple a manner as possible.

Logistic regression can be viewed as a simple, single layer neural network. Similarly, a neural network can be viewed as multiple layers of logistic regression.

The difference being that, logistic regression can detect linear patterns only i.e. just do line fitting through the training dataset.

Neural networks can detect non-linear patterns. This is because the each layer of the neural network has a non-linear activation function.

Logistic Regression with Gradient Descent

The goal of Logistic Regression is to train a model or create a model which can make predictions, more specifically True or False predictions.

The input is anything that can be represented as a matrix, say X, then we need a matrix W and a vector b such that:

W * X + b = a.

When an activation function say Ω is applied to a, we get the True or False result.

Rounding Ω(a) to an integer gives the True or False result.

Gradient Descent Algorithm

The training algorithm proceeds as follows:

We take a large number of training examples and We start with a zero matrix and zero vector for W and b.

For each training example X and training result Y,

we calculate a = Ω (W * X + b)
Next, we calculate the cost Y vs. a
Next, we adjust W and b based on the cost.

We do this until there is no difference in cost across different iterations i.e there is no gradient descent.

Neural Network

The Neural Network goal is similar, but goal is to train multiple layers and we start with random value matrices

The neural network algorithm is similar to the gradient descent algorithm, but once again it works across multiple layers.

As an equivalent of step 1 in logistic regression, we have "Forward Propagation"
Once again we calculate cost by comparing with intermediate step's result with the training result.
As an equivalent of step 3 in Logistic regression, we have backward propogation, which adjusts the weights across all the layers.

Other Stuff

Hyper-parameters

Things like the number of layers, the learning rate etc. Tuning these for optimal efficiency is a course of its own

Vectorization

This is a computation optimization where we avoid explicit for-loops in the code and instead use Python and Numpy's inbuilt features such as broadcasting

Learning Tip

Do the course with a friend, makes it much more easier and fun..

Saturday, April 8, 2017

Learning Scala online

1. Programming courses take a lot more time than any other course

2. Do all of the code covered in the videos, not just the assignments

3. This one took almost a year

4. Key concept learnt was "Pattern matching - applied to lists, objects etc."
https://github.com/gany-c/ScalaCoursera

Wednesday, June 24, 2015

Mining millions of reviews

Study Notes from http://users.eecs.northwestern.edu/~choudhar/Publications/MiningMillionsofReviewsATechniqueToRankProductsBasedOnImportanceofReviews.pdf

Abstract

Mine Reviews
Get Score – present a product ranking model that applies weights to product review factors to calculate a products ranking score. Rank reviews and products based on score.
Sort reviews and products based on score.

Introduction

Too many reviews
Reviews may contain –ve feedback about the seller and not the product. Those need to be filtered.
A Review's credibility can be based on -

Date or Age of the review
Number of Helpful votes/ number of votes

Ratings provided can have a personal bias.

Methodology

Summary: Sentiment analysis for each relevant sentence
While calculating the product ranking scores, reviews should not be equally weighted, they should be mined and given proportional weights.
3 stages are proposed in evaluating a review's weight

filter out irrelevant sentences.
use helpfulness votes and age to derive the review's weight.
calculate the product's ranking score based on the review weights.

Filtering Irrelevant sentences

This is treated as a binary classification problem
Use Support Vector Machines (SVM) to train a hypothesis function h. - sentence gives h(sentence)
The sentence is translated to a vector X
The SVM uses linear regression – h(X) = Beta-Transpose * X + b (Just like you used gradient descent to implement linear regression, use SVM here)
1000 sentences are collected manually and used as the training set
10 fold cross validation is used.

Calculating the Review's weight

Helpfulness Vote - H

bare minimum, it is X out of Y people found it useful.
to beat the bias in ratings do the following

ignore reviews with less than 10 votes
use simple X out of Y ratio for items with 10 – 200 reviews
if number of items is greater than 200, multiply by a gaining factor, > 1

Age of review and durability - T

Younger reviews have a greater weight.

younger reviews will naturally have less number of votes
newer versions of the product will match with newer reviews.

Calculating the age based metric = e^ Decay rate * ( time of review – time of product release) + initializing factor.

Sentence splitter and part of speech tagging:-

Split reviews into sentences, using MXTERMINATOR
Assign positive or negative sentiments to sentences.

Use part of speech tagger to do assess sentiment.
Sentences are saved with PART-OF-SPEECH tags.

Product Ranking Score Function:-

1. Calculate the sentiment or the polarity of the review. Use polarity, H and T to calculate the product ranking score.

2. Calculating review sentiment

1. Manually pick a set of common adjectives/ adverbs as a seed list.

2. Augment it with synonyms and antonyms

3. If a sentence has an adjective or adverb from a positive set then it is positive.

4. Negative sentences are handled similarly

5. If a sentence has more than 1 sentiment, then you Polarity = Sum of positives + sum of negatives.

3. Final Score of a product is – For all of its reviews, Sum of all ( Polarity * H * T) / Sum of all H * Sum of all T

Evaluation and Analysis

Compare review rank with sales rank (Amazon specific) - how well the product is sold within its category
Mean Average Precision or MAP

Spearman's coefficient for a set of products between both human ranking and the above algorithm.

Results

Filtering out irrelevant sentences improves performance
giving weights to reviews is an additional improvement
weights + age of reviews is even better

Effects of Individual Features

To get the features that contribute the most to the ranking, the correlation between the ranking by a feature and overall ranking can be calculated.

Future work, Consider these additional attributes

Reviewer Credibility
Prioritizing features
Look for Sarcasm :)
Filter out Spam.
Data from other sources.

Friday, March 27, 2015

Performance Testing

This is a rough outline of my post, will be improved..

Goal: Performance test a Java application - Rest API, load balanced on linux nodes.

1. Generate requests i.e. the load for performance test through JMeter or Silk (simulating 1000s of client requests)
2. Take one or 2 server nodes as sample and extrapolate.
3. Monitoring tools can be - Jconsole (comes with JDK) and vmstat - available in Linux.
4. Metrics to be measured are -

CPU usage
Number of threads
Memory saw tooth

Example of unnecessary high CPU usage in code

while(true)

{

if(some-externally-flag){

}

This will cause an extraordinary amount of churn.

Sunday, February 1, 2015

Architect - enroute

I'm trying to transition from a senior level engineer to a Software Architect. I believe the key to that transition is designing applications that can scale to take enterprise level performance loads. That in turn boils down to the following questions:-

Most production systems are distributed, i.e. several nodes sit behind a domain address and share the work load as determined by a load balancer.

How to do Capacity Planning for such a system:-

number of nodes
amount of memory and processors on each node
How do you performance test such a system.

If such a load balanced cluster exists at a middleware level, What is the best way of running a batch job over that cluster?
How does such a system maintain state or sessions?

How to determine the best technology stack for a given application?
When would you use an asynchronous channel of communication? JMS or Kafka?
How would you determine if a regular RDBMS such as Oracle or a NoSQL system such as Cassandra needs to be used?
When would you use object caching systems like MemcacheD or CouchBase?
How to trace requests that flow across several clusters e.g. an incoming HTTP request for a page view could flow across a front server cluster and then proceed to a middleware cluster before hitting the DB and then back..

I hope to find the answers to all these questions and come up with blog posts for each of these.. Architect en route.. :)