# Data Analytics

### Cohort Analysis

Cohort Analysis is a technique used to analyze characteristics of a cohort (a group of customers distinguished on a common characteristic) over time. It is actually another type of customer segmentation which extends the analysis over a defined period.

One of the frequently applied use case in sales function is to segment customer base based on some set of characteristics. The criteria could be to categorize them into groups who are likely to continue buying, who are likely to defect or who have already defected (went inactive).

Once these groups are formed, some of the common applications for analysis would be to:

- Study customer retention – use the results to learn about conversion rates of certain groups and accordingly focus marketing initiatives (may be try to focus on customers who could be retained)
- Forecast transactions for cohorts/individual customers and predict purchase volume
- Bring more business – Identify groups for upselling and cross-selling
- Estimate marketing costs by calculating lifetime value of a customer by cohort
- Improve customer experience based on individual customer needs across websites and stores

### Marketing Analytics

Marketing is hugely important for a business to succeed. Being able to clearly define marketing objectives and accordingly prioritize on marketing spend is one of the major challenges marketers face. And in order to tune their approach, marketers need important metrics from various business functions to determine marketing effectiveness. Below is an attempt to categorize some of the generally applied analytic techniques that can be used to measure the marketing performance.

Our first step in this analysis would be to identify relevant data sources and develop automation capabilities to streamline data into well-defined repositories. Next, we could use a combination of descriptive and predictive analytic techniques to gain insights. And further we could integrate different models and automate their execution to perform prescriptive analytics for continuous monitoring and feedback.

Marketing drives sales and sales in turn should help improve marketing strategy. Let’s look at some techniques to identify sales patterns and then work on improving our mix of marketing activities.

Sales |

Applications |
Applicable Tools/Techniques |
Required Measures/Expected Results |

Sales Performance (Descriptive) | Visualizing data using Time Series Analysis and other metrics using standard/ad hoc reporting and operational dashboards that cater to different audiences | Use accumulated data over time to learn about correlations and identify patterns |

ARIMA models for time series data | ||

Sales Performance (Predictive) | Simple and multiple linear regression techniques for forecasting and simulation | Determine future possibilities and predicting events to make more informed decisions |

Customer Service |

Applications |
Applicable Tools/Techniques |
Required Measures/Expected Results |

Customer Acquisition and Retention | Logistic Regression (Churn Analysis) | Using historical data to identify ingress and egress of customers |

Customer Segmentation | Cluster Analysis | Identify potential markets and improve on promotion, product, pricing and distribution decisions |

Decision Trees | ||

Hypothesis Testing | ||

Product and Brand Feedback | Text Analytics using Natural Language Processing Toolkit from Python | Analyze unstructured data from social media platforms such as Facebook, Twitter, Yelp etc. |

Sentiment Analysis using Stanford NLP | ||

Customer Loyalty | Logistic Regression | Understand customer behavior and improve decisions around targeted promotions |

Multivariate Analysis using Factor Analysis, Principal Component Analysis or Canonical Correlation Analysis | ||

E-Marketing | Clickstream Analysis (Traffic and E-commerce-based) | Improve conversion and sales |

Drive email marketing campaigns | ||

Google Analytics for website statistics | Search engine optimization (SEO) | |

Channel adaptation |

Note: The above mentioned techniques can always be used across a set of problems depending on their applicability.

After analyzing the results from our analytical models, we have to take measures on improving crucial marketing activities such as generating leads, demand creation and product promotion. Further, above analysis could be used to design and implement marketing strategies including product and brand promotion, pricing strategy, distribution and customer service. And the findings can be employed in improving questionnaires and other mechanisms of collecting marketing data and customer feedback to learn about product performance and brand value.

With these new analytics capabilities, we can make predictions much more accurately and provide our marketing teams with new ideas to drive promotions and boost sales.

In general, adoption and effective application of these analytic techniques is challenging. Building the right analytics should be informed by industry knowledge and subject to the business function in context. However, this is a process which requires constructive iteration over a long term and in most cases should lead in optimizing marketing performance and delivering tremendous value to the organization.

### Text Analytics using Natural Language Processing

Natural Language Processing (NLP) combines artificial intelligence and machine learning techniques with linguistics to process and understand human language. Using NLP, various sources of unstructured data such as social media, call (text) logs, emails etc. could be leveraged to extract actionable insights. Some of the applications include text processing for information retrieval, sentiment analysis, question answering etc.

The core of the problem is that natural languages have been constantly evolving with growing vocabulary. In addition, some of the inherent aspects of the language such as grammar, syntax, semantics and varied writing styles add to the complexity of their analysis. It is quite challenging to arrive at definitive rules while creating systems that make sense of the language. As a result, a logical process of building a parsing system should focus more on using application-specific techniques and the domain in context.

Some of the techniques being:

**NLP using Natural Language Toolkit (NLTK) library from Python**

Using the open source library – NLTK 3.0 from Python, I was able to understand the trend of a set of ailments (in the medical domain). This could be achieved by counting the frequencies of these words (ailments) from call (text) logs pertaining to a certain time period.

**Stanford NLP**

In another NLP application, I used Stanford NLP libraries to understand customer opinion. To be more specific, this was to perform Sentiment Analysis on Yelp reviews.

### Cluster Analysis using R

# Load the data set europe # View the first 10 rows of data head(europe, n=10) # Perform the cluster analysis euroclust<-hclust(dist(europe[-1])) # Plot the dendrogram plot(euroclust, labels=europe$Country) # Add the rectangles to identify the five clusters rect.hclust(euroclust, 5)

### Classification using Decision Trees in R

# Loading the required libraries install.packages("ISLR") library(ISLR) install.packages("tree") library(tree) attach(Carseats) head(Carseats, n=10) dim(Carseats) range(Sales) # Creating a categorical variable for Sales data depending on the below condition High = ifelse(Sales >=8, "Yes", "No") # Appending this column "High" to the Carseats dataset Carseats = data.frame(Carseats, High) dim(Carseats) # Remove the Sales columns from the dataset Carseats = Carseats[,-1] dim(Carseats) # Split the dataset into traning and testing set.seed(2) # Generating the traning and testing datasets train = sample(1:nrow(Carseats),nrow(Carseats)/2) test = -train training_data = Carseats[train,] testing_data = Carseats[test,] # Creating this variable to compare our prediction with the actual data testing_High = High[test] # Fit the tree model (full model) using training data tree_model = tree(High~., training_data) plot(tree_model) text(tree_model, pretty=0)

# We will evaluate how our model is performing using the testing data # We are going to predict using the tree model on the testing data and pass the # parameter as "class" for the type of prediction tree_pred = predict(tree_model, testing_data, type="class") # To compare the means - we check the misclassification error mean (tree_pred != testing_High) #0.295 - 29.5% is a high number, which we can reduce this # Now can prune our tree to reduce the misclassification error # We will perform cross validation to check at what level we will stop pruning set.seed(3) # Generate a cross validation tree cv_tree = cv.tree(tree_model, FUN = prune.misclass) names(cv_tree) # We will plot the size of the tree versus the deviance (that is the error rate) plot(cv_tree$size, cv_tree$dev, type = "b")

# We can see below that minimum error rate is at tree size 9. So let’s create a pruned model below: pruned_model = prune.misclass(tree_model, best=9) plot(pruned_model) text(pruned_model, pretty=0)

# Check how our model is performing tree_pred = predict(pruned_model, testing_data, type = "class") # Mean of the tree predicted from testing high mean(tree_pred != testing_High) #[1] 0.29 - we have reduced the misclassification rate by pruning out tree

### Market Basket Analysis and Association Rules using R

Market basket analysis provides great insights into purchasing behaviors of customers. Based on customer purchase data and association rules, we arrive at groups of related products which typically people buy together.

# Load the required libraries install.packages("arules") library(arules) library(datasets) # Load the data set myData

# Fetch the rules with support as 0.001 and confidence as 0.7 rules

rules<-sort(rules, by="confidence", decreasing=TRUE) options(digits=2) inspect(rules[1:10])

#Visualizing the results install.packages("igraph") install.packages("arulesViz") library(arulesViz) library(tcltk) rulesImp

References:

1. Arules Package: http://cran.at.r-project.org/web/packages/arules/arules.pdf

2. ArulesViz Package: http://cran.r-project.org/web/packages/arulesViz/vignettes/arulesViz.pdf

### Using Spotfire for Predictive Analytics (Regression Modeling)

We are building a model using Linear Regression to forecast sales

`Sales` ~ `Order Quantity` + `Discount` + `Shipping Cost` + `Profit` + `Unit Price` + `Product Base Margin`

This is the model with “Sales” as the Response variable and all the subsequent columns after the “~” considered as Predictor variables.

Let us click on “OK” to examine the results for the model

In the Model Summary pane, we can check the summary metrics:

Residual standard error: 1421 on 8329 degrees of freedom(63 observations deleted due to missingness)Multiple R-squared: 0.8421, Adjusted R-squared: 0.842F-statistic: 7406 on 6 and 8329 DF, p-value: 0

Below is the significance of the model parameters:

Residual Standard Error: A lower value indicates the model is better fit for our data.

Adjusted R-Squared: This is a commonly used measure of fit of a regression equation. It penalizes the addition of too many variables while rewarding a good fit of the regression equation. A higher Adjusted R-Squared value represents the model to be a better fit.

p-value: Predictors with this value closer to zero are better contributing to the model

Some of the other factors which will influence our model are Collinearity and multicollinearity, and Variance Inflation Factor (VIF), AIC and BIC values can help assess our model.

Collinearity is a case of an independent variable being a linear function of another. And in Mulitcollinearity, a variable is a linear function of two or more variables. These issues can increase the likelihood of making false conclusions from our estimates.

High VIF means that multicollinearity significantly impacts the equation whereas lower AIC and BIC are better.

The Table of Coefficients will have various p-values for various predictors (also called Regressors). Lower p-values will give the significance of each predictor in the model

If there are patterns in the the “Residuals vs. Fitted” plot, then the current model could be improved.

A simple horizontal bar signifying the relative importance of each predictor used in the model. Discount is the least important predictor.

If the normal QQ plot closely approximates to the line y=x, then the model fits the data well.

In the above plot, the larger values represent points (data points) which are more influential and have to be further investigated.

Depending on these various factors, the model has to go through a series of investigative steps till a satisfactory level of fit is reached.

In addition to the knowledge of statistics, domain specific understanding is also quite crucial in assessing the inputs and the results. For example when analyzing sales, we examine specific types of sales broken into tiers depending on various criteria such as quarter of the year, geographic factors, economic indicators, seasonal influences etc.

We can exclude the outliers which will skew our results. Further, appropriate weights could be distributed on each input parameter to identify whether the specific type of sale is profitable to our business.