Airbnb User Behavior Classification

Sklearn logo

1. Project Overview
2. Tools Used
3. Data Distribution
4. Phase 1, Multivariate Classification Models
5. Phase 2, Binary Classification Model
6. Phase 2, Establish Threshold
7. Final Results
8. Flask App
9. Github Repo - Link

1. Project Overview

The goal of this project was to Classify new AirBnB users as to their destination preference solely based on online activity and static user data.

Static User Data such as:
- Device Used
- Signup Method
- Age
Dynamic User Data (Sessions) :
- Active Session Time
- Buttons Clicked
- Active Browser type
10 Million User activity data points were used for the analysis.
A total of 190 Features were generated using these data points.

Project Broken Down into two Phases:

Multiclass Classification
Binary Classification

2. Tools Used

Sklearn, Postgres SQL, Flask, Plotly, D3, Python, Pandas, Matplotlib and Seaborn, Jupyter Notebook, HTML, CSS,.

3. Algorithms used:

Logistic Regression
KNN
SVM (Support Vector Machines)
Ensemble Random Forest Classifier
Decision Tree Classifier
Tuning: Sklearn GridSearchCV

3. Data Distribution

Exploratory Data Analysis was performed on the age distribution of the target demographic. Below is a plot created using Seaborn, represents the histogram of average AirB&B user age. Map

Multivariate Classification Models

Phase 1: Choosing Algorithm

Tested Initial Data using Various Models: Each Model was Cross Validated Using Test Sets. Below is the test data model summary. Map Best Performing model was Random Forest Classifier. These results were not significant due to high class imbalance. Map

Phase 2, Binary Classification Model

Due to high class imbalance, we were unable to get significant results from multivariate classification. In order to derive significant results from the data, I narrowed down the classes to find just users travelling in the States vs the rest of the world.

As you can see below Logistic Regression, outperformed random forest classifier by a large margin. Scoring Metric was F1 Score, with a good balance of accuracy and recall. Map

Listed below are the features with the highest impact on the model. R Squared of this model remained consistent with the test set, we can conclude there was no overfitting. Map

Phase 2, Establish Threshold

Challenge: Choosing the right threshold. How to determine that this is the most effective threshold?

Solution 1, Precision Recall Curve: It helps determine a good balance between precision and recall.
Solution 2, Create a custom Cost Function.

Testing the threshold via custom cost function:

TP=Advertised Correctly. Gain 1.5$
FN= Miss Opportunity. Loss 1/2$
FP= Wrong Demographic. Loss 1/2$
Best Threshold .35
Profit Margin of 4,000$ with this model and threshold.

Final Results:

Best Performing model for the binary dataset was Logistic Regression. Model Accurately predicts 82% percent of us Customers, while minimizing loss on False Positives.

Details on the model below:

Metric	Before TH	After TH
Recall	.74	.82
Precision	.47	.43
F1	.58	.57

Map

App:

Contact:

Thank you for visiting the page, feel free to contact me at smeet.vikani@gmail.com

Written on May 21, 2018

Smeet's Data Science Blog