This survey possessed almost 2 hundred variables, and over 2,000 observations. Furthermore, there have been quite a few sorts of reply selections, which meant {that a} honest quantity of preprocessing could be essential earlier than function choice and engineering might start.
First, I imported all libraries that have been essential on this venture on the prime of the pocket book. It’s important to have this not only for your self, however for others who could take a look at your code or edit it.
I’ve additionally integrated a code to increase Python’s view to suit your entire display screen.
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
!pip set up shap
import shap
import matplotlib.pyplot as plt
%matplotlib inline
warnings.filterwarnings("ignore")
!pip set up pywaffle
import pywaffle as pyw
import statsmodels.api as sm
import xgboost as xgb
import shap
from pyarrow import csv, parquet, schema, string, float32, int32, int16, int64
from sklearn.feature_selection import SelectKBest, f_classif
from IPython.show import show, HTML
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.model_selection import cross_val_score
from tabulate import tabulate
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score,confusion_matrix
shap.initjs()
from sklearn.datasets import make_classification
from sklearn.dummy import DummyClassifier
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Increase Python view to suit complete display screen
from IPython.core.show import show, HTML
show(HTML("<fashion>.container { width:100% ! necessary; }<fashion>"))
gaming_df = pd.read_csv("Deloitte.csv")
gaming_df.head()
Wanting on the first 5 rows, I discovered that there have been almost 2 hundred variables on this knowledge set. Furthermore, there have been totally different reply selections that wanted labeling. Earlier than entering into labeling, I wished to have a look at the variables sorts within the dataset.
gaming_df.dtypes
I spotted that the labels have been simply too lengthy on this knowledge set for me to learn. As an illustration “QNEW2 — How outdated are the kids in your house?-14–18 years’” can rapidly take up plenty of room in your pocket book display screen. I opted to make use of substitute to shorten all of those labels.
gaming_df.rename(columns={'This autumn - What's your gender?': 'Gender'}, inplace=True)
gaming_df.rename(columns={'area - Area': 'Area'}, inplace=True)
gaming_df.rename(columns={'age - you're...': 'Age'}, inplace=True)
gaming_df.rename(columns={'Q2 - Wherein state do you presently reside?': 'State'}, inplace=True)
gaming_df.rename(columns={'Q5 - Which class finest describes your ethnicity?': 'Ethnicity' }, inplace=True)
gaming_df.rename(columns={'QNEW2 - How outdated are the kids in your house?-0-4 years': 'Kids 0-4 years'}, inplace=True)
gaming_df.rename(columns={'QNEW2 - How outdated are the kids in your house?-5-9 years': 'Kids 5-9 years'}, inplace=True)
gaming_df.rename(columns={'QNEW2 - How outdated are the kids in your house?-10-13 years': 'Kids 10-13 years'}, inplace=True)
gaming_df.rename(columns={'QNEW2 - How outdated are the kids in your house?-14-18 years': 'Kids 14-18 years'}, inplace=True)
gaming_df.rename(columns={'QNEW2 - How outdated are the kids in your house?-19-25 years': 'Kids 19-25 years'}, inplace=True)
gaming_df.rename(columns={'QNEW2 - How outdated are the kids in your house?-26+ years': 'Kids 26+years'}, inplace=True)
gaming_df.rename(columns={'Q6 - Into which of the next classes does your complete annual family revenue fall earlier than taxes? Once more, we promise to maintain this, and all of your solutions, fully confidential.': 'Household_Income'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Flat panel tv': 'Personal Flat Panel TV'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Digital video recorder (DVR)': 'Personal DVR'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Streaming media field or over-the-top field': 'Personal streaming media field or over-the-top field'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Transportable streaming thumb drive/fob': 'Personal Transportable streaming thumb drive/fob'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Over-the-air digital TV antenna (totally free entry to community broadcast with out pay TV subscription)': 'Personal Over-the-air digital TV antenna (totally free entry to community broadcast with out pay TV subscription)'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Blu-ray disc participant/DVD participant': 'Personal Blu-ray disc participant/DVD participant'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Gaming console': 'Personal Gaming console'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Transportable online game participant': 'Personal Transportable online game participant'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Pc community/router in your house for wi-fi laptop/laptop computer utilization': 'Personal Pc community/router'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Desktop laptop': 'Personal Desktop laptop'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Laptop computer laptop': 'Personal Laptop computer laptop'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Pill': 'Personal Pill'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Devoted e-book reader': 'Personal Devoted e-book reader'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Smartphone': 'Personal Smartphone'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Fundamental cell phone (not a smartphone)': 'Personal Fundamental cell phone'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Good watch': 'Personal Good watch'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Health band': 'Personal Health band'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Digital actuality headset': 'Personal Digital actuality headset'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Drone': 'Personal Drone'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Not one of the above': 'Personal Not one of the above'}, inplace=True)
gaming_df.rename(columns={'QNEW2 - How outdated are the kids in your house?-0-4 years': 'Kids 0-4 years'}, inplace=True)
gaming_df.rename(columns={'QNEW2 - How outdated are the kids in your house?-5-9 years': 'Kids 5-9 years'}, inplace=True)
gaming_df.rename(columns={'QNEW2 - How outdated are the kids in your house?-10-13 years': 'Kids 10-13 years'}, inplace=True)
gaming_df.rename(columns={'QNEW2 - How outdated are the kids in your house?-14-18 years': 'Kids 14-18 years'}, inplace=True)
gaming_df.rename(columns={'QNEW2 - How outdated are the kids in your house?-19-25 years': 'Kids 19-25 years'}, inplace=True)
gaming_df.rename(columns={'QNEW2 - How outdated are the kids in your house?-26+ years': 'Kids 26+years'}, inplace=True)
gaming_df.rename(columns={'Q6 - Into which of the next classes does your complete annual family revenue fall earlier than taxes? Once more, we promise to maintain this, and all of your solutions, fully confidential.': 'Family Earnings'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Flat panel tv': 'Personal Flat Panel TV'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Digital video recorder (DVR)': 'Personal DVR'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Streaming media field or over-the-top field': 'Personal streaming media field or over-the-top field'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Transportable streaming thumb drive/fob': 'Personal Transportable streaming thumb drive/fob'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Over-the-air digital TV antenna (totally free entry to community broadcast with out pay TV subscription)': 'Personal Over-the-air digital TV antenna (totally free entry to community broadcast with out pay TV subscription)'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Blu-ray disc participant/DVD participant': 'Personal Blu-ray disc participant/DVD participant'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Gaming console': 'Personal Gaming console'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Transportable online game participant': 'Personal Transportable online game participant'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Pc community/router in your house for wi-fi laptop/laptop computer utilization': 'Personal Pc community/router'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Desktop laptop': 'Personal Desktop laptop'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Laptop computer laptop': 'Personal Laptop computer laptop'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Pill': 'Personal Pill'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Devoted e-book reader': 'Personal Devoted e-book reader'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Smartphone': 'Personal Smartphone'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Fundamental cell phone (not a smartphone)': 'Personal Fundamental cell phone'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Good watch': 'Personal Good watch'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Health band': 'Personal Health band'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Digital actuality headset': 'Personal Digital actuality headset'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Drone': 'Personal Drone'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Not one of the above': 'Personal Not one of the above'}, inplace=True)
gaming_df.rename(columns={'Q8 - Which of the next media or dwelling leisure gear does your family personal?-Dont Know': 'Dont Know'}, inplace=True)
gaming_df.rename(columns={'Q10 - Of these merchandise you indicated you don't presently personal, which of the next do you propose to buy within the subsequent 12 months?-Flat panel tv': 'Buy Flat panel in subsequent 12 months'}, inplace=True)
gaming_df.rename(columns={'Q10 - Of these merchandise you indicated you don't presently personal, which of the next do you propose to buy within the subsequent 12 months?-Digital video recorder (DVR)': 'Buy digital video recorder (DVR) in subsequent 12 months'}, inplace=True)
gaming_df.rename(columns={'Q10 - Of these merchandise you indicated you don't presently personal, which of the next do you propose to buy within the subsequent 12 months?-Streaming media field or over-the-top field': 'Buy Streaming media field or over-the-top field in subsequent 12 months'}, inplace=True)
gaming_df.rename(columns={'Q10 - Of these merchandise you indicated you don't presently personal, which of the next do you propose to buy within the subsequent 12 months?-Transportable streaming thumb drive/fob': 'Buy Transportable streaming thumb drive/fob in subsequent 12 months'}, inplace=True)
gaming_df.rename(columns={'Q10 - Of these merchandise you indicated you don't presently personal, which of the next do you propose to buy within the subsequent 12 months?-Over-the-air digital TV antenna (totally free entry to community broadcast with out pay TV subscription)': 'Buy Excessive digital TV antenna in subsequent 12 months'}, inplace=True)
gaming_df.rename(columns={'Q10 - Of these merchandise you indicated you don't presently personal, which of the next do you propose to buy within the subsequent 12 months?-Blu-ray disc participant/DVD participant': 'Buy Blu-ray disc participant/DVD participant in subsequent 12 months'}, inplace=True)
gaming_df.rename(columns={'Q10 - Of these merchandise you indicated you don't presently personal, which of the next do you propose to buy within the subsequent 12 months?-Gaming console': 'Buy gaming console in subsequent 12 months'}, inplace=True)
gaming_df.rename(columns={'Q10 - Of these merchandise you indicated you don't presently personal, which of the next do you propose to buy within the subsequent 12 months?-Transportable online game participant': 'Buy Transportable online game participant in subsequent 12 months'}, inplace=True)
gaming_df.rename(columns={'Q10 - Of these merchandise you indicated you don't presently personal, which of the next do you propose to buy within the subsequent 12 months?-Pc community/router in your house for wi-fi laptop/laptop computer utilization': 'Buy laptop community/router in subsequent 12 months'}, inplace=True)
gaming_df.rename(columns={'Q10 - Of these merchandise you indicated you don't presently personal, which of the next do you propose to buy within the subsequent 12 months?-Desktop laptop': 'Buy Desktop laptop in subsequent 12 months'}, inplace=True)
gaming_df.rename(columns={'Q10 - Of these merchandise you indicated you don't presently personal, which of the next do you propose to buy within the subsequent 12 months?-Laptop computer laptop': 'Buy Laptop computer Desktop laptop in subsequent 12 months'}, inplace=True)
gaming_df.rename(columns={'Q10 - Of these merchandise you indicated you don't presently personal, which of the next do you propose to buy within the subsequent 12 months?-Pill': 'Buy Pill in subsequent 12 months'}, inplace=True)
gaming_df.rename(columns={'Q10 - Of these merchandise you indicated you don't presently personal, which of the next do you propose to buy within the subsequent 12 months?-Devoted e-book reader': 'Buy Devoted e-book reader in subsequent 12 months'}, inplace=True)
gaming_df.rename(columns={'Q10 - Of these merchandise you indicated you don't presently personal, which of the next do you propose to buy within the subsequent 12 months?-Smartphone': 'Buy Smartphone in subsequent 12 months'}, inplace=True)
gaming_df.rename(columns={'Q10 - Of these merchandise you indicated you don't presently personal, which of the next do you propose to buy within the subsequent 12 months?-Fundamental cell phone (not a smartphone)': 'Buy Fundamental moblie telephone in subsequent 12 months'}, inplace=True)
gaming_df.rename(columns={'Q10 - Of these merchandise you indicated you don't presently personal, which of the next do you propose to buy within the subsequent 12 months?-Good watch': 'Buy Good watch in subsequent 12 months'}, inplace=True)
gaming_df.rename(columns={'Q10 - Of these merchandise you indicated you don't presently personal, which of the next do you propose to buy within the subsequent 12 months?-Health band': 'Buy Health band in subsequent 12 months'}, inplace=True)
gaming_df.rename(columns={'Q10 - Of these merchandise you indicated you don't presently personal, which of the next do you propose to buy within the subsequent 12 months?-Digital actuality headset': 'Buy Digital actuality headset in subsequent 12 months'}, inplace=True)
gaming_df.rename(columns={'Q10 - Of these merchandise you indicated you don't presently personal, which of the next do you propose to buy within the subsequent 12 months?-Drone': 'Buy Drone in subsequent 12 months'}, inplace=True)
gaming_df.rename(columns={'Q10 - Of these merchandise you indicated you don't presently personal, which of the next do you propose to buy within the subsequent 12 months?-Not one of the above': 'Buy Not one of the above in subsequent 12 months'}, inplace=True)
gaming_df.rename(columns={"Q10 - Of these merchandise you indicated you don't presently personal, which of the next do you propose to buy within the subsequent 12 months?-Do not Know": "Buy Do not Know in subsequent 12 months"}, inplace=True)
gaming_df.rename(columns={'Q11r1 - Flat panel tv - Of the merchandise you indicated you personal, which [totalcount] do you worth essentially the most? Please rank the highest [totalcount], with "1" being essentially the most valued. Make your choices by clicking every merchandise within the order you want to rank.': 'Worth flat panel TV'}, inplace=True)
gaming_df.rename(columns={'Q11r2 - Digital video recorder (DVR) - Of the merchandise you indicated you personal, which [totalcount] do you worth essentially the most? Please rank the highest [totalcount], with "1" being essentially the most valued. ': 'Worth Digital video recorder (DVR)'}, inplace=True)
gaming_df.rename(columns={'Q11r3 - Streaming media field or over-the-top field - Of the merchandise you indicated you personal, which [totalcount] do you worth essentially the most? Please rank the highest [totalcount], with "1" being essentially the most valued. ': 'Worth Streaming media field or over-the-top'}, inplace=True)
gaming_df.rename(columns={'Q11r4 - Transportable streaming thumb drive/fob - Of the merchandise you indicated you personal, which [totalcount] do you worth essentially the most? Please rank the highest [totalcount], with "1" being essentially the most valued.': 'Worth Transportable streaming thumb drive/fob'}, inplace=True)
gaming_df.rename(columns={'Q11rNew1 - Over-the-air digital TV antenna (totally free entry to community broadcast with out pay TV subscription) - Of the merchandise you indicated you personal, which [totalcount] do you worth essentially the most? Please rank the highest [totalcount], with "1" being essentially the most worth': 'Worth Over-the-air digital TV antenna'}, inplace=True)
gaming_df.rename(columns={'Q11r5 - Blu-ray disc participant/DVD participant - Of the merchandise you indicated you personal, which [totalcount] do you worth essentially the most? Please rank the highest [totalcount], with "1" being essentially the most valued. Make your choices by clicking every merchandise within the order you want to': 'Worth Blu-ray disc participant/DVD participant'}, inplace=True)
gaming_df.rename(columns={'Q11r6 - Gaming console - Of the merchandise you indicated you personal, which [totalcount] do you worth essentially the most? Please rank the highest [totalcount], with "1" being essentially the most valued. Make your choices by clicking every merchandise within the order you would like': 'Worth Gaming console'}, inplace=True)
gaming_df.rename(columns={'Q11r7 - Transportable online game participant - Of the merchandise you indicated you personal, which [totalcount] do you worth essentially the most? Please rank the highest [totalcount], with "1" being essentially the most valued. Make your choices by clicking every': 'Worth Transportable online game participant'}, inplace=True)
gaming_df.rename(columns={'Q11r8 - Pc community/router in your house for wi-fi laptop/laptop computer utilization - Of the merchandise you indicated you personal, which [totalcount] do you worth essentially the most? Please rank the highest [totalcount], with "1" being essentially the most va': 'Worth Pc community/router'}, inplace=True)
gaming_df.rename(columns={'Q11r9 - Desktop laptop - Of the merchandise you indicated you personal, which [totalcount] do you worth essentially the most? Please rank the highest [totalcount], with "1" being essentially the most valued. Make your choices by clicking every merchandise within the order you want to rank. ': 'Worth Desktop laptop'}, inplace=True)
gaming_df.rename(columns={'Q11r10 - Laptop computer laptop - Of the merchandise you indicated you personal, which [totalcount] do you worth essentially the most? Please rank the highest [totalcount], with "1" being essentially the most valued. Make your choices by clicking every merchandise within the order you want to rank.': 'Worth Laptop computer laptop'}, inplace=True)
gaming_df.rename(columns={'Q11r12 - Pill - Of the merchandise you indicated you personal, which [totalcount] do you worth essentially the most? Please rank the highest [totalcount], with "1" being essentially the most valued. Make your choices by clicking every merchandise within the orde': 'Worth Pill'}, inplace=True)
gaming_df.rename(columns={'Q11r14 - Devoted e-book reader - Of the merchandise you indicated you personal, which [totalcount] do you worth essentially the most? Please rank the highest [totalcount], with "1" being essentially the most valued. Make your choices by clicking every merchandise in': 'Worth Devoted e-book reader'}, inplace=True)
gaming_df.rename(columns={'Q11r15 - Smartphone - Of the merchandise you indicated you personal, which [totalcount] do you worth essentially the most? Please rank the highest [totalcount], with "1" being essentially the most valued. Make your choices by clicking every merchandise in th': 'Worth Smartphone'}, inplace=True)
gaming_df.rename(columns={'Q11r17 - Fundamental cell phone (not a smartphone) - Of the merchandise you indicated you personal, which [totalcount] do you worth essentially the most? Please rank the highest [totalcount], with "1" being essentially the most valued. Make your choices by clicking every merchandise within the order you': 'Worth Fundamental cell phone'}, inplace=True)
gaming_df.rename(columns={'Q11r18 - Good watch - Of the merchandise you indicated you personal, which [totalcount] do you worth essentially the most? Please rank the highest [totalcount], with "1" being essentially the most valued. Make your choices by clicking every merchandise i': 'Worth Good watch'}, inplace=True)
gaming_df.rename(columns={'Q11r19 - Health band - Of the merchandise you indicated you personal, which [totalcount] do you worth essentially the most? Please rank the highest [totalcount], with "1" being essentially the most valued. Make your choices by clicking every merchandise within the order': 'Worth Health band'}, inplace=True)
gaming_df.rename(columns={'Q11rNew2 - Digital actuality headset - Of the merchandise you indicated you personal, which [totalcount] do you worth essentially the most? Please rank the highest [totalcount], with "1" being essentially the most valued. Make your sele': 'Worth Digital actuality headset'}, inplace=True)
gaming_df.rename(columns={'Q11rNew3 - Drone - Of the merchandise you indicated you personal, which [totalcount] do you worth essentially the most? Please rank the highest [totalcount], with "1" being essentially the most valued. Make your choices by clicking every merchandise within the order you want to rank. ': 'Worth Drone'}, inplace=True)
gaming_df.rename(columns={'Q11r22 - Placeholder - Of the merchandise you indicated you personal, which [totalcount] do you worth essentially the most? Please rank the highest [totalcount], with "1" being essentially the most valued. Make your choices by clicking every merchandise within the order you want to rank. ': 'Worth Placeholder'}, inplace=True)
gaming_df.rename(columns={'Q15r1 - Smartphone - Of the time you spend watching films, what share of time do you watch on the next units?': '% View films on Smartphone'}, inplace=True)
gaming_df.rename(columns={'Q15r2 - Pill - Of the time you spend watching films, what share of time do you watch on the next units?': '% View films on Pill'}, inplace=True)
gaming_df.rename(columns={'Q15r3 - Laptop computer/Desktop - Of the time you spend watching films, what share of time do you watch on the next units?': '% View films on Laptop computer/Desktop'}, inplace=True)
gaming_df.rename(columns={'Q15r4 - Tv - Of the time you spend watching films, what share of time do you watch on the next units?' : '% View films on Tv'}, inplace=True)
gaming_df.rename(columns={'Q16r1 - Smartphone - Of the time you spend watching sports activities, what share of time do you watch on the next units?' : '% View sports activities on Smartphone'}, inplace=True)
gaming_df.rename(columns={'Q16r2 - Pill - Of the time you spend watching sports activities, what share of time do you watch on the next units?' : '% View sports activities on Pill'}, inplace=True)
gaming_df.rename(columns={'Q16r3 - Laptop computer/Desktop - Of the time you spend watching sports activities, what share of time do you watch on the next units?' : '% View sports activities on Laptop computer/Desktop'}, inplace=True)
gaming_df.rename(columns={'Q16r4 - Tv - Of the time you spend watching sports activities, what share of time do you watch on the next units?' : '% View sports activities on Tv'}, inplace=True)
gaming_df.rename(columns={'Q17r1 - Smartphone - Of the time you spend watching TV exhibits , what share of time do you watch on the next units?' : '% View TV exhibits on Smartphone'}, inplace=True)
gaming_df.rename(columns={'Q17r2 - Pill - Of the time you spend watching TV exhibits , what share of time do you watch on the next units?' : '% View TV exhibits on Pill'}, inplace=True)
gaming_df.rename(columns={'Q17r3 - Laptop computer/Desktop - Of the time you spend watching TV exhibits , what share of time do you watch on the next units?' : '% View TV exhibits on Laptop computer/Desktop'}, inplace=True)
gaming_df.rename(columns={'Q17r4 - Tv - Of the time you spend watching TV exhibits , what share of time do you watch on the next units?' : '% View TV exhibits on Tv'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Photograph/video' : 'Varieties of apps used - Photograph/video'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Banking' : 'Varieties of apps used - Banking'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Health/well being' : 'Varieties of apps used - Health/well being'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Meals/drink' : 'Varieties of apps used - Meals/drink'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Retail/purchasing' : 'Varieties of apps used - Retail/purchasing'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Navigation' : 'Varieties of apps used - Navigation'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Social networks' : 'Varieties of apps used - Social networks'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Studying books' : 'Varieties of apps used - Studying books'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Streaming music' : 'Varieties of apps used - Streaming music'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Streaming video' : 'Varieties of apps used - Streaming video'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Opinions/guides' : 'Varieties of apps used - Opinions/guides'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Information consolidator' : 'Varieties of apps used - Information consolidator'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Newspaper/information broadcaster' : 'Varieties of apps used - Newspaper/information broadcaster'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Magazines' : 'Varieties of apps used - Magazines'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Diagnostic/utilities' : 'Varieties of apps used - Diagnostic/utilities'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-QR Reader' : 'Varieties of apps used - QR Reader'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Journey' : 'Varieties of apps used - Journey'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Transportation' : 'Varieties of apps used - Transportation'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Location' : 'Varieties of apps used - Location'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Language' : 'Varieties of apps used - Language'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Film' : 'Varieties of apps used - Film'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Enterprise' : 'Varieties of apps used - Enterprise'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Video games' : 'Varieties of apps used - Video games'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Productiveness' : 'Varieties of apps used - Productiveness'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Sports activities' : 'Varieties of apps used - Sports activities'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Climate' : 'Varieties of apps used - Climate'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Browser' : 'Varieties of apps used - Browser'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-VOIP' : 'Varieties of apps used - VOIP'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Courting' : 'Varieties of apps used - Courting'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Messaging' : 'Varieties of apps used - Messaging'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Cellular cost' : 'Varieties of apps used - Cellular cost'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Schooling' : 'Varieties of apps used - Schooling'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Tickets' : 'Varieties of apps used - Tickets'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Reservations' : 'Varieties of apps used - Reservations'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Particular curiosity/Pastime apps' : 'Varieties of apps used - Particular curiosity/Pastime apps'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-I don't use any of the above sorts of apps on a frequent (on a regular basis/weekly) foundation.' : 'Varieties of apps used - None'}, inplace=True)
gaming_df.rename(columns={'Q22 - What sorts of apps do you utilize steadily (on a regular basis/weekly) in your smartphone?-Dont Know' : 'Varieties of apps used - Dont Know'}, inplace=True)
gaming_df.rename(columns={'Q26 - Which of the next subscriptions does your family buy?-Pay TV (conventional cable and/or satellite tv for pc bundle)' : 'Subscriptions - Pay TV'}, inplace=True)
gaming_df.rename(columns={'Q26 - Which of the next subscriptions does your family buy?-Dwelling web' : 'Subscriptions - Dwelling Web'}, inplace=True)
gaming_df.rename(columns={'Q26 - Which of the next subscriptions does your family buy?-Landline phone' : 'Subscriptions - Landline phone'}, inplace=True)
gaming_df.rename(columns={'Q26 - Which of the next subscriptions does your family buy?-Cellular voice (smartphone or primary cell phone calling plan)' : 'Subscriptions - Cellular voice'}, inplace=True)
gaming_df.rename(columns={'Q26 - Which of the next subscriptions does your family buy?-Cellular knowledge plan' : 'Subscriptions - Cellular knowledge plan'}, inplace=True)
gaming_df.rename(columns={'Q26 - Which of the next subscriptions does your family buy?-Streaming video service' : 'Subscriptions - Streaming video service'}, inplace=True)
gaming_df.rename(columns={'Q26 - Which of the next subscriptions does your family buy?-Streaming music service' : 'Subscriptions - Streaming music service'}, inplace=True)
gaming_df.rename(columns={'Q26 - Which of the next subscriptions does your family buy?-Gaming' : 'Subscriptions - Gaming'}, inplace=True)
gaming_df.rename(columns={'Q26 - Which of the next subscriptions does your family buy?-Information/Newspaper (print or digital)' : 'Subscriptions - Information/Newspaper(print or digital)'}, inplace=True)
gaming_df.rename(columns={'Q26 - Which of the next subscriptions does your family buy?-Journal (print or digital)' : 'Subscriptions - Journal (print or digital)'}, inplace=True)
gaming_df.rename(columns={'Q26 - Which of the next subscriptions does your family buy?-Not one of the above' : 'Subscriptions - Not one of the above'}, inplace=True)
gaming_df.rename(columns={'Q26 - Which of the next subscriptions does your family buy?-Dont Know' : 'Subscriptions - Dont Know'}, inplace=True)
gaming_df.rename(columns={'Q36r1 - Pay TV (conventional cable and/or satellite tv for pc bundle) - Of the providers you indicated your family purchases, which [totalcount] do you worth essentially the most?Please rank the highest [totalcount], with "1" being essentially the most valued. ' : 'Worth essentially the most - Pay TV'}, inplace=True)
gaming_df.rename(columns={'Q36r2 - Dwelling web - Of the providers you indicated your family purchases, which [totalcount] do you worth essentially the most?Please rank the highest [totalcount], with "1" being essentially the most valued. ' : 'Worth essentially the most - Dwelling Web'}, inplace=True)
gaming_df.rename(columns={'Q36r3 - Landline phone - Of the providers you indicated your family purchases, which [totalcount] do you worth essentially the most?Please rank the highest [totalcount], with "1" being essentially the most valued. ' : 'Worth essentially the most - Landline phone'}, inplace=True)
gaming_df.rename(columns={'Q36r4 - Cellular voice - Of the providers you indicated your family purchases, which [totalcount] do you worth essentially the most?Please rank the highest [totalcount], with "1" being essentially the most valued. ' : 'Worth essentially the most - Cellular voice'}, inplace=True)
gaming_df.rename(columns={'Q36r5 - Cellular knowledge plan - Of the providers you indicated your family purchases, which [totalcount] do you worth essentially the most?Please rank the highest [totalcount], with "1" being essentially the most valued. ' : 'Worth essentially the most - Cellular knowledge plan'}, inplace=True)
gaming_df.rename(columns={'Q36r6 - Streaming video service - Of the providers you indicated your family purchases, which [totalcount] do you worth essentially the most?Please rank the highest [totalcount], with "1" being essentially the most valued. ' : 'Worth essentially the most - Streaming video service'}, inplace=True)
gaming_df.rename(columns={'Q36r7 - Streaming music service - Of the providers you indicated your family purchases, which [totalcount] do you worth essentially the most?Please rank the highest [totalcount], with "1" being essentially the most valued. ' : 'Worth essentially the most - Streaming music service'}, inplace=True)
gaming_df.rename(columns={'Q36r8 - Gaming - Of the providers you indicated your family purchases, which [totalcount] do you worth essentially the most?Please rank the highest [totalcount], with "1" being essentially the most valued. ' : 'Worth essentially the most - Gaming'}, inplace=True)
gaming_df.rename(columns={'Q36r9 - Information/Newspaper (print or digital) - Of the providers you indicated your family purchases, which [totalcount] do you worth essentially the most?Please rank the highest [totalcount], with "1" being essentially the most valued. ' : 'Worth essentially the most - Information/Newspaper(print or digital)'}, inplace=True)
gaming_df.rename(columns={'Q36r10 - Journal (print or digital) - Of the providers you indicated your family purchases, which [totalcount] do you worth essentially the most?Please rank the highest [totalcount], with "1" being essentially the most valued. ': 'Worth essentially the most - Journal'}, inplace=True)
gaming_df.rename(columns={'Q29 - You stated that you simply subscribe to dwelling Web entry, how way more would you be keen to pay to obtain double your obtain velocity?': 'Prepared to pay to double obtain velocity'}, inplace=True)
gaming_df.rename(columns={'Q37r1 - Attending reside performances (sporting occasions, concert events, or stage (musical, dramatic, or different)) - For the next sorts of leisure actions, please rank your prime three, with "1" being essentially the most most well-liked.': 'Rank - Attending reside performances'}, inplace=True)
gaming_df.rename(columns={'Q37r2 - Going to the flicks - For the next sorts of leisure actions, please rank your prime three, with "1" being essentially the most most well-liked.': 'Rank - Going to the flicks'}, inplace=True)
gaming_df.rename(columns={'Q37r3 - Watching tv (video content material on any gadget) - For the next sorts of leisure actions, please rank your prime three, with "1" being essentially the most most well-liked.Please rank the highest three.': 'Rank - Watching tv'}, inplace=True)
gaming_df.rename(columns={'Q37r4 - Listening to music (utilizing any gadget) - For the next sorts of leisure actions, please rank your prime three, with "1" being essentially the most most well-liked.': 'Rank - Listening to music'}, inplace=True)
gaming_df.rename(columns={'Q37r5 - Studying books (both bodily books or through an e-book reader and/or on-line) - For the next sorts of leisure actions, please rank your prime three, with "1" being essentially the most most well-liked.': 'Rank - Studying books'}, inplace=True)
gaming_df.rename(columns={'Q37r6 - Studying magazines (both printed or on-line) - For the next sorts of leisure actions, please rank your prime three, with "1" being essentially the most most well-liked.': 'Rank - Studying magazines'}, inplace=True)
gaming_df.rename(columns={'Q37r7 - Studying newspapers (both printed or on-line) - For the next sorts of leisure actions, please rank your prime three, with "1" being essentially the most most well-liked.': 'Rank - Studying newspapers'}, inplace=True)
gaming_df.rename(columns={'Q37r8 - Listening to the radio (any format and/or gadget) - For the next sorts of leisure actions, please rank your prime three, with "1" being essentially the most most well-liked.': 'Rank - Listening to the radio'}, inplace=True)
gaming_df.rename(columns={'Q37r9 - Enjoying video video games (handhelds, PC, console, cell/mobile/smartphone, on-line) - For the next sorts of leisure actions, please rank your prime three, with "1" being essentially the most most well-liked.': 'Rank - Enjoying video video games'}, inplace=True)
gaming_df.rename(columns={'Q37r10 - Utilizing the Web for social or private pursuits - For the next sorts of leisure actions, please rank your prime three, with "1" being essentially the most most well-liked.': 'Rank - Utilizing the Web for social or private pursuits'}, inplace=True)
gaming_df.rename(columns={'QNEW19r1 - Hire a bodily DVD/Blu-ray - Fascinated about the way you watch films, how steadily do you do every of the next?': 'Film Frequency - Hire a bodily DVD/Blu-ray'}, inplace=True)
gaming_df.rename(columns={'QNEW19r2 - Buy a bodily DVD/Blu-ray - Fascinated about the way you watch films, how steadily do you do every of the next?': 'Film Frequency - Buy a bodily DVD/Blu-ray' }, inplace=True)
gaming_df.rename(columns={'QNEW19r3 - Buy digital video leisure to obtain onto your gadget through on-line service - Fascinated about the way you watch films, how steadily do you do every of the next?': 'Film Frequency - Buy digital video leisure to obtain' }, inplace=True)
gaming_df.rename(columns={'QNEW19r4 - Hire digital video leisure the place a digital file is downloaded to your gadget - Fascinated about the way you watch films, how steadily do you do every of the next?': 'Film Frequency - Hire digital video leisure to obtain' }, inplace=True)
gaming_df.rename(columns={'QNEW19r5 - Watch digital video leisure through a web based streaming service - Fascinated about the way you watch films, how steadily do you do every of the next?': 'Film Frequency - Watch digital video leisure to obtain' }, inplace=True)
gaming_df.rename(columns={"QNEW19r6 - Buy/lease a video through your tv service supplier's On-Demand or Pay-Per-View service (i.e., through a set-top-box) - Fascinated about the way you watch films, how steadily do you do every of the next?": "Film Frequency - Buy/lease a video through TV service On-Demand PPV"}, inplace=True)
gaming_df.rename(columns={'QNEW20r1 - Hire a bodily DVD/Blu-ray - Fascinated about the way you watch tv programming, how steadily do you do every of the next?': 'TV Frequency - Hire a bodily DVD/Blu-ray' }, inplace=True)
gaming_df.rename(columns={'QNEW20r2 - Buy a bodily DVD/Blu-ray - Fascinated about the way you watch tv programming, how steadily do you do every of the next?': 'TV Frequency - Buy a bodily DVD/Blu-ray' }, inplace=True)
gaming_df.rename(columns={'QNEW20r3 - Buy digital video leisure to obtain onto your gadget through on-line service - Fascinated about the way you watch tv programming, how steadily do you do every of the next?': 'TV Frequency - Buy digital video leisure' }, inplace=True)
gaming_df.rename(columns={'QNEW20r4 - Watch digital video leisure through a web based streaming service - Fascinated about the way you watch tv programming, how steadily do you do every of the next?': 'TV Frequency - Watch digital video leisure through a web based streaming service' }, inplace=True)
gaming_df.rename(columns={"QNEW20r5 - Buy/lease a video through your tv service supplier's On-Demand or Pay-Per-View service (i.e., through a set-top-box) - Fascinated about the way you watch tv programming, how steadily do you do every of the next?": "TV Frequency - Buy/lease a video through your tv on-demand or PPV" }, inplace=True)
gaming_df.rename(columns={'QNEW24 - Do you ever "binge-watch" tv exhibits, which means watching three or extra episodes of a TV collection in a single sitting?': 'Do you binge-watch tv exhibits - min of three episodes in a sitting' }, inplace=True)
gaming_df.rename(columns={"QNEW28 - How steadily do you utilize a good friend or member of the family's (somebody not residing in your family) subscription login data to observe digital content material?": "Frequency of utilizing household or good friend subscription log in"}, inplace=True)
gaming_df.rename(columns={"QNEW29 - While you use a good friend or member of the family's subscription login data to observe digital content material, what sort of content material do you most frequently watch?": "While you use household or good friend sub, what do you watch most frequently"}, inplace=True)
gaming_df.rename(columns={'Q73r2 - Learn for work and/or college - That are belongings you usually do whereas watching your property tv system?': 'Do throughout dwelling TV viewing - Learn for work/college' }, inplace=True)
gaming_df.rename(columns={'Q73r3 - Learn for pleasure - That are belongings you usually do whereas watching your property tv system?': 'Do throughout dwelling TV viewing - Learn for pleasure' }, inplace=True)
gaming_df.rename(columns={'Q73r4 - Browse and surf the Net - That are belongings you usually do whereas watching your property tv system?': 'Do throughout dwelling TV viewing - Browse and surf the Net' }, inplace=True)
gaming_df.rename(columns={'Q73r5 - Microblogging - That are belongings you usually do whereas watching your property tv system?': 'Do throughout dwelling TV viewing - Microblogging' }, inplace=True)
gaming_df.rename(columns={'Q73r6 - Learn e mail - That are belongings you usually do whereas watching your property tv system?': 'Do throughout dwelling TV viewing - Learn e mail' }, inplace=True)
gaming_df.rename(columns={'Q73r7 - Write e mail - That are belongings you usually do whereas watching your property tv system?': 'Do throughout dwelling TV viewing - Write e mail' }, inplace=True)
gaming_df.rename(columns={'Q73r8 - Textual content message - That are belongings you usually do whereas watching your property tv system?': 'Do throughout dwelling TV viewing - Textual content message' }, inplace=True)
gaming_df.rename(columns={'Q73r9 - Use a social community - That are belongings you usually do whereas watching your property tv system?': 'Do throughout dwelling TV viewing - Use a social community' }, inplace=True)
gaming_df.rename(columns={'Q73r10 - Discuss on the telephone - That are belongings you usually do whereas watching your property tv system?': 'Do throughout dwelling TV viewing - Discuss on the telephone' }, inplace=True)
gaming_df.rename(columns={'Q73r11 - Browse for services on-line - That are belongings you usually do whereas watching your property tv system?': 'Do throughout dwelling TV viewing - Browse for services on-line' }, inplace=True)
gaming_df.rename(columns={'Q73r12 - Buy services on-line - That are belongings you usually do whereas watching your property tv system?': 'Do throughout dwelling TV viewing - Buy services on-line' }, inplace=True)
gaming_df.rename(columns={'Q73r13 - Play video video games - That are belongings you usually do whereas watching your property tv system?': 'Do throughout dwelling TV viewing - Play video video games' }, inplace=True)
gaming_df.rename(columns={'Q39r1 - I'd quite pay for information on-line in trade for not being uncovered to commercials. - Utilizing the dimensions under, please point out how a lot you agree or disagree with the next statements. If the query doesn't apply to you, select "N/A."': 'I'd quite pay for information on-line to keep away from advertisements' }, inplace=True)
gaming_df.rename(columns={'Q39rNEW1 - I'd quite pay for sports activities data on-line in trade for not being uncovered to commercials. - Utilizing the dimensions under, please point out how a lot you agree or disagree with the next statements. ': 'I'd quite pay for sports activities data on-line to keep away from advertisements' }, inplace=True)
gaming_df.rename(columns={'Q39rNEW2 - I'd quite pay for video games on-line in trade for not being uncovered to commercials. - Utilizing the dimensions under, please point out how a lot you agree or disagree with the next statements. If the query doesn't apply to you, select "N/A."': 'I'd quite pay for video games on-line to keep away from advertisements' }, inplace=True)
gaming_df.rename(columns={'Q39rNEW3 - I'd quite pay for music on-line in trade for not being uncovered to commercials. - Utilizing the dimensions under, please point out how a lot you agree or disagree with the next statements. If the query doesn't apply to you, select "N/A."': 'I'd quite pay for music on-line to keep away from advertisements' }, inplace=True)
gaming_df.rename(columns={'Q39rNEW4 - I'd quite pay for TV exhibits on-line in trade for not being uncovered to commercials. - Utilizing the dimensions under, please point out how a lot you agree or disagree with the next statements.': 'I'd quite pay for TV exhibits on-line to keep away from advertisements' }, inplace=True)
gaming_df.rename(columns={'Q39rNEW5 - I'd quite pay for films on-line in trade for not being uncovered to commercials. - Utilizing the dimensions under, please point out how a lot you agree or disagree with the next statements. If the query doesn't apply to you, select "N/A."': 'I'd quite pay for films on-line to keep away from advertisements' }, inplace=True)
gaming_df.rename(columns={'Q39r2 - I'd be keen to offer extra private data on-line if that meant I might obtain promoting extra focused to my wants and pursuits. - Utilizing the dimensions under, please point out how a lot you agree or disagree with the next statements.': 'I'd quite present extra private data on-line to obtain focused advertisements' }, inplace=True)
gaming_df.rename(columns={'Q39r3 - By offering extra private data on-line, I'm frightened about changing into a sufferer of identification theft. - Utilizing the dimensions under, please point out how a lot you agree or disagree with the next statements. ': 'By offering private information on-line I could also be a sufferer of identification theft' }, inplace=True)
gaming_df.rename(columns={'Q39r4 - I'd be keen to view promoting with my streaming video programming if it considerably decreased the price of the subscription.(e.g., decreased subscription price by 25%) - Utilizing the dimensions under, please point out how a lot you agree or disagree with.': 'I'd be keen to view advertisements if it lower sub price' }, inplace=True)
gaming_df.rename(columns={'Q89 - Which of the next is your most steadily used mechanism to get information?': 'How do you get information' }, inplace=True)
Furthermore, it rapidly turned clear that I would want to exchange a few of the exchange the string responses with numerical codes. This may permit me to have interval-level variables for additional analyses. Thankfully, I scanned by way of the variables to notice which scales have been used all through the survey. There was plenty of repetition, which meant that changing responses was comparatively easy.
gaming_df.exchange({'Sure':1,'No':0},inplace=True)
gaming_df.exchange({'By no means':1,'Hardly ever (one to a few instances a 12 months)':2,'Sometimes (month-to-month)':3, 'Incessantly (on daily basis/weekly)':4},inplace=True)
gaming_df.exchange({'Virtually by no means':1, 'Hardly ever (10%-50% of the time)':2, 'Incessantly (between 50% and 75% of the time)':3, 'Virtually all the time (larger than 75% of the time)':4, 'All the time (near 100% of the time)':5}, inplace=True)
gaming_df.exchange({'N/A; I do not need a foundation to reply':-999,'Disagree strongly':1,'Disagree considerably':2,'Agree considerably':3,'Agree strongly':4}, inplace=True)
Following this, it made sense to plot some primary demographic knowledge earlier than I carried out additional analyses. It’s necessary to notice if there are any outliers or unusual occurrences in your knowledge earlier than you progress on in preprocessing. Moreover, it’s necessary to notice that there are some issues that you simply’ll be unable to alter.
# Plot histogram of age distribution
plt.determine(figsize=(10,6))# Shade map
cm = plt.cm.get_cmap('BrBG_r')
plt.xlabel('Age', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.title('Age Distribution', fontsize=16)
n, bins, patches = plt.hist(gaming_df['Q1r1 - To begin, what is your age?'].astype(float), 37, shade='blue')
bin_centers = 0.5 * (bins[:-1] + bins[1:])
# Scale values to interval [0,1]
col = bin_centers - min(bin_centers)
col /= max(col)
for c, p in zip(col, patches):
plt.setp(p, 'facecolor', cm(c))
plt.present()
It follows that plenty of older and youthful individuals took this survey. It might be that point and participant cost have been contributing components towards how participant engagement shook out through the knowledge assortment course of. Youthful individuals have much less revenue and (often) extra time, and the identical might be argued for older individuals, who’re usually retired by their mid 60s to early 70s.
# Get gender counts
counts = gaming_df['Gender'].value_counts()
gender_df = pd.DataFrame(counts)
gender_df['percent'] = (gender_df['Gender'] /gender_df['Gender'].sum()) * 100
# To assist me additional perceive the gender breakdown, I've created the next chart right here
fig = plt.determine(
FigureClass=pyw.Waffle,
rows=3,
columns = 9,
values=gender_df['percent'],
colours=("#FA8072", "#0000FF"),
title={'label': 'Gender Distribution', 'loc': 'middle','measurement':20},
labels=[f"{k} ({'{:.2f}'.format(v)}%)" for k, v in zip(gender_df.index, gender_df.percent)],
legend={'loc': 'higher left', 'bbox_to_anchor': (1, 1)},
icons=['female','male'],
icon_size=45,
icon_legend=True,
figsize=(10, 8)
)plt.present()
The gender depend was about even. Subsequent, I wanted to have a look at the regional and employment counts to see how each broke down.
# Plot distribution
fig, ax = plt.subplots(figsize=(16, 8))
fig.suptitle('Area Distribution', measurement = 20, shade = 'black')
# Explode = (0.15, 0.15, 0.15, 0.15)
labels = ['South', 'West', 'Northeast', 'Midwest']
# Solely explode the south, since it's the largest chunk
explode = (0.1,0,0,0)
sizes = region_df['percent']
colours = ['#607B8B', '#8DB6CD', '#A4D3EE', '#B0E2FF']
ax.pie(sizes, explode=explode, startangle=60,colours = colours, labels=labels,autopct='%1.0f%%', pctdistance=0.7)
ax.add_artist(plt.Circle((0,0),0.4,fc='white'))
# Alter the info label measurement
plt.rcParams['font.size'] = 15
plt.present()import matplotlib.pyplot as plt
# Plot distribution
fig, ax = plt.subplots(figsize=(16, 8))
fig.suptitle('Employment Standing Distribution', measurement = 20, shade = 'black')
labels = ['Employed full-time or part-time', 'Retired', 'Student','Unemployed','Self-employed']
sizes = status_df['percent']
# Solely explode the employed full-time, since it's the largest chunk
explode = (0.1,0,0,0,0)
colours = ['#00FFFF', '#7FFFD4', '#76EEC6', '#66CDAA','#87CEFA']
ax.pie(sizes, explode=explode, startangle=60,colours = colours, labels=labels,autopct='%1.0f%%', pctdistance=0.7)
ax.add_artist(plt.Circle((0,0),0.4,fc='white'))
# Alter the info label measurement
plt.rcParams['font.size'] = 15
plt.present()
# Shorten label
gaming_df['Ethnicity'].exchange('Pacific Islander (together with Native Hawaiian, Native American, or Native Alaskan)', 'Pacific Islander & Associated', inplace=True)# Get ethnicity counts
counts = gaming_df['Ethnicity'].value_counts()
eth_df = pd.DataFrame(counts)
eth_df['percent'] = (eth_df['Ethnicity'] /eth_df['Ethnicity'].sum())
eth_df['percent']
# Kind the dataframe
eth_df.sort_values(by='p.c', ascending=True, inplace=True)
# Create a shade palette
palette = sns.color_palette("tab20c", len(eth_df['percent']))
# Create a horizontal bar plot
plt.barh(eth_df.index, eth_df['percent'], shade=palette)
# Add labels and title
plt.title('Ethinicity Distribution')
plt.xlabel('Share')
plt.ylabel('Ethinicity')
plt.present()
To this point, these numbers look like logical. The survey seems to have a comparatively even distribution, although 37% of individuals who accomplished the survey have been from the south. Almost 50% of individuals have been employed full-time or part-time, and almost 69% of individuals recognized as White or Caucasian. Clearly, there’s a skew in a selected client section, and so that is necessary to recollect as we transfer alongside towards our outcomes.
Lacking values are important to search for right here. Specifically, a survey typically has excessive or notable lacking depend charges.
# I must get a way of what number of lacking values there are general on this survey knowledge set
missing_count = gaming_df.isna().sum()
missing_df = (pd.concat([missing_count.rename('Missing count'),
missing_count.div(len(gaming_df))
.rename('Missing ratio')],axis = 1)
.loc[missing_count.ne(0)])# Shade the background of cells in a dataframe relying on the worth of the info in every column
missing_df.fashion.background_gradient(cmap="winter")
There are variables which might be lacking knowledge. This is a matter. A few of these rows have over 50% lacking, equivalent to Worth digital video recorder (DVR), which is lacking 92.5% of responses.
Now, I might have imputed knowledge, however I feel that this might obfuscate the outcomes. Furthermore, some completion percentages are under 50%, which meant that imputation for common (for numerical variables) and mode (for categorical variables) for these variables would seemingly not be actually correct. I wanted to drop variables that have been lacking greater than 60%. This threshold would permit me to give attention to variables that might be integrated into the mannequin.
# It is smart to eliminate columns which have greater than 60% knowledge lacking, as imputing knowledge will not seize true variance in these columns
gaming_df = gaming_df.drop(missing_df[missing_df['Missing ratio'] > 0.60].index, axis=1)
Make certain that you examine to ensure this works, and create a brand new knowledge body to deal with this.
missing_count1 = gaming_df.isna().sum()
missing_df1 = (pd.concat([missing_count1.rename('Missing count'),
missing_count1.div(len(gaming_df))
.rename('Missing ratio')],axis = 1)
.loc[missing_count1.ne(0)])# Shade the background of cells in a dataframe relying on the worth of the info in every column
missing_df1.fashion.background_gradient(cmap="winter")
This illustrated that I didn’t have any variables that had >60% lacking responses. At this stage, it was essential to think about dummy coding. There have been variables within the survey that had string responses, and it was essential to assign numerical responses to them.
# One hot-encoding is chosen for all variables the place a price of 1 signifies the presence of that class
# and 0 signifies its absence
# It's appropriate when there isn't a inherent order to the connection between classes/no assumption of ordinality within the mannequin# Choose the "This autumn - What's your gender?" variable from the DataFrame
gender_column = gaming_df['Gender']
# Carry out one-hot encoding for the "This autumn - What's your gender?" variable
gender_encoded_data = pd.get_dummies(gender_column, prefix='Gender')
# Choose the "Q5 - Which class finest describes your ethnicity?" variable from the DataFrame
ethnicity_column = gaming_df['Ethnicity']
# Carry out one-hot encoding for the "This autumn - What's your gender?" variable
ethnicity_encoded_data = pd.get_dummies(ethnicity_column, prefix='Ethnicity')
# Choose the "area - Area" variable from the DataFrame
region_column = gaming_df['Region']
# Carry out one-hot encoding for the "area - Area" variable
region_encoded_data = pd.get_dummies(region_column, prefix='Area')
# Choose the "While you use household or good friend sub, what do you watch most frequently" variable from the DataFrame
Subwatch_column = gaming_df['When you use family or friend sub, what do you watch most often']
# Carry out one-hot encoding for the "Q5 - Which class finest describes your ethnicity?" variable
Subwatch_encoded_data = pd.get_dummies(Subwatch_column, prefix='Sub Watch')
# Concatenate the encoded knowledge with the unique DataFrame
gaming_df_encoded = pd.concat([gaming_df, gender_encoded_data, ethnicity_encoded_data, region_encoded_data, Subwatch_encoded_data], axis=1)
# Print the ensuing DataFrame
gaming_df_encoded
This code allowed me to award a “1” for each time a selected response variable confirmed for a participant, and a “0” for when it didn’t.
For label encoder, I wanted to exchange specific values that had an ordinal/interval scale. This was related for illustrating power and route for every variable scale.
# Label Encoder utilizing exchange () methodology
# That is used for changing every variable response into a singular coded response
# It's typically used when a variable has an inherent ordinal/interval relationship
# It's helpful significantly when constructing fashions that may straight interpret the encoded values # Change methodology for Family Earnings
encoding_mapping = {
"Lower than $29,999": 1,
"$30,000 to $49,999": 2,
"$50,000 to $99,999": 3,
"$100,000 to $299,999": 4,
"Greater than $300,000": 5,
"Have no idea": -999,
}
# I am going to make sure to drop the -999 values, however for now, this can allow to me to code what I will not want sooner or later
# This may make dropping reply selections simpler for myself
# Carry out label encoding utilizing the desired mapping
gaming_df_encoded["Household_Income"] = gaming_df_encoded["Household_Income"].exchange(encoding_mapping)
# Change methodology for QNEW3 - What's your employment standing?
encoding_mapping = {
"Retired": 0,
"Unemployed": 1,
"Pupil": 2,
"Self-employed": 3,
"Employed full-time or part-time": 4
}
# Carry out label encoding utilizing the desired mapping
gaming_df_encoded["QNEW3 - What is your employment status?"] = gaming_df_encoded["QNEW3 - What is your employment status?"].exchange(encoding_mapping)
# Change methodology for pay to double Web velocity
encoding_mapping = {
"I'm not keen to pay extra for quicker obtain speeds as my present velocity is enough for my wants": 1,
"I choose quicker velocity however I'm unwilling to pay greater than I already do": 2,
"I'm keen to pay $5 per thirty days on prime of what I already pay": 3,
"I'm keen to pay $10 per thirty days on prime of what I already pay": 4,
"I'm keen to pay $20 per thirty days on prime of what I already pay": 5,
"I'm keen to pay $30 or extra per thirty days on prime of what I already pay": 6,
}
# Carry out label encoding utilizing the desired mapping
gaming_df_encoded["Willing to pay to double download speed"] = gaming_df_encoded["Willing to pay to double download speed"].exchange(encoding_mapping)
NaNs take care of values that aren’t a quantity. These are sometimes both lacking or undefined knowledge in a knowledge set. I coded responses that I wouldn’t want sooner or later as -999. This can be a helpful tactic when there are responses (e.g., don’t know) that actually can’t be included in a scale.
# The -999 responses are usually not helpful on this knowledge set
# Drop rows with -999 values
gaming_df_encoded = gaming_df_encoded[~(gaming_df_encoded == -999).any(axis=1)]
I executed one remaining examine for NaNs.
# Verify for NaN values in gaming_df_encoded
has_nan = gaming_df_encoded.isnull().any().any()if has_nan:
print("There are remaining NaN values in gaming_df_encoded.")
else:
print("There are not any remaining NaN values in gaming_df_encoded.")
It appeared that there have been no NaNs right here. This allowed me to have a look at univariate breakdowns for variables.
gaming_df_encoded.describe()
Subsequent, I checked out a correlation matrix to know how specific variables interacted with one another.
# Calculate the correlation matrix
correlation_matrix = gaming_df_encoded.corr()# Choose options with correlation above a sure threshold
correlation_threshold = 0.7 # Alter the edge as wanted
selected_features = correlation_matrix[abs(correlation_matrix['Subscriptions - Streaming video service']) > correlation_threshold].index.tolist()
correlation_matrix
Recall that there have been over 200 variables at the start of this venture. In consequence, it was tough to view prime correlations by way of this window. As an alternative, I wrote code that ranked correlations with a ≥0.5 threshold.
# Discover correlations
corr = gaming_df_encoded.corr()# Filter out self-correlations (correlation of a variable with itself)
corr = corr.masks(np.triu(np.ones(corr.form)).astype(bool))
# Unstack the correlations
unstacked_corr = corr.unstack().reset_index()
# Rename columns
unstacked_corr.columns = ['Variable 1', 'Variable 2', 'Correlation']
# Kind correlations in descending order
sorted_corr = unstacked_corr.sort_values('Correlation', ascending=False)
# Filter out correlations with absolute worth lower than 0.5
strongest_corr = sorted_corr[abs(sorted_corr['Correlation']) >= 0.5]
# Add rating function
strongest_corr['Rank'] = strongest_corr['Correlation'].abs().rank(ascending=False)
# Format desk for clearer show
strongest_corr_formatted = strongest_corr.spherical(3) # Spherical correlation values to three decimal locations
strongest_corr_formatted['Variable 1'] = strongest_corr_formatted['Variable 1'].astype(str)
strongest_corr_formatted['Variable 2'] = strongest_corr_formatted['Variable 2'].astype(str)
strongest_corr_formatted['Correlation'] = strongest_corr_formatted['Correlation'].apply(lambda x: f"{x:+.3f}")
# Show the formatted desk
desk = tabulate(strongest_corr_formatted, headers='keys', showindex=False, tablefmt='fancy_grid')
print(desk)
I plotted distribution and histograms for all variables. This helped me perceive if I wanted to normalize any variables. I added a warning within the code to flag skewed histograms and excessive variances. That is additionally useful in figuring out whether or not your knowledge is homoscedastic (e.g. equal variance throughout all reply selections — good!) or heteroscedastic (e.g. unequal variance throughout all reply selections — dangerous!).
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np# Outline the frequency columns
frequency_columns = ['Movie Frequency - Rent digital video entertainment to download',
'Movie Frequency - Watch digital video entertainment to download',
'Movie Frequency - Purchase/rent a video via TV service On-Demand PPV',
'TV Frequency - Rent a physical DVD/Blu-ray',
'TV Frequency - Purchase a physical DVD/Blu-ray',
'TV Frequency - Purchase digital video entertainment',
'TV Frequency - Watch digital video entertainment via an online streaming service',
'TV Frequency - Purchase/rent a video via your television on-demand or PPV',
'Frequency of using family or friend subscription log in',
'Do during home TV viewing - Read for work/school',
'Do during home TV viewing - Read for pleasure',
'Do during home TV viewing - Browse and surf the Web',
'Do during home TV viewing - Microblogging',
'Do during home TV viewing - Read email',
'Do during home TV viewing - Write email',
'Do during home TV viewing - Text message',
'Do during home TV viewing - Use a social network',
'Do during home TV viewing - Talk on the phone',
'Do during home TV viewing - Browse for products and services online',
'Do during home TV viewing - Purchase products and services online',
'Do during home TV viewing - Play video games',
'I would rather pay for news online to avoid ads',
'I would rather pay for sports information online to avoid ads',
'I would rather pay for games online to avoid ads',
'I would rather pay for music online to avoid ads',
'I would rather pay for TV shows online to avoid ads',
'I would rather pay for movies online to avoid ads',
'I would rather provide more personal information online to receive targeted ads',
'By providing personal info online I may be a victim of identity theft',
'I would be willing to view ads if it cut sub cost']
# Iterate over every column and plot histograms with distribution form
for column in frequency_columns:
plt.determine(figsize=(8, 6))
# Plot histogram
plt.subplot(2, 1, 1)
sns.histplot(gaming_df_encoded[column], kde=True)
plt.xlabel(column)
plt.ylabel('Frequency')
plt.title('Histogram with Distribution Form of {}'.format(column))
# Plot distribution form
plt.subplot(2, 1, 2)
sns.kdeplot(gaming_df_encoded[column])
plt.xlabel(column)
plt.ylabel('Density')
plt.title('Distribution Form of {}'.format(column))
# Flag skewed histograms
skewness = gaming_df_encoded[column].skew()
if abs(skewness) > 1:
plt.xlabel(f"Skewness: {skewness:.2f}", shade='purple')
plt.axvline(gaming_df_encoded[column].imply(), shade='purple', linestyle='--')
# Flag excessive variance
variance = gaming_df_encoded[column].var()
if variance > np.percentile(gaming_df_encoded[column].var(), 75):
plt.xlabel(f"Variance: {variance:.2f}", shade='orange')
plt.axvline(gaming_df_encoded[column].imply(), shade='orange', linestyle='--')
plt.tight_layout()
plt.present()
I used to be glad to see that there didn’t appear to be something that wanted to be normalized right here, as all the charts have been in blue, and displayed heteroscedastic knowledge variances.
Earlier than I might pursue a logistic regression, I would like to coach, check, and cut up the info. You’ll be capable to keep away from knowledge leakage, as data from the check set can influence how options are chosen, which might result in overfitting. Overfitting implies that the mannequin can’t generalize to unseen knowledge since it’s carefully based mostly on the coaching knowledge. In different phrases, it could possibly’t ingest new knowledge make make efficient predictions.
# Earlier than constructing a logistic regression, I would like to coach, check, cut up the info
# Separate the goal variable
column_to_drop = 'Subscriptions - Streaming video service'
X = gaming_df_encoded.drop(columns=[column_to_drop])
y = gaming_df_encoded[column_to_drop]# Cut up the info into coaching and testing units
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Outline the variety of prime options to pick out
okay = 5
# Carry out SelectKBest function choice
selector = SelectKBest(score_func=f_classif, okay=okay)
selector.match(X, y)
# Chosen options
selected_features = X.columns[selector.get_support()]
# Create a desk for chosen options
selected_features_table = pd.DataFrame({'Options': selected_features})
# Set the HTML styling for the desk
html_table = selected_features_table.fashion.set_table_attributes('fashion="font-size: 40pt;"').hide_index().render()
# Print the chosen options desk with elevated font measurement
show(HTML(html_table))
I ran a logistic regression right here. The dependent variable is binary (e.g. “1” for subscription, “0” for no subscription). Since I used to be concerned about seeing how nicely a mannequin might predictive streaming subscriptions, this meant I wanted to look right into a classification methodology. Logistic regression is usually a primary good step.
I wished to make certain that I had key metrics obtainable to me, together with # of iterations, p-value, Z scores, confidence intervals, variety of observations, and different components. I additionally requested a ROC curve and classification desk.
# Deal with lacking values
X_train_cleaned = X_train[selected_features].dropna()# Scale the enter knowledge
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_cleaned)
# Prepare a logistic regression mannequin on the chosen options
lr_model = LogisticRegression()
lr_model.match(X_train_scaled, y_train[X_train_cleaned.index])
# Predict on the coaching set
X_train_selected_scaled = scaler.remodel(X_train[selected_features].dropna())
y_train_pred = lr_model.predict(X_train_selected_scaled)
y_train_pred_proba = lr_model.predict_proba(X_train_selected_scaled)[:, 1]
# Calculate analysis metrics on the coaching set
train_accuracy = accuracy_score(y_train[X_train_cleaned.index], y_train_pred)
precision = precision_score(y_train[X_train_cleaned.index], y_train_pred)
recall = recall_score(y_train[X_train_cleaned.index], y_train_pred)
f1 = f1_score(y_train[X_train_cleaned.index], y_train_pred)
roc_auc = roc_auc_score(y_train[X_train_cleaned.index], y_train_pred_proba)
# Classification desk on the coaching set
classification_table = pd.crosstab(y_train[X_train_cleaned.index], y_train_pred, rownames=['Actual'], colnames=['Predicted'])
# Show the logistic regression output desk
logit_model = sm.Logit(y_train[X_train_cleaned.index], sm.add_constant(X_train[selected_features].dropna()))
logit_results = logit_model.match()
print(logit_results.abstract())
# Plot ROC curve on the coaching set
fpr, tpr, thresholds = roc_curve(y_train[X_train_cleaned.index], y_train_pred_proba)
plt.determine(figsize=(10, 6))
plt.plot(fpr, tpr, label='Logistic Regression (AUC = {:.2f})'.format(roc_auc))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Optimistic Price')
plt.ylabel('True Optimistic Price')
plt.title('Receiver Working Attribute (ROC) Curve')
plt.legend(loc='decrease proper')
plt.present()
# Print analysis metrics on the coaching set
print("Coaching Accuracy:", train_accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Rating:", f1)
print("ROC AUC Rating:", roc_auc)
# Show classification desk on the coaching set
sns.heatmap(classification_table, annot=True, cmap='Blues', fmt='g')
plt.xlabel('Predicted')
plt.ylabel('Precise')
plt.title('Classification Desk (Coaching Set)')
plt.present()
# Calculate true positives, true negatives, false positives, false negatives
true_positives = classification_table.loc[1, 1]
true_negatives = classification_table.loc[0, 0]
false_positives = classification_table.loc[0, 1]
false_negatives = classification_table.loc[1, 0]
# Print the counts
print("True Positives:", true_positives)
print("True Negatives:", true_negatives)
print("False Positives:", false_positives)
print("False Negatives:", false_negatives)
Coaching Accuracy: 0.7156153050672182
Precision: 0.7409638554216867
Recall: 0.7165048543689321
F1 Rating: 0.7285291214215203
ROC AUC Rating: 0.798642495059713
Total, it appeared like this mannequin did a good job in predicting values. The world underneath curve (AUC) rating was .80, which is fairly good, however not stellar. There have been many false positives (e.g., you falsely predicted one thing that was true)(129) and false negatives (e.g., you falsely predicted one thing was damaging) (146). The recall rating (.72) meant that this mannequin was barely much less good at classifying precise optimistic subscriptions from your entire pool of individuals quite than positively predicting individuals who had a TV/video subscription from the overall optimistic group (e.g. precision = .74).
It might be prudent to take a step again right here. What does this imply? If I’m a marketer, I’m seemingly concerned about discovering individuals who do not need a TV and video subscription. Due to this fact, it’s extra expensive for me to be false in saying that somebody doesn’t have a TV and video subscription, as it is a missed alternative to market to somebody and attempt to persuade them to get a subscription service. I’ll want to reduce my FNs, which implies I must give attention to enhancing recall. If I used to be frightened about being false in saying that somebody does have a TV and video subscription service, I’d fear about minimizing FPs and give attention to enhancing precision.
I wished to make certain that I run this mannequin towards check knowledge.
# Deal with lacking values
X_test_cleaned = X_test[selected_features].dropna()# Scale the enter knowledge
X_test_scaled = scaler.remodel(X_test_cleaned)
# Predict on the testing set
y_test_pred = lr_model.predict(X_test_scaled)
y_test_pred_proba = lr_model.predict_proba(X_test_scaled)[:, 1]
# Calculate analysis metrics on the testing set
test_accuracy = accuracy_score(y_test[X_test_cleaned.index], y_test_pred)
precision_test = precision_score(y_test[X_test_cleaned.index], y_test_pred)
recall_test = recall_score(y_test[X_test_cleaned.index], y_test_pred)
f1_test = f1_score(y_test[X_test_cleaned.index], y_test_pred)
roc_auc_test = roc_auc_score(y_test[X_test_cleaned.index], y_test_pred_proba)
# Classification desk on the testing set
classification_table_test = pd.crosstab(y_test[X_test_cleaned.index], y_test_pred, rownames=['Actual'], colnames=['Predicted'])
# Show the logistic regression output desk for testing
logit_model_test = sm.Logit(y_test[X_test_cleaned.index], sm.add_constant(X_test[selected_features].dropna()))
logit_results_test = logit_model_test.match()
print(logit_results_test.abstract())
# Plot ROC curve on the testing set
fpr_test, tpr_test, thresholds_test = roc_curve(y_test[X_test_cleaned.index], y_test_pred_proba)
plt.determine(figsize=(10, 6))
plt.plot(fpr_test, tpr_test, label='Logistic Regression (AUC = {:.2f})'.format(roc_auc_test))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Optimistic Price')
plt.ylabel('True Optimistic Price')
plt.title('Receiver Working Attribute (ROC) Curve (Testing Set)')
plt.legend(loc='decrease proper')
plt.present()
# Print analysis metrics on the testing set
print("Testing Accuracy:", test_accuracy)
print("Precision (Testing):", precision_test)
print("Recall (Testing):", recall_test)
print("F1 Rating (Testing):", f1_test)
print("ROC AUC Rating (Testing):", roc_auc_test)
# Show classification desk on the testing set
sns.heatmap(classification_table_test, annot=True, cmap='Blues', fmt='g')
plt.xlabel('Predicted')
plt.ylabel('Precise')
plt.title('Classification Desk (Testing Set)')
plt.present()
# Calculate true positives, true negatives, false positives, false negatives on the testing set
true_positives_test = classification_table_test.loc[1, 1]
true_negatives_test = classification_table_test.loc[0, 0]
false_positives_test = classification_table_test.loc[0, 1]
false_negatives_test = classification_table_test.loc[1, 0]
# Print the counts on the testing set
print("True Positives (Testing):", true_positives_test)
print("True Negatives (Testing):", true_negatives_test)
print("False Positives (Testing):", false_positives_test)
print("False Negatives (Testing):", false_negatives_test)
Testing Accuracy: 0.7107438016528925
Precision (Testing): 0.664
Recall (Testing): 0.7477477477477478
F1 Rating (Testing): 0.7033898305084746
ROC AUC Rating (Testing): 0.7731242693074754
The AUC rating decreased barely within the check knowledge (.77). That is comparatively regular. Whereas above .80 is right, it is a respectable mannequin. It needs to be famous that many individuals have streaming video providers, and so it may be tough to discern which traits actually assist prediction in determining these providers.
To make sure that that is mannequin’s efficiency is powerful and to estimate the way it would possibly carry out on unseen knowledge, cross-validation is exceptionally helpful. It additionally aids us in avoiding over-fitting the mannequin and is especially helpful with small knowledge units.
# In consequence, this seems to be mannequin, and it's essential to run cross-validation
# That is helpful to make sure that the mannequin works with the patterns of information# Create a logistic regression mannequin
mannequin = LogisticRegression()
# Carry out cross-validation
cv_scores = cross_val_score(mannequin, X, y, cv=5) # Specify the variety of folds (e.g., cv=5 for 5-fold cross-validation)
# Create a DataFrame to retailer the cross-validation scores
cv_scores_df = pd.DataFrame({'Cross-Validation Rating': cv_scores})
# Print the cross-validation scores in a desk format
print(tabulate(cv_scores_df, headers='keys', tablefmt='psql'))
# Calculate the imply and normal deviation of the cross-validation scores
mean_cv_score = np.imply(cv_scores)
std_cv_score = np.std(cv_scores)
# Print the imply and normal deviation of the cross-validation scores
print("Imply CV Rating:", mean_cv_score)
print("Std CV Rating:", std_cv_sc
The cross-validation ran 5 iterations, scoring 0.9, 1, .85, .95, .95. The typical was .93 and the STDV rating was .05. This demonstrated excessive accuracy and low variance, that are two issues which might be important in cross-validation.
I wanted to see if there was any form of hyper-parameter tuning that I might do right here by operating an elasticnet regression.
# I wished to see if I might use hyperparameter tuning with elasticnet regression
clf = [
LogisticRegression(solver='saga', penalty='elasticnet', l1_ratio=0.5, max_iter=1000),
LogisticRegression(solver='saga', penalty='elasticnet', l1_ratio=0.5, max_iter=1000),
LogisticRegression(solver='saga', penalty='elasticnet', l1_ratio=0.5, max_iter=1000),
LogisticRegression(solver='saga', penalty='elasticnet', l1_ratio=0.5, max_iter=1000)
]clf_columns = []
clf_compare = pd.DataFrame(columns=clf_columns)
row_index = 0
for alg in clf:
predicted = alg.match(X_train, y_train).predict(X_test)
fp, tp, th = roc_curve(y_test, predicted)
clf_name = alg.__class__.__name__
clf_compare.loc[row_index, 'Train Accuracy'] = spherical(alg.rating(X_train, y_train), 5)
clf_compare.loc[row_index, 'Test Accuracy'] = spherical(alg.rating(X_test, y_test), 5)
clf_compare.loc[row_index, 'Precision'] = spherical(precision_score(y_test, predicted), 5)
clf_compare.loc[row_index, 'Recall'] = spherical(recall_score(y_test, predicted), 5)
clf_compare.loc[row_index, 'AUC'] = spherical(roc_auc_test, 5)
row_index += 1
clf_compare.sort_values(by=['Test Accuracy'], ascending=False, inplace=True)
clf_compare
The AUC rating just about remained the identical right here, and so I elected to maneuver on to Shapley values.
Shapley values are significantly helpful as a knowledge visualization to assist illustrate the influence of sure variables. That is significantly helpful for function significance, function choice, in addition to mannequin enchancment. I ran Shapley values under.
# Lastly, I am a fan of Shapley values as a knowledge visualization to assist folks perceive outcomes# Cut up the info into coaching and testing units
X_train, X_test, y_train, y_test = train_test_split(gaming_df_encoded[selected_features], gaming_df_encoded['Subscriptions - Streaming video service'], test_size=0.2, random_state=42)
# Prepare a machine studying mannequin (Random Forest Regressor)
mannequin = RandomForestRegressor()
mannequin.match(X_train, y_train)
# Initialize the explainer object
explainer = shap.Explainer(mannequin, X_train)
# Calculate Shapley values for the testing knowledge
shap_values = explainer.shap_values(X_test)
# Create a abstract plot of Shapley values
shap.summary_plot(shap_values, X_test)
# Get options
y = gaming_df_encoded['Subscriptions - Streaming video service']
X = gaming_df_encoded[selected_features]# Prepare mannequin
mannequin = xgb.XGBRegressor(goal="reg:squarederror")
mannequin.match(X, y)
#Get shap values
explainer = shap.Explainer(mannequin)
shap_values = explainer(X)
# Waterfall plot for first remark
shap.plots.waterfall(shap_values[0])
Slightly unexpectedly, the Shapley values illustrated that basically solely proudly owning a gaming console elevated the expected final result in comparison with different values used right here. Whereas that is surprising, this research did depend on a small survey knowledge set, and most of the people did personal a streaming service subscription.
Thanks very a lot for studying!