In this Research Project I would like to explore data related to Netflix and its Movies and TV Shows.
The reason why I chose this dataset was because I've always been an avid fan of film and one problem that I had with Netflix is that it's search function is not that helpful and I wanted a way to figure out new movies and tv shows that I could watch in the future based on specific parameters such as genre, duration, release-date and director/cast.
My taste in film often changes based on what mood I'm in and so one question that I would like to investigate is how many genres Netflix has and which genre dominates the platform.
Another question that I wanted to investigate regards the spread of film durations. My choice of film often involves the length of the film as I usually like to watch short tv shows during exam season and longer movies on weekends when I don't have work. Therefore one question I have is what is the average film duration for tv shows and movies on Netflix?
Release date is also an interesting parameter to explore. For example, does Netflix own rights to older movies especially since a lot of these older movies are owned by big corporations and may not allow for Netflix to get the rights to play these films. What time period are most of the films on Netflix from?
Last but not least, I would like to be able to sort films based on director and cast. I've been recently seeing a lot more Asian films in the last year and I was wondering if Netflix's choice in including more modern Asian films would mean that there are a lot more films directed and casted by Asians. Who are the International (non-US) directors/cast with most films on Netflix?
These are just some of the basic questions that I will exploring in this project.
Let us first take a look at what type of data that this dataset. The dataset used is included in the google drive folder. The csv file itself is called netflix_tites.
Link to original dataset: https://www.kaggle.com/shivamb/netflix-shows
This data was collected by Shivam Bansal on Kaggle.
Import Libraries:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
from collections import Counter
from google.colab import drive
drive.mount('/content/gdrive')
df_netflix = pd.read_csv('/content/gdrive/My Drive/Data Science/netflix_titles.csv')
pd.set_option("display.max.columns", None)
Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
Let's take a look at the Data's information including its columns. From below we can see that there are 7787 different titles included in the dataset. The data that we have been provided include the show's id (not that useful), type, title, director, cast, country, date_added (not as useful), release_year, rating, duration, listed_in (genre) and description. In the project I will be focusing on sorting based on title, director, cast, country, release_year, rating, duration and listed_in. As shown by the column's type, only the release year is included as an integer, the rest is included as a string or object.
df_netflix.info()
#df_netflix.describe() use later for release year data
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7787 entries, 0 to 7786 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 show_id 7787 non-null object 1 type 7787 non-null object 2 title 7787 non-null object 3 director 5398 non-null object 4 cast 7069 non-null object 5 country 7280 non-null object 6 date_added 7777 non-null object 7 release_year 7787 non-null int64 8 rating 7780 non-null object 9 duration 7787 non-null object 10 listed_in 7787 non-null object 11 description 7787 non-null object dtypes: int64(1), object(11) memory usage: 730.2+ KB
df_netflix.shape
(7787, 12)
df_netflix.dtypes
show_id object type object title object director object cast object country object date_added object release_year int64 rating object duration object listed_in object description object dtype: object
Next, lets look at the datasets shape and structure. We will also take a clearer look at what data each column actually represents and their distribution.
As can be seen below, not every column has an equal count showing that there is not data present for each row. However, we can see that every title has an id, type and title as there are 7787 of each of them.
There are not that many notable data quality issues. The number of titles with no director is a little concerning but other than that, most of the data seem very clean and have at least 7000 different inputs.
We can also see some interesting facts such as that David Attenborough is the top cast member and Raul Campos and Jan Suter are the top directors. Unexpectedly, United States is the top location that the films were produced in.
There are some surprising things about this data. The duration for tv shows may be a little difficult to understand as they are given by how many seasons they have instead of total duration or average episode duration. Also one thing to note is that there are multiple director and cast members listed for each title and we might have to take some time to parse them into individual people.
df_netflix.isnull().sum()
show_id 0 type 0 title 0 director 2389 cast 718 country 507 date_added 10 release_year 0 rating 7 duration 0 listed_in 0 description 0 dtype: int64
df_netflix.describe(include=np.object)
show_id | type | title | director | cast | country | date_added | rating | duration | listed_in | description | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 7787 | 7787 | 7787 | 5398 | 7069 | 7280 | 7777 | 7780 | 7787 | 7787 | 7787 |
unique | 7787 | 2 | 7787 | 4049 | 6831 | 681 | 1565 | 14 | 216 | 492 | 7769 |
top | s1996 | Movie | The Last Face | Raúl Campos, Jan Suter | David Attenborough | United States | January 1, 2020 | TV-MA | 1 Season | Documentaries | Multiple women report their husbands as missin... |
freq | 1 | 5377 | 1 | 18 | 18 | 2555 | 118 | 2863 | 1608 | 334 | 3 |
Now that we know what this data includes, we can clean the data by removing columns we don't need. We will sort individual column data later when we get to the Data Exploration and Visualization part.
df_netflix = df_netflix.drop(['show_id','description'], axis=1)
df_netflix.head()
type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | |
---|---|---|---|---|---|---|---|---|---|---|
0 | TV Show | 3% | NaN | João Miguel, Bianca Comparato, Michel Gomes, R... | Brazil | August 14, 2020 | 2020 | TV-MA | 4 Seasons | International TV Shows, TV Dramas, TV Sci-Fi &... |
1 | Movie | 7:19 | Jorge Michel Grau | Demián Bichir, Héctor Bonilla, Oscar Serrano, ... | Mexico | December 23, 2016 | 2016 | TV-MA | 93 min | Dramas, International Movies |
2 | Movie | 23:59 | Gilbert Chan | Tedd Chan, Stella Chung, Henley Hii, Lawrence ... | Singapore | December 20, 2018 | 2011 | R | 78 min | Horror Movies, International Movies |
3 | Movie | 9 | Shane Acker | Elijah Wood, John C. Reilly, Jennifer Connelly... | United States | November 16, 2017 | 2009 | PG-13 | 80 min | Action & Adventure, Independent Movies, Sci-Fi... |
4 | Movie | 21 | Robert Luketic | Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar... | United States | January 1, 2020 | 2008 | PG-13 | 123 min | Dramas |
Let's look at what the Netflix titles are categorised by in regards to type.
Since the two types of data that type includes are "Movie" and "TV Show" we can sort the larger netflix dataframe into two separate dataframes for each type.
df_netflix.type.value_counts
TVshows = df_netflix[df_netflix['type'] == 'TV Show']
Movie = df_netflix[df_netflix['type'] == 'Movie']
df_netflix.type.unique()
array(['TV Show', 'Movie'], dtype=object)
Let's take a look at the number of TV shows vs the number of Movies on Netflix.
As can we see from below there are way more Movies than TV shows. This makes sense as the TV show industry started later than the Film industry and there are probably more films out there that Netflix can get copyrights for.
plt.figure(figsize=(10,7))
sns.set(style="darkgrid")
ax = sns.countplot(x="type", data=df_netflix, palette="mako")
ax.set(xlabel='Type', ylabel='Number of Type')
plt.title("Number of TV Shows vs Movies")
Text(0.5, 1.0, 'Number of TV Shows vs Movies')
Let's take a look at the data we were given regarding release dates for the movies and tv-shows.
As it is the only numerical data we got, we can take a more quanitative look at the data.
print(df_netflix.release_year)
0 2020 1 2016 2 2011 3 2009 4 2008 ... 7782 2005 7783 2015 7784 2019 7785 2019 7786 2019 Name: release_year, Length: 7787, dtype: int64
df_netflix.describe()
release_year | |
---|---|
count | 7787.000000 |
mean | 2013.932580 |
std | 8.757395 |
min | 1925.000000 |
25% | 2013.000000 |
50% | 2017.000000 |
75% | 2018.000000 |
max | 2021.000000 |
Below is a boxplot and histogram of the data. It is apparent that most of the releases are concentrated in the 10 year period from 2010-2020.
number_of_releases = df_netflix['release_year'].value_counts()
plt.figure(figsize=(30,50))
bx = sns.catplot(y='release_year', data=df_netflix, palette="Set2", kind="box")
bx.set(xlabel='Release Year', ylabel='Number of releases')
plt.title("Boxplot of Releases")
Text(0.5, 1.0, 'Boxplot of Releases')
<Figure size 2160x3600 with 0 Axes>
bx = sns.histplot(x='release_year', data=df_netflix, palette="Set2")
bx.set(xlabel='Release Year', ylabel='Number of releases')
plt.title("Histogram of Releases")
Text(0.5, 1.0, 'Histogram of Releases')
The bar chart below gives us a better idea of the years with the most releases. It is unsuprising that the top 5 comprise of the last 5 years.
plt.figure(figsize=(15,10))
ax = sns.countplot(x='release_year', data=df_netflix, palette="Set2", order=number_of_releases.index[0:15])
ax.set(xlabel='Release Year', ylabel='Number of releases')
plt.title("Bar Graph of Releases")
Text(0.5, 1.0, 'Bar Graph of Releases')
I had a hunch that the decrease in releases in the last two years might be related to COVID. The graph below shows the change in numbers of TV Shows and Movies released and the corresponding period of COVID. It is clear that since COVID began, the number of releases for both Movies and TV Shows have decreased. This data makes sense as most film crews were unable to congregate due to the pandemic.
TVshows_progress = TVshows['release_year'].value_counts().sort_index()
Movie_progress = Movie['release_year'].value_counts().sort_index()
plt.figure(figsize=(14, 7))
plt.plot(TVshows_progress.index, TVshows_progress.values, label='TV shows')
plt.plot(Movie_progress.index, Movie_progress.values, label='Movie')
#plt.plot(df_netflix['release_year'].index, df_netflix['release_year'].values, label='Total')
plt.axvline(2019, alpha=0.3, linestyle='--', color='r')
plt.axvline(2021, alpha=0.3, linestyle='--', color='r')
plt.axvspan(2019, 2021, alpha=0.2, color='r', label='Coronavirus')
plt.xticks(list(range(1925, 2026, 5)), fontsize=12)
plt.title('Content growth throughout history', fontsize=18)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Amount of content', fontsize=14)
plt.yticks(fontsize=12)
plt.legend()
plt.show()
df_netflix2 = df_netflix[df_netflix.release_year>=2000]
Total_progress = df_netflix2['release_year'].value_counts().sort_index()
plt.figure(figsize=(14, 7))
plt.plot(Total_progress.index, Total_progress.values, label='Total Releases', color='g')
plt.axvline(2019, alpha=0.3, linestyle='--', color='r')
plt.axvline(2021, alpha=0.3, linestyle='--', color='r')
plt.axvspan(2019, 2021, alpha=0.2, color='r', label='Coronavirus')
plt.xticks(list(range(2000, 2022, 1)), fontsize=12)
plt.title('Content growth in the 21st century', fontsize=18)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Amount of content', fontsize=14)
plt.yticks(fontsize=12)
plt.legend()
plt.show()
The bar charts below show the countries with the highest number of movie and tv show releases. It is interesting that India is second in the list, highlighting the rise of Bollywood and Netflix's decision to include more Indian films on its platform (perhaps to increase viewers from Asian communities).
# print(df_netflix.columns)
# country=pd.Series(df_netflix.country)
# print(pd.unique(country))
list_country = [x.split(', ') for x in df_netflix.dropna(subset=['country'])['country'].tolist()]
list_country = list(itertools.chain(*list_country))
df_netflix_country_count = pd.DataFrame(Counter(list_country).most_common()[:10], columns=['Country', 'Count'])
plt.figure(figsize=(12,10))
sns.set(style="darkgrid")
ax = sns.barplot(y="Count", x='Country', data=df_netflix_country_count, palette="Set2", orient='v')
There are a total of 120 countries in the dataset. However, each movie and TV show is given multiple countries and hence inflates the score of countries with a lot of capital in film production such as the United States.
df_netflix_country_count = pd.DataFrame(Counter(list_country).most_common()[:], columns=['Country', 'Count'])
print("Number of different countries:",len(df_netflix_country_count)-1)
print()
plt.figure(figsize=(20,50))
sns.set(style="darkgrid")
ax = sns.barplot(x="Count", y='Country', data=df_netflix_country_count, palette="Set2", orient='h')
Number of different countries: 120
The bar charts below show the contnet rating for both movies and TV-shows. It is interesting that Movies have a lot more R-rated films. Perhaps this is because TV channels would not allow for R-rated films to be placed on their platform.
rating = Movie['rating'].value_counts()
plt.figure(figsize=(14,7))
plt.title('Content ratings of Movies', fontsize=18)
plt.tick_params(labelsize=14)
sns.barplot(y=rating.index, x=rating.values, palette="Set2")
plt.xlabel('Count', fontsize=14)
plt.ylabel('Rating', fontsize=14)
plt.show()
rating = TVshows['rating'].value_counts()
plt.figure(figsize=(14,7))
plt.title('Content ratings of TV Shows', fontsize=18)
plt.tick_params(labelsize=14)
sns.barplot(y=rating.index, x=rating.values, palette="Set2")
plt.xlabel('Count', fontsize=14)
plt.ylabel('Rating', fontsize=14)
plt.show()
Below, I have analysed the distribution of films based on their length. Due to the difference of data type for movies and TV shows, I have separated the data for the two different subcategories. Some highlights from the Movies data is that the mean duration was 99 minutes and the longest movie was a whooping 312 minutes long (however that movie is not really that long as it was an interactive film with multiple endings making it an outlier).
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
pd.options.mode.chained_assignment = None # default='warn'
grp = df_netflix.groupby('type')
movie = grp.get_group('Movie')
movie['duration'] = [int(i.split(' ')[0]) for i in movie.duration.dropna()]
plt.figure(figsize=(14, 7))
print()
print(movie.describe())
sns.distplot(movie['duration'], bins=60).set(ylabel=None)
plt.title('Length distribution of films', fontsize=18)
plt.xlabel('Duration', fontsize=14)
plt.show()
release_year duration count 5377.000000 5377.000000 mean 2012.920030 99.307978 std 9.663282 28.530881 min 1942.000000 3.000000 25% 2012.000000 86.000000 50% 2016.000000 98.000000 75% 2018.000000 114.000000 max 2021.000000 312.000000
short = movie.sort_values('duration')[['title', 'duration']].iloc[:20]
plt.figure(figsize=(14,7))
plt.title('Top 20 shortest movies available on Netflix', fontsize=18)
plt.tick_params(labelsize=14)
sns.barplot(y=short['title'], x=short['duration'], palette="Set2")
plt.xlabel('Duration', fontsize=14)
plt.ylabel('Title', fontsize=14)
plt.show()
long = movie.sort_values('duration')[['title', 'duration']].iloc[-20:]
#print(long)
plt.figure(figsize=(14,7))
plt.title('Top 20 longest movies available on Netflix', fontsize=18)
plt.tick_params(labelsize=14)
sns.barplot(y=long['title'], x=long['duration'], palette="Set2")
plt.xlabel('Duration', fontsize=14)
plt.ylabel('Title', fontsize=14)
plt.show()
Unlike the movies, the TV shows have their duration data given by how many seasons. As expected most of the shows only have one season and this makes sense as most newer TV shows have not released all their seasons.
list_TV_Show = [x.split(', ') for x in TVshows.dropna(subset=['duration'])['duration'].tolist()]
list_TV_Show = list(itertools.chain(*list_TV_Show))
df_netflix_tv_count = pd.DataFrame(Counter(list_TV_Show).most_common()[:],columns=['x', 'y'])
#print(df_netflix_tv_count)
labels = df_netflix_tv_count['x']
sizes = df_netflix_tv_count['y']
plt.figure(figsize=(30,20))
labels = df_netflix_tv_count.keys()
sns.barplot(x=df_netflix_tv_count['x'], y=df_netflix_tv_count['y'], palette="Set2")
plt.tick_params(labelsize=14)
plt.title('Duration for TV Shows on Netflix', fontsize=18)
plt.xlabel('Duration', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.show()
list_TV_Show = [x.split(', ') for x in TVshows.dropna(subset=['duration'])['duration'].tolist()]
list_TV_Show = list(itertools.chain(*list_TV_Show))
df_netflix_tv_count = pd.DataFrame(Counter(list_TV_Show).most_common()[:],columns=['x', 'y'])
searchfor = ['1 Season', '2 Season', '3 Season']
df_netflix_tv_count = df_netflix_tv_count[~df_netflix_tv_count['x'].str.contains('|'.join(searchfor),na=False)]
#print(df_netflix_tv_count)
labels = df_netflix_tv_count['x']
sizes = df_netflix_tv_count['y']
plt.figure(figsize=(30,20))
labels = df_netflix_tv_count.keys()
sns.barplot(x=df_netflix_tv_count['x'], y=df_netflix_tv_count['y'], palette="Set2" )
plt.tick_params(labelsize=14)
plt.title('Duration for TV Shows on Netflix (Excluding 1-3 Seasons', fontsize=18)
plt.xlabel('Duration', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.show()
list_TV_Show = [x.split(', ') for x in TVshows.dropna(subset=['duration'])['duration'].tolist()]
list_TV_Show = list(itertools.chain(*list_TV_Show))
df_netflix_tv_count = pd.DataFrame(Counter(list_TV_Show).most_common()[:],columns=['x', 'y'])
df_netflix_tv_count2 = pd.DataFrame(Counter(list_TV_Show).most_common()[:4],columns=['x', 'y'])
x = df_netflix_tv_count['y'].sum()-df_netflix_tv_count.iloc[0]['y']-df_netflix_tv_count.iloc[1]['y']-df_netflix_tv_count.iloc[2]['y']-df_netflix_tv_count.iloc[3]['y']
df = pd.DataFrame([['Other', x]], columns=list('xy'))
df_netflix_tv_count2=df_netflix_tv_count2.append(df)
print(df_netflix_tv_count2)
labels = df_netflix_tv_count2['x']
sizes = df_netflix_tv_count2['y']
colors = ['#ff6666', '#ffcc99', '#99ff99', '#66b3ff','#ffb3ff']
# Plot
plt.pie(sizes, labels=labels, colors=colors, startangle=90,frame=True)
centre_circle = plt.Circle((0,0),0.5,color='black', fc='white',linewidth=0)
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.title('Pie Chart of Number of Seasons of TV Shows', fontsize=18)
plt.axis('equal')
plt.tight_layout()
plt.grid(False)
plt.xlabel('Duration', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.tick_params(top=False, bottom=False, left=False, right=False,
labelleft=False, labelbottom=False)
plt.show()
x y 0 1 Season 1608 1 2 Seasons 382 2 3 Seasons 184 3 4 Seasons 87 0 Other 149
Both TV shows and Movies have been described by the same datset of genres. However, it is to note that each movie and TV show was given multiple categories, and therefore more broad category names tend to show up more in the dataset. This explains why the top 4 categories are all very broad (international, dramas and comedies). In total there were 41 different categories and ignoring the category of International Movies the top genre on Netflix would be Dramas.
print("Number of Releases with categories provided:", df_netflix.listed_in.count())
list_category = [x.split(', ') for x in df_netflix.dropna(subset=['listed_in'])['listed_in'].tolist()]
list_category = list(itertools.chain(*list_category))
df_netflix_category_count = pd.DataFrame(Counter(list_director).most_common()[:50], columns=['Category', 'Count'])
print("Number of different categories:",len(df_netflix_category_count)-1)
#print(df_netflix_category_count)
Number of Releases with categories provided: 7787 Number of different categories: 41
plt.figure(figsize=(10,20))
plt.title('Top 50 Categories available on Netflix', fontsize=18)
plt.tick_params(labelsize=14)
sns.barplot(y=df_netflix_category_count.Category, x=df_netflix_category_count.Count, palette="Set2")
plt.xlabel('Count', fontsize=14)
plt.ylabel('Category', fontsize=14)
plt.show()
The bar chart below shows the directors that have the most releases on Netflix. There were a total of 4478 directors in the database.
list_director = [x.split(', ') for x in df_netflix.dropna(subset=['director'])['director'].tolist()]
list_director = list(itertools.chain(*list_director))
df_netflix_director_count = pd.DataFrame(Counter(list_director).most_common()[:], columns=['Director', 'Count'])
print("Number of different Directors:",len(df_netflix_director_count))
df_netflix_director_count = pd.DataFrame(Counter(list_director).most_common()[:50], columns=['Director', 'Count'])
print()
plt.figure(figsize=(10,20))
plt.title('Top Directors available on Netflix', fontsize=18)
plt.tick_params(labelsize=14)
sns.barplot(y=df_netflix_director_count.Director, x=df_netflix_director_count.Count, palette="Set2")
plt.xlabel('Count', fontsize=14)
plt.ylabel('Director Name', fontsize=14)
plt.show()
Number of different Directors: 4478
There are 2117 directors releasing movies/TV shows in the United States and 2464 directors releasing movies/TV shows outside the United States. It is to note that some of these directors releasing movies/TV shows in the US may also release movies internationally as well.
united_states = df_netflix[df_netflix["country"].str.contains("United States",na=False)]
list_director = [x.split(', ') for x in united_states.dropna(subset=['director'])['director'].tolist()]
list_director = list(itertools.chain(*list_director))
df_netflix_director_count = pd.DataFrame(Counter(list_director).most_common()[:], columns=['director', 'Count'])
print("Number of different Directors:",len(df_netflix_director_count))
df_netflix_cast_count = pd.DataFrame(Counter(list_director).most_common()[:10], columns=['director', 'Count'])
print()
plt.figure(figsize=(25,7))
plt.title('Top 10 US directors available on Netflix', fontsize=18)
plt.tick_params(labelsize=14)
sns.barplot(y="Count", x='director', data=df_netflix_cast_count, palette="Set2", orient='v')
plt.xlabel('Count', fontsize=14)
plt.ylabel('Director Name', fontsize=14)
plt.show()
Number of different Directors: 2117
non_united_states = df_netflix[~df_netflix["country"].str.contains("United States",na=False)]
list_director = [x.split(', ') for x in non_united_states.dropna(subset=['director'])['director'].tolist()]
list_director = list(itertools.chain(*list_director))
df_netflix_director_count = pd.DataFrame(Counter(list_director).most_common()[:], columns=['director', 'Count'])
print("Number of different Directors:",len(df_netflix_director_count))
df_netflix_director_count = pd.DataFrame(Counter(list_director).most_common()[:10], columns=['director', 'Count'])
print()
plt.figure(figsize=(25,7))
plt.title('Top 10 Non-US directors available on Netflix', fontsize=18)
plt.tick_params(labelsize=14)
sns.barplot(y="Count", x='director', data=df_netflix_director_count, palette="Set2", orient='v')
plt.xlabel('Count', fontsize=14)
plt.ylabel('Director Name', fontsize=14)
plt.show()
Number of different Directors: 2464
The bar chart below shows the cast that have the most releases on Netflix. There were a total of 32,881 cast members in the database.
list_cast = [x.split(', ') for x in df_netflix.dropna(subset=['cast'])['cast'].tolist()]
list_cast = list(itertools.chain(*list_cast))
df_netflix_cast_count = pd.DataFrame(Counter(list_cast).most_common()[:], columns=['Cast', 'Count'])
print("Number of different Cast Members:",len(df_netflix_cast_count))
df_netflix_cast_count = pd.DataFrame(Counter(list_cast).most_common()[:50], columns=['Cast', 'Count'])
print()
plt.figure(figsize=(10,20))
plt.title('Top 50 Cast available on Netflix', fontsize=18)
plt.tick_params(labelsize=14)
sns.barplot(y=df_netflix_cast_count.Cast, x=df_netflix_cast_count.Count, palette="Set2")
plt.xlabel('Count', fontsize=14)
plt.ylabel('Cast Name', fontsize=14)
plt.show()
Number of different Cast Members: 32881
There are 13,786 cast members with movies/TV shows in the United States and 21,222 cast members with movies/TV shows outside the United States. It is to note that some of these directors releasing movies in the US may also release movies/TV shows internationally as well.
united_states = df_netflix[df_netflix["country"].str.contains("United States",na=False)]
list_cast = [x.split(', ') for x in united_states.dropna(subset=['cast'])['cast'].tolist()]
list_cast = list(itertools.chain(*list_cast))
df_netflix_cast_count = pd.DataFrame(Counter(list_cast).most_common()[:], columns=['Cast', 'Count'])
print("Number of different Cast Members:",len(df_netflix_cast_count))
df_netflix_cast_count = pd.DataFrame(Counter(list_cast).most_common()[:10], columns=['Cast', 'Count'])
print()
#print(list_cast)
plt.figure(figsize=(18,10))
ax = sns.barplot(y="Count", x='Cast', data=df_netflix_cast_count, palette="Set2", orient='v')
ax.set(xlabel='Cast', ylabel='Number of releases')
plt.ylabel('Count', fontsize=14)
plt.xlabel('Cast Name', fontsize=14)
plt.title('Top 10 US Cast available on Netflix', fontsize=18)
Number of different Cast Members: 13786
Text(0.5, 1.0, 'Top 10 US Cast available on Netflix')
non_united_states = df_netflix[~df_netflix["country"].str.contains("United States",na=False)]
list_cast = [x.split(', ') for x in non_united_states.dropna(subset=['cast'])['cast'].tolist()]
list_cast = list(itertools.chain(*list_cast))
df_netflix_cast_count = pd.DataFrame(Counter(list_cast).most_common()[:], columns=['Cast', 'Count'])
print("Number of different Cast Members:",len(df_netflix_cast_count))
df_netflix_cast_count = pd.DataFrame(Counter(list_cast).most_common()[:10], columns=['Cast', 'Count'])
print()
#print(list_cast)
plt.figure(figsize=(18,10))
ax = sns.barplot(y="Count", x='Cast', data=df_netflix_cast_count, palette="Set2", orient='v')
ax.set(xlabel='Cast', ylabel='Number of releases')
plt.ylabel('Count', fontsize=14)
plt.xlabel('Cast Name', fontsize=14)
plt.title('Top 10 Non-US Cast available on Netflix', fontsize=18)
Number of different Cast Members: 21222
Text(0.5, 1.0, 'Top 10 Non-US Cast available on Netflix')
I really enjoyed investigating this dataset and I will definitely spend more time figuring things out and relations between the columns/variables. I would like to use this data and combine it with another database such as IMDB so that I can create a database that will allow me to see how good the movie is. Additionally, I would like to replicate the recommendation feed that Netflix produces using NLP databases provided by sources like Google. Maybe if I had more time on my hand and the availability of data to platforms like Hulu and HBO, I could even generate a bigger database showing all movies/tv shows and what platform they are free to watch on. Thank you for taking the time to read through this blog and my data analysis of Netflix.