Introduction to my Data Analysis Project

In this Research Project I would like to explore data related to Netflix and its Movies and TV Shows.

The reason why I chose this dataset was because I've always been an avid fan of film and one problem that I had with Netflix is that it's search function is not that helpful and I wanted a way to figure out new movies and tv shows that I could watch in the future based on specific parameters such as genre, duration, release-date and director/cast.

My taste in film often changes based on what mood I'm in and so one question that I would like to investigate is how many genres Netflix has and which genre dominates the platform.

Another question that I wanted to investigate regards the spread of film durations. My choice of film often involves the length of the film as I usually like to watch short tv shows during exam season and longer movies on weekends when I don't have work. Therefore one question I have is what is the average film duration for tv shows and movies on Netflix?

Release date is also an interesting parameter to explore. For example, does Netflix own rights to older movies especially since a lot of these older movies are owned by big corporations and may not allow for Netflix to get the rights to play these films. What time period are most of the films on Netflix from?

Last but not least, I would like to be able to sort films based on director and cast. I've been recently seeing a lot more Asian films in the last year and I was wondering if Netflix's choice in including more modern Asian films would mean that there are a lot more films directed and casted by Asians. Who are the International (non-US) directors/cast with most films on Netflix?

These are just some of the basic questions that I will exploring in this project.

First Part: Data Cleaning and Data Checks

Let us first take a look at what type of data that this dataset. The dataset used is included in the google drive folder. The csv file itself is called netflix_tites.

Link to original dataset: https://www.kaggle.com/shivamb/netflix-shows

This data was collected by Shivam Bansal on Kaggle.

Import Libraries:

Let's take a look at the Data's information including its columns. From below we can see that there are 7787 different titles included in the dataset. The data that we have been provided include the show's id (not that useful), type, title, director, cast, country, date_added (not as useful), release_year, rating, duration, listed_in (genre) and description. In the project I will be focusing on sorting based on title, director, cast, country, release_year, rating, duration and listed_in. As shown by the column's type, only the release year is included as an integer, the rest is included as a string or object.

Next, lets look at the datasets shape and structure. We will also take a clearer look at what data each column actually represents and their distribution.

As can be seen below, not every column has an equal count showing that there is not data present for each row. However, we can see that every title has an id, type and title as there are 7787 of each of them.

There are not that many notable data quality issues. The number of titles with no director is a little concerning but other than that, most of the data seem very clean and have at least 7000 different inputs.

We can also see some interesting facts such as that David Attenborough is the top cast member and Raul Campos and Jan Suter are the top directors. Unexpectedly, United States is the top location that the films were produced in.

There are some surprising things about this data. The duration for tv shows may be a little difficult to understand as they are given by how many seasons they have instead of total duration or average episode duration. Also one thing to note is that there are multiple director and cast members listed for each title and we might have to take some time to parse them into individual people.

Now that we know what this data includes, we can clean the data by removing columns we don't need. We will sort individual column data later when we get to the Data Exploration and Visualization part.

Second Part: Data Exploration and Visualization

1. Type of Netflix Titles

Let's look at what the Netflix titles are categorised by in regards to type.

Since the two types of data that type includes are "Movie" and "TV Show" we can sort the larger netflix dataframe into two separate dataframes for each type.

Let's take a look at the number of TV shows vs the number of Movies on Netflix.

As can we see from below there are way more Movies than TV shows. This makes sense as the TV show industry started later than the Film industry and there are probably more films out there that Netflix can get copyrights for.

2. Analysis on Release Dates

Let's take a look at the data we were given regarding release dates for the movies and tv-shows.

As it is the only numerical data we got, we can take a more quanitative look at the data.

Below is a boxplot and histogram of the data. It is apparent that most of the releases are concentrated in the 10 year period from 2010-2020.

The bar chart below gives us a better idea of the years with the most releases. It is unsuprising that the top 5 comprise of the last 5 years.

3. Release Dates in regards to COVID

I had a hunch that the decrease in releases in the last two years might be related to COVID. The graph below shows the change in numbers of TV Shows and Movies released and the corresponding period of COVID. It is clear that since COVID began, the number of releases for both Movies and TV Shows have decreased. This data makes sense as most film crews were unable to congregate due to the pandemic.

4. Country Analysis

The bar charts below show the countries with the highest number of movie and tv show releases. It is interesting that India is second in the list, highlighting the rise of Bollywood and Netflix's decision to include more Indian films on its platform (perhaps to increase viewers from Asian communities).

There are a total of 120 countries in the dataset. However, each movie and TV show is given multiple countries and hence inflates the score of countries with a lot of capital in film production such as the United States.

5. Rating

The bar charts below show the contnet rating for both movies and TV-shows. It is interesting that Movies have a lot more R-rated films. Perhaps this is because TV channels would not allow for R-rated films to be placed on their platform.

6a. Duration for Movies

Below, I have analysed the distribution of films based on their length. Due to the difference of data type for movies and TV shows, I have separated the data for the two different subcategories. Some highlights from the Movies data is that the mean duration was 99 minutes and the longest movie was a whooping 312 minutes long (however that movie is not really that long as it was an interactive film with multiple endings making it an outlier).

6b. Duration for TV Shows

Unlike the movies, the TV shows have their duration data given by how many seasons. As expected most of the shows only have one season and this makes sense as most newer TV shows have not released all their seasons.

7. Genre (listed_in)

Both TV shows and Movies have been described by the same datset of genres. However, it is to note that each movie and TV show was given multiple categories, and therefore more broad category names tend to show up more in the dataset. This explains why the top 4 categories are all very broad (international, dramas and comedies). In total there were 41 different categories and ignoring the category of International Movies the top genre on Netflix would be Dramas.

8. Director

The bar chart below shows the directors that have the most releases on Netflix. There were a total of 4478 directors in the database.

9. Director Analysis (United States vs non-United States)

There are 2117 directors releasing movies/TV shows in the United States and 2464 directors releasing movies/TV shows outside the United States. It is to note that some of these directors releasing movies/TV shows in the US may also release movies internationally as well.

10. Cast Analysis

The bar chart below shows the cast that have the most releases on Netflix. There were a total of 32,881 cast members in the database.

11. Cast Analysis (United States vs non-United States)

There are 13,786 cast members with movies/TV shows in the United States and 21,222 cast members with movies/TV shows outside the United States. It is to note that some of these directors releasing movies in the US may also release movies/TV shows internationally as well.

Follow up Questions.

I really enjoyed investigating this dataset and I will definitely spend more time figuring things out and relations between the columns/variables. I would like to use this data and combine it with another database such as IMDB so that I can create a database that will allow me to see how good the movie is. Additionally, I would like to replicate the recommendation feed that Netflix produces using NLP databases provided by sources like Google. Maybe if I had more time on my hand and the availability of data to platforms like Hulu and HBO, I could even generate a bigger database showing all movies/tv shows and what platform they are free to watch on. Thank you for taking the time to read through this blog and my data analysis of Netflix.

Link: https://chrislim1234.github.io/