Data Visualizations using Python and Seaborn
Data Visualization
Data visualization refers to the process of representation of data in various visual formats like a graph, chart, etc. It is important because it allows trends and hidden patterns to be more easily seen, which is also easier for the human brain to understand.
Python provides various libraries for data visualization libraries such as matplotlib, seaborn, plotly, bokeh, etc. In this article, we learn about data visualization by using seaborn.
Seaborn
Seaborn is a Python data visualization library built on top of the matplotlib library. It provides a high-level interface for drawing attractive and informative statistical graphics. The best thing about seaborn is we get a lot of statistical tools that help us understand more about the data.
First, we need to divide our dataset in between independent variables and dependent variables. To find the relation or pattern between any two variables. Seaborn comes with many different types of plot. Now we will discuss different types of plot provided by seaborn.
Installing Seaborn
The seaborn has few requirements which you need to have
1. NumPy
2. pandas
3. matplotlib
4 SciPy
To install seaborn you should have all the above-mentioned libraries installed in your system .Once you install the required libraries, we are ready to install seaborn. To install seaborn, you can use
pip install seaborn
After we install seaborn, we are all set to explore seaborn.
To begin with seaborn libraries we need to import it as
import seaborn as sns
After importing we will be using built-in function load_dataset() which is used to load the dataset which is present inside the seaborn libraries.
In this article, we will be using the ‘tips’ dataset. This is a dataset from a restaurant where many come to eat food and after eating based on a total bill they hate paid some tips. For these, we have features like total_bill, tip, sex, smoker, day, time, size you can load the dataset by using
import seaborn as sns
df=sns.load_dataset('tips')
df.head()
output :
The aim of this dataset is that we should be able to create a model where we need to predict the tip based on the features available
Here you can see the tip is dependent feature and total_bill, sex, day, time, size are the independent features.
Correlation Matrices
We will start by plotting correlation matrices. It can be done by using a feature called heatmap. Heatmap helps us to find a correlation between every feature. Correlation helps us to find out the interrelation between all the features.
But the basic requirement for finding correlation is that the feature should be numerical feature i.e data type must be int or float.
Correlation matrices cannot be found for categorical features because they are object type. Whenever you will find correlation matrices the value will be ranging from -1 to +1 which is Pearson correlation. So to find the correlation you can use
df.corr()
Output :
As you can see we are getting only 3 features because only these here are numerical and the rest are categorical.
To visualize it for getting better understanding you can use
sns.heatmap(df.corr())
Output :
Scatterplot
Scatter plots are used to plot data points on the horizontal and vertical axis. It shows how much one variable is affected by another. It shows the extent of correlation. It is also used to find the relationship between two variables.
To plot a scatter plot we use relplot() function of seaborn library. It can be done by using
sns.relplot(x='total_bill', y='tip', data=df)
Output :
Hue Plot
If you want to classify your plot based on any feature. So you can use any feature and classify scatter plot. For this, there is a parameter called as hue. You can use hue plot as
sns.relplot(x= 'total_bill', y='tip', hue='sex', data=df)
Output :
As you can see the scatter plot is classified based on sex by giving color to each point.
Pair Plot
If there are two or more independent features then we use the pair plot. It is also called a scatter plot, in which one variable in the same data row is matched with the value of another variable. In pair plot of, there are more than two independent features then it will combine or do permutation and combination of all the features. This plot can only be a plot on numerical data. It can plot by using
sns.pairplot(df)
Output :
Dist Plot
Dist Plot helps to create histograms. We will use a function called as distplot() which plots histogram. It creates a frequency distribution of continuous variables. It can be created by using
sns.distplot(df['total_bill'])
Output :
Join Plot
A join plot helps to learn about the relationship between 2 numeric variables. It is used to do univariate analysis. It displays a correlation between two variables. You can plot a join plot as
sns.jointplot(x='total_bill', y='tip', data=df, kind='hex')
Output :
you can also change the kind parameter here and get different visualizations as
sns.jointplot(x='total_bill', y='tip', data=df, kind='kde')
Output :
Plotting Categorical Data
So, far we have visualized data which is specifically numerical i.e. int or float datatype. Now we will try to analyze categorical data type. In our dataset, there are 4 categorical features that are day, sex, time, smoker.
Count Plot
It shows the counts of observations in each categorical bin using bars. It can be thought of as a histogram across a categorical variable.
sns.countplot('sex', data=df)
Output :
As you can see it plots the number of bars as there are categories in the features.
Bar Plot
Bar plot does the same work as count plot. But in this, we have to specify both x and y. Based on one feature it will display other value
sns.barplot(x='smoker' , y='total_bill', data=df)
Output :
Box Plot
The box plot shows the quartile values of the distribution. Each value in the box plot corresponds to actual observation in the data. It is often used in explanatory data analysis. It also shows outliers. You can plot boxplot as
sns.boxplot(x='smoker', y='total_bill', data=df)
Output :
Violin Plot
The violin plot helps to see both the distribution of data in terms of kernel density estimation and the box plot. It displays how the distribution is with respect to data. It is a combination of both boxplot and kernel density estimation. You can violin plot as
sns.violinplot(x='day', y='total_bill', data=df)
Output :
So far we have discussed various data visualization techniques using seaborn. This various data visualization will surely help in your respective data science projects. Hope you will understand data visualization using seaborn in python.