## Top 5 Statistical Concepts for Every Data Scientist!

Statistics is an amazing tool while carrying out the art of Machine Learning and Data Science. A basic visualization such as a box plot gives you some more valuable information, and with the help of statistics, we can get more info and insights into data in a focused way.

By utilizing the statistical concepts, we get more valuable experiences that let us know the distribution of data and structure where we can apply more Machine Learning procedures to get more information on data. Let’s dive into it.

1. Probability Distribution

2. Over and Under Sampling

3. Accuracy

4. Hypothesis Testing and Statistical Significance

5. Dimensionality Reduction

**1. ****Probability Distribution**

A **Uniform distribution** has a single value that happens in a specific range while anything outside the reach is only 0. We can accept it as a portrayal of categorical variables either 0 or 1.

A **Normal Distribution** is otherwise called Gaussian Distribution that is characterized by its mean and standard deviation. The mean shifts distribution spatially where standard deviation controls the spread. We know the spread of the data and the average value of our dataset with Gaussian distribution.

A **Poisson Distribution** is equivalent to Normal yet with the expansion of skewness. It has a moderately uniform spread in all directions like normal at the time of low-value skewness. The spread of data will be diverse in various directions when the skewness esteem is high.

There are numerous distributions you can realize which help to interpret our categorical data with Uniform distribution.

Statistical intervals and hypothesis tests relied upon explicit distributional assumptions.

**2. ****Over and Under Sampling**

Classification problems utilize these methods. In some cases, our classification dataset is one-sided towards one side. For instance, we have 1000 examples for class 1, yet just 200 for class 2. We have ML strategies to show the data and make predictions. For this situation, we have two preprocessing choices that help in the training of our ML models.

Under- sampling implies we select just some data from the greater part class, as a similar number of the minority classes. Now we have an equal on the probability distribution of the classes. The dataset is level out by picking less samples.

Oversampling implies we multiply the minority class with the end goal that it has a similar count the greater part class. Now we have leveled out our dataset and the distribution of minorities without extra data.

In the above mentioned example, we can understand the issue in two different ways. By utilizing undersampling we select just 200 records for both the class 1 and 2. Another technique is utilizing oversampling or upsampling we recreate 200 guides to 800 with the end goal that the two classes have 100 examples each where the model works better.

**3. ****Accuracy**

**True positive:** whenever the predicted and the original value is the similar then it is valid or counted.

**True negative: **doesn’t recognize or identify when the condition is not true.

**False-positive: **whenever the condition is not present it is false or detected.

**False-negative: **doesn’t identify the condition when it is not present.

**Sensitivity: **also termed as **recall**; measures the proportion of actual positive cases that got predicted as positive (True positive). sensitivity = TP/(TP+FN).

**Specificity: **measures the proportion of actual negative cases that got predicted negative (True negative). specificity = TN/(TN+FP).

**Precision: **measures the proportion of both True and False positive that got predicted correctly. precision = TP/(TP+FP).

Accuracy helps to evaluate the performance of models, and in some cases, it is not an efficient metric. Precision describes how precise/accurate our model is out of that predicted positive. When the costs of false-positive are high, Precision is a good measure to determine. When there are high costs associated with false-negative, Recall is the best metric to choose the model.

**4. ****Hypothesis Testing and Statistical Significance**

**Null Hypothesis:** the hypothesis that there is no difference between the specified population.

**Alternative Hypothesis:** The hypothesis that defines something is occurring to the sample observations due to an external reason.

**P-value:** It is the probability of acquiring the expected result of a test, assuming that the null hypothesis is true. A minor p-value suggests there is stronger evidence in favor of the alternative hypothesis.

**Alpha:** the probability of rejecting the null hypothesis when it is true, this is also known as Type 1 error.

**Beta:** known as Type 2 error, failed to reject the false null hypothesis.

Hypothesis testing is an essential step in statistics. It helps to assess two mutually exclusive statements about a population to figure which argument is best upheld by sample data. Statistical significance is a measure of the probability of the null hypothesis being true compared to the acceptable level of uncertainty regarding the correct answer. A p-value of 5% or lower is considered statistically significant. Statistical hypothesis testing helps in determining whether the result of the data set is statistically significant.

**5. ****Dimensionality Reduction**

It is the process of reducing the dimensions of our dataset. The purpose of this is to explain problems that arise in the case of datasets with huge dimensions. In other words, it has many features. At the point when a more dependent variable exists in a dataset, at that point more samples need to have every combination of features that increase the complexity of our model. Dimensionality decrease can incorporate less data identical to many features that help in quicker computing, less redundancies, and more exact models.

From the above representation, we consider our dataset as a cubical structure that has three dimensions and 1000 points or values. With today’s computational power and techniques, 1000 records are easy to process, but on a large scale, we may run into problems. However, when we look at our data in a 2-Dimensional view that is one side of the cube, we can conclude that it’s easy to separate all the colors from this view. The projection of 3D data onto a 2D plane is possible by Dimensionality reduction. It effectively reduces the number of values we need to compute onto a 100. With regards to vast data the reduction would be a more computational saving that prompts great results.

Future pruning is alternative way we can perform dimensionality reduction. In this, we remove features that are not important to our analysis.

PCA is one of the most well-known statistical ideas used for Dimensionality reduction that makes vector representation of features that impact output that is a correlation.

**Conclusion**

These are just the building blocks of data science. There are many other concepts that are needed to be known. Statistics help to solve complex problems in the real world where data scientists and researchers can look for meaningful patterns and changes in data.