Statistics – Tutorial:
Every day we come across a lot of information in the form of facts, numerical figures, tables, graph, etc. These are provided by newspapers, televisions, magazines, blogs and other means of communication. These may relate to cricket batting or bowling averages, profits of a company, temperatures of cities, expenditures in various sectors of a five year budget plan, polling results, and so on. The numerical facts or figures, collected with a definite purpose are called Data(plural form of Latin word Datum).
Our world is becoming more and more information oriented. Every part of our lives utilizes data in one or other forms. So, it becomes essential for us to know how to extract meaningful information from such data which is studied in a branch of mathematics called statistics.
The word statistics appears to have been derived from Latin word Status (a political state) meaning collection of data on different aspects of the life of the people, useful to the State.
Later Scientists seek to answer questions using rigorous methods and careful observations. These observations – collected from the likes of field notes, surveys, and experiments – form the backbone of a statistical investigation and are called data. Statistics is the study of how best to:
- interpret and
- Present the data.
The topics scientists investigate are as diverse as the questions they ask. However, many of these investigations can be addressed with a small number of data collection techniques, analytic tools, and fundamental concepts in statistical inference.
Let’s get into details of each stage.
Don’t get confused when we hear the word population, we typically think of all the people living in a town, state, or country. In statistics, a population is an entire group about which some information is required to be ascertained. A statistical population need not consist only of people.We can have population of heights, weights, BMIs, hemoglobin levels, events, outcomes. Learn more..
A sample is any part of the fully defined population. A syringe full of blood drawn from the vein of a patient is a sample of all the blood in the patient’s circulation at the moment. Similarly, 100 patients of schizophrenia in a clinical study is a sample of the population of schizophrenics, provided the sample is properly chosen and the inclusion and exclusion criteria are well defined. Learn more..
The two main types of statistics are descriptive statistics and inferential statistics.As we know that the steps to study a survey or an experiment are to collect, organize, analyze, interpret and present the data. Now the steps are divided into two groups where the initial steps like collecting, organizing and presenting belong to Descriptive statistics and the remaining two steps like analyzing and interpreting(drawing the conclusion) the data belong to Inferential statistics. Learn more..
Suppose we are conducting an experiment to count the marks of students in a class based on a surprise test. We want to know how many students can attempt the test based on memory. We get a list of marks which is varying its value among the students. This item marks is known as variable which is studied in a sample or population. Learn more..
Datum is the singular form of the noun where as Data is plural form which has been since 20th century. Data can be categorized as either numeric or nonnumeric. Specific terms are used as follows. Learn more..
The technique used to convert a set of data into visual insight is known as data visualization. The main aim of data visualization is to give the data a meaningful representation. To create an instant understanding from multi-variable data, it can be displayed as 2d or 3d format images with techniques such as colorization, 3D imaging, animation and spatial annotation. Learn more..
In statistics we deal with huge amounts of data related to a particular survey or experiment. We cannot pin locate and analyze the data for future predictions based on each value. The bulkiness of the data can be reduced by organizing it into a frequency table or histogram. Frequency distribution organizes the heap of data into a few meaningful categories. Learn more..
We are aware of the term average during our schooling. In statistics , when dealing with set of quantitative data we term it as mean. It is computed by adding all the values in the data set divided by the number of observations in it. Learn more..
The median of a numerical data set is the value in the middle most when the data is arranged in ascending or descending order. It is the halfway point in a data set also known as the positional average .we know in triangles median is a line that divides the opposite side into equal lengths giving area being divided exactly 50%. Learn more..
Along with mean and median, Mode is one of the central tendencies of data distribution. It does not involve much of tedious computations and can be found by easy observation of occurrences of data values. When data is tightly clustered around one or two values, Mode is the most meaningful average. Learn more..
Range is defined as the difference between the maximum value and minimum value in a data set. The minimum and maximum values are useful to know, and helpful in identifying outliers, but the range is extremely sensitive to outliers and not very useful as a general measure of dispersion in the data. Learn more..
The average of the squared differences from the mean is known as the variance. Smaller the variance, closer the data points to the mean and from each other. Higher the variance indicates that the data points are very spread out from the mean and from each other. Learn more..
ne of the measures of spread is Standard deviation. If in a normal distribution, the mean and standard deviation are known, it is easy to calculate percentile rank of any given score. The standard deviation is a statistic that tells you how tightly all the values in dataset are clustered around the mean. Learn more..
A fundamental task in any statistical analyses is to characterize the location and variability of a data set. A distribution of data item values can be symmetrical or asymmetrical. Skewness is asymmetry in a statistical distribution, where the curve appears distorted or skewed either to the left or to the right. Learn more..
Kurtosis is a measure of thickness of a variable distribution found in the tails. The outliers in the given data have more effect on this measure. Moreover, it does not have any unit. The kurtosis of a distribution can be classified as leptokurtic, mesokurtic and platykurtic. Learn more..
Correlation is used to find relationships between quantitative variables or categorical variables. A positive correlation indicates the extent to which those variables increase or decrease in parallel; a negative correlation (inverse correlation) indicates the extent to which one variable increases as the other decreases. Learn more..
The covariance of two variables x and y in a data set is a measure of the directional relationship between them. A positive covariance indicates a positive linear relationship between the variables which move together. A negative covariance indicates that the variables move inversely. Learn more..
The probability theory is very helpful for making predictions. In research investigation, estimates and predictions form an important part. Using statistical methods, we estimate for the further analysis. The role of probability in modern science is simply a substitute for certainty. Learn more..
A random variable is the value of the variable which represents the outcome of a statistical experiment within sample space (range of values). It is usually represented by X. The two types of random variables are discrete random variable and continuous random variable. Learn more..
Probability distribution is a measure of random variable’s behavior/dispersion which indicates the likelihood of an event or outcome. To represent a probability distribution, we use equations and tables of variable values and probabilities. Learn more…
A Bernoulli distribution is a discrete probability distribution in which the random variable (X) takes only two possible values (Bernoulli trial). One possible value is 1 (success) with probability p and another value is 0 (failure) with probability (1–p). Here, p denotes the probability of success. Learn more..
A Binomial distribution is a discrete probability distribution in which the random variable (X) follows:When there are only two possible outcomes of each trial, success and failure.Here, probability (success) is p and the probability (failure) is q or (1-p) where either of them remains constant throughout experiment. Learn more..
Consider a sequence of Bernoulli trials (failure and success), the geometric distribution is used to find the number of failures before the first success. For a geometric distribution with probability of success, the probability that exactly x failures occur before the first success. Learn more..
In statistics, a distribution where selections are made from two groups without replacing members of the groups is known as Hyper geometric distribution. Hyper geometric distribution is the probability distribution of a hyper geometric random variable. Learn more..
A discrete probability distribution that gives the probability of a given number of events k occurring in a fixed interval of time is known as Poisson distribution. The Poisson distribution is used to calculate the probabilities of number of successes based on the mean number of successes. Learn more..
The exponential distribution is a continuous memory less distribution that describes the time between events in a Poisson process. The continuous analogue of the geometric distribution gives exponential distribution. Learn more..
Data sampling is a statistical technique used to select, manipulate and analyze a subset of data points to identify patterns and trends in the larger data set being examined. Data scientists, predictive modelers and data analysts often work with a small, manageable amount of data. Learn more..
Random sampling is a technique where each item in the population has an even chance and likelihood of being selected in the sample. Here the selection of items completely depends on chance or by probability and therefore the name ‘method of chances’. Learn more..
Systematic sampling is a probability sampling method in which the sample is chosen from a target population by selecting a random starting point and selecting other members after a fixed ‘sampling interval’. This sampling interval is calculated by dividing the population size by the desired sample size. Learn more..
A probability sampling technique in which the researcher divides the entire population into different subgroups (strata), then randomly selects the final items proportionally from the different strata(must be non-overlapping).Stratified sampling is also known as proportional sampling or quota sampling. Learn more..
A sampling technique which divides the main population into various clusters which consist of multiple sample parameters like demographics, habits, background or any other attribute.Cluster sampling allows the researchers to collect data by bifurcating the data into small, more effective groups instead of selecting the entire population of data. Learn more..
A non-probability sampling technique where the assembled sample has the same proportions of individuals as the entire population with respect to known characteristics.The main reason in choosing quota samples is that it allows the researchers to sample a subgroup that is of great interest to the study. Learn more..
A non-probability sampling technique in which subjects are selected because of their convenience. The convenience samples are also called as the Accidental Samples because the subjects happen to be accidentally selected under study. Learn more..
A non-probability sampling technique in which researcher begins with a small population of known individuals and expands the sample by asking those initial participants to identify others that should participate in the study. Learn more..
A sampling method is called biased if the survey sample does not accurately represent the population. Sampling bias is sometimes called ascertainment bias or systematic bias. Sampling bias refers to sample and also the method of sampling.Bias can be either intentional or not.Sometimes even poor measurement process can lead to bias. Learn more..
Sampling distribution is the probability distribution of a sample of a population instead of the entire population using various statistics (mean, mode, median, standard deviation and range) based on randomly selected samples. This distribution helps in hypothesis testing (likeness of an outcome). Learn more..
Given a sufficiently large sample size selected from a population with a finitevariance, the mean of all samples from the same population will be approximately equal to the mean of the population thereby forming an approximate normal distribution pattern. Learn more..
A standard normal distribution is a normal distribution with mean 0, standard deviation 1, Area under curve 1 and of infinite extent. When the unit of measure is changed to measure standard deviations from the mean, thenall normal distributions will become equivalent to the standard normal distribution. Learn more..
A z-score/ standard score indicates how many standard deviations an element is far from the mean. A z-score can be calculated using the formula. Learn more..