/    /  Machine Learning – Interview Questions Part 6

 

1.What are the different ways to handle missing data and invalid data in a dataset?

Ignore the data row

1. This is a quick solution and typically is preferred in cases where the percentage of missing values is relatively low (<5%). It is a dirty approach as you lose data.

2. You can also select to drop the rows only if all of the values in the row are missing.

Back-fill or forward-fill to propagate next or previous values respectively

Note that the NaN value will remain even after forward filling or back filling if a next or previous value isn’t available or it is also a NaN value.

3. Replace with some constant value outside fixed value range-999,-1 etc

This method is useful as it gives the possibility to group missing values as a separate category represented by a constant value. It is a preferred option when it doesn’t make sense to try and predict a missing value.

4. Replace with mean, median value

This simple imputation method is based on treating every variable individually, ignoring any interrelationships with other variables. This method is beneficial for simple linear models and NN.

2. What are the differences between Labelled and Unlabelled Data?

Labeled data is a group of samples that have been marked with one or more labels. Labeling typically takes a set of unlabeled data and expands each piece of that unlabeled data with meaningful tags that are informative.

Unlabeled data is a description for pieces of data that have not been tagged with labels identifying characteristics, properties or classifications. Unlabeled data is typically used in various forms of machine learning.

3. What are the differences between Data Processing, Data Preprocessing and Data Wrangling?

Data Processing

Data Processing is a mission of converting data from a given form to a more usable and desired form. To make it simple, making it more meaningful and informative.  The output of this complete process can be in any desired form like graphs, videos, charts, tables, images and many more, depending on the task we are performing and the requirements of the machine.

Data Preprocessing

Data Preprocessing is a technique which is used to convert the raw data set into a clean data set. In other words, whenever the data is collected from different sources it is collected in raw format which is not feasible for the analysis.

Hence, certain steps are followed and executed in order to convert the data into a small and clean data set. These set of steps is known as Data Preprocessing. The Data Preprocessing steps are:

  1. Data Cleaning
  2. Data Integration
  3. Data Transformation
  4. Data Reduction
  5. Data Wrangling

Data Wrangling

Data Wrangling is a technique which is performed at the time of making an interactive model. In other words, it is used to convert the raw data into the format convenient for the consumption of data.

Data Wrangling technique is also known as Data Munging. This method also follows certain steps like after extracting the data from different data sources, sorts the data using particular algorithm is performed, decomposes the data into a different structured format and finally stores the data into another database.

4. What do you mean by Univariate?

Univariate Data

Univariate data consists of only one variable. The analysis of univariate data is the simplest form of analysis since the information deals with only one quantity that varies. It does not deal with causes or relationships. The main purpose of this analysis is to describe the data and find patterns that exist within it. The pattern description found in this type of data can be made by drawing conclusions using central tendency measures mean, median and mode, dispersion or spread of data that is range, minimum, maximum, quartiles, variance and standard deviation and by using frequency distribution tables, histograms, pie charts, frequency polygon and bar charts.

5. What do you mean by Bivariate Data?

Bivariate Data

Bivariate data consists of two different variables. The analysis of Bivariate data deals with causes and relationships. The analysis is done to find out the relationship between the two variables. Thus, bivariate data analysis involves comparisons, relationships, causes and explanations. These variables are frequently plotted on X and Y axis on the graph for better understanding of data and one of these variables is independent while the other is dependent.

6. What do you mean by Multivariate Data?

Multivariate data consists of three or more variables. It is comparable to bivariate but contains more than one dependent variable. The ways to perform analysis on this data depends on the goals to be achieved. Some of the techniques are regression analysis, path analysis, factor analysis and multivariate analysis of variance (MANOVA).

7. What is the difference between Data Analyst and Data Scientist?

Data Scientist

A Data Scientist is a professional who understands data from a business point of view. He will make predictions to help businesses take accurate decisions. Data scientists come with a skills of computer applications, modeling, statistics and math. They are efficient in picking the right problems, which will add value to the organization after resolving it.

Data Analyst

Data Analyst plays a key role in Data Science. They perform a various task related to gathering, organizing data and obtaining statistical information out of them.  They are also responsible to present the data in the form of charts, graphs and tables which are also called as Data Visualizations and use the same to build relational databases for organizations.

8. What do you mean by Noise in given Dataset and How can you remove Noise in Dataset?

Noise is unwanted data items, features or records which don’t help in explaining the feature itself, or the relationship between feature & target. Noise often causes the algorithms to miss out patterns in the data. Noisy data is meaningless data. The term has been used as a synonym for corrupt data. However, its meaning include any data that cannot be understood and interpreted correctly by machines, such as unstructured text. Any data which has been received, stored, or changed in such a manner that it cannot be read or used by the program can be described as noisy data.

Methods to detect and remove Noise in Dataset

1. K-fold validation

In this method, we can look at the cross-validation score of each fold and analyze the folds which have poor CV scores, what are the common attributes of records having poor scores, etc.

2. Manual method

In this method, we can evaluate CV of each record (predicted vs. actual) and filter/analyze the records having a poor CV score. This will help us in analyzing why this is happening in the first place.

3. Density-based anomaly detection

This method assumes normal data points occur around a dense neighborhood and abnormalities are far away.

9. What do you mean by Law of transformation of Skewed Variables?

Different features in the data set may have values in different ranges. For example, in an employee data set, the range of salary feature may lie from thousands to lakhs but the range of values of age feature will be in 20- 60. That means a column is more weighted compared to other. Data transformation mainly deals with normalizing also known as scaling data, handling skewness and aggregation of attributes.

10. What are the differences between Mean, Median, Mode? How these are helpful to deal with the missing values in the given dataset?

Mean

Mean is the average of the Dataset. It is the ratio of Sum of total observations to the Total number of observations.

Median

It is the middle value of the Dataset. If the total number of observations in the Dataset is odd in number, then median is the middle most value or observation. If the total number of observations in the dataset are even in number, then the median is given by the average of the middle two values of the dataset.

Mode

Mode is the most frequently occurred observation or value in the entire Dataset.

If the data is Normally distributed then, missing values can be imputed or replaced by mean of the all observations of the dataset.

If the data is skewed then, it is better to impute or replace the missing values by Median of all observations of the Dataset.

When the Data is related to the frequency, then mode is used to replace or impute the missing values in the respective Dataset.