How do you deal with missing data when using linear regression modelling strategies? What is the best way to inform business users about how this might impact the validity and reliability of the regression model's findings?

Data Science & Algorithms AI & Machine Learning Predictive Analytics

89 views3 Upvotes3 Comments

Sort By:

Oldest

Associate Director, Data Science & Analytics in Travel and Hospitalitya year ago

There is no easy way out here, unfortunately. Linear regression cannot handle missing values, so you have to either impute the missing values, or drop the entire row with any missing value. Both of these approaches can bias any inference from the model.

You will have to take a judgment call after analyzing why the values are missing.

Are there any patterns with the missing values? Then it is better to impute.

Do only few columns have missing values, and that too only a few of them? Then you may just drop the rows with missing values.

There are whole chapters written about handling missing values, but no conclusion that you can directly use.

President & Chief Data Officer in Services (non-Government)a year ago

I pretty much agree with Rajesh's comment.

It depends on how much data is missing, if the data is missing at random or if there is a systematic pattern (bias), the amount of variability in the data and the presence of outliers. If there is a systematic pattern in the missing data, that could be problematic. One thing that can be helpful is to eliminate columns with missing data, especially if there is multicollinearity in your dataset (the column with missing data is strongly correlated with other columns in your dataset). I would recommend running the analysis with the missing data omitted (rows and or columns) and again with imputation and compare the results.

Please join or sign in to view more content.

By joining the Peer Community, you'll get:

Peer Discussions and Polls
One-Minute Insights
Connect with like-minded individuals

Founder, CEO in Services (non-Government)a year ago

This can be tricky but the optimal approach to handle missing data depends on a few factors. Examples

1.       How much do you know of the space – is the behaviour and shape consistent or not? i.e., do you understand the general behaviour, shape of the context from which the data is coming from? Always skewed-one way, always normal etc.

2.       How many data points are missing? If the missing data points are few, less than 5% of the data-set, I can pull a random sample that matches a minimum level of confidence that is practical to the problem being solved (let us say 95%). I have done this once, where a randomized pull gave me zero missing values off the population. Then run the sample vs. the whole data set, assess the shape, key stat summaries. If all OK, then this may be a viable option.

3.       Depending on shape of the data set, the option to replace missing values with Mean, Median may not bias your outcomes enough to inform a different decision.

4.       Decision sensitivity. Do you need directional or precise clarity?