How do you deal with missing data when using linear regression modelling strategies? What is the best way to inform business users about how this might impact the validity and reliability of the regression model's findings?

89 views3 Upvotes3 Comments
Sort By:
Oldest
Associate Director, Data Science & Analytics in Travel and Hospitalitya year ago
There is no easy way out here, unfortunately. Linear regression cannot handle missing values, so you have to either impute the missing values, or drop the entire row with any missing value. Both of these approaches can bias any inference from the model.

You will have to take a judgment call after analyzing why the values are missing.

Are there any patterns with the missing values? Then it is better to impute.

Do only few columns have missing values, and that too only a few of them? Then you may just drop the rows with missing values.

There are whole chapters written about handling missing values, but no conclusion that you can directly use.
President & Chief Data Officer in Services (non-Government)a year ago
I pretty much agree with Rajesh's comment.

It depends on how much data is missing, if the data is missing at random or if there is a systematic pattern (bias), the amount of variability in the data and the presence of outliers.  If there is a systematic pattern in the missing data, that could be problematic.  One thing that can be helpful is to eliminate columns with missing data, especially if there is multicollinearity in your dataset (the column with missing data is strongly correlated with other columns in your dataset). I would recommend running the analysis with the missing data omitted (rows and or columns) and again with imputation and compare the results.
1
lock icon

Please join or sign in to view more content.

By joining the Peer Community, you'll get:

  • Peer Discussions and Polls
  • One-Minute Insights
  • Connect with like-minded individuals
Founder, CEO in Services (non-Government)a year ago
This can be tricky but the optimal approach to handle missing data depends on a few factors. Examples

1.       How much do you know of the space – is the behaviour and shape consistent or not? i.e., do you understand the general behaviour, shape of the context from which the data is coming from? Always skewed-one way, always normal etc.

 

2.       How many data points are missing? If the missing data points are few, less than 5% of the data-set, I can pull a random sample that matches a minimum level of confidence that is practical to the problem being solved (let us say 95%). I have done this once, where a randomized pull gave me zero missing values off the population. Then run the sample vs. the whole data set, assess the shape, key stat summaries. If all OK, then this may be a viable option.

 

3.       Depending on shape of the data set, the option to replace missing values with Mean, Median may not bias your outcomes enough to inform a different decision.

 

4.       Decision sensitivity. Do you need directional or precise clarity?

Content you might like

IT Manager in Constructiona month ago
Hello,
the topic is so broad, what are you focused on?
Read More Comments
4.8k views2 Upvotes5 Comments

Cost of RPA products27%

Lack of developers who can code RPA applications44%

Amount of customization needed to automate business processes24%

Lack of RPA code maintenance resources4%

View Results
11.7k views5 Upvotes8 Comments
Senior Director, Technology Solutions and Analytics in Telecommunication3 years ago
Palantir Foundry
3
Read More Comments
11.7k views13 Upvotes49 Comments

Lack of security16%

Inaccuracy45%

Bias20%

Job losses6%

Negative cultural impact7%

Lack of IP protection2%

Widespread knowledge gaps2%

Economic volatility

Another threat

View Results
2.9k views2 Comments