I'm trying to have my data scientists focus on ways to spend less time cleaning data, but they always blame our business partners for poor data quality. Besides attacking data quality and blaming others, what are some good initiatives to evaluate that could empower my team to deliver faster / better insights?

344 views2 Upvotes7 Comments

Sort By:

Oldest

Director of Operations in Services (non-Government)a year ago

Part of the role of a data scientist is cleaning data, and if there’s a lot of time spent on cleaning then a system should have been developed that automated the data cleanup process.

That’s to say if the messy data is always coming in the same or similar formats. If not, it might be valuable to get on a meet and hash air each side’s grievances, potentially finding a solution or option that streamlined data cleanup.

Director of Data Architecture in Mediaa year ago

I would start with Culture Change from Data Scientists vs Business Partners ==> Data scientists + business partners vs Data quality ... Everyone is responsible for data quality.

This could be a lengthy post so I will put some of the highlights as bullet points for technical initiatives that can help (depending on the organization operating mode some orgs have MLE and DS as separate entities, some orgs combine these roles, some orgs have DS and Data Engineering...etc)

-Make sure data is stored in the right storage objects (cost control, latency and discovery)
-Establishing Feature Store

-MLOps
-Terminating model pipelines jobs upon unit testing failure vs handling DQ downstream.
-CICD for Model pipelines
-Data Contracts
-Model Observability

1 Reply

CMO in Services (non-Government)a year ago

Good suggestion . Thank you.

Please join or sign in to view more content.

By joining the Peer Community, you'll get:

Peer Discussions and Polls
One-Minute Insights
Connect with like-minded individuals

Chief Data Officer in Softwarea year ago

If we agree that the most common dimensions of data quality are accuracy, timeliness, uniqueness, and completeness of data, then I think what you will find is a relatively small portion of time for your DS team is actual quality issues. What they are calling 'poor quality' is likely more a function of different structures, formats, standards, semantics, etc. These issues are caused by the fact the data in source systems is optimized for operational use cases, not analytical use cases - and this will never change.

The best thing you can do is drive a culture change in your team to have them realize that business stakeholders are acting with positive intentions, and that source data exists as it does because conscious decisions were made to optimize that data for non-analytical uses.

While wrangling data may be drudgery and the worst part of a data scientists' job, it will never be eliminated. Blaming the business, when the business is operating with the intention of maximizing profits, is counterproductive and disempowering.

Director of BI & Insights in Services (non-Government)a year ago

Addressing data quality asap is crucial for any Data science/Analytics/BI team which can drastically increase their value. They should spend time creating and optimising models, instead of cleaning.

There are several proactive initiatives that you can consider to empower your data science team to deliver faster and better insights:

-Automated Data Cleaning Tools: Invest in data cleaning and preprocessing tools that can automate routine tasks, such as missing value imputation, outlier detection, and standardization.

-Data Quality Framework: Develop frameworks that defines data quality metrics, processes, and, most importantly, responsibilities (who is responsible for what). This framework can help establish clear standards for data accuracy, completeness, and consistency, reducing the potential for poor data quality.

-Collaborative Data Governance: Cross-functional teams, including data engineers, data scientists and business partners should raise data issues as early as possible, so to resolve them.

-Education and Training: Provide ongoing training to your team (Data + Business) in data quality best practices, advanced data cleaning techniques, and tools.

-Standardized Data Collection Processes: Work with business partners to establish standardized & automated data collection processes. This helps prevent inconsistent data entry and reduces the need for extensive data cleaning downstream.

-Feedback Loops: Establish regular feedback loops with business partners to collaboratively address data quality issues. Foster a culture of continuous improvement.

By combining these initiatives, you can create an environment where your data science team is focused on doing its core job, delivering faster and better insights, rather than performing mundane data cleaning tasks again and again

Senior Data and Analytics Leader in Government8 months ago

In addition to what has been mentioned in other comments, establishing a centralized data catalog is key for data scientists to understand data sources, definitions, and transformations, making it easier for them to work with the data effectively. Additionally, exploring opportunities to automate data integration processes by implementing data integration tools and workflows, you can streamline the process of combining data from various sources. This reduces the manual effort required for data cleaning and allows your data scientists to focus more on analysis and insights generation.

Content you might like

Any recommendations for a comprehensive GenAI learning platform?

Engineering Data & Analytics Security Strategy & Roadmap+4 more

IT Manager in Constructiona month ago

Hello,
the topic is so broad, what are you focused on?

Which cloud provider do you use for your Enterprise Data Warehousing needs?

Data & Analytics Business Intelligence Storage

Google Cloud Platform - BigQuery15%

Amazon Web Services - Redshift46%

Microsoft - Azure35%

Snowflake3%

View Results

4.6k views2 Upvotes2 Comments

I would like to see if there is some consistency across industries in terms of definitions for data asset, data domain, and data product?

Peer Insights Operations Management Data & Analytics

Director of Enterprise Data & Analytics in Retail13 days ago

I look at those terms as general terms that do not have unique definition across industries, rather different forms of those terms by industry.

Data assets and data products are closely related, data domains are a ...read more

1 1 Reply

Has anyone created a side-by-side comparison of Snowflake/DataBricks/Microsoft Fabric/Synapse on Azure? Could you please suggest a resource or share your thoughts?

Data & Analytics

Director of Data & Analytics6 months ago

Snowflake and Databricks are the industry leaders in the modern data landscape. However, Databricks as a platform it offers more features and functions for typical D&A and Advanced analytics. When it comes to snowflake, it ...read more

2 Replies

510 views1 Upvote3 Comments

How important is it for a cloud FinOps tool to have cost allocation and data analysis capabilities?

Data & Analytics Cloud Financial Management

Critical7%

Very important57%

Somewhat important27%

Somewhat unimportant3%

Not at all important1%

Unsure1%

View Results

2.3k views1 Comment