I'm trying to have my data scientists focus on ways to spend less time cleaning data, but they always blame our business partners for poor data quality. Besides attacking data quality and blaming others, what are some good initiatives to evaluate that could empower my team to deliver faster / better insights?
Sort By:
Oldest
Director of Data Architecture in Mediaa year ago
I would start with Culture Change from Data Scientists vs Business Partners ==> Data scientists + business partners vs Data quality ... Everyone is responsible for data quality.This could be a lengthy post so I will put some of the highlights as bullet points for technical initiatives that can help (depending on the organization operating mode some orgs have MLE and DS as separate entities, some orgs combine these roles, some orgs have DS and Data Engineering...etc)
-Make sure data is stored in the right storage objects (cost control, latency and discovery)
-Establishing Feature Store
-MLOps
-Terminating model pipelines jobs upon unit testing failure vs handling DQ downstream.
-CICD for Model pipelines
-Data Contracts
-Model Observability
CMO in Services (non-Government)a year ago
Good suggestion . Thank you.
Chief Data Officer in Softwarea year ago
If we agree that the most common dimensions of data quality are accuracy, timeliness, uniqueness, and completeness of data, then I think what you will find is a relatively small portion of time for your DS team is actual quality issues. What they are calling 'poor quality' is likely more a function of different structures, formats, standards, semantics, etc. These issues are caused by the fact the data in source systems is optimized for operational use cases, not analytical use cases - and this will never change. The best thing you can do is drive a culture change in your team to have them realize that business stakeholders are acting with positive intentions, and that source data exists as it does because conscious decisions were made to optimize that data for non-analytical uses.
While wrangling data may be drudgery and the worst part of a data scientists' job, it will never be eliminated. Blaming the business, when the business is operating with the intention of maximizing profits, is counterproductive and disempowering.
Director of BI & Insights in Services (non-Government)a year ago
Addressing data quality asap is crucial for any Data science/Analytics/BI team which can drastically increase their value. They should spend time creating and optimising models, instead of cleaning.There are several proactive initiatives that you can consider to empower your data science team to deliver faster and better insights:
-Automated Data Cleaning Tools: Invest in data cleaning and preprocessing tools that can automate routine tasks, such as missing value imputation, outlier detection, and standardization.
-Data Quality Framework: Develop frameworks that defines data quality metrics, processes, and, most importantly, responsibilities (who is responsible for what). This framework can help establish clear standards for data accuracy, completeness, and consistency, reducing the potential for poor data quality.
-Collaborative Data Governance: Cross-functional teams, including data engineers, data scientists and business partners should raise data issues as early as possible, so to resolve them.
-Education and Training: Provide ongoing training to your team (Data + Business) in data quality best practices, advanced data cleaning techniques, and tools.
-Standardized Data Collection Processes: Work with business partners to establish standardized & automated data collection processes. This helps prevent inconsistent data entry and reduces the need for extensive data cleaning downstream.
-Feedback Loops: Establish regular feedback loops with business partners to collaboratively address data quality issues. Foster a culture of continuous improvement.
By combining these initiatives, you can create an environment where your data science team is focused on doing its core job, delivering faster and better insights, rather than performing mundane data cleaning tasks again and again
Senior Data and Analytics Leader in Government8 months ago
In addition to what has been mentioned in other comments, establishing a centralized data catalog is key for data scientists to understand data sources, definitions, and transformations, making it easier for them to work with the data effectively. Additionally, exploring opportunities to automate data integration processes by implementing data integration tools and workflows, you can streamline the process of combining data from various sources. This reduces the manual effort required for data cleaning and allows your data scientists to focus more on analysis and insights generation.
That’s to say if the messy data is always coming in the same or similar formats. If not, it might be valuable to get on a meet and hash air each side’s grievances, potentially finding a solution or option that streamlined data cleanup.