Why should we (not) authorize AI solutions' developers to train their AI models on our public data?
Sort By:
Oldest
Principal Software Engineer, Data Engineering in Energy and Utilities3 months ago
It depends on what is expected out of the AI Models. 1) If the output has to include both generic information and private information, public data training would help. For private information, RAG can be used.
2) If the output should not have hallucinated results and is more Org/User domain-specific, then RAG would be the best approach for contextual grounding.
Information Security Analyst in Governmenta month ago
Good Morning, public data has data quality issues and we need to take that into consideration when building any AI model. There are also unattended biases. We've taken the approach to share city data publicly via chatbots but ensuring that we have controls in places to review and limit specific responses for public queries. For example, we want users to focus on the scope of city services/data provided by city agencies, not necessarily other news outside the scope. Start small and build incrementally.Senior Director - Partner Solutions in Consumer Goodsa month ago
This is a complex question with an even more complex response - short of saying - It Depends!!Innovation, scientific and economic growth are direct factors which will be advantaged by allowing our public data to be trained on. But, it is more complex and it depends come in because . . .
- Let's say your data is public but has some personal information that may be subject to data privacy laws - who will be responsible?
- Let's say there are copyright considerations in your public data, what is your expectations on fair use?
- If you are in EU, it gets even more complicated with the life of data if indexed wrt GDPR
Always take in to account any Privacy concerns. Public data often includes personal information about individuals. Allowing developers unrestricted access to this data for training AI models can compromise people's privacy rights. Even if the data is anonymized, there's always a risk of re-identification through advanced data linkage techniques.
Then there is potential misuse of data. Developers might use public data for purposes that are not in the public interest or go against ethical standards. There's a potential for data to be used in ways that harm individuals or groups, such as discriminatory practices in AI decision-making.
How do we control it? Developers might use public data for purposes that are not in the public interest or go against ethical standards. There's a potential for data to be used in ways that harm individuals or groups, such as discriminatory practices in AI decision-making.
Then we have the point on sensible AI. Using public data without explicit consent can raise ethical questions about fairness and justice. It may disproportionately benefit developers and tech companies without providing adequate benefits or protections to the individuals whose data is being used.
So yes you can use public data but take some guard rails in your framework so you don't have to struggle on justifying later on.