Testing for Bias in Production ML Services

Testing deployed ML services is different from testing traditional software. Unlike software with deterministic outcomes, ML systems operate in probabilities and can return different results over time.

At TinyData, we make tools that help ensure the safety of production ML systems. To demonstrate this, I will showcase the approach we took to testing 4 commercial ML systems for gender bias. By having the ability to easily generate datasets for blackbox testing, we could find large categories of images that result in gender labelling errors. We will then discuss the workflow required for turning these error-producing datasets into training data for improving the systems.

2026 Talks

Testing for Bias in Production ML Services

The AI Conference for Humans Who Ship