AI is the future and there’s lots of money to be made from it. But organisations keep making the news over AI governance failings, such as Microsoft’s chatbot that turned racist and google images labelling African-Americans as gorillas.

We’re seeing a growth of ethics and governance councils but with mixed success – Google shut theirs down. Why is good governance proving so difficult? Does there have to be a trade-off between good governance and innovation?

The AI Rush

AI is taking off. The market for AI is has been forecast to grow from $9.5bn in 2018 to $118.6bn by 2025. Naturally there is a race to get to the opportunities first.

However, making money from machine learning right now is not necessarily easy. Up to 87% of machine learning projects never go live. Some projects fail because the idea doesn’t work out or the data isn’t available. Even for projects that have a good idea and adequate data, the journey to running the model live isn’t easy.

Getting data, preparing it, keeping it fresh and managing changes to it are all challenging. There can also be demanding hardware needs. The data flowing through the model and the quality of predictions can require monitoring. All this makes for a challenging AI operations landscape.   

One approach is the “move fast and break things” way – just deal with the issues as they come. 

This is especially risky with machine learning as one can break things in a whole range of ways. 

Let’s look at some of them.

Challenges for Running AI in Prod

Machine learning models are only as good as the data that they are trained on. Central to use of machine learning is taking patterns from known data and reapplying the extracted patterns to new data. If the known data is not representative of the new data then the model will give bad predictions. So a major challenge is getting data that is representative.

The level of quality required of predictions naturally varies with use-case. If the predictions are going to be followed up by a human (e.g. flagging transactions as potential fraud risks) then a poor prediction might not be too serious. If a self-driving car mistakes a stop sign for a speed limit sign then that could be very serious. In some cases even one bad prediction could be intolerable.


Sometimes a model can perform well but occasional predictions might go astray. This is likely to happen if there are particular data points which stand out from others – outlier data points:

Outliers don’t fit the pattern that the rest of the data points do and the model is built to be general so it will predict on them as though they were within the main pattern. If you’ve got outliers that you can detect then you may be able to plan for that and withhold predictions for them and/or send them down a different process. Naturally it takes time to get a process like that in place so it’s a risk you want to mitigate in advance rather than try to react to.

Concept Drift

You may get good representative data that you train your model on and it might perform well in live. Then it might mysteriously start to perform worse and worse. This can happen if the relationships for the data change in the real world. For example, if you’re predicting what fashion items are likely to sell and the season changes. Then you’ll find your model suggesting swimwear in winter. To avoid losing money you’d need to scramble to get fresh data to train your model on.


There are data points that might correlate highly with certain outcomes but which we shouldn’t use for ethical reasons. For example, we wouldn’t find it acceptable to use race in automated recommendations for parole.

This kind of issue can sneak into a machine learning model by accident. When training a machine learning model, a whole data set might be processed with a large range of features without anyone necessarily choosing particular features explicitly. There could simply be features lurking in the data set that correlate with certain outcomes (and might therefore receive a high weight in the model) but which it would be unethical to use in a decision.

An apparent model bias situation hit the news recently with AppleCard offering David Heinemeier Hanson a much higher credit limit than his wife, despite his wife having a higher credit score. Apple co-founder Steve Wozniak was among those to report the same:

The operators of AppleCard have stated that they don’t use gender as a data-point. However, gender could enter indirectly. For example, if occupation is used this could indirectly bias for gender in some cases as certain occupations are dominated by one gender (e.g. primary school teachers tend to be female).

The AppleCard case has attracted regulatory attention and highlights that bias represents legal risk as well as a reputation risk. New York’s Department of Financial Services stated that an “algorithm that intentionally or not results in discriminatory treatment of women or any other protected class violates New York law.


The Facebook and Cambridge Analytica scandal has brought a lot of attention on privacy issues. A key part of that was the sharing of data without adequate consent from the person that the data concerned. There can be risks in this space for machine learning. 

A machine learning model is likely to predict similarly to the nearest data points in the training data. Say you’re predicting voting and somebody asks for predictions for retired female voters in a given district. One might not expect that to reveal much about who was surveyed for the training data – but it might if there’s only a handful of retired female voters in that district. It might then be possible for somebody so-inclined to work backwards and figure out things about your data that you didn’t know you were revealing.

Range of Risks

So we’re seeing a range of risks. Poor data might lead to poor predictions that cost money. Or poor predictions might be outright dangerous. Both happened with the failed Watson for Oncology project which had about $62M of funding and was shut down after making too many unsafe predictions. 

Even if the training data is representative for a wide range of cases, the real world might lead to unanticipated kinds of data. This happened with apple’s face recognition software that was tricked by a mask and the self-driving car that wasn’t able to recognise jaywalkers

Machine learning models can be briefly successful and then get tripped up when the real-world data changes (concept drift). But models can also be too responsive to new data. Microsoft’s famous chatbot was learning continuously from its interactions online and the result was that it became racist. 

It’s also possible to get into hot water by overlooking the ethical dimensions of using particular data points such as gender or race. An example of this was Amazon Rekognition being shown by ACLU to be far more likely to mistake a member of US Congress for a criminal (compared with criminal mugshots) if they were a person of colour:

With increasingly high-profile incidents hitting the news, organisations are concerned to ensure that AI adoption doesn’t come at an unacceptable cost.

Striving for Good Governance


Government and regulators are also concerned. Facebook has even gone so far as to call on government to provide more regulation. We are in the early days of forming regulation as regulators have concerns about what effects greater regulation might have, especially on innovation. But there are some notable attempts emerging.

The European Union’s GDPR legislation states that “the data subject shall have the right not to be subject to a decision based solely on automated processing, including profiling, which produces legal effects concerning him or her or similarly significantly affects him or her.” It adds that data subjects have a right to “meaningful information about the logic involved” and to “the significance and the envisaged consequences” of automated decision-making.

The general thrust is that if AI is used to make a decision that impacts on me then I should be able to challenge that decision and get a meaningful explanation. If concern is raised about the process then the organisation might be challenged on whether the process introduced unnecessary risk.

Guidelines have been issued by the US Federal Reserve for the financial industry on the use of quantitative models (SR11-7 pdf). These also expect organisations not to take unnecessary risk and to be able to justify their design decisions. Furthermore, they call for active monitoring and evaluation of models. The management of risk involved in the design and use of a model should be able to stand up to challenge by an informed party and the institution is expected to have processes in place to ensure that the risk trade-offs of any design are justifiable.

Internal AI Ethics Bodies

Part of the AI governance picture now is organisations setting up councils to oversee guidelines on appropriate use of AI in their organisation. A key problem these councils face is how to close the gap between high-level aspirations like transparency and fairness and actionable advice.

Closing the Gap

So how to close the gap between governance aspirations and data science practice? And can it be done in a way that doesn’t compromise on innovation?

Lessons of DevOps

It is possible to go faster and with better governance. This is one of the lessons of the rise of DevOps. Automating processes allows them to be executed faster and more predictably and for more reliable tracking. There may be an upfront cost to automating deployment and monitoring processes but the cost is paid back by the benefits of being able to make iterative improvements more often, gathering more feedback and responding to feedback more quickly.

DevOps for Machine Learning is Special

Unfortunately, more than just existing DevOps practices are needed in order to close the gap between AI governance aspirations and data science practice. 


If something goes wrong with a model then you might naturally want to be able to go back to how it was trained (that particular version of it) and make a tweak and train it again. If your process is to stand up to challenge by a potential regulator/auditor then you’ll need to think about whether they’ll ask for for this. The practice of being able to go back to a particular version and rebuild it from source is pretty typical for mainstream DevOps but it’s not an easy thing to achieve for machine learning. There can be a lot of data involved, it can change often and it can be transformed as it goes through the process. 

An example machine learning pipeline might have several stages before generating a model:

For reproducibility you would want to be able to go back to a particular training run that resulted in a particular model (the .pkl here) and find all the versions of the source files (the .py files) and the versions of the data, as well as who/what initiated the run and when and any metrics (e.g. accuracy).


For full reproducibility you would also need to know exactly what the request was that was made and what the prediction was. So this data needs to be stored and made retrievable. Full logging of predictions is useful because it gives you:

  1. An audit trail of requests that can be individually investigated.
  2. Fresh data that can be used for training the next iteration of the model.
  3. A stream of data that can be monitored against benchmarks.

Monitoring of the data stream could be used to mitigate the risk of concept drift. This would involve watching whether data is close enough to the training data for predictions to be trustworthy. Alerts could be set up for if the data strays too far.


Deploying a new iteration of a model to take over all of live traffic in one go might be risky unless you’re entirely sure that live traffic matches the training data. Given the risks, you might want to tentatively deploy new versions of models to just run against a portion of live data before scaling up to take over from the old version.


We’ve seen that GDPR asks that data subjects have a right to “meaningful information about the logic involved.” Providing explanations for machine learning decisions can be challenging. Machine learning takes patterns from data and reapplies them but the patterns are not always simple lines that can be visualized. There are a range of types of technique and the kind of explanations that are achievable can vary

There’s an example of the use of one such technique in Seldon’s Alibi explainer library. In that example income is predicted from US census data which includes features such as age, marital status, gender and occupation. A technique called ‘anchors’ is used to reveal which feature values were most relevant to a particular prediction. The technique scans the dataset and finds patterns in the predictions. A particular low-income prediction is chosen and the technique is used to reveal that a separated female would be classified as low-income in 95% of cases, meaning that other features (such as occupation or age) then have very little relevance for separated females.

The income classifier example bears a striking resemblance to the incident with AppleCard offering David Heinemeier Hanson a much higher credit limit than his wife. Building explainability into the design would’ve allowed it to be determined quickly whether gender was the reason for the bias.

There are also explanation techniques applicable to text classification or image predictions. Another Alibi example shows how particular sections from an image can be determined to be especially relevant to how it was classified:

Here the nose-region of the picture was especially important for determining that the ImageNet picture was of a Persian Cat.

Rise of MLOps

We can now see that there’s a range of challenges to tackle for good AI governance. Many organisations are choosing whether to put together an MLOps infrastructure in-house to tackle these challenges or to choose a platform.

ML Platforms

ML Platforms come in a range of flavours. Some are part of a cloud provider offering, such as AWS SageMaker or AzureML. Others are an offering in themselves, such as Databricks Mlflow. 

The level of automation involved in platforms can vary. For example Google’s AutoML and DataRobot aim to enable models to be produced with minimal machine learning expertise. 

Some platforms are closed source, some are open (such as the kubernetes-oriented kubeflow). Some are more oriented towards particular stages of machine learning (e.g. training vs deployment) and the extent of support can vary when it comes to aiding reproducibility, monitoring or explainability. Organisations have to consider their use-cases and priorities when evaluating platforms.

Tool Landscape

Rather than choosing an existing platform, organisations or teams can choose to assemble their own. There is not yet a ‘canonical stack’ of obvious choices. Instead there’s a range of choices for each part of the pipeline and a particular assemblage of choices might not have been designed to work together. Teams have to evaluate tools for how well they fit their needs and also how well they fit their other tool choices. 

The kubeflow platform represents an interesting position here as it aims to bring together best-of-breed open source tools and ensure they work well together. Seldon, where I am an Engineer, specialises in open source tools for deployment and governance. We partner with kubeflow on tools such as KFServing. We also offer Seldon Deploy, an enterprise product that brings open source tools together to get the best out of them and speed up project delivery. Seldon Deploy can be used either standalone or together with other platforms to add an extra layer of visibility and governance. 

Industry Initiatives

The Linux Foundation’s LF AI project aims to foster open source collaboration on machine learning tools. It has a great visualization of the tool landscape. The full landscape is large but here’s a snippet:

The Institute for Ethical AI  is a volunteer-led research centre with over 150 expert members and has put a particular focus on providing practical tools and showing how to follow its principles. The Institute has created an AI explainability library, a framework to evaluate machine learning suppliers and listings of MLOps tools.

The Practical AI Ethics Alliance was recently formed by Dan Jeffries of Pachyderm (a company specialising in data pipelines, versioning and lineage). The foundation has put an emphasis on closing the gap between AI governance aspirations and realisation. They’ve emphasised auditing and explainability as well as suggesting the formation of AI QA/response teams.

This is just some of what’s going on in this fast-moving space.

Move Fast with Stable Infrastructure

Part of the future for machine learning is to follow Facebook’s “move fast with stable infrastructure” motto. MLOps infrastructure will enable organisations to go faster and with greater governance. Right now MLOps is a space of innovation. Projects looking to leverage MLOps have to first learn how to navigate the space and take advantage of it in a way that best suits their situation and aims.

More by Ryan Dawson

Why is DevOps for Machine Learning so Different?

Ryan Dawson


Why Big IT Unification Projects Famously Fail

Ryan Dawson