Ask the Expert feat. Sam Jenkins

As part of our ‘Ask The Expert‘ blog series, we had an in depth chat with Sam Jenkins, MLOps Lead at Ultraleap.

In this article, we explore Sam’s journey from chemical engineering to MLOps, discussing effective strategies for scaling machine learning models, monitoring model performance, and integrating ethical considerations.

Welcome, can you please introduce yourself?

Hi there! I’m Sam Jenkins and I’m the MLOps Lead at Ultraleap, a User Interface company that uses Computer Vision Machine Learning models to create hand tracking and haptics technologies. Here I’m responsible for the design, implementation and maintenance of the MLOps system and underlying infrastructure.

Prior to this I worked as a data scientist at a scientific instrumentation company called Malvern Panalytical, who were looking to implement machine learning solutions on their IoT data. At that time several companies were pushing for AI solutions on their existing products but hadn’t quite considered the full extent of the infrastructure requirements, so given my background, I worked on the engineering work for the data platform in support of the modelling.

My background originally was in chemical / process engineering (now I think about it you could consider MLOps being the “process engineering” of the AI industry…). My experience in that field, as well as a large amount of systems engineering and modelling in MATLAB, definitely helped me along the way!

In the context of MLOps, what are the most effective strategies you’ve seen for scaling machine learning models to handle increasingly large and complex datasets? Can you share any specific tools or practices that have been particularly impactful in this area?

To this I would say that you can’t patch over bad system design. Having an understanding of the fundamentals of distributed systems is key to making appropriate decisions on technologies relative to your specific use case, which will allow you to scale more easily.

At Ultraleap we have to maintain the training and deployment / packaging of over 80 models, which makes for some very complex pipelines! Maintaining the lineage of data through to model over many experiments across a large team is important, so we make sure to properly track experiments, maintain metadata across pipelines for historical purposes, and maintain good API interfaces with our software teams to package models downstream.

More broadly I think a good grounding in DevOps practices is very important. Infrastructure as code, solid CI / CD pipelines, good environment segregation and containerisation will make maintaining your infrastructure easier in support of model training and deployment.

If we’re considering specific tools, kubernetes obviously works great for controlling scale (more so for deployment than training). If you want good experiment tracking, metadata storage and model registries, look towards ClearML or MLFlow, or AzureML / Sagemaker if you want to keep consistent with a cloud provider. Picking a stack appropriate for your use case is particularly important, so make sure you consider your businesses unique challenges and edge cases!

Monitoring and management of machine learning models in production can be challenging. How do you approach the monitoring of model performance over time, particularly in terms of dealing with issues like model drift or data shifts? Additionally, what tools or methodologies do you recommend for effective version control and lifecycle management of these models?

Drift can be a complex problem. Is the model drifting due to the production data falling outside of its training data distribution? Or have the external parameters of what is considered to be a quality prediction changed? i.e. sentiment / concept drift. I think having a very good understanding of the domain you’re working in is essential here, which is where talented data scientists are worth their weight in gold! This is the starting point, then you can think about setting up monitoring / alerting for model drift based on your KPIs.

Again we come back to having solid infrastructure, and part of this is having really good monitoring and observability of your system. If you have a distributed MLOps system running on kubernetes for example, utilising Prometheus and Grafana, as well as a good centralised logging service (e.g. cloudwatch if you’re using EKS), you’re going in the right direction.

MLOps has a wide range of applications across different sectors. In your experience, how have MLOps practices uniquely evolved to support diverse domains like Large Language Models (LLMs), Computer Vision (CV), and the Internet of Things (IoT)?

The LLM boom means we’re seeing drastically increasing computational requirements, either for training foundational models or fine-tuning existing ones. They also have requirements specific to the field, e.g. transfer learning, RLHF and vector dbs, so much so that the industry seems to be coining the term LLMOps, which seems to refer to the extra bits on top of MLOps to support LLMs in production. Everyone loves a buzzword, LLMOps seems to be the new one.

Tooling surrounding data management in ML has become more mature, maintaining versions of datasets and updating them can be very important in CV use cases. One other interesting aspect of CV is image annotation, and one area I have noticed in MLOps tooling that has developed is an increased number of data annotation tools.

How do you see the role of MLOps in driving industry and societal change, particularly through automation and advanced AI applications? What are some of the most significant impacts you’ve observed or predict will occur in the near future due to the advancement of MLOps?

It’s hard to put a finger on the direct impact of MLOps upon industry and societal change, but companies who are building out their AI solutions are now taking the underlying infrastructure and system design more seriously. If you have great MLOps you have a much more efficient process for delivering your product. Going back to my process engineering comparison, you can liken it to an increasingly efficient factory to produce cars. As humans we have vast experience in improving manufacturing efficiency for producing objects in the physical world, and MLOps applies those principles to the digital world. I think we’ll see rapid advancements in certain domains as they’ll be able to train and deploy models much more effectively as we get better and better tooling, reducing the time from research to product.

Written by

Principal Recruiter

Data Platform & Architecture

View profile

Scott Rogers