Ask HN: What does your ML pipeline look like?

As a DevOps Engineer working for a ML-based company and have had worked for others in the past, these are my quick suggestions for production readiness.


If you are doing any kind of soft-realtime (i.e. not batch processing) inference, by exposing a model on a request-response lifecycle, use Tensorflow Serving for concurrency reasons.

Version your models and track their training. Use something like MLFlow for that. Divise a versioning system that makes sense for your organization.

If you are using Kubernetes in Production, mount NFS in your containers to serve models. Do not download anything (from S3, for instance) on container start up time unless your models are small (<1Gb).

If you have to write some sort of heavy preprocessing or postprocessing steps, eventually port them to a more efficient language than Python. Say Go, Rust, etc.


Do NOT make your ML engineers/researchers write anything above the model stack. Don’t make them write queue management logic, webservers, etc. That’s not their skillset, they will write poorer and less performant code. Bring in a Backend Engineer EARLY.

Do NOT mix and match if you are working on an asynchronous model, i.e. don’t have a callback-based API and then have a mix of queues and synchronous HTTP calls. Use queues EVERYWHERE.

DO NOT start new projects in Python 2.7. From past experiences, some ML engineers/researchers are quite attached to the older versions of Python. These are ending support in 2020 and it makes no sense to start a project using them now.

Can you elaborate on why downloading from S3 at startup is a bad idea? And why not synchronous everywhere as opposed to always queues?

Good points overall that I’d agree with.

Containers are meant to be stateless infrastructure. By downloading something at startup, you’re breaking that contract implicitly. Secondly, depending on where you’re deploying, downloads from S3 (and then loading to memory) may take a non-negligible amount of time that can impact the availability of your pods (again, depending on their configuration).

Synchronicity everywhere may cause request loss if your ML pipeline is not very reliable, which in most cases it isn’t. Relying on a message queuing system will also increase system observability because it’s easier to expose metrics, log requests, take advantage of time travelling for debugging, etc.

The one I’m working with _now_ is very low tech: daily Python processing data from GCP, and writing back to GCP; a handful of scripts that check everything is reasonnable. That’s because we serve internal results, mostly read by humans.

The most impressive description that I’ve seen run live is described here:…

I’d love to have feedback from more than Jan because I’m planning on encouraging it internally.

The best structure that I’ve seen is at scale (at a top 10 company) was:

– a service that hosted all models, written in Python or R, stored as Java objects (built with a public H2O library);

– Data scientists could upload models by drag-and-drop on a bare-bones internal page;

– each model was versioned (data and training not separate) by name, using a basic folders/namespace/default version increment approach;

– all models were run using Kubernetes containers; each model used a standard API call to serve individual inferences;

– models could use other models output as input, taking the parent-model inputs as their own in the API;

– to simplify that, most models were encouraged to use a session id or user id as single entry, and most inputs were gathered from a all-encompasing live storage, connected to that model-serving structure;

– every model had extensive monitoring for distribution of input (especially missing), output, delay to respond to make sure both matched expectation from the trained model;

e.g.: if you are training a fraud model, and more than 10% of output in the last hour was positive, warn the DS to check and consider calling security;

e.g.a.: if more than 5% of “previous page looked at” are empty, there’s probably a pipeline issue;

– there were some issues with feature engineering: I feel like the solution chosen was suboptimal because it created two data pipelines, one for live and one for storage/training.

For that problem, I’d recommend that architecture instead:…

I see this as a very timely question. As ML has proliferated, so has the number of ways to construct machine learning pipelines. This means that from one project to another the tools/libraries change, the codebases start looking very different, and each project gets its own unique deployment and evaluation process.

In my experience, this causes several problems that hinder the expansion of ML within organizations

1. When a ML project inevitably changes hands, there is a strong desire from the new people working on it to tear things down and start over

2. It’s hard to leverage and build upon success on previous projects e.g “my teammate built a model that worked really well for predict X, but I can use any of the that code to predict y”

3. Data science managers face challenges tracking progress and managing multiple data science projects simultaneously.

4. People and teams new to machine learning have a hard time charting single a path to success.

While coming up with a single way to build machine learning pipelines may never be possible, consolidation in the approaches would go a long way.

We’ve already seen that happen in the modeling algorithms part of the pipeline with libraries like scikit-learn. It doesn’t matter what machine learning algorithm you use, it the code will be fit/transform/predict.

Personally, I’ve noticed this problem of multiple approaches and ad-hoc solutions to be most prevalent in the feature engineering step of the process. That is one of the reasons I work on an open source library called Featuretools ( that aims to use automation to create more reusable abstrations for feature engineering.

If you look at our demos (, we cover over a dozen different use cases, but the code you use to construct features stays the same. This means it is easier to reuse previous work and reproduce results in development and production environments.

Ultimately, building successful machine learning pipelines is not about having an approach for the problem you are working on today, but something that generalizes across the all the problems you and your team will work on in the future.

Would be interested if anyone in here has a pipeline operating on regulated data (HIPAA, financial, etc). Having a hard time drawing boundaries around what the data science team has access to for development and experimentation vs. where production pipelines would operate. (e.g. where in the process do people/processes get access to live data)

This is great, thanks for the link.
Could you expand on how this workflow be different/better than sticking to just something like TFX and tensorflow serving? Is it easier to use or more scalable?

It is pretty much the same as TFX – but with Spark for both DataPrep and Distributed HyperparamOpt/Training, and a Feature Store. Model serving is slightly more sophisticated than just TensorFlow Serving on Kubernetes. We support serving requests through the Hopsworks REST API to TFServering/Kubernetes. This gives us both access control (clients have a TLS cert to authenticate and authorize themselves) and we log all predictions to a Kafka topic. We are adding support to enrich feature vectors using the Feature Store in the serving API, not quite there yet.

We intend to support TFX as we already support Flink. Flink/Beam for Python 3 is needed for TFX, but it’s not quite there yet, almost.

It will be interesting to see which one of Spark or Beam will become the horizontally scalable platform of choice for TensorFlow. (PyTorch people don’t seem as interested, as they mostly come from a background of not wanting complexity).

Here’s a framework we’ve been developing for this purpose, delivered as a python cookiecutter:

The framework makes use of conda environments, python 3.6+, makefiles, and jupyter notebooks, so if that tooling fits into your data/ML workflow.

We gave a tutorial on the framework at Pydata NYC, and it’s still very much in active development – we refine it with every new project we work on. The tutorial is available from:

While it works well for our projects, the more real-world projects we throw at it, the better the tooling gets, so we’d love to hear from anyone who want to give it a shot.

Depends on what you’re trying to do.

Are you putting a trained inference model into production as a product? Is it a RL system (completely different architecture than an inference system)? Are you trying to build a model with your application data from scratch? Are you doing NLP or CV?

As a rule of thumb I look at the event diagram of the application/systems you’re trying to implement ML into, which should tell you how to structure your data workflows in line with the existing data flows of the application. If it’s strictly a constrained research effort then pipelines are less important, so go with what’s fast and easy to document/read.

Generally speaking, you want your ML/DS/DE systems to be loosely coupled to your application data structures – with well defined RPC standards informed by the data team. I generally hate data pooling, but if we’re talking about pooling 15 microservices vs pooling 15 monoliths, then the microservices pooling might be necessary.

Realistically this ends up being a business decision based on organizational complexity.

Thanks for the reply. Could you give some more insight into how and what tools you choose for the different sort of tasks (say NLP vs CV vs RL)? Also, how and why are different tools/pipelines better for production and product building?

Leave a Reply

Your email address will not be published. Required fields are marked *

Next Post

Simple cooking methods flush arsenic out of rice

Sun Apr 21 , 2019