Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What does your ML pipeline look like?
318 points by mallochio on April 21, 2019 | hide | past | favorite | 49 comments
Experienced machine learning professionals - How do you create scalable, deployable and reproducible data/ML pipelines at your work?



As a DevOps Engineer working for a ML-based company and have had worked for others in the past, these are my quick suggestions for production readiness.

DOs:

If you are doing any kind of soft-realtime (i.e. not batch processing) inference, by exposing a model on a request-response lifecycle, use Tensorflow Serving for concurrency reasons.

Version your models and track their training. Use something like MLFlow for that. Divise a versioning system that makes sense for your organization.

If you are using Kubernetes in Production, mount NFS in your containers to serve models. Do not download anything (from S3, for instance) on container start up time unless your models are small (<1Gb).

If you have to write some sort of heavy preprocessing or postprocessing steps, eventually port them to a more efficient language than Python. Say Go, Rust, etc.

DO NOTs:

Do NOT make your ML engineers/researchers write anything above the model stack. Don't make them write queue management logic, webservers, etc. That's not their skillset, they will write poorer and less performant code. Bring in a Backend Engineer EARLY.

Do NOT mix and match if you are working on an asynchronous model, i.e. don't have a callback-based API and then have a mix of queues and synchronous HTTP calls. Use queues EVERYWHERE.

DO NOT start new projects in Python 2.7. From past experiences, some ML engineers/researchers are quite attached to the older versions of Python. These are ending support in 2020 and it makes no sense to start a project using them now.


+1, tomasdpinho. Yes to everything, and notably the queues everywhere, versioning the models, and the issue to mix sync and async (go for queues).

As a scientist designing risk management systems, I also like to:

. avoid moving the data;

. bring the (ML/stats) code to the data;

. make in-memory computations (when possible) to reduce latency (network+disk);

. work on live data instead of copies that drift out-of-date; and

. write software to keep models up to date because they drift with time too and that's a major, operationally un-noticed, and extremely costly problem.

I'm not yet into Tensor/ML-Flow, but I use R, JS, and Postgres, thereby relying on open-source eco-systems (and packages) that are:

. as standard as possible;

. well-maintained;

. with a long expected support; and

. as few dependencies as possible.


+2 for bringing the (ML/stats) code to the data instead of the other way around


Could you speak to your experience with this particular list item?


We deal with fairly large volumes of data on a frequent basis so it would not make sense for each data scientist to create a copy within their own environment. Everyone works off a centralized data source and we provide them with Jupyter/Spark in an internal cloud environment.


Can you elaborate on why downloading from S3 at startup is a bad idea? And why not synchronous everywhere as opposed to always queues?

Good points overall that I'd agree with.


Containers are meant to be stateless infrastructure. By downloading something at startup, you're breaking that contract implicitly. Secondly, depending on where you're deploying, downloads from S3 (and then loading to memory) may take a non-negligible amount of time that can impact the availability of your pods (again, depending on their configuration).

Synchronicity everywhere may cause request loss if your ML pipeline is not very reliable, which in most cases it isn't. Relying on a message queuing system will also increase system observability because it's easier to expose metrics, log requests, take advantage of time travelling for debugging, etc.


> Containers are meant to be stateless infrastructure. By downloading something at startup, you're breaking that contract implicitly.

I feel that mounting a NFS partition is a similar break of contract. I.e. you could see the same image behave differently depending on what's in the NFS partition. I feel like to get data in a "reproducible" way you need to pull it from a data versioning system. I think there's different ways to implement data versioning with their own trade-offs. NFS and S3, among others, could be used to implement data versioning.

I agree with you that in theory an NFS is more performant because it allows you to load lazily.


Curious about how you'd scale with data versioning.

In any type of realtime, high bandwidth feed, I feel like what you're suggesting isn't cost effective for the benefits it provides.

If you need absolute reproducibility and back-testing or your feed is lower bandwidth, it maybe makes sense. But not for larger systems.


Interesting topic. :)

This is mainly relevant if your data is used for training.

It seems like you'd want to use a log-based system like kafka to manage versioning and state in this case. I imagine you could:

1. Store incoming training data in a "raw data" topic.

2. A model trainer consumes incoming training data, updates a model's state, and at a pre-determined period writes the model's state as of a given offset in the "raw data" topic in a "model state checkpoint" topic.

3. Then you probably have some "regression testing" workflow that reads from the "model state checkpoint" topic and upon success writes to a "latest best model" topic.

4. Workers that use the model in production read from the "latest best model" topic and update their state upon a change.

I imagine you could add constraints about "model" continuity or gradual release to production that would make the process more complex, but I feel like fundamentally kafka solves a lot of the distributed systems problems.


> By downloading something at startup, you're breaking that contract implicitly.

Nitpicking here, but if you can ensure that certain version is downloaded, then the contract isn't violated.


Excuse my ignorance, but why is an NFS better than S3? Both are loading from disk to memory of the Tensorflow Serving container, aren't they?


NFS is faster and it looks like a normal filesystem to the app so you don't need any special file I/O code.



Python 2.7 in 2019 is the best Python 2.7 there has ever been -- which is to say, it works very, very well. "more heat than light" on this particular topic. Do not start new projects in Python 2.7 -- ok fine. However, not done with Python 2.7 here


Depends on what you're trying to do.

Are you putting a trained inference model into production as a product? Is it a RL system (completely different architecture than an inference system)? Are you trying to build a model with your application data from scratch? Are you doing NLP or CV?

As a rule of thumb I look at the event diagram of the application/systems you're trying to implement ML into, which should tell you how to structure your data workflows in line with the existing data flows of the application. If it's strictly a constrained research effort then pipelines are less important, so go with what's fast and easy to document/read.

Generally speaking, you want your ML/DS/DE systems to be loosely coupled to your application data structures - with well defined RPC standards informed by the data team. I generally hate data pooling, but if we're talking about pooling 15 microservices vs pooling 15 monoliths, then the microservices pooling might be necessary.

Realistically this ends up being a business decision based on organizational complexity.


Thanks for the reply. Could you give some more insight into how and what tools you choose for the different sort of tasks (say NLP vs CV vs RL)? Also, how and why are different tools/pipelines better for production and product building?


How you parse and manage the inputs is significantly different between those types.

With NLP as one example, you need to determine when are you going to do tokenization? - aka break up the inputs into "tokens." So do you do this at ingest, in transit, at rest?

With CV you don't need to do tokenization at all (probably).

So the tools really come out of the use case and how/when you put them into the production chain.


Here are some things that have helped me make shipping models easier.

- Tag and version your datasets, especially your test set, so that you are confident when you compare performance across models or over time.

- Test your training pipeline. Run it on a single batch or a tiny dataset. The whole test should take less than 1 minute to confirm that your training pipeline didn't break. Once one test works, people will write others.

- You should be able to measure the accuracy of your model with a single command, or as part of your CI process. This command should spit out a single report (PDF or html) which is all you need to look at to decide whether to ship a new model or not, including run-time performance (if needed).

- Don't create hermetic training environments that prevent you from doing debugging. Sometimes you just need to ssh in and put in some print statements to track down a problem.


The one I’m working with _now_ is very low tech: daily Python processing data from GCP, and writing back to GCP; a handful of scripts that check everything is reasonnable. That’s because we serve internal results, mostly read by humans.

The most impressive description that I’ve seen run live is described here: https://towardsdatascience.com/rendezvous-architecture-for-d...

I’d love to have feedback from more than Jan because I’m planning on encouraging it internally.

The best structure that I’ve seen is at scale (at a top 10 company) was:

- a service that hosted all models, written in Python or R, stored as Java objects (built with a public H2O library);

- Data scientists could upload models by drag-and-drop on a bare-bones internal page;

- each model was versioned (data and training not separate) by name, using a basic folders/namespace/default version increment approach;

- all models were run using Kubernetes containers; each model used a standard API call to serve individual inferences;

- models could use other models output as input, taking the parent-model inputs as their own in the API;

- to simplify that, most models were encouraged to use a session id or user id as single entry, and most inputs were gathered from a all-encompasing live storage, connected to that model-serving structure;

- every model had extensive monitoring for distribution of input (especially missing), output, delay to respond to make sure both matched expectation from the trained model;

e.g.: if you are training a fraud model, and more than 10% of output in the last hour was positive, warn the DS to check and consider calling security;

e.g.a.: if more than 5% of “previous page looked at” are empty, there’s probably a pipeline issue;

- there were some issues with feature engineering: I feel like the solution chosen was suboptimal because it created two data pipelines, one for live and one for storage/training.

For that problem, I’d recommend that architecture instead: https://www.datasciencefestival.com/video/dsf-day-4-trainlin...


What does "model" mean? What kind of data are contained in a machine learning model? Second, how do you decide a model is robust? I'm asking because I'm looking at using ML to more efficiently use some quality assurance tools for a product line. The idea is to develop a model such that product A, B, C can have existing (or supplementary data) QA data plugged into a model, and then an appropriate sampling plan can be output.

An intern showed proof of concept of such a model based on one product, and it's fantastic work that could save thousands of dollars, but we're struggling with how to "qualify" it. How do we know we won't get a "garbage in/garbage out" situation?


So you want to figure out how often you need to sample products for QA?

A model is two things: a description of what's in the black box (could be a linear model, a neural network architecture, etc) and some weights which uniquely define "that specific model". Each model will have some known input (eg image, tabular data) and output (eg number, image, list etc).

You need to store both the structure and weights: for example your model is y = mx + c, but you need to know m and c to uniquely define it.

To answer your second question robustness means a smart test strategy. Train on representative data, validate during training on a second data set and test on hold-out data that the model has never seen.

Unfortunately it's quite hard to prove model robustness (in the case of deep learning anyway), you have to assume that you trained on enough realistic data.

If you really have no idea about robustness, then you should probably do a kind of soft-launch. Run your model in production alongside what you currently use, and see whether the output makes sense.

You could try, for example, sampling with your current strategy as well as the schedule defined by your ML model (so you lose nothing but a bit of time if the ML system is crap). Then compare the two datasets and see whether the ML model is at least performing the same as your current method.

Surely you can make some naive estimates of robustness though? eg if the model says sample 5% of your product, you then have a bound on the chance that you miss something.


1. What I’m working on at the moment is AB-testing, so no real models there; plenty of simulations and tests though.

2. There are several videos of Jan describing his work, including that one, so I’ll let him give examples of what he means by models: https://www.datasciencefestival.com/video/dsf-day-2-jan-teic...

3. At the big company, it’s an e-commerce website with many products along many dimensions, so models about what aspect of the product customers would be interested in, whether they are likely to commit to purchasing now or just browsing; price sensitivity against other factors. They typically have non-authenticated users, so they have to guess a lot about the users, from time of day, country of connection, type of device used, browsing rhythm — the inferences are not perfect, but they inform how the product is presented, and have a meaningful impact on conversion.

4. In the presentation at Trainline, there are not explicit about what models they have in mind, but it’s also an e-commerce company, so a lot of similar decision.

One unique problem they had talked about openly before (UK train companies are not really reactive but British people love their festivals, championship matches, protests, horse- and dog-races and drinking during all of the above): they deal with the occasional crowded train, so they are trying to predict if a train is going to swamped and if the person booking is going to the event in question. In the latter case, they’d rather avoid the loud fans or drunken top-hatted horse-owners.

For all of the above models, the models are trying to predict something that they can have ground-truth about (typically: buying behaviour), often based on data obtained minutes later. That means all are monitoring the model accuracy, typically off-line. In most cases, they are also monitoring the impact of the use of the model: better recommendations should lead to better conversion, but also, say, a higher MRR.


We're building a deployment system based on the following principles that we've learned:

- Version control of models and datasets

- Code, naming conventions and formatting consistency

- Testing of models before deployment into production (in most cases, this helps gain credibility across engineering)

- Model review process with a senior data scientist

- Kubernetes/Docker for deployment serving as an API or running a scheduled job

- Model performance monitoring when in production to identify degradation


Could you also give us some details about the software you specifically use in the pipeline, other than kubernetes/docker? Do you use any available (possibly open source)? tools for versioning and monitoring? Or are you building this all up from scratch for your needs?


We use Jenkins for our builds (and soon simple testcases), which then feeds into our system that's built from scratch. However, we are looking to productized the system as we had some discussions with other corporate partners who are interested in as well.

The system we've built in-house is fairly simple to keep development fast - versioning is manual (https://packaging.python.org/guides/hosting-your-own-index/) but no reason why you couldn't use a repository manager.

For monitoring, we essentially track any activity related to the model including inputs, outputs, timestamps, duration, etc. to a database and have JavaScript charts render. We might put this into Kafka but seems overkill at the moment and likely force us to hire an actual support team.


Would be interested if anyone in here has a pipeline operating on regulated data (HIPAA, financial, etc). Having a hard time drawing boundaries around what the data science team has access to for development and experimentation vs. where production pipelines would operate. (e.g. where in the process do people/processes get access to live data)


I do machine learning work in healthcare, and work for a HIPAA covered entity. The issue of permissions and data access often gets applied in an unnecessarily strict fashion to data scientists in these environments, often due to a lack of understanding from engineering managers who have been brought in from a non-regulated environment (e.g., hiring a salesforce engineering manager into a healthcare system so they can "disrupt" or "solve" healthcare -- they hear "regulation" and immediately clamp down on everything).

HIPAA allows use of clinical data for treatment, payment, and operations. You can also get around consent issues if the data is properly deidentified. If you have a data scientist who is working to further treatment, payment, or operations (i.e., isn't working on purely marketing uses, selling the data, or doing "real" research), then they are allowed to use the data, assuming it's the minimum necessary for their job. For training machine learning models that support operations, "minimum necessary" is probably a lot of data. And, obviously, the production pipelines and training/experimentation/development would need access to the same amount of data if you want to train and deploy models.

Data scientists are also likely to be the first to notice problems with how your product is working, often before the data engineering team. At my company, I've found numerous bugs in our data engineering pipelines and production code because I've seen anomalies in the data and went digging through the data warehouse, replicas of the production databases, and within our actual product. You probably want to support and encourage that kind of sleuthing - but each organization is different, so maybe you have better QA that's more attuned to data issues.

My opinion, from having done this for over a decade, is that the question shouldn't be about how much access you give your data scientists. They should have access to nearly all of the data that's within their domain, assuming they're legally entitled to it. The question you should be solving for is what restrictions should be placed on how they access and process that data: e.g., have EC2 instances and centralized jupyter notebooks available for them to download and process data, and prohibit storing data on a laptop.


I'd just like to point out that "the minimum necessary for their job" is the reason many engineering managers apply unnecessarily strict rules.

It's very difficult to build rules and policies that allow broad access while maintain minimum necessary. Some project may be completely justified in accessing "all" (waves hands) data at its conception but slowly morphing to focus on only a few key identifiers while still processing "all" data.


Totally, and to extend that further, one employee may work on multiple projects for which the "minimum necessary" is different. Part of my job involves patient matching and reporting patient-level data to partners we have a BAA with. That means I need access to patient names and addresses. However, if I'm working on training some ML model to predict diabetic progression, it's not necessary for me to pull the names of patients.

I think there's an incorrect assumption in here that there exists a technical solution which entirely solves this problem; that we just need to figure out what the right set of rules are, or get the right column-level and row-level security policies in place and we're all set. It's necessary but not sufficient to have those kinds of safeguards in place. You also need to trust somebody in the organization, and you need to give those somebodies training and support to do the right thing.

In my case, I need access to all of the (clinical) data within the organization. I don't really care how that end is achieved: with one account that has every permission, with multiple accounts that are used for different purposes, or whatever. Ultimately, it's in the interest of the organization to make sure that I have the access I need to successfully do my job.


Not the OP, but I'd love to chat more about your experiences in this space. I don't see an email in your profile; feel free to drop me a note using the one in my email.


We're in the financial services space. The data science team has access to a data warehouse that gets updated daily but no access to production data directly so we use this for development and experimentation.

The definition of production environment varies for us. Production could mean running scheduled batch jobs on a separate 'production' pipeline from the data warehouse and results can stored in a database. Naturally, all our data is in-house at the moment so we have had to set everything up ourselves.

We're working on an enterprise-wide deployment system that integrates with the build/testing infrastructure where all the data science teams across the financial institution can leverage for deploying models and monitoring performance.


I have a dataset management mechanism (delete, copy, duplicate, etc.) where dataset attributes are tagged as PII (personally identifiable information), and where generic filters are then applied to obfuscate PII for datasets that'll be used by non-privileged users, e.g., data science.

- It's not bullet proof, but it achieves what I'm looking for.


I see this as a very timely question. As ML has proliferated, so has the number of ways to construct machine learning pipelines. This means that from one project to another the tools/libraries change, the codebases start looking very different, and each project gets its own unique deployment and evaluation process.

In my experience, this causes several problems that hinder the expansion of ML within organizations

1. When a ML project inevitably changes hands, there is a strong desire from the new people working on it to tear things down and start over

2. It's hard to leverage and build upon success on previous projects e.g "my teammate built a model that worked really well for predict X, but I can use any of the that code to predict y"

3. Data science managers face challenges tracking progress and managing multiple data science projects simultaneously.

4. People and teams new to machine learning have a hard time charting single a path to success.

While coming up with a single way to build machine learning pipelines may never be possible, consolidation in the approaches would go a long way.

We've already seen that happen in the modeling algorithms part of the pipeline with libraries like scikit-learn. It doesn't matter what machine learning algorithm you use, it the code will be fit/transform/predict.

Personally, I've noticed this problem of multiple approaches and ad-hoc solutions to be most prevalent in the feature engineering step of the process. That is one of the reasons I work on an open source library called Featuretools (http://github.com/featuretools/featuretools/) that aims to use automation to create more reusable abstrations for feature engineering.

If you look at our demos (https://www.featuretools.com/demos), we cover over a dozen different use cases, but the code you use to construct features stays the same. This means it is easier to reuse previous work and reproduce results in development and production environments.

Ultimately, building successful machine learning pipelines is not about having an approach for the problem you are working on today, but something that generalizes across the all the problems you and your team will work on in the future.


Followup question: How often do you all re-tune your hyperparameters? Automatically whenever your training ETL runs? Only on first deployment? Somewhere in between?


At Logical Clocks, we build a horizontally scalable ML pipeline framework on the open-source Hopsworks platform, based around its feature store and Airflow for orchestration:

* https://hopsworks.readthedocs.io/en/latest/hopsml/hopsML.htm...

* https://www.logicalclocks.com/feature-store/

The choice for the DataPrep stage basically comes down to Spark or Apache Beam, and we currently support Spark, but plan to soon add support for Beam, because of some of the goodies in TFX (TensorFlow Extended).

For distributed hyperparam opt and distributed training, we leverage Apache Spark and our own version of YARN that supports GPUs -

* https://www.youtube.com/watch?v=tx6HyoUYGL0

For model serving, we support Kubernetes:

* https://hopsworks.readthedocs.io/en/0.9/hopsml/hopsML.html#s...

Our platform supports TLS/SSL certificates everywhere and is open-source. Download it and try it, and it runs in several large enterprises in Europe. We have a cluster with >1000 users in Sweden here:

* https://www.hops.site

(Edited for formatting)


This is great, thanks for the link. Could you expand on how this workflow be different/better than sticking to just something like TFX and tensorflow serving? Is it easier to use or more scalable?


It is pretty much the same as TFX - but with Spark for both DataPrep and Distributed HyperparamOpt/Training, and a Feature Store. Model serving is slightly more sophisticated than just TensorFlow Serving on Kubernetes. We support serving requests through the Hopsworks REST API to TFServering/Kubernetes. This gives us both access control (clients have a TLS cert to authenticate and authorize themselves) and we log all predictions to a Kafka topic. We are adding support to enrich feature vectors using the Feature Store in the serving API, not quite there yet.

We intend to support TFX as we already support Flink. Flink/Beam for Python 3 is needed for TFX, but it's not quite there yet, almost.

It will be interesting to see which one of Spark or Beam will become the horizontally scalable platform of choice for TensorFlow. (PyTorch people don't seem as interested, as they mostly come from a background of not wanting complexity).


Here’s a framework we’ve been developing for this purpose, delivered as a python cookiecutter:

https://github.com/hackalog/cookiecutter-easydata

The framework makes use of conda environments, python 3.6+, makefiles, and jupyter notebooks, so if that tooling fits into your data/ML workflow.

We gave a tutorial on the framework at Pydata NYC, and it’s still very much in active development - we refine it with every new project we work on. The tutorial is available from:

https://github.com/hackalog/bus_number

While it works well for our projects, the more real-world projects we throw at it, the better the tooling gets, so we’d love to hear from anyone who want to give it a shot.


+1 for mentioning this. Took it out for a test run. Really neat stuff.


We do everything with https://www.deepdetect.com/. Small team and targeted applications, this may not apply to all needs. The time from dev to prod is very low though.


I'm not a professional but I built a pipeline for Makers Part List - It involves ingesting a video URL, converting the video into images, then storing the images in google storage. Once stored I trigger the model to classify the image. The images are then displayed to annotators who verify/relabel the images. Once I get enough new images the system creates a .csv and uploads it to googles autoML where it retrains my model.

My bottlenecks now are splitting the videos into images as its a very CPU intensive process. Implementing a queue here is my best choice I think.


Interesting. Could you maybe expand on the tools that you utilize inside the pipeline for ETL, model creation, annotation and testing?


I'm running the front and backend of the consumer site on Heroku. The meat of pipeline is hosted on a DigitalOcean High CPU Droplet. I use ffmpeg to extract images from the provided videos. I store everything in Google Cloud Storage and create references to each photo in Firestore. I use Firebase to power for the image verifying/labeling app I built. Its a simple app that presents the viewer with the image and the label that it was given. If its not correct they enter the correct label. I use a cloud function to move the images into an exportable format for autoML once a new image threshold has been hit. Testing is me using it and seeing if it is correctly identifying the objects.


I've been referencing this lately: https://github.com/pditommaso/awesome-pipeline


GitHub repo and a Makefile...

Cut C++ binaries, statically linked, test on local against small data, and ship them to production with scp. Run either standalone, on GPU cluster or MPI cluster.

How the data gets to where it needs to be depends on the environment.

Makefile contains all the steps : from compile to test to deploy.


That sounds terrible to me. I hope there is at least proper CI which executes stuff from the makefile, but even that, `scp` to ship something in 2019 is a step back (https://www.phoronix.com/scan.php?page=news_item&px=OpenSSH-...) .


Nah, it’s very easy to know what’s going on as a result. The endpoint is only ever a single binary that uses environment variables. Nothing mutates the environment. And the environment is only ever a place with an interface to hardware and to data.

There is a single threaded CI for each environment that checks prerequisites, builds, tests, copies, executes and saves logs.

So CI is also a simple queue and binaries remove the need for containers.

Experience has made me a believer in making things only as complicated as they need to be and favouring solutions that can fit into one brain. Do it whenever it fits your requirements and is possible.


Obligatory XKCD: https://xkcd.com/2054/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: