BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

What Are Best Practices For Collaboration Between Data Scientists?

Following
This article is more than 7 years old.

What are best practices for collaboration between data scientists? originally appeared on Quora - the place to gain and share knowledge, empowering people to learn from others and better understand the world.

Answer by Ben Hamner, Co-founder and CTO of Kaggle, on Quora:

Most data scientists, and data science teams, have terrible practices for collaboration. The current default workflows have grown organically and are bad. You need to be really intentional to do a lot better, and this yields large gains in productivity and reducing painful frictions.

There’s a single principle you should be aiming for, and everything else will follow. This principle is: make everything you do easily reproducible, both for yourself and others.

Making your work reproducible will make it far easier for your future self, your teammates, and everyone else with access to it to understand it, update it, and build on top of it.

Here’s some steps that you can take to make your work more reproducible:

Work off the same data

The data may be stored in flat files or accessed through other systems (such as a relational database or HDFS).

The unfortunate default that most people use is to start with the same raw data as others, but then download or query it and then do everything downstream locally without pushing intermediate data versions back out.

Sharing intermediate datasets by default is super valuable: it enables your coworkers to take advantage of the data cleaning work you’ve done (or vice versa), and understand the context behind results you create and surface potential caveats.

There’s a variety of ways to do this (that depend on the scale of the data and any limitations in your current environment). They include:

  • Use a shared server with a large SSD for your team.
  • Have a step that automatically pushes intermediate data files back to the appropriate cloud container.
  • Take advantage of syncing abilities in your team’s favorite file sharing tool (Google Drive, Dropbox, Box, etc.)
  • Store intermediate results back in your analytics database as new tables (BigQuery, Postgres, Redshift, etc.)

Version your code

All your code should be versioned and pushed back to a non-local system that others can view in a browser and pull down to their own systems as needed.

Git, along with any of the git hosts with a nice web UI, (whether running on a private datacenter or the public cloud) can be a great tool to help with this.

Explicitly setting random seeds in your code also makes debugging much easier.

Use a data pipeline

Anything complex (where the output of one data transformation is used as an input to another) should be formalized into a data pipeline. This let’s you run the entire set of transformations with a single command.

If your current workflow involves manually running two or more commands to go from the raw data to the end result, then you need this.

The advantages are threefold:

  • It codifies precisely how your data transformations are structured.
  • It makes inspecting and debugging intermediate outputs easy.
  • It reduces the time it takes to rebuild your work from scratch (or the necessary intermediate starting point) when you make a change.

A single bash script that runs everything start-to-end isn’t a bad starting point. Using a formal pipeline tool (Make, Luigi, Airflow, Drake, etc.) takes this to the next level by making it easy to rebuild certain targets and only recreate the necessary subgraph. There’s a lot of workable pipeline tools, but I believe a great one that establishes the right set of primitives, is both super easy to use and supports the full necessary complexity, and really sets the standard for how we should approach data pipelines hasn’t been created yet.

Share a computational environment

Massive amounts of time gets wasted in data science collaboration because computational environments aren’t currently shared by default. You should never need to make sure you’re running the same version of a Python library as your colleague to run her code.

Some ways to improve this are:

  • Have a single Docker container shared for the project that runs everyone’s code.
  • Work off a shared server with a standardized computational environment (many data science teams have found having a massive server running JupyterHub that everyone uses to be very productive).
  • Leverage any tools in your analytics languages that facilitate shared computational environments (e.g. Conda Environments).

Make a one-line code update as fast and painless as possible

A corollary to the reproducibility principle is that it should be as fast as possible to make a one-line fix to the code. This enables you and your colleagues to quickly make corrections, or change a parameter and see how that impacts the results.

Ideally after making the one-line update, it is a single command (e.g. a save in a web UI or a “git push”) to kick off a full build of the downstream data pipeline and see the impact that change had.

How our work at Kaggle fits in

At Kaggle, we’re aiming to make all of these practices simple defaults, so you don’t have to do anything special to follow them. We’ll be expanding Kaggle Kernels (our collaborative data science environment) to support all the work a typical data science team does and make it reproducible in a fun, collaborative environment.

This question originally appeared on Quora. - the place to gain and share knowledge, empowering people to learn from others and better understand the world. You can follow Quora on Twitter, Facebook, and Google+. More questions: