Setup docker for Kaggle

Here are some notes on setting up docker for Kaggle (especially on installing and enabling nbextensions). I had to do this from time to time and wanted to write the steps down for the record. I put it here in case it’s useful for someone else.

Why use docker for Kaggle

Kaggle is a good place to learn machine learning and data science. I think its docker image is a good option for data science development environment for two scenarios:

  • if I want to do local experiments for Kaggle’s kernel-only competitions, this is exactly the same environment of Kaggle kernels; and it is can be kept up-to-date by simply rebuilding the latest images.
  • for everything else, the list of packages included in the image is built by the Kaggle community and therefore includes most useful tools that I know or don’t know about; this is much easier than building your own list of packages.

There are two images, CPU-only and GPU. The first is sufficient if you don’t have or need a GPU.

First, install docker if not already.

CPU-only image

It is easier than GPU partly because there is already an image stored on Google Container Registry and I can simply install the image by typing from the terminal.

Put the following in .bash_profile and run source .bash_profile

This command basically setups file directories and port forwarding, and runs jupyter notebook from within docker. (There are more details about the docker-run command on this blog post from Kaggle.) Now I can run kjupyter from the terminal and go to http://localhost:8888/ for jupyter notebook.

About nbextensions

One issue I had with Kaggle’s docker image is that it does not have nbextensions out of the box, and these extensions (e.g., table of content) are very useful in the notebook environment. These lines in the definition of kjpyter above solved that

GPU image

Unlike CPU image, this does not have an image in the repo so you can’t run docker pull like before. Instead, follow the instructions here, clone the git repository of Kaggle docker by running

Under the project root docker-python/, run ./build --gpu.

Like before, put the following in .bash_profile and run source .bash_profile

This line in the above definition of kjupyter solves a problem with using GPU. (This link has some background.)

Now you can run kjupyter from the terminal and go to http://localhost:8888/ for jupyter notebook. Nbextensions are taken care of in a similar way as the CPU version.

Update-1: an example where I enable some extension by default, notice the additional jupyter nbextension enable toc2/main --user in the defintion below:

Update-2: under some recent versions of conda the notebook fails to connect to the kernel and hangs there. I’ve found the solution described here works: downgrade tornado by pip install tornado==4.5.3:

References

Software Engineering SMTS at Salesforce Commerce Cloud Einstein

Software Engineering SMTS at Salesforce Commerce Cloud Einstein