Skip to content

Training Deep Learning Models

We strive to make our work as reproducible as possible. Thus we established the following rules for developing deep learning models within our team:

  • when training on shared infrastructure make sure you don't use all the resources
  • when training on shared infrastructure please train your models inside docker containers
  • connect your training with our MLFlow instance
  • the code for your training should be in our repo: https://scm.cms.hu-berlin.de/berlinunited/tools/naoth-deeplearning
  • datasets used for training should be published to datasets.naoth.de
  • trained models should be published at models.naoth.de

Train Models inside docker

Note

TODO: fix user permissions, add restrictions for cpu, how to train headless? TODO: setting a user gives a weird error, that perhaps can be ignored

As an example we show how you can train a tensorflow model: Lets assume you have the following folder structure inside the deepl learning repo:

ball_detection_tf/
├── my_cool_dataset.pkl
└── train.py
cd inside the ball_detection_tf folder and run

docker run  -it --privileged -v ${PWD}:/mycode -u $(id -u):$(id -g) --cpuset-cpus="0-3" --ipc host -w /mycode tensorflow/tensorflow:latest-gpu /bin/bash
or if you don't have a GPU available run
docker run  -it --privileged -v ${PWD}:/mycode --cpuset-cpus="0-3" --gpus all --ipc host -w /mycode tensorflow/tensorflow:latest /bin/bash
after that you will have a bash shell inside the container. The current path on your machine is mount at /mycode. Typing ls should give you:
my_cool_dataset.pkl  train.py

Now you can run your python script normally.