My Python-Data Science Environment Setup

Setting up a good Data Science Environment is first step before starting any project. In this post, I will summarize my current python-Data Science Environment set up.

Note

I am passionate about good practices that allow you to work in an organized way and facilitate team work.

Python virtual environment

Python virtual environments allows you to install different versions of python and python packages into different directories on your computer. The idea is to set up a new python environment for each new data scientist project. This is defenitly a best practice.

Create a new python virtual environment

To create a new python environment called env, go to your project’s directory and run the bellow command.

python3 -m venv env

Activate a python virtual environment

In order to use the new python virual environment you need to activate it.

source env/bin/activate

Close a python virtual environment

Finally, in oder to leave the virual environmente run:

deactivate

Installing python libraries

Once your python vertual environment has been created and activated, you can start installing packages. Suppose you want to install the NumPy library:

pip install numpy

In case you want to see the list of installed packages:

pip list

Create/use requirements.txt file

You can also export the list of all the installed packages and versions by running:

pip freeze > requirements.txt

This is a good practice since it will allow collaborators to clone your python virtual environment with all its dependencies running the following command.

pip install -r requirements.txt

References

Version control with Git and GitHub

Git allows you to keep track of the changes in your source code. GitHub is a hosting service for Git repositories. You will be able to store your tracked code on the cloud using GitHub.

Initializing a Git repository in an existing directory

For start controlling a project directory with Git run the following command

git init
git add .
git commit -m 'Initial porject version'

Ignoring files

You can force Git to ignore some files by creating a file .gitignore listing them. You can create the file running

echo > ".gitignore"

and then list the desired files that you want to ignore.

Adding a remote repository using GitHub

Remote repositories are versions of your project that are hosted on the Internet. GitHub enables you to collaborate with others by managing remote repositories.

Firstly, you will need to create a new GitHub repository.

Afterwards, you can push your locally hosted repository to GitHub from the command line.

git remote add origin <url>
git push -u origin main

References

VS Code

VS code is a great code editor that comes with lots of functionalities and built-in extensions. I highly recommend to check out its built-in source control system. I include a short list of the extensions that I have installed.

References

Jupyter notebooks

Jupyter notebooks are a great tool for data analysis, exploration and visualization. JupyterLab is the lastest web-based interactive development environment for notebooks, code, and data.

You can install it using pip with

pip install jupyterlab

and then launch jupyterab with

jupyter-lab

The IPython kernel is the Python execution backened for Jupyter. For using a kernel in your virtual environment env run the following commands

pip install ipykernel
python -m ipykernel install --name=env

Afterwards, jupyterLab automatically ensures that the IPython kernel is available, and you will be able to choose your virtual environment as a kernel in jupyterLab when opening a new notebook.

References

Data Science Project Structure

You can check out the recommended project directory structure from Cookiecutter Data Science.

“A logical, reasonably standardized, but flexible project structure for doing and sharing data science work”

Written on September 28th, 2022 by Guillermo Villanueva