Setting up a good Data Science Environment is first step before starting any project. In this post, I will summarize my current python-Data Science Environment set up.
I am passionate about good practices that allow you to work in an organized way and facilitate team work.
Python virtual environment
Python virtual environments allows you to install different versions of python and python packages into different directories on your computer. The idea is to set up a new python environment for each new data scientist project. This is defenitly a best practice.
Create a new python virtual environment
To create a new python environment called env, go to your project’s directory and run the bellow command.
python3 -m venv env
Activate a python virtual environment
In order to use the new python virual environment you need to activate it.
source env/bin/activate
Close a python virtual environment
Finally, in oder to leave the virual environmente run:
deactivate
Installing python libraries
Once your python vertual environment has been created and activated, you can start installing packages. Suppose you want to install the NumPy library:
pip install numpy
In case you want to see the list of installed packages:
pip list
Create/use requirements.txt file
You can also export the list of all the installed packages and versions by running:
pip freeze > requirements.txt
This is a good practice since it will allow collaborators to clone your python virtual environment with all its dependencies running the following command.
pip install -r requirements.txt
References
Version control with Git and GitHub
Git allows you to keep track of the changes in your source code. GitHub is a hosting service for Git repositories. You will be able to store your tracked code on the cloud using GitHub.
Initializing a Git repository in an existing directory
For start controlling a project directory with Git run the following command
git init
git add .
git commit -m 'Initial porject version'
Ignoring files
You can force Git to ignore some files by creating a file .gitignore listing them. You can create the file running
echo > ".gitignore"
and then list the desired files that you want to ignore.
Adding a remote repository using GitHub
Remote repositories are versions of your project that are hosted on the Internet. GitHub enables you to collaborate with others by managing remote repositories.
Firstly, you will need to create a new GitHub repository.
Afterwards, you can push your locally hosted repository to GitHub from the command line.
git remote add origin <url>
git push -u origin main
References
VS Code
VS code is a great code editor that comes with lots of functionalities and built-in extensions. I highly recommend to check out its built-in source control system. I include a short list of the extensions that I have installed.
- Atom Material Theme
- GitHub Markdown Preview
- gitignore
- Python
- Tabnine AI Autocomplete
- Excel Viewer
References
Jupyter notebooks
Jupyter notebooks are a great tool for data analysis, exploration and visualization. JupyterLab is the lastest web-based interactive development environment for notebooks, code, and data.
You can install it using pip with
pip install jupyterlab
and then launch jupyterab with
jupyter-lab
The IPython kernel is the Python execution backened for Jupyter. For using a kernel in your virtual environment env run the following commands
pip install ipykernel
python -m ipykernel install --name=env
Afterwards, jupyterLab automatically ensures that the IPython kernel is available, and you will be able to choose your virtual environment as a kernel in jupyterLab when opening a new notebook.
References
Data Science Project Structure
You can check out the recommended project directory structure from Cookiecutter Data Science.
“A logical, reasonably standardized, but flexible project structure for doing and sharing data science work”