Overview of Python Visualization Toolkit¶
These two talks provide pretty good overview of the Python visualization landscape.
- Jake VanderPlas at PyCon 2017: The Python Visualization Landscape.
- James Bednar at AnacondaCon 2018: PyViz: Dashboards for Visualizing 1 Billion Datapoints in 30 Lines of Python.
Python environment setup instruction¶
When you work on data visualizations, there are two main choices: local and cloud. Working locally means installing various packages in your computer and working on the cloud means using services like Google Colaboratory.
Local installation can be nice and convenient (when it works) because everything is in your computer. You can work without network connection and everything can be configured exactly as you see fit. However, local installation can be a huge source of pain because there are multiple ways to install packages (e.g., Anaconda, System Python, etc.) and they can interfere and break. This "dependency hell" can be extremely frustrating. At the same time, working locally is limited by the power of your computer. For instance, you may not be able to run a heavy computational task because your laptop is 10 years old or because you refuse to delete a game for a huge dataset.
On the other hand, working on a cloud environment can also be nice and convenient (when it works). You don't need to worry about different installation and dependency hell (usually), as long as the environment that the cloud provider provides is adequate. All computation is done on the cloud so your computer will not suffer. But that also means you can only work within the boundary set by the cloud provider. For instance the free Google Colaboratory has a certain limit on memory and disk usage, and to go beyond you have to pay.
My recommendation is the following: if you are comfortable with managing your computer and multiple Python installations and packages, or want to learn system management, go with local option. You can either use Anaconda (recommended method and much easier to manage dependency hell) or just vanilla Python (with virtuanenv and pip). If you don't want to spend time on this, use Google Colaboratory.
Local setup¶
If you don't have much experience with Python ecosystem, we recommend using Anaconda, which makes most data analysis tasks easier. But it is totally fine not to use Anaconda.
Setting up Anaconda environment¶
Anaconda is a popular, all-in-one Python package distribution system. If your work is focused on data analysis, machine learning, visualization, etc., then just sticking with Anaconda is probably the most convenient way to manage multiple Python versions, packages, and virtual environments. Anaconda comes with a package manager called conda that replaces pip and virtualenv. It allows you to install packages, create virtual environments, and more. It even allows you to set a different Python version for each virtual environment. If you use Anaconda, you want to stick to it as much as possible by exclusively using conda (use pip only when there is no way to install with conda).
First, download Anaconda for your system from here. Follow the instructions and reopen your shell (or use the Anaconda GUI system). By default, Anaconda activates base environment.
We usually want to use a virtual environment for each of your project. By using virtual environments, you can isolate each environment from the others and maintain separate sets (versions) of packages. conda has a built-in support for virtual environments.
$ conda create -n dviz python=3.8
This command creates a virtual environment named (-n) dviz with Python 3.8. You can activate the environment (whenever you begins to work on this course) by running
$ conda activate dviz
You can install packages by running
$ conda install packagename
Let's install core Python data packages.
$ conda install numpy scipy pandas scikit-learn matplotlib seaborn pillow jupyter jupyterlab
and deactivate (when you're done) by running
$ conda deactivate
For the full documentation, see https://conda.io/docs/user-guide/tasks/manage-environments.html
When the environment is activated, your prompt will show the name of the environment. Now if you install a package, it will be installed into this environment, but not in the other virtual environment. In other words, each virtual environment maintains its own "world" and totally different packages can be installed in each "world".
But even if you have created a virtual environment and installed all necessary packages, you can't use this environment in the Jupyter notebook/Lab yet. Jupyter needs a special program called "IPython kernel" to execute code. So, for each virtual environment we need to also create a corresponding kernel. We can do manually by using ipykernel package like following:
$ conda activate dviz
$ conda install ipykernel
$ python -m ipykernel install --user --name=dviz
or we can use nb_conda package by installing it.
$ conda install nb_conda
Once it's installed, kernels for all virtual environments created by conda are automatically created and show up in Jupyter. So this is a much more convenient option.
pip + virtualenv¶
If you are not using Anaconda, you can use Python's virtualenv to create virtual environments and pip to install packages to the virtual environments. See https://docs.python.org/3/tutorial/venv.html For instance, you can run the following to create a virtual environment in the current directory (it will create dviz-env folder).
$ python3 -m venv dviz-env
and then activate and deactivate it by
$ source dviz-env/bin/activate
$ deactivate
Once you activate a virtual environment, you can install packages into it with pip.
$ pip install packagename
Like the case with conda, we need to install a kernel for each virtual environment to use it in Jupyter.
$ source dviz-env/bin/activate
$ pip install ipykernel
$ python -m ipykernel install --user --name=dviz
Jupyter¶
Once you have setup your local environment, you can run
$ jupyter lab
or use Jupyter notebook.
jupyter notebook
Jupyter lab is the 'next generation' notebook system with many powerful features. Some packages that we use work more nicely with Jupyter lab (for some lab assignments you may need to use jupyter notebook instead of the lab).
Jupyter also has a desktop app: https://github.com/jupyterlab/jupyterlab-desktop
nteract¶
Another convenient way to use Jupyter is using the nteract app. This is essentially a desktop Jupyter app. Instead of running Jupyter server and using a web browser, you can simply open a notebook file with nteract app and you'll be able to run the code as if you're using the Jupyter on the web. If you use Atom editor, you can make it interactive by using https://nteract.io/atom.
VS code¶
Visual Studio code is a powerful editor and it supports notebooks and interactivity for Python. If you open a Jupyter notebook file, it works seamlessly within VS code. With the Remote - SSH plugin, you can maintain almost identical experience using a computing server.
Cloud setup¶
There are many cloud-based Jupyter notebook services. The best option is probably the Google Colaboratory. It allows installation of packages and you can even use GPU/TPUs!
Google colaboratory¶
Google Colaboratory is Google's collaborative Jupyter notebook service. This is the recommended cloud option. Most packages that we use are already installed and you can install additioanl packages by running
!pip install packagename
in the computing cell.
Other options¶
- Azure notebooks: Microsoft also has a cloud notebook service called Azure notebooks. This service also allows installing new packages through
!pip install .... - CoCalc (https://cocalc.com/) is a service by SageMath. You can use it freely but the free version is slow and can be turned off without warning. Most of the packages that we use are pre-installed. We may be able to provide a subscription through the school.
- Kaggle Kernels: The famous machine learning / data science competition service Kaggle offers cloud-based notebooks called Kaggle kernels. Because you can directly use all the Kaggle datasets, it is an excellent option to do your project if you use one of the Kaggle datasets. It allows uploading your own dataset and install some packages, but not all packages are supported.
Lab assignment¶
- Set up your local Python environment following the instructions. You should be using a virtual environment on your local machine.
- Install Jupyter notebook and Jupyter lab.
- Launch jupyter notebook (lab)
- Create a new notebook and play with it. Print "Hello world!".
If you want to use a cloud environment,
- Try out the cloud environments listed above. (Google colaboratory is recommended)
- Try installing/importing the following packages.
- Create a cell that prints "Hello world!".
- Submit the notebook.
Finally, these are the packages that we plan to use. So check out their homepages and figure out what they are about.
- Jupyter Notebook and Lab: https://jupyter.org/
- numpy: http://www.numpy.org/
- scipy: http://www.scipy.org/
- matplotlib: http://matplotlib.org/
- seaborn: http://seaborn.pydata.org/
- pandas: http://pandas.pydata.org/
- scikit-learn: http://scikit-learn.org/stable/
- altair: https://github.com/altair-viz/altair
- vega_datasets: https://github.com/altair-viz/vega_datasets
- bokeh: http://bokeh.pydata.org/en/latest/
- datashader: http://datashader.org/
- holoviews: http://holoviews.org/
- wordcloud: https://github.com/amueller/word_cloud
- spacy: https://spacy.io/
Install them using your package manager (conda or pip).
Once you have installed the Jupyter locally or succeeded with a cloud environment, run the following import cell to make sure that every package is installed successfully. Submit the notebook on the canvas.
!pip install numpy scipy matplotlib seaborn pandas altair vega_datasets pillow sklearn bokeh datashader holoviews wordcloud spacy
import numpy
import scipy
import matplotlib
import seaborn
import pandas
import altair
import vega_datasets
import sklearn
import bokeh
import datashader
import holoviews
import wordcloud
import spacy
Run in Google Colab
View on Github