Module 10: Logscale¶
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
import scipy.stats as ss
import vega_datasets
Ratio and logarithm¶
If you use linear scale to visualize ratios, it can be quite misleading.
Let's first create some ratios.
x = np.array([1, 1, 1, 1, 10, 100, 1000])
y = np.array([1000, 100, 10, 1, 1, 1, 1 ])
ratio = x/y
print(ratio)
[1.e-03 1.e-02 1.e-01 1.e+00 1.e+01 1.e+02 1.e+03]
Q: Plot on the linear scale using the scatter() function. Also draw a horizontal line at ratio=1 for a reference.
X = np.arange(len(ratio))
# Implement
Text(0, 0.5, 'Ratio')
Q: Is this a good visualization of the ratio data? Why? Why not? Explain.
Q: Can you fix it?
# Implement
Log-binning¶
Let's first see what happens if we do not use the log scale for a dataset with a heavy tail.
Q: Load the movie dataset from vega_datasets and remove the NaN rows based on the following three columns: IMDB_Rating, IMDB_Votes, Rotten_Tomatoes_Rating.
# TODO: Implement the functionality mentioned above
# The following code is just a dummy. You should load the correct dataset from vega_datasets package.
movies = pd.DataFrame({"Worldwide_Gross": np.random.sample(200), "IMDB_Rating": np.random.sample(200)})
If you simply call hist() method with a dataframe object, it identifies all the numeric columns and draw a histogram for each.
Q: draw all possible histograms of the movie dataframe. Adjust the size of the plots if needed.
# Implement
As we can see, a majority of the columns are not normally distributed. In particular, if you look at the worldwide gross variable, you only see a couple of meaningful data from the histogram. Is this a problem of resolution? How about increasing the number of bins?
ax = movies["Worldwide_Gross"].hist(bins=200)
ax.set_xlabel("World wide gross")
ax.set_ylabel("Frequency")
Text(0, 0.5, 'Frequency')
Maybe a bit more useful, but it doesn't tell anything about the data distribution above certain point. How about changing the vertical scale to logarithmic scale?
ax = movies["Worldwide_Gross"].hist(bins=200)
ax.set_yscale('log')
ax.set_xlabel("World wide gross")
ax.set_ylabel("Frequency")
Text(0, 0.5, 'Frequency')
Now, let's try log-bin. Recall that when plotting histgrams we can specify the edges of bins through the bins parameter. For example, we can specify the edges of bins to [1, 2, 3, ... , 10] as follows.
movies["IMDB_Rating"].hist(bins=range(0,11))
<matplotlib.axes._subplots.AxesSubplot at 0x7f89e3b38990>
Here, we can specify the edges of bins in a similar way. Instead of specifying on the linear scale, we do it on the log space. Some useful resources:
Hint: since $10^{\text{start}} = \text{min(Worldwide\_Gross)}$, $\text{start} = \log_{10}(\text{min(Worldwide\_Gross)})$
min(movies["Worldwide_Gross"])
0.0
Because there seems to be movie(s) that made $0, and because log(0) is undefined & log(1) = 0, let's add 1 to the variable.
movies["Worldwide_Gross"] = movies["Worldwide_Gross"]+1.0
Replace the dummy values for bins with 20 log bins using numpy.geomspace function.
# TODO: Replace the dummy value of bins using np.geomspace.
# Create 20 bins that cover the whole range of the dataset.
bins = [1.0, 2.0, 4.0]
bins
array([1.00000000e+00, 3.14018485e+00, 9.86076088e+00, 3.09646119e+01,
9.72346052e+01, 3.05334634e+02, 9.58807191e+02, 3.01083182e+03,
9.45456845e+03, 2.96890926e+04, 9.32292387e+04, 2.92757043e+05,
9.19311230e+05, 2.88680720e+06, 9.06510822e+06, 2.84661155e+07,
8.93888645e+07, 2.80697558e+08, 8.81442219e+08, 2.76789150e+09])
Now we can plot a histgram with log-bin. Set both axis to be log-scale.
ax = (movies["Worldwide_Gross"]+1.0).hist(bins=bins)
ax.set_yscale('log')
ax.set_xscale('log')
ax.set_xlabel("World wide gross")
ax.set_ylabel("Frequency")
Text(0, 0.5, 'Frequency')
What is going on? Is this the right plot?
Hint: Look at the previous frequency plots before we used log-bin. Why are the shapes different?
Q: explain and fix
# Implement
Text(0, 0.5, 'Probability density')
Q: Can you explain the plot? Why are there gaps?
CCDF¶
CCDF is a nice alternative to examine distributions with heavy tails. The idea is same as CDF, but the direction of aggregation is opposite. For a given value x, CCDF(x) is the number (fraction) of data points that are same or larger than x. To write code to draw CCDF, it'll be helpful to draw it by hand by using a very small, toy dataset. Draw it by hand and then think about how each point in the CCDF plot can be computed.
Q: Draw a CCDF of worldwide gross data in log-log scale
# TODO: Implement functionality mentioned above
# You must replace the dummy values with the correct code.
worldgross_sorted = np.random.sample(200)
Y = np.random.sample(200)
We can also try in semilog scale (only one axis is in a log-scale), where the horizontal axis is linear.
plt.xlabel("World wide gross")
plt.ylabel("CCDF")
plt.plot(worldgross_sorted,Y)
plt.yscale('log')
A straight line in semilog scale means exponential decay (cf. a straight line in log-log scale means power-law decay). So it seems like the amount of money a movie makes across the world follows roughly an exponential distribution, while there are some outliers that make insane amount of money.
Q: Which is the most successful movie in our dataset?
You can use the following
# Implement
Title Avatar US Gross 7.60168e+08 Worldwide Gross 2.76789e+09 US DVD Sales 1.46154e+08 Production Budget 2.37e+08 Release Date Dec 18 2009 MPAA Rating PG-13 Running Time min NaN Distributor 20th Century Fox Source Original Screenplay Major Genre Action Creative Type Science Fiction Director James Cameron Rotten Tomatoes Rating 83 IMDB Rating 8.3 IMDB Votes 261439 Name: 1234, dtype: object
Run in Google Colab
View on Github