Module 10: Logscale¶

Submission instructions¶

Your final submission should contain¶

A .ipynb file of the entire notebook
An html file of the entire notebook (see here to convert .ipynb to .html)

In [ ]:

            
                Copied!
                
                    
                    
                
                

        
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
import scipy.stats as ss
import vega_datasets
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
import scipy.stats as ss
import vega_datasets

Ratio and logarithm¶

If you use linear scale to visualize ratios, it can be quite misleading.

Let's first create some ratios.

In [ ]:

            
                Copied!
                
x = np.array([1,    1,   1,  1, 10, 100, 1000])
y = np.array([1000, 100, 10, 1, 1,  1,   1   ])
ratio = x/y
print(ratio)
x = np.array([1,    1,   1,  1, 10, 100, 1000])
y = np.array([1000, 100, 10, 1, 1,  1,   1   ])
ratio = x/y
print(ratio)

[1.e-03 1.e-02 1.e-01 1.e+00 1.e+01 1.e+02 1.e+03]

Q: Plot on the linear scale using the scatter() function. Also draw a horizontal line at ratio=1 for a reference.

In [ ]:

            
                Copied!
                
X = np.arange(len(ratio))

# Implement
X = np.arange(len(ratio))

# Implement

Out[ ]:

Text(0, 0.5, 'Ratio')

Q: Is this a good visualization of the ratio data? Why? Why not? Explain.

In [ ]:

Q: Can you fix it?

In [ ]:

            
                Copied!
                
# Implement
# Implement

Log-binning¶

Let's first see what happens if we do not use the log scale for a dataset with a heavy tail.

Q: Load the movie dataset from vega_datasets and remove the NaN rows based on the following three columns: IMDB_Rating, IMDB_Votes, Rotten_Tomatoes_Rating.

In [ ]:

            
                Copied!
                
# TODO: Implement the functionality mentioned above 
# The following code is just a dummy. You should load the correct dataset from vega_datasets package. 
movies = pd.DataFrame({"Worldwide_Gross": np.random.sample(200), "IMDB_Rating": np.random.sample(200)})
# TODO: Implement the functionality mentioned above 
# The following code is just a dummy. You should load the correct dataset from vega_datasets package. 
movies = pd.DataFrame({"Worldwide_Gross": np.random.sample(200), "IMDB_Rating": np.random.sample(200)})

If you simply call hist() method with a dataframe object, it identifies all the numeric columns and draw a histogram for each.

Q: draw all possible histograms of the movie dataframe. Adjust the size of the plots if needed.

In [ ]:

            
                Copied!
                
# Implement
# Implement

As we can see, a majority of the columns are not normally distributed. In particular, if you look at the worldwide gross variable, you only see a couple of meaningful data from the histogram. Is this a problem of resolution? How about increasing the number of bins?

In [ ]:

            
                Copied!
                
ax = movies["Worldwide_Gross"].hist(bins=200)
ax.set_xlabel("World wide gross")
ax.set_ylabel("Frequency")
ax = movies["Worldwide_Gross"].hist(bins=200)
ax.set_xlabel("World wide gross")
ax.set_ylabel("Frequency")

Out[ ]:

Text(0, 0.5, 'Frequency')

Maybe a bit more useful, but it doesn't tell anything about the data distribution above certain point. How about changing the vertical scale to logarithmic scale?

In [ ]:

            
                Copied!
                
ax = movies["Worldwide_Gross"].hist(bins=200)
ax.set_yscale('log')
ax.set_xlabel("World wide gross")
ax.set_ylabel("Frequency")
ax = movies["Worldwide_Gross"].hist(bins=200)
ax.set_yscale('log')
ax.set_xlabel("World wide gross")
ax.set_ylabel("Frequency")

Out[ ]:

Text(0, 0.5, 'Frequency')

Now, let's try log-bin. Recall that when plotting histgrams we can specify the edges of bins through the bins parameter. For example, we can specify the edges of bins to [1, 2, 3, ... , 10] as follows.

In [ ]:

            
                Copied!
                
movies["IMDB_Rating"].hist(bins=range(0,11))
movies["IMDB_Rating"].hist(bins=range(0,11))

Out[ ]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f89e3b38990>

Here, we can specify the edges of bins in a similar way. Instead of specifying on the linear scale, we do it on the log space. Some useful resources:

Hint: since $10^{\text{start}} = \text{min(Worldwide\_Gross)}$, $\text{start} = \log_{10}(\text{min(Worldwide\_Gross)})$

In [ ]:

            
                Copied!
                
min(movies["Worldwide_Gross"])
min(movies["Worldwide_Gross"])

Out[ ]:

0.0

Because there seems to be movie(s) that made $0, and because log(0) is undefined & log(1) = 0, let's add 1 to the variable.

In [ ]:

            
                Copied!
                
movies["Worldwide_Gross"] = movies["Worldwide_Gross"]+1.0
movies["Worldwide_Gross"] = movies["Worldwide_Gross"]+1.0

Replace the dummy values for bins with 20 log bins using numpy.geomspace function.

In [ ]:

            
                Copied!
                
# TODO: Replace the dummy value of bins using np.geomspace.  
# Create 20 bins that cover the whole range of the dataset. 
bins = [1.0, 2.0, 4.0]
bins
# TODO: Replace the dummy value of bins using np.geomspace.  
# Create 20 bins that cover the whole range of the dataset. 
bins = [1.0, 2.0, 4.0]
bins

Out[ ]:

array([1.00000000e+00, 3.14018485e+00, 9.86076088e+00, 3.09646119e+01,
       9.72346052e+01, 3.05334634e+02, 9.58807191e+02, 3.01083182e+03,
       9.45456845e+03, 2.96890926e+04, 9.32292387e+04, 2.92757043e+05,
       9.19311230e+05, 2.88680720e+06, 9.06510822e+06, 2.84661155e+07,
       8.93888645e+07, 2.80697558e+08, 8.81442219e+08, 2.76789150e+09])

Now we can plot a histgram with log-bin. Set both axis to be log-scale.

In [ ]:

            
                Copied!
                
ax = (movies["Worldwide_Gross"]+1.0).hist(bins=bins)
ax.set_yscale('log')
ax.set_xscale('log')
ax.set_xlabel("World wide gross")
ax.set_ylabel("Frequency")
ax = (movies["Worldwide_Gross"]+1.0).hist(bins=bins)
ax.set_yscale('log')
ax.set_xscale('log')
ax.set_xlabel("World wide gross")
ax.set_ylabel("Frequency")

Out[ ]:

Text(0, 0.5, 'Frequency')

What is going on? Is this the right plot?

Hint: Look at the previous frequency plots before we used log-bin. Why are the shapes different?

Q: explain and fix

In [ ]:

            
                Copied!
                
# Implement
# Implement

Out[ ]:

Text(0, 0.5, 'Probability density')

Q: Can you explain the plot? Why are there gaps?

In [ ]:

CCDF¶

CCDF is a nice alternative to examine distributions with heavy tails. The idea is same as CDF, but the direction of aggregation is opposite. For a given value x, CCDF(x) is the number (fraction) of data points that are same or larger than x. To write code to draw CCDF, it'll be helpful to draw it by hand by using a very small, toy dataset. Draw it by hand and then think about how each point in the CCDF plot can be computed.

Q: Draw a CCDF of worldwide gross data in log-log scale

In [ ]:

            
                Copied!
                
# TODO: Implement functionality mentioned above
# You must replace the dummy values with the correct code. 
worldgross_sorted = np.random.sample(200)
Y = np.random.sample(200)
# TODO: Implement functionality mentioned above
# You must replace the dummy values with the correct code. 
worldgross_sorted = np.random.sample(200)
Y = np.random.sample(200)

We can also try in semilog scale (only one axis is in a log-scale), where the horizontal axis is linear.

In [ ]:

            
                Copied!
                
plt.xlabel("World wide gross")
plt.ylabel("CCDF")
plt.plot(worldgross_sorted,Y)
plt.yscale('log')
plt.xlabel("World wide gross")
plt.ylabel("CCDF")
plt.plot(worldgross_sorted,Y)
plt.yscale('log')

A straight line in semilog scale means exponential decay (cf. a straight line in log-log scale means power-law decay). So it seems like the amount of money a movie makes across the world follows roughly an exponential distribution, while there are some outliers that make insane amount of money.

Q: Which is the most successful movie in our dataset?

You can use the following

In [ ]:

            
                Copied!
                
# Implement
# Implement

Out[ ]:

Title                                  Avatar
US Gross                          7.60168e+08
Worldwide Gross                   2.76789e+09
US DVD Sales                      1.46154e+08
Production Budget                    2.37e+08
Release Date                      Dec 18 2009
MPAA Rating                             PG-13
Running Time min                          NaN
Distributor                  20th Century Fox
Source                    Original Screenplay
Major Genre                            Action
Creative Type                 Science Fiction
Director                        James Cameron
Rotten Tomatoes Rating                     83
IMDB Rating                               8.3
IMDB Votes                             261439
Name: 1234, dtype: object

In [ ]: