Practical Workshop¶
Project Description¶
Introduction¶
Whenever we perform an analysis, it is essential to formulate hypotheses that we can then validate or refute. Sometimes, we accept them; other times, we reject them. To make good decisions, a company must understand whether its assumptions are correct.
In this project, we will compare the music preferences of the cities of Tangananica and Tanganana. We will analyze the data to test the hypotheses and compare the behavior of users in both cities.
Data Description¶
The data is stored in the file music_project.xlsx
, separated into the tabs Monday
, Wednesday
and Friday
.
Each tab contains the following information:
Column | Description | Data Type |
---|---|---|
userID |
User ID | object |
Track |
Song title | object |
artist |
Name of singer or group | object |
genre |
Music genre | object |
City |
User's city | object |
time |
Exact time the song was played | object |
Day |
Day of the week | object |
Exercises¶
Hypothesis to be tested¶
- User activity differs depending on the day of the week and the city.
- On Monday mornings, residents of Tangananica and Tangananá listen to different genres. The same is true on Friday nights.
- Listeners in Tangananica and Tangananá have different preferences. In Tangananica they prefer pop while in Tangananá there are more rap fans.
Troubleshooting¶
Here are a series of step-by-step steps to solve the use case.
- To do this, you have to add the corresponding code where it says
#FIXME
. - Follow the instructions that will be provided.
Reading data
- Print the name of each sheet in the file
music_project.xlsx
on the screen. - Read each sheet as a data frame and join them into a single data frame called
df
.
# librerias
import pandas as pd
# Especificar la ruta del archivo Excel
excel_file = 'https://raw.githubusercontent.com/fralfaro/MADS-Workshops/main/docs/TallerPandas/data/music_project_cl.xlsx'
# Lee los nombres de las hojas
nombres_hojas = #FIXME
# Imprimir la lista con el nombre de las hojas en excel
nombres_hojas
# Leer archivos
df_monday = #FIXME (usar 'pd.read_excel')
df_wednesday = #FIXME (usar 'pd.read_excel')
df_friday = #FIXME (usar 'pd.read_excel')
# Juntar los DataFrames en uno solo
df = pd.concat([df_monday, df_wednesday, df_friday], ignore_index=True)
# Renombrar Columnas
df.columns = df.columns.str.lower().str.strip()
df = df.rename(columns = {
'userid':'user_id',
})
# Mostrar las primeras filas del DataFrame
df.head(10)
# Obtener las dimensiones del DataFrame
print("Dimensiones del DataFrame:")
print(df.shape)
# Mostrar información del DataFrame
print("Información del DataFrame:")
df.info()
We can see three problems with the style in the column names:
- Some names are uppercase, others are lowercase.
- There are spaces in some names.
Please detect the third problem yourself and describe it here
.
The number of column values is different. This means that the data contains missing values.
Conclusions: ...
Missing Values
First, find the number of missing values in the table. To do this, use two pandas
methods:
> Exercise: Calculate missing values per column
#FIXME (usar '.isnull().sum()')
Not all missing values affect the research. For example, missing values in track
and artist
are not critical. You can simply replace them with clear markers.
But missing values in genre
can affect the comparison of music preferences in Tangananica and Tangananá. In real life, it would be useful to know the reasons for missing data and try to compensate for them. But we don't have that opportunity in this project. So you will have to:
- Fill in these missing values with markers
- Evaluate how much the missing values can affect your calculations
Replace missing values in track
, artist
, and genre
with the string unknown
. To do this, create the list columns_to_replace
, loop over it with for
, and replace the missing values in each of the columns:
# Recorrer los nombres de las columnas y reemplazar los valores faltantes con 'unknown'
columns_to_replace = ['track', 'artist', 'genre']
for col in columns_to_replace:
df[col] = #FIXME (usar '.fillna()')
> Exercise: Count the missing values again
#FIXME (usar '.isnull().sum()')
Duplicates
Find the number of obvious duplicates in the table using a command:
> Exercise: Count duplicates.
#FIXME (usar '.duplicated().sum()')
Call the pandas
method to get rid of obvious duplicates:
> Exercise: Remove duplicates.
#FIXME (usar '.drop_duplicates().sum()')
> Exercise: Count duplicates again.
#FIXME (usar '.duplicated().sum()')
Now get rid of implicit duplicates in the genre
column. For example, the name of a genre can be written in different ways. Such errors will also affect the result.
Print a list of unique genre names, sorted in alphabetical order. To do so:
- Retrieve the desired DataFrame column
- Apply a sorting method to it
- For the sorted column, call the method that will return all unique column values
> Exercise: Display and sort unique 'genre' names. (hint
: use sorted()
and .unique()
commands)
#FIXME (usar 'sorted()' y '.unique()')
Look through the list to find implicit duplicates of the genre hiphop
. These can be misspelled names or alternative names of the same genre.
You will see the following implicit duplicates:
hip
hop
hip-hop
To get rid of them, declare the function replace_wrong_genres()
with two parameters:
wrong_genres=
— the list of duplicatescorrect_genre=
— the string with the correct value
The function should correct the names in the genre
column of the df
table, i.e. replace each value in the wrong_genres
list with the value in correct_genre
.
# Funcion para reemplazar duplicados implicitos
def replace_wrong_genres(df,wrong_genres,correct_genre):
for string in wrong_genres:
df.loc[lambda x: x['genre']==string,'genre'] = correct_genre
return df
Call replace_wrong_genres()
and pass it arguments to remove implicit duplicates (hip
, hop
, and hip-hop
) and replace them with hiphop
:
# Eliminar duplicados implicitos
wrong_genres = ['hip','hop','hip-hop']
correct_genre = 'hiphop'
df = #FIXME (aplicar la funcion 'replace_wrong_genres' a 'df')
> Exercise: Display and sort unique 'genre' names. (hint
: use sorted()
and .unique()
commands)
#FIXME (usar 'sorted()' y '.unique()')
Conclusions: ...
Hypothesis¶
Hypothesis 1: Comparing user behavior in two cities
According to the first hypothesis, users in Tangananica and Tangananá listen to music differently. Test this using data from three days of the week: Monday, Wednesday, and Friday.
- Divide the users into groups by city.
- Compare how many tracks each group played on Monday, Wednesday, and Friday.
For the sake of practice, perform each calculation separately.
Evaluate user activity in each city. Group the data by city and find the number of songs played in each group.
Counting tracks played in each city
Tangananica has more tracks played than Tangananá. But that doesn't imply that citizens of Tangananica listen to music more often. This city is simply larger and there are more users.
Now group the data by day of the week and find the number of tracks played on Monday, Wednesday, and Friday.
> Exercise: Calculate the number of tracks played on each of the three days. (hint
: use groupby()
command)
#FIXME (usar '.groupby()' sobre 'day' y aplicar a 'genre' )
Wednesday is the quietest day overall. But if we consider the two cities separately, we might come to a different conclusion.
You've seen how grouping by city or day works. Now write a function that groups by both.
Create the number_tracks()
function to calculate the number of songs played on a given day and city. It will require two parameters:
- day of the week
- city name
In the function, use a variable to store the rows from the original table, where:
- The value of the
'day'
column is equal to theday
parameter - The value of the
'city'
column is equal to thecity
parameter
Apply consecutive filtering with logical indexing.
Then calculate the values of the 'user_id'
column in the resulting table. Store the result in a new variable. Return this variable from the function.
# Agrupar los datos por 'city' y 'day', y contar el número de ocurrencias de 'genre' en cada grupo
df_city_day = df.groupby(['city', 'day'])['genre'].count().reset_index()
# Pivotar la tabla para tener 'city' como índice, 'day' como columnas, y la suma de 'genre' como valores
pivot_df = df_city_day.pivot_table(index='city', columns='day', values='genre', aggfunc='sum').reset_index()
# Reordenar las columnas según el orden especificado
pivot_df = pivot_df[['city', 'Monday', 'Wednesday', 'Friday']]
# Imprimir el DataFrame resultante
pivot_df
Conclusions: ...
Hypothesis 2: Music at the beginning and end of the week
According to the second hypothesis, on Monday morning and Friday night, citizens of Tangananica listen to different genres than those enjoyed by users in Tangananá.
Get the dataframes (make sure the name of your combined table matches the dataframe provided in the two code blocks below):
- For Tangananica —
tangananica_general
- For Tangananá —
tanganana_general
# Definir 'tangananica_general', filtrar por ciudad igual a Tangananica
tangananica_general = #FIXME (usar '.loc()' y filtrar por ciudad objetivo )
# Definir 'tanganana_general', filtrar por ciudad igual a Tanganana
tanganana_general = #FIXME (usar '.loc()' y filtrar por ciudad objetivo )
Write the function genre_weekday()
with four parameters:
- A table for data
- The day of the week
- The first timestamp, in 'hh:mm' format
- The last timestamp, in 'hh:mm' format
The function should return information about the 15 most popular genres on a given day within the period between the two timestamps.
# Declarar la funcion 'genero_weekday'
def genre_weekday(df, day, time1, time2):
genre_list = df[(df['day'] == day) & (df['time'] > time1) & (df['time'] < time2)]
genre_list_sorted = genre_list.groupby('genre')['genre'].count().sort_values(ascending=False).head(10)
return genre_list_sorted
> Exercise: Call the function for Monday morning in Tanganyika ('07:00:00', '11:00:00').
#FIXME (ejemplo directo: 'genre_weekday(tangananica_general, 'Monday', '07:00:00', '11:00:00')')
> Exercise: Call the function for Monday morning in Tangananá ('07:00:00', '11:00:00').
#FIXME (aplicar lo mismo que el caso anterior)
> Exercise: Call the function for Friday night in Tanganyika ('17:00:00', '23:00:00')
#FIXME (aplicar lo mismo que el caso anterior)
> Exercise: Call the function for Friday night in Tangananá ('17:00:00', '23:00:00')
#FIXME (aplicar lo mismo que el caso anterior)
Conclusions: ...
Hypothesis 3: Gender preferences in Tangananica and Shelbyville
Hypothesis: Tangananica loves rap music. Tangananica citizens like pop music more.
Group the tangananica_general
table by genre and find the number of songs played for each genre using the count()
method. Then sort the result in descending order and store it in tangananica_genres
.
# Contar cuántas veces aparece cada género musical en el DataFrame 'tangananica_general'
# y ordenar los resultados de forma descendente.
tangananica_genres = tangananica_general.groupby('genre')['genre'].count().sort_values(ascending=False)
> Exercise: Display the first 10 rows of 'tangananica_genres'
#FIXME (usar '.head(10)')
Now do the same with the Tangananá data.
Group the tanganana_general
table by genre and find the number of songs played for each genre. Then sort the result in descending order and store it in the tanganana_genres
table:
# Contar cuántas veces aparece cada género musical en el DataFrame 'tanganana_general'
# y ordenar los resultados de forma descendente.
tanganana_genres = tanganana_general.groupby('genre')['genre'].count().sort_values(ascending=False)
> Exercise: Display the first 10 rows of 'tanganana_genres'
#FIXME (usar '.head(10)')
Conclusions: ...
General Conclusions of the case study¶
Conclusions should contain information from data cleaning to validation or refutation of the three hypotheses