Ejemplo Práctico 01¶
Descripción del conjunto de datos¶
Introducción¶
A continuación analizaremos el conjunto de datos denominado player_data.csv, el cual contiene información básica de los jugadores de la NBA.
Descripción de las Columnas¶
| columna | tipo | descripción | |------------ |--------- |-------------------------------------- | | name | object | nombre del jugador | | year_start | int64 | año de inicio de carrera en la NBA | | year_end | int64 | año de término de carrera en la NBA | | position | object | posición del jugador | | height | object | altura del jugador | | weight | float64 | peso del jugador | | birth_date | object | fecha de nacimiento | | college | object | universidad antes de entrar a la NBA |
Resolución del Problema¶
# load data
import pandas as pd
path = "https://raw.githubusercontent.com/fralfaro/python_data_manipulation/main/docs/pandas/data/player_data.csv"
player_data = pd.read_csv(
path, # path
sep="," # separation
)
player_data
name | year_start | year_end | position | height | weight | birth_date | college | |
---|---|---|---|---|---|---|---|---|
0 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | 240.0 | June 24, 1968 | Duke University |
1 | Zaid Abdul-Aziz | 1969 | 1978 | C-F | 6-9 | 235.0 | April 7, 1946 | Iowa State University |
2 | Kareem Abdul-Jabbar | 1970 | 1989 | C | 7-2 | 225.0 | April 16, 1947 | University of California, Los Angeles |
3 | Mahmoud Abdul-Rauf | 1991 | 2001 | G | 6-1 | 162.0 | March 9, 1969 | Louisiana State University |
4 | Tariq Abdul-Wahad | 1998 | 2003 | F | 6-6 | 223.0 | November 3, 1974 | San Jose State University |
... | ... | ... | ... | ... | ... | ... | ... | ... |
4545 | Ante Zizic | 2018 | 2018 | F-C | 6-11 | 250.0 | January 4, 1997 | NaN |
4546 | Jim Zoet | 1983 | 1983 | C | 7-1 | 240.0 | December 20, 1953 | Kent State University |
4547 | Bill Zopf | 1971 | 1971 | G | 6-1 | 170.0 | June 7, 1948 | Duquesne University |
4548 | Ivica Zubac | 2017 | 2018 | C | 7-1 | 265.0 | March 18, 1997 | NaN |
4549 | Matt Zunic | 1949 | 1949 | G-F | 6-3 | 195.0 | December 19, 1919 | George Washington University |
4550 rows × 8 columns
Módulos Básicos¶
Existen módulos para comprender rápidamente la naturaleza del dataframe.
# primeras silas
print("first 5 rows:")
player_data.head(5)
first 5 rows:
name | year_start | year_end | position | height | weight | birth_date | college | |
---|---|---|---|---|---|---|---|---|
0 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | 240.0 | June 24, 1968 | Duke University |
1 | Zaid Abdul-Aziz | 1969 | 1978 | C-F | 6-9 | 235.0 | April 7, 1946 | Iowa State University |
2 | Kareem Abdul-Jabbar | 1970 | 1989 | C | 7-2 | 225.0 | April 16, 1947 | University of California, Los Angeles |
3 | Mahmoud Abdul-Rauf | 1991 | 2001 | G | 6-1 | 162.0 | March 9, 1969 | Louisiana State University |
4 | Tariq Abdul-Wahad | 1998 | 2003 | F | 6-6 | 223.0 | November 3, 1974 | San Jose State University |
# ultimas filas
print("\nlast 5 rows:")
player_data.tail(5)
last 5 rows:
name | year_start | year_end | position | height | weight | birth_date | college | |
---|---|---|---|---|---|---|---|---|
4545 | Ante Zizic | 2018 | 2018 | F-C | 6-11 | 250.0 | January 4, 1997 | NaN |
4546 | Jim Zoet | 1983 | 1983 | C | 7-1 | 240.0 | December 20, 1953 | Kent State University |
4547 | Bill Zopf | 1971 | 1971 | G | 6-1 | 170.0 | June 7, 1948 | Duquesne University |
4548 | Ivica Zubac | 2017 | 2018 | C | 7-1 | 265.0 | March 18, 1997 | NaN |
4549 | Matt Zunic | 1949 | 1949 | G-F | 6-3 | 195.0 | December 19, 1919 | George Washington University |
# tipo
print("\ntype of dataframe:")
type(player_data)
type of dataframe:
pandas.core.frame.DataFrame
# tipo por columnas
print("\ntype of columns:")
player_data.dtypes
type of columns:
name object year_start int64 year_end int64 position object height object weight float64 birth_date object college object dtype: object
# dimension
print("\nshape:")
player_data.shape
shape:
(4550, 8)
# nombre de las columnas
print("\ncols:")
player_data.columns
cols:
Index(['name', 'year_start', 'year_end', 'position', 'height', 'weight', 'birth_date', 'college'], dtype='object')
# indice
print("\nindex:")
player_data.index
index:
RangeIndex(start=0, stop=4550, step=1)
# acceder a la columna posicion
print("\ncolumn 'position': ")
player_data['position'].head()
column 'position':
0 F-C 1 C-F 2 C 3 G 4 F Name: position, dtype: object
# cambiar nombre de una o varias columnas
player_data = player_data.rename(columns={"birth_date": "Birth", "college": "College"})
player_data.head()
name | year_start | year_end | position | height | weight | Birth | College | |
---|---|---|---|---|---|---|---|---|
0 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | 240.0 | June 24, 1968 | Duke University |
1 | Zaid Abdul-Aziz | 1969 | 1978 | C-F | 6-9 | 235.0 | April 7, 1946 | Iowa State University |
2 | Kareem Abdul-Jabbar | 1970 | 1989 | C | 7-2 | 225.0 | April 16, 1947 | University of California, Los Angeles |
3 | Mahmoud Abdul-Rauf | 1991 | 2001 | G | 6-1 | 162.0 | March 9, 1969 | Louisiana State University |
4 | Tariq Abdul-Wahad | 1998 | 2003 | F | 6-6 | 223.0 | November 3, 1974 | San Jose State University |
# fijar columna especifica como indice
player_data = player_data.set_index(["name"])
player_data.head()
year_start | year_end | position | height | weight | Birth | College | |
---|---|---|---|---|---|---|---|
name | |||||||
Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | 240.0 | June 24, 1968 | Duke University |
Zaid Abdul-Aziz | 1969 | 1978 | C-F | 6-9 | 235.0 | April 7, 1946 | Iowa State University |
Kareem Abdul-Jabbar | 1970 | 1989 | C | 7-2 | 225.0 | April 16, 1947 | University of California, Los Angeles |
Mahmoud Abdul-Rauf | 1991 | 2001 | G | 6-1 | 162.0 | March 9, 1969 | Louisiana State University |
Tariq Abdul-Wahad | 1998 | 2003 | F | 6-6 | 223.0 | November 3, 1974 | San Jose State University |
# ordenar dataframe por columna especifica
player_data = player_data.sort_values("weight")
player_data.head()
year_start | year_end | position | height | weight | Birth | College | |
---|---|---|---|---|---|---|---|
name | |||||||
Penny Early | 1969 | 1969 | G | 5-3 | 114.0 | May 30, 1943 | NaN |
Spud Webb | 1986 | 1998 | G | 5-6 | 133.0 | July 13, 1963 | North Carolina State University |
Earl Boykins | 1999 | 2012 | G | 5-5 | 135.0 | June 2, 1976 | Eastern Michigan University |
Muggsy Bogues | 1988 | 2001 | G | 5-3 | 136.0 | January 9, 1965 | Wake Forest University |
Chet Aubuchon | 1947 | 1947 | G | 5-10 | 137.0 | May 18, 1916 | Michigan State University |
# resumen de la información
player_data.describe(include='all')# player_data.describe()
year_start | year_end | position | height | weight | Birth | College | |
---|---|---|---|---|---|---|---|
count | 4550.000000 | 4550.000000 | 4549 | 4549 | 4544.000000 | 4519 | 4248 |
unique | NaN | NaN | 7 | 28 | NaN | 4161 | 473 |
top | NaN | NaN | G | 6-7 | NaN | February 23, 1945 | University of Kentucky |
freq | NaN | NaN | 1574 | 473 | NaN | 3 | 99 |
mean | 1985.076264 | 1989.272527 | NaN | NaN | 208.908011 | NaN | NaN |
std | 20.974188 | 21.874761 | NaN | NaN | 26.268662 | NaN | NaN |
min | 1947.000000 | 1947.000000 | NaN | NaN | 114.000000 | NaN | NaN |
25% | 1969.000000 | 1973.000000 | NaN | NaN | 190.000000 | NaN | NaN |
50% | 1986.000000 | 1992.000000 | NaN | NaN | 210.000000 | NaN | NaN |
75% | 2003.000000 | 2009.000000 | NaN | NaN | 225.000000 | NaN | NaN |
max | 2018.000000 | 2018.000000 | NaN | NaN | 360.000000 | NaN | NaN |
Módulos Avanzados¶
Cuando se trabaja con un conjunto de datos, se crea una dinámica de preguntas y respuestas, en donde a medida que necesito información, se va accediendo al dataframe. En algunas ocaciones es directo, basta un simple módulo, aunque en otras será necesaria realizar operaciones un poco más complejas.
Por ejemplo, del conjunto de datos en estudio, se esta interesado en responder las siguientes preguntas:
a) Determine si el dataframe tiene valores nulos
# veamos las columnas con datos nulos
for col in player_data.columns:
temp = player_data[player_data[col].isnull()]
print(col)
print(f"valores nulos: {len(temp)}\n")
year_start valores nulos: 0 year_end valores nulos: 0 position valores nulos: 1 height valores nulos: 1 weight valores nulos: 6 Birth valores nulos: 31 College valores nulos: 302
# ocupar comando .notnull().all(axis=1) para ver todos los valores que NO son nulos para todas las columnas
player_data.notnull().all(axis=1).head(10)
name Penny Early False Spud Webb True Earl Boykins True Muggsy Bogues True Chet Aubuchon True Greg Grant True Angelo Musi True Ernie Calverley True Tyler Ulis True Lionel Malamed True dtype: bool
b) Elimine los valores nulos del dataframe
# ocupar masking
mask = lambda df: df.notnull().all(axis=1)
player_data = player_data[mask]
player_data.head()
year_start | year_end | position | height | weight | Birth | College | |
---|---|---|---|---|---|---|---|
name | |||||||
Spud Webb | 1986 | 1998 | G | 5-6 | 133.0 | July 13, 1963 | North Carolina State University |
Earl Boykins | 1999 | 2012 | G | 5-5 | 135.0 | June 2, 1976 | Eastern Michigan University |
Muggsy Bogues | 1988 | 2001 | G | 5-3 | 136.0 | January 9, 1965 | Wake Forest University |
Chet Aubuchon | 1947 | 1947 | G | 5-10 | 137.0 | May 18, 1916 | Michigan State University |
Greg Grant | 1990 | 1996 | G | 5-7 | 140.0 | August 29, 1966 | Trenton State University |
c) Determinar el tiempo (en años) de cada jugador en su posición
player_data['duration'] = player_data['year_end'] - player_data['year_start']
player_data.head()
year_start | year_end | position | height | weight | Birth | College | duration | |
---|---|---|---|---|---|---|---|---|
name | ||||||||
Spud Webb | 1986 | 1998 | G | 5-6 | 133.0 | July 13, 1963 | North Carolina State University | 12 |
Earl Boykins | 1999 | 2012 | G | 5-5 | 135.0 | June 2, 1976 | Eastern Michigan University | 13 |
Muggsy Bogues | 1988 | 2001 | G | 5-3 | 136.0 | January 9, 1965 | Wake Forest University | 13 |
Chet Aubuchon | 1947 | 1947 | G | 5-10 | 137.0 | May 18, 1916 | Michigan State University | 0 |
Greg Grant | 1990 | 1996 | G | 5-7 | 140.0 | August 29, 1966 | Trenton State University | 6 |
d) Fecha de str
a objeto datetime
player_data['birth_date_dt'] = pd.to_datetime(player_data['Birth'], format="%B %d, %Y")
player_data.head()
year_start | year_end | position | height | weight | Birth | College | duration | birth_date_dt | |
---|---|---|---|---|---|---|---|---|---|
name | |||||||||
Spud Webb | 1986 | 1998 | G | 5-6 | 133.0 | July 13, 1963 | North Carolina State University | 12 | 1963-07-13 |
Earl Boykins | 1999 | 2012 | G | 5-5 | 135.0 | June 2, 1976 | Eastern Michigan University | 13 | 1976-06-02 |
Muggsy Bogues | 1988 | 2001 | G | 5-3 | 136.0 | January 9, 1965 | Wake Forest University | 13 | 1965-01-09 |
Chet Aubuchon | 1947 | 1947 | G | 5-10 | 137.0 | May 18, 1916 | Michigan State University | 0 | 1916-05-18 |
Greg Grant | 1990 | 1996 | G | 5-7 | 140.0 | August 29, 1966 | Trenton State University | 6 | 1966-08-29 |
e) Determinar todas las posiciones
positions = player_data['position'].unique()
positions
array(['G', 'G-F', 'F-G', 'F', 'F-C', 'C-F', 'C'], dtype=object)
f) Iterar sobre cada posición y encontrar el mayor valor de la columna weight
# Iterar sobre cada posición y encontrar el mayor valor.
nba_position_duration = dict()
# iterar
for position in positions:
# filtrar
df_aux = player_data.loc[lambda x: x['position'] == position]
# encontrar maximo de la columna objetivo
max_duration = df_aux['weight'].max()
# guardar en pd.Series
nba_position_duration[position] = max_duration
# retornar serie
nba_position_duration
{'G': 235.0, 'G-F': 240.0, 'F-G': 245.0, 'F': 284.0, 'F-C': 290.0, 'C-F': 280.0, 'C': 360.0}