Pandas I¶
Introducción¶
Pandas es un paquete de Python que proporciona estructuras de datos rápidas, flexibles y expresivas diseñadas para que trabajar con datos "relacionales" o "etiquetados" sea fácil e intuitivo.
Su objetivo es ser el bloque de construcción fundamental de alto nivel para hacer análisis de datos prácticos del mundo real en Python. Además, tiene el objetivo más amplio de convertirse en la herramienta de análisis/manipulación de datos de código abierto más potente y flexible disponible en cualquier idioma. Ya está en camino hacia este objetivo.
Series y DataFrames
- Las series son arreglos unidimensionales con etiquetas. Se puede pensar como una generalización de los diccionarios de Python.
- Los dataframe son arreglos bidimensionales y una extensión natural de las series. Se puede pensar como la generalización de un numpy.array.
Pandas Dataframes¶
Como se mencina anteriormente, los dataframes son arreglos de series, los cuales pueden ser de distintos tipos (numéricos, string, etc.). En esta parte mostraremos un ejemplo aplicado de las distintas funcionalidades de los dataframes.
Creación de Dataframes¶
La creación se puede hacer de variadas formas con listas, dictionarios , numpy array , entre otros.
import pandas as pd
import numpy as np
# empty dataframe
df_empty = pd.DataFrame()
df_empty
# dataframe with list
df_list = pd.DataFrame(
[
["nombre_01", "apellido_01", 60],
["nombre_02", "apellido_02", 14]
],
columns = ["nombre", "apellido", "edad"]
)
df_list
nombre | apellido | edad | |
---|---|---|---|
0 | nombre_01 | apellido_01 | 60 |
1 | nombre_02 | apellido_02 | 14 |
# dataframe with dct
df_dct = pd.DataFrame(
{
"nombre": ["nombre_01", "nombre_02",],
"apellido": ["apellido_01", "apellido_02"],
"edad": np.array([60,14]),
}
)
df_dct
nombre | apellido | edad | |
---|---|---|---|
0 | nombre_01 | apellido_01 | 60 |
1 | nombre_02 | apellido_02 | 14 |
Lectura de Datos¶
En general, cuando se trabajan con datos, estos se almacenan en algún lugar y en algún tipo de formato, por ejemplo:
.txt
.csv
.xlsx
.db
- etc.
🏀 Ejemplo: El conjunto de datos "player_info.csv" es una colección exhaustiva que proporciona información detallada sobre los jugadores de la NBA desde el año 1947 hasta el 2018.
En conjunto, estas columnas ofrecen una visión completa de los jugadores de la NBA durante más de medio siglo, permitiendo análisis detallados sobre su trayectoria, características físicas y antecedentes educativos.
📋Descripción de las columnas
Columna | Descripción |
---|---|
name | El nombre completo de cada jugador en el formato "apellido, nombre". |
year_start | El año en el que cada jugador comenzó su carrera profesional en la NBA. |
year_end | El año en el que la carrera profesional de cada jugador en la NBA llegó a su fin. |
position | La posición principal en la que cada jugador se desempeña en el campo de juego (base, escolta, alero, ala-pívot, pívot). |
height | La altura de cada jugador en pulgadas. |
weight | El peso de cada jugador en libras. |
birth_date | La fecha de nacimiento de cada jugador. |
college | La universidad a la que asistió cada jugador antes de ingresar a la NBA. |
# cargar datos
path = 'https://raw.githubusercontent.com/fralfaro/MAT281_2024/main/docs/lectures/data_manipulation/data/player_info.csv'
df = pd.read_csv(path, sep="," )
# mostar resultados
df.head()
name | year_start | year_end | position | height | weight | birth_date | college | |
---|---|---|---|---|---|---|---|---|
0 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | NaN | June 24, 1968 | NaN |
1 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | NaN | June 24, 1968 | NaN |
2 | Zaid Abdul-Aziz | 1969 | 1978 | C-F | 6-9 | 235.0 | April 7, 1946 | Iowa State University |
3 | Kareem Abdul-Jabbar | 1970 | 1989 | C | 7-2 | 225.0 | April 16, 1947 | University of California, Los Angeles |
4 | Mahmoud Abdul-Rauf | 1991 | 2001 | G | 6-1 | 162.0 | March 9, 1969 | Louisiana State University |
Objetos de Pandas¶
En un nivel muy básico, los objetos de Pandas se pueden considerar como versiones mejoradas de matrices de numpy
en las que las filas y columnas se identifican con etiquetas en lugar de simples índices enteros.
DataFrame
# valores
df.values
array([['Alaa Abdelnaby', 1991, 1995, ..., nan, 'June 24, 1968', nan], ['Alaa Abdelnaby', 1991, 1995, ..., nan, 'June 24, 1968', nan], ['Zaid Abdul-Aziz', 1969, 1978, ..., 235.0, 'April 7, 1946', 'Iowa State University'], ..., ['Bill Zopf', 1971, 1971, ..., 170.0, 'June 7, 1948', 'Duquesne University'], ['Ivica Zubac', 2017, 2018, ..., 265.0, 'March 18, 1997', nan], ['Matt Zunic', 1949, 1949, ..., 195.0, 'December 19, 1919', 'George Washington University']], dtype=object)
# indice
df.index
RangeIndex(start=0, stop=4551, step=1)
# columnas
df.columns
Index(['name', 'year_start', 'year_end', 'position', 'height', 'weight', 'birth_date', 'college'], dtype='object')
# tipo
type(df)
pandas.core.frame.DataFrame
Series
# elegir columna (pueden ser una o varias)
pd_series = df['name']
# valores
pd_series.values
array(['Alaa Abdelnaby', 'Alaa Abdelnaby', 'Zaid Abdul-Aziz', ..., 'Bill Zopf', 'Ivica Zubac', 'Matt Zunic'], dtype=object)
# indice
pd_series.index
RangeIndex(start=0, stop=4551, step=1)
# tipo
type(pd_series)
pandas.core.series.Series
Operaciones en Pandas¶
Las operaciones en Pandas se refieren a las acciones que se pueden realizar sobre los objetos de Pandas, como Series y DataFrames, para manipular, transformar y analizar datos.
Operaciones Básicas¶
# priemras filas
df.head()
name | year_start | year_end | position | height | weight | birth_date | college | |
---|---|---|---|---|---|---|---|---|
0 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | NaN | June 24, 1968 | NaN |
1 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | NaN | June 24, 1968 | NaN |
2 | Zaid Abdul-Aziz | 1969 | 1978 | C-F | 6-9 | 235.0 | April 7, 1946 | Iowa State University |
3 | Kareem Abdul-Jabbar | 1970 | 1989 | C | 7-2 | 225.0 | April 16, 1947 | University of California, Los Angeles |
4 | Mahmoud Abdul-Rauf | 1991 | 2001 | G | 6-1 | 162.0 | March 9, 1969 | Louisiana State University |
# ultimas filas
df.tail()
name | year_start | year_end | position | height | weight | birth_date | college | |
---|---|---|---|---|---|---|---|---|
4546 | Ante Zizic | 2018 | 2018 | F-C | 6-11 | 250.0 | January 4, 1997 | NaN |
4547 | Jim Zoet | 1983 | 1983 | C | 7-1 | 240.0 | December 20, 1953 | Kent State University |
4548 | Bill Zopf | 1971 | 1971 | G | 6-1 | 170.0 | June 7, 1948 | Duquesne University |
4549 | Ivica Zubac | 2017 | 2018 | C | 7-1 | 265.0 | March 18, 1997 | NaN |
4550 | Matt Zunic | 1949 | 1949 | G-F | 6-3 | 195.0 | December 19, 1919 | George Washington University |
# informacion del dataframe
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4551 entries, 0 to 4550 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 4551 non-null object 1 year_start 4551 non-null int64 2 year_end 4551 non-null int64 3 position 4550 non-null object 4 height 4550 non-null object 5 weight 4543 non-null float64 6 birth_date 4520 non-null object 7 college 4247 non-null object dtypes: float64(1), int64(2), object(5) memory usage: 284.6+ KB
# tipo de datos por columnas
df.dtypes
name object year_start int64 year_end int64 position object height object weight float64 birth_date object college object dtype: object
# filas y columnas
df.shape
(4551, 8)
# estadisticas basicas
df.describe()
year_start | year_end | weight | |
---|---|---|---|
count | 4551.000000 | 4551.000000 | 4543.000000 |
mean | 1985.077565 | 1989.273786 | 208.901167 |
std | 20.972067 | 21.872522 | 26.267502 |
min | 1947.000000 | 1947.000000 | 114.000000 |
25% | 1969.000000 | 1973.000000 | 190.000000 |
50% | 1986.000000 | 1992.000000 | 210.000000 |
75% | 2003.000000 | 2009.000000 | 225.000000 |
max | 2018.000000 | 2018.000000 | 360.000000 |
# cantidad de objetos unicos - columna especifica
df['year_start'].nunique()
72
# objetos unicos por columna especifica
df['year_start'].unique()
array([1991, 1969, 1970, 1998, 1997, 1977, 1957, 1947, 2017, 2006, 1954, 1988, 1968, 2013, 1976, 1971, 1973, 2007, 2015, 1986, 2014, 1987, 2018, 2011, 2008, 1982, 2009, 1967, 1960, 1985, 2016, 1996, 2001, 1994, 1992, 2012, 2002, 1989, 2005, 1972, 1981, 1995, 2000, 1999, 2010, 1975, 1983, 1993, 1979, 1990, 1955, 2004, 1962, 2003, 1951, 1949, 1978, 1956, 1964, 1961, 1984, 1974, 1953, 1980, 1950, 1952, 1965, 1963, 1966, 1959, 1948, 1958], dtype=int64)
# cantidad de objetos unicos - todas las columnas
df.nunique()
name 4500 year_start 72 year_end 72 position 7 height 28 weight 143 birth_date 4161 college 473 dtype: int64
# numero de ocurrencias de cada valor en una columna
df['year_start'].value_counts()#.sort_index()
1968 173 1947 161 1950 120 1971 100 1949 94 ... 1964 23 1966 23 1958 21 1961 20 1960 19 Name: year_start, Length: 72, dtype: int64
# ordenar valores - por columna especifica, menor a mayor
df.sort_values('year_start', ascending = True).head()
name | year_start | year_end | position | height | weight | birth_date | college | |
---|---|---|---|---|---|---|---|---|
3483 | Irv Rothenberg | 1947 | 1949 | C | 6-7 | 215.0 | December 31, 1921 | Long Island University |
3008 | Buddy O'Grady | 1947 | 1949 | G | 5-11 | 160.0 | January 19, 1920 | Georgetown University |
1008 | Bob Dille | 1947 | 1947 | F | 6-3 | 190.0 | July 2, 1917 | Valparaiso University |
3020 | Garland O'Shields | 1947 | 1947 | G-F | 6-1 | 195.0 | May 23, 1921 | University of Tennessee |
3726 | Belus Smawley | 1947 | 1952 | G-F | 6-1 | 195.0 | March 20, 1918 | Appalachian State University |
# ordenar valores - por columna especifica, mayor a menor
df.sort_values('year_start', ascending = False).head()
name | year_start | year_end | position | height | weight | birth_date | college | |
---|---|---|---|---|---|---|---|---|
1227 | Terrance Ferguson | 2018 | 2018 | G-F | 6-7 | 184.0 | May 17, 1998 | NaN |
2837 | Monte Morris | 2018 | 2018 | G | 6-3 | 175.0 | June 27, 1995 | Iowa State University |
1334 | Markelle Fultz | 2018 | 2018 | G | 6-4 | 195.0 | May 29, 1998 | University of Washington |
2676 | Alfonzo McKinnie | 2018 | 2018 | F | 6-8 | 215.0 | September 17, 1992 | University of Wisconsin-Green Bay |
3088 | Marcus Paige | 2018 | 2018 | G | 6-1 | 175.0 | September 11, 1993 | University of North Carolina |
Operaciones Matemáticas¶
# operaciones aritmeticas
s1 = df['year_end']
s2 = df['year_start']
suma = s1+s2
resta = s1-s2
multiplicacion = s1*s2
division = s1/s2
# suma
print(f"suma: \n{suma.head()}\n")
# resta
print(f"resta: \n{resta.head()}\n")
# multiplicacion
print(f"multiplicacion: \n{multiplicacion.head()}\n")
# division
print(f"division: \n{division.head()}")
suma: 0 3986 1 3986 2 3947 3 3959 4 3992 dtype: int64 resta: 0 4 1 4 2 9 3 19 4 10 dtype: int64 multiplicacion: 0 3972045 1 3972045 2 3894682 3 3918330 4 3983991 dtype: int64 division: 0 1.002009 1 1.002009 2 1.004571 3 1.009645 4 1.005023 dtype: float64
# operaciones estadisticas
s1 = df['year_start']
print(f"mean: {s1.mean()}") # mean
print(f"std: {s1.std()}") # std
print(f"min: {s1.min()}") # min
print(f"max: {s1.max()}") # max
mean: 1985.0775653702483 std: 20.97206734360449 min: 1947 max: 2018
# correlaciones lineales
cols = ['year_start', 'year_end']
df[cols].corr()
year_start | year_end | |
---|---|---|
year_start | 1.000000 | 0.978784 |
year_end | 0.978784 | 1.000000 |
Operaciones Avanzadas¶
# crear una columna constante
df['label'] = 'NBA'
df.head()
name | year_start | year_end | position | height | weight | birth_date | college | label | |
---|---|---|---|---|---|---|---|---|---|
0 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | NaN | June 24, 1968 | NaN | NBA |
1 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | NaN | June 24, 1968 | NaN | NBA |
2 | Zaid Abdul-Aziz | 1969 | 1978 | C-F | 6-9 | 235.0 | April 7, 1946 | Iowa State University | NBA |
3 | Kareem Abdul-Jabbar | 1970 | 1989 | C | 7-2 | 225.0 | April 16, 1947 | University of California, Los Angeles | NBA |
4 | Mahmoud Abdul-Rauf | 1991 | 2001 | G | 6-1 | 162.0 | March 9, 1969 | Louisiana State University | NBA |
# Eliminar columna
df = df.drop('label',axis=1)
df.head()
name | year_start | year_end | position | height | weight | birth_date | college | |
---|---|---|---|---|---|---|---|---|
0 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | NaN | June 24, 1968 | NaN |
1 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | NaN | June 24, 1968 | NaN |
2 | Zaid Abdul-Aziz | 1969 | 1978 | C-F | 6-9 | 235.0 | April 7, 1946 | Iowa State University |
3 | Kareem Abdul-Jabbar | 1970 | 1989 | C | 7-2 | 225.0 | April 16, 1947 | University of California, Los Angeles |
4 | Mahmoud Abdul-Rauf | 1991 | 2001 | G | 6-1 | 162.0 | March 9, 1969 | Louisiana State University |
# nueva columna a partir de otras dos
df['duration'] = df['year_end']-df['year_start']
df.head()
name | year_start | year_end | position | height | weight | birth_date | college | duration | |
---|---|---|---|---|---|---|---|---|---|
0 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | NaN | June 24, 1968 | NaN | 4 |
1 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | NaN | June 24, 1968 | NaN | 4 |
2 | Zaid Abdul-Aziz | 1969 | 1978 | C-F | 6-9 | 235.0 | April 7, 1946 | Iowa State University | 9 |
3 | Kareem Abdul-Jabbar | 1970 | 1989 | C | 7-2 | 225.0 | April 16, 1947 | University of California, Los Angeles | 19 |
4 | Mahmoud Abdul-Rauf | 1991 | 2001 | G | 6-1 | 162.0 | March 9, 1969 | Louisiana State University | 10 |
# mediante funciones 'apply'
df['greater_than_10'] = df['duration'].apply(lambda x: 1 if x>10 else 0)
df.head()
name | year_start | year_end | position | height | weight | birth_date | college | duration | greater_than_10 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | NaN | June 24, 1968 | NaN | 4 | 0 |
1 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | NaN | June 24, 1968 | NaN | 4 | 0 |
2 | Zaid Abdul-Aziz | 1969 | 1978 | C-F | 6-9 | 235.0 | April 7, 1946 | Iowa State University | 9 | 0 |
3 | Kareem Abdul-Jabbar | 1970 | 1989 | C | 7-2 | 225.0 | April 16, 1947 | University of California, Los Angeles | 19 | 1 |
4 | Mahmoud Abdul-Rauf | 1991 | 2001 | G | 6-1 | 162.0 | March 9, 1969 | Louisiana State University | 10 | 0 |
df['greater_than_10'].value_counts()
0 3999 1 552 Name: greater_than_10, dtype: int64
Muchas veces, en un Dataframe se necesita realizar operaciones entre la fila actual y la fila anterior, lo cual puede ser complejo si no se utilizan las funciones correctas. A continuación se trabajan algunas de estas funciones:
shift()
: Se utiliza para mover hacia arriba o hacia abajo los valores de una columna o serie de datos.cumsum()
: es una función que se utiliza para calcular la suma acumulativa de valores a lo largo de un eje en un DataFrame o una Serie.pct_change()
: es una función que se utiliza para calcular el cambio porcentual entre los elementos de una serie o columna en un DataFrame.rank()
: es una función que se utiliza para asignar un rango a los elementos de una serie o columna en un DataFrame.
# aplicar funciones
df['shift'] = df['duration'].shift() # se muestra el valor de la fila anterior (la primera fila en este caso es NaN)
df['cumsum'] = df['duration'].cumsum() # suma acumulada entre la fila actual y todas las anteriores
df['pct_change'] = df['duration'].pct_change() # porcentaje de cambio entre la fila actual y la anterior
df['rank'] = df['duration'].rank() # ranking de los valores (donde 1 es el menor valor)
df.head()
name | year_start | year_end | position | height | weight | birth_date | college | duration | greater_than_10 | shift | cumsum | pct_change | rank | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | NaN | June 24, 1968 | NaN | 4 | 0 | NaN | 4 | NaN | 2714.5 |
1 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | NaN | June 24, 1968 | NaN | 4 | 0 | 4.0 | 8 | 0.000000 | 2714.5 |
2 | Zaid Abdul-Aziz | 1969 | 1978 | C-F | 6-9 | 235.0 | April 7, 1946 | Iowa State University | 9 | 0 | 4.0 | 17 | 1.250000 | 3720.0 |
3 | Kareem Abdul-Jabbar | 1970 | 1989 | C | 7-2 | 225.0 | April 16, 1947 | University of California, Los Angeles | 19 | 1 | 9.0 | 36 | 1.111111 | 4545.0 |
4 | Mahmoud Abdul-Rauf | 1991 | 2001 | G | 6-1 | 162.0 | March 9, 1969 | Louisiana State University | 10 | 0 | 19.0 | 46 | -0.473684 | 3912.5 |
cols = ['shift', 'cumsum','pct_change', 'rank']
df = df.drop(cols,axis=1)
df.head()
name | year_start | year_end | position | height | weight | birth_date | college | duration | greater_than_10 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | NaN | June 24, 1968 | NaN | 4 | 0 |
1 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | NaN | June 24, 1968 | NaN | 4 | 0 |
2 | Zaid Abdul-Aziz | 1969 | 1978 | C-F | 6-9 | 235.0 | April 7, 1946 | Iowa State University | 9 | 0 |
3 | Kareem Abdul-Jabbar | 1970 | 1989 | C | 7-2 | 225.0 | April 16, 1947 | University of California, Los Angeles | 19 | 1 |
4 | Mahmoud Abdul-Rauf | 1991 | 2001 | G | 6-1 | 162.0 | March 9, 1969 | Louisiana State University | 10 | 0 |
Filtrar Datos¶
Para filtrar datos en Pandas, se utiliza el método loc()
o iloc()
, dependiendo de si queremos filtrar por etiquetas de índice o por posición. Para efectos prácticos, utilizaremos solo loc()
.
# 'fecha' mayor a 2000
df_new = df.loc[df['year_start'] >= 2000]
df_new.head()
name | year_start | year_end | position | height | weight | birth_date | college | duration | greater_than_10 | |
---|---|---|---|---|---|---|---|---|---|---|
10 | Alex Abrines | 2017 | 2018 | G-F | 6-6 | 190.0 | August 1, 1993 | NaN | 1 | 0 |
11 | Alex Acker | 2006 | 2009 | G | 6-5 | 185.0 | January 21, 1983 | Pepperdine University | 3 | 0 |
15 | Quincy Acy | 2013 | 2018 | F | 6-7 | 240.0 | October 6, 1990 | Baylor University | 5 | 0 |
19 | Hassan Adams | 2007 | 2009 | G | 6-4 | 220.0 | June 20, 1984 | University of Arizona | 2 | 0 |
20 | Jordan Adams | 2015 | 2016 | G | 6-5 | 209.0 | July 8, 1994 | University of California, Los Angeles | 1 | 0 |
# crear condicion
valor_objetivo = 2000
condicion = (df['year_start'] >= valor_objetivo)
condicion.head()
0 False 1 False 2 False 3 False 4 False Name: year_start, dtype: bool
# aplicar condicion
df_new = df.loc[condicion]
df_new.head()
name | year_start | year_end | position | height | weight | birth_date | college | duration | greater_than_10 | |
---|---|---|---|---|---|---|---|---|---|---|
10 | Alex Abrines | 2017 | 2018 | G-F | 6-6 | 190.0 | August 1, 1993 | NaN | 1 | 0 |
11 | Alex Acker | 2006 | 2009 | G | 6-5 | 185.0 | January 21, 1983 | Pepperdine University | 3 | 0 |
15 | Quincy Acy | 2013 | 2018 | F | 6-7 | 240.0 | October 6, 1990 | Baylor University | 5 | 0 |
19 | Hassan Adams | 2007 | 2009 | G | 6-4 | 220.0 | June 20, 1984 | University of Arizona | 2 | 0 |
20 | Jordan Adams | 2015 | 2016 | G | 6-5 | 209.0 | July 8, 1994 | University of California, Los Angeles | 1 | 0 |
Veamos distintos tipos de filtros:
# mayor o igual a 2000
df_new = df.loc[df['year_start'] >= 2000]
df_new.head()
name | year_start | year_end | position | height | weight | birth_date | college | duration | greater_than_10 | |
---|---|---|---|---|---|---|---|---|---|---|
10 | Alex Abrines | 2017 | 2018 | G-F | 6-6 | 190.0 | August 1, 1993 | NaN | 1 | 0 |
11 | Alex Acker | 2006 | 2009 | G | 6-5 | 185.0 | January 21, 1983 | Pepperdine University | 3 | 0 |
15 | Quincy Acy | 2013 | 2018 | F | 6-7 | 240.0 | October 6, 1990 | Baylor University | 5 | 0 |
19 | Hassan Adams | 2007 | 2009 | G | 6-4 | 220.0 | June 20, 1984 | University of Arizona | 2 | 0 |
20 | Jordan Adams | 2015 | 2016 | G | 6-5 | 209.0 | July 8, 1994 | University of California, Los Angeles | 1 | 0 |
# entre 2005-2015
df_new = df.loc[df['year_start'].between(2005,2015)]
df_new.head()
name | year_start | year_end | position | height | weight | birth_date | college | duration | greater_than_10 | |
---|---|---|---|---|---|---|---|---|---|---|
11 | Alex Acker | 2006 | 2009 | G | 6-5 | 185.0 | January 21, 1983 | Pepperdine University | 3 | 0 |
15 | Quincy Acy | 2013 | 2018 | F | 6-7 | 240.0 | October 6, 1990 | Baylor University | 5 | 0 |
19 | Hassan Adams | 2007 | 2009 | G | 6-4 | 220.0 | June 20, 1984 | University of Arizona | 2 | 0 |
20 | Jordan Adams | 2015 | 2016 | G | 6-5 | 209.0 | July 8, 1994 | University of California, Los Angeles | 1 | 0 |
22 | Steven Adams | 2014 | 2018 | C | 7-0 | 255.0 | July 20, 1993 | University of Pittsburgh | 4 | 0 |
# solo 2000
df_new = df.loc[df['year_start']==2000]
df_new.head()
name | year_start | year_end | position | height | weight | birth_date | college | duration | greater_than_10 | |
---|---|---|---|---|---|---|---|---|---|---|
67 | Rafer Alston | 2000 | 2010 | G | 6-2 | 171.0 | July 24, 1976 | California State University, Fresno | 10 | 0 |
143 | Chucky Atkins | 2000 | 2010 | G | 5-11 | 160.0 | August 14, 1974 | University of South Florida | 10 | 0 |
154 | William Avery | 2000 | 2002 | G | 6-2 | 197.0 | August 8, 1979 | Duke University | 2 | 0 |
286 | Jonathan Bender | 2000 | 2010 | F | 6-11 | 202.0 | January 30, 1981 | NaN | 10 | 0 |
386 | Calvin Booth | 2000 | 2009 | C | 6-11 | 230.0 | May 7, 1976 | Pennsylvania State University | 9 | 0 |
# varias condiciones condiciones
df_new = df.loc[(df['year_start']==2000) & (df['duration']>5)]
df_new.head()
name | year_start | year_end | position | height | weight | birth_date | college | duration | greater_than_10 | |
---|---|---|---|---|---|---|---|---|---|---|
67 | Rafer Alston | 2000 | 2010 | G | 6-2 | 171.0 | July 24, 1976 | California State University, Fresno | 10 | 0 |
143 | Chucky Atkins | 2000 | 2010 | G | 5-11 | 160.0 | August 14, 1974 | University of South Florida | 10 | 0 |
286 | Jonathan Bender | 2000 | 2010 | F | 6-11 | 202.0 | January 30, 1981 | NaN | 10 | 0 |
386 | Calvin Booth | 2000 | 2009 | C | 6-11 | 230.0 | May 7, 1976 | Pennsylvania State University | 9 | 0 |
403 | Ryan Bowen | 2000 | 2010 | F | 6-7 | 215.0 | November 20, 1975 | University of Iowa | 10 | 0 |
# Filtrar por patrón de texto
df_new = df.loc[df['name'].str.contains('Michael')]
df_new.head()
name | year_start | year_end | position | height | weight | birth_date | college | duration | greater_than_10 | |
---|---|---|---|---|---|---|---|---|---|---|
21 | Michael Adams | 1986 | 1996 | G | 5-10 | 162.0 | January 19, 1963 | Boston College | 10 | 0 |
93 | Michael Anderson | 1989 | 1989 | G | 5-11 | 184.0 | March 23, 1966 | Drexel University | 0 | 0 |
104 | Michael Ansley | 1990 | 1992 | F | 6-7 | 225.0 | February 8, 1967 | University of Alabama | 2 | 0 |
261 | Michael Beasley | 2009 | 2018 | F | 6-9 | 235.0 | January 9, 1989 | Kansas State University | 9 | 0 |
430 | Michael Bradley | 2002 | 2006 | F-C | 6-10 | 245.0 | April 18, 1979 | Villanova University | 4 | 0 |
Valores Nulos y Duplicados¶
Valores Nulos¶
Un valor nulo (o faltante) representa la ausencia de un valor en una celda específica de un DataFrame o una Serie. Los valores nulos pueden ocurrir por varias razones, como datos perdidos o no disponibles, errores de medición o problemas de entrada de datos.
Los valores nulos se representan en Pandas mediante el objeto NaN
(acrónimo de "Not a Number"). NaN
es un valor especial de punto flotante definido en el estándar IEEE para representar valores no definidos o indefinidos. En Pandas, los valores nulos se representan como NaN
para las Series y DataFrames que utilizan datos numéricos, y como None
para las Series y DataFrames que utilizan datos no numéricos.
# valores nulos
df.isnull().sum()
name 0 year_start 0 year_end 0 position 1 height 1 weight 8 birth_date 31 college 304 duration 0 greater_than_10 0 dtype: int64
# Eliminar las filas que contienen valores nulos
df_new = df.dropna()
df_new.head()
name | year_start | year_end | position | height | weight | birth_date | college | duration | greater_than_10 | |
---|---|---|---|---|---|---|---|---|---|---|
2 | Zaid Abdul-Aziz | 1969 | 1978 | C-F | 6-9 | 235.0 | April 7, 1946 | Iowa State University | 9 | 0 |
3 | Kareem Abdul-Jabbar | 1970 | 1989 | C | 7-2 | 225.0 | April 16, 1947 | University of California, Los Angeles | 19 | 1 |
4 | Mahmoud Abdul-Rauf | 1991 | 2001 | G | 6-1 | 162.0 | March 9, 1969 | Louisiana State University | 10 | 0 |
5 | Tariq Abdul-Wahad | 1998 | 2003 | F | 6-6 | 223.0 | November 3, 1974 | San Jose State University | 5 | 0 |
6 | Shareef Abdur-Rahim | 1997 | 2008 | F | 6-9 | 225.0 | December 11, 1976 | University of California | 11 | 1 |
# Rellenar los valores nulos con un valor específico, por ejemplo cero
df_new = df.fillna(0)
df_new.head()
name | year_start | year_end | position | height | weight | birth_date | college | duration | greater_than_10 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | 0.0 | June 24, 1968 | 0 | 4 | 0 |
1 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | 0.0 | June 24, 1968 | 0 | 4 | 0 |
2 | Zaid Abdul-Aziz | 1969 | 1978 | C-F | 6-9 | 235.0 | April 7, 1946 | Iowa State University | 9 | 0 |
3 | Kareem Abdul-Jabbar | 1970 | 1989 | C | 7-2 | 225.0 | April 16, 1947 | University of California, Los Angeles | 19 | 1 |
4 | Mahmoud Abdul-Rauf | 1991 | 2001 | G | 6-1 | 162.0 | March 9, 1969 | Louisiana State University | 10 | 0 |
Datos Duplicados¶
En Pandas, se pueden manejar los datos duplicados utilizando el método duplicated()
y drop_duplicates()
.
- El método
duplicated()
devuelve un booleano que indica si una fila es duplicada o no, es decir, si existe otra fila con los mismos valores. - El método
drop_duplicates()
elimina las filas duplicadas de un DataFrame.
# Comprobar filas duplicadas
df.duplicated().sum()
1
# Eliminar filas duplicadas
df_new = df.drop_duplicates()
df_new.head()
name | year_start | year_end | position | height | weight | birth_date | college | duration | greater_than_10 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | NaN | June 24, 1968 | NaN | 4 | 0 |
2 | Zaid Abdul-Aziz | 1969 | 1978 | C-F | 6-9 | 235.0 | April 7, 1946 | Iowa State University | 9 | 0 |
3 | Kareem Abdul-Jabbar | 1970 | 1989 | C | 7-2 | 225.0 | April 16, 1947 | University of California, Los Angeles | 19 | 1 |
4 | Mahmoud Abdul-Rauf | 1991 | 2001 | G | 6-1 | 162.0 | March 9, 1969 | Louisiana State University | 10 | 0 |
5 | Tariq Abdul-Wahad | 1998 | 2003 | F | 6-6 | 223.0 | November 3, 1974 | San Jose State University | 5 | 0 |
# Comprobar filas duplicadas
df_new.duplicated().sum()
0
Manipulación de Fechas¶
Pandas se desarrolló en el contexto del modelado financiero, por lo que, contiene varias herramientas para trabajar con fechas, horas y datos indexados por tiempo.
Comenzaremos por entendender las herramientas para manejar fechas y horas en Python, antes de pasar más específicamente a las herramientas proporcionadas por Pandas.
import datetime
now = datetime.datetime.now()
print(now)
2024-05-02 10:28:54.404800
Puedes acceder a partes específicas de un objeto datetime.datetime
, como el año, el mes, el día, la hora, el minuto y el segundo. Algunos de los atributos más comunes son:
year
: representa el año de la fecha y la hora.month
: representa el mes de la fecha y la hora, como un número entre 1 y 12.day
: representa el día del mes de la fecha y la hora, como un número entre 1 y 31.hour
: representa la hora del día de la fecha y la hora, como un número entre 0 y 23.minute
: representa los minutos de la hora de la fecha y la hora, como un número entre 0 y 59.second
: representa los segundos de la hora de la fecha y la hora, como un número entre 0 y 59.microsecond
: representa los microsegundos de la hora de la fecha y la hora, como un número entre 0 y 999999.
Veamos todo lo anterior, aplicado al dataframe que estamos trabajando:
# cambiar formato de la fecha
df['new_birth_date'] = pd.to_datetime(df['birth_date'])
df.head()
name | year_start | year_end | position | height | weight | birth_date | college | duration | greater_than_10 | new_birth_date | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | NaN | June 24, 1968 | NaN | 4 | 0 | 1968-06-24 |
1 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | NaN | June 24, 1968 | NaN | 4 | 0 | 1968-06-24 |
2 | Zaid Abdul-Aziz | 1969 | 1978 | C-F | 6-9 | 235.0 | April 7, 1946 | Iowa State University | 9 | 0 | 1946-04-07 |
3 | Kareem Abdul-Jabbar | 1970 | 1989 | C | 7-2 | 225.0 | April 16, 1947 | University of California, Los Angeles | 19 | 1 | 1947-04-16 |
4 | Mahmoud Abdul-Rauf | 1991 | 2001 | G | 6-1 | 162.0 | March 9, 1969 | Louisiana State University | 10 | 0 | 1969-03-09 |
# ver atributos
df.dtypes
name object year_start int64 year_end int64 position object height object weight float64 birth_date object college object duration int64 greater_than_10 int64 new_birth_date datetime64[ns] dtype: object
# calcular atributos como año, mes, dia, ...
df['year'] = df['new_birth_date'].dt.year
df['month'] = df['new_birth_date'].dt.month
df['day'] = df['new_birth_date'].dt.day
df['hour'] = df['new_birth_date'].dt.hour
df['minute'] = df['new_birth_date'].dt.minute
df['second'] = df['new_birth_date'].dt.second
df['microsecond'] = df['new_birth_date'].dt.microsecond
df.head()
name | year_start | year_end | position | height | weight | birth_date | college | duration | greater_than_10 | new_birth_date | year | month | day | hour | minute | second | microsecond | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | NaN | June 24, 1968 | NaN | 4 | 0 | 1968-06-24 | 1968.0 | 6.0 | 24.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | NaN | June 24, 1968 | NaN | 4 | 0 | 1968-06-24 | 1968.0 | 6.0 | 24.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | Zaid Abdul-Aziz | 1969 | 1978 | C-F | 6-9 | 235.0 | April 7, 1946 | Iowa State University | 9 | 0 | 1946-04-07 | 1946.0 | 4.0 | 7.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | Kareem Abdul-Jabbar | 1970 | 1989 | C | 7-2 | 225.0 | April 16, 1947 | University of California, Los Angeles | 19 | 1 | 1947-04-16 | 1947.0 | 4.0 | 16.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | Mahmoud Abdul-Rauf | 1991 | 2001 | G | 6-1 | 162.0 | March 9, 1969 | Louisiana State University | 10 | 0 | 1969-03-09 | 1969.0 | 3.0 | 9.0 | 0.0 | 0.0 | 0.0 | 0.0 |
# lista de columnas a eliminar
columnas_a_eliminar = ['year', 'month', 'day', 'hour', 'minute', 'second', 'microsecond']
# eliminar las columnas
df = df.drop(columnas_a_eliminar, axis=1)
# mostrar dataframe
df.head()
name | year_start | year_end | position | height | weight | birth_date | college | duration | greater_than_10 | new_birth_date | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | NaN | June 24, 1968 | NaN | 4 | 0 | 1968-06-24 |
1 | Alaa Abdelnaby | 1991 | 1995 | F-C | 6-10 | NaN | June 24, 1968 | NaN | 4 | 0 | 1968-06-24 |
2 | Zaid Abdul-Aziz | 1969 | 1978 | C-F | 6-9 | 235.0 | April 7, 1946 | Iowa State University | 9 | 0 | 1946-04-07 |
3 | Kareem Abdul-Jabbar | 1970 | 1989 | C | 7-2 | 225.0 | April 16, 1947 | University of California, Los Angeles | 19 | 1 | 1947-04-16 |
4 | Mahmoud Abdul-Rauf | 1991 | 2001 | G | 6-1 | 162.0 | March 9, 1969 | Louisiana State University | 10 | 0 | 1969-03-09 |