Merge y Concat¶
Concat¶
concat()
es una función de Pandas que se utiliza para combinar DataFrames y Series en un solo objeto. Permite unir varios objetos a lo largo de un eje en particular con opciones para configurar la forma en que se realiza la concatenación.
Observación: Se define la función display_side_by_side
para poder imprimir dos o más dataframe lados a lado.
from IPython.display import display_html
def display_side_by_side(*args):
html_str = ''
for df in args:
html_str += df.to_html()
html_str += ' '
display_html(
html_str.replace('table','table style="display:inline"'),
raw=True
)
Concatenar dos DataFrames verticalmente¶
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [7, 8, 9], 'B': [10, 11, 12]})
result = pd.concat([df1, df2])
display_side_by_side(df1,df2,result)
A | B | |
---|---|---|
0 | 1 | 4 |
1 | 2 | 5 |
2 | 3 | 6 |
A | B | |
---|---|---|
0 | 7 | 10 |
1 | 8 | 11 |
2 | 9 | 12 |
A | B | |
---|---|---|
0 | 1 | 4 |
1 | 2 | 5 |
2 | 3 | 6 |
0 | 7 | 10 |
1 | 8 | 11 |
2 | 9 | 12 |
Concatenar dos DataFrames horizontalmente¶
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'C': [7, 8, 9], 'D': [10, 11, 12]})
result = pd.concat([df1, df2], axis=1)
display_side_by_side(df1,df2,result)
A | B | |
---|---|---|
0 | 1 | 4 |
1 | 2 | 5 |
2 | 3 | 6 |
C | D | |
---|---|---|
0 | 7 | 10 |
1 | 8 | 11 |
2 | 9 | 12 |
A | B | C | D | |
---|---|---|---|---|
0 | 1 | 4 | 7 | 10 |
1 | 2 | 5 | 8 | 11 |
2 | 3 | 6 | 9 | 12 |
Concatenar dos DataFrames verticalmente con distintas columnas¶
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [7, 8, 9], 'C': [10, 11, 12]})
result = pd.concat([df1, df2])
display_side_by_side(df1,df2,result)
A | B | |
---|---|---|
0 | 1 | 4 |
1 | 2 | 5 |
2 | 3 | 6 |
A | C | |
---|---|---|
0 | 7 | 10 |
1 | 8 | 11 |
2 | 9 | 12 |
A | B | C | |
---|---|---|---|
0 | 1 | 4.0 | NaN |
1 | 2 | 5.0 | NaN |
2 | 3 | 6.0 | NaN |
0 | 7 | NaN | 10.0 |
1 | 8 | NaN | 11.0 |
2 | 9 | NaN | 12.0 |
Concatenar DataFrames con diferentes índices¶
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=[0, 1, 2])
df2 = pd.DataFrame({'A': [7, 8, 9], 'B': [10, 11, 12]}, index=[3, 4, 5])
result = pd.concat([df1, df2])
display_side_by_side(df1,df2,result)
A | B | |
---|---|---|
0 | 1 | 4 |
1 | 2 | 5 |
2 | 3 | 6 |
A | B | |
---|---|---|
3 | 7 | 10 |
4 | 8 | 11 |
5 | 9 | 12 |
A | B | |
---|---|---|
0 | 1 | 4 |
1 | 2 | 5 |
2 | 3 | 6 |
3 | 7 | 10 |
4 | 8 | 11 |
5 | 9 | 12 |
Merge¶
merge()
es una función de Pandas que se utiliza para combinar dos o más DataFrames en un solo DataFrame. La función merge()
combina los datos basándose en las columnas compartidas, también conocidas como claves.
Combinar dos DataFrames en función de una columna común¶
import pandas as pd
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
result = pd.merge(left, right, on='key')
display_side_by_side(left,right,result)
key | A | B | |
---|---|---|---|
0 | K0 | A0 | B0 |
1 | K1 | A1 | B1 |
2 | K2 | A2 | B2 |
3 | K3 | A3 | B3 |
key | C | D | |
---|---|---|---|
0 | K0 | C0 | D0 |
1 | K1 | C1 | D1 |
2 | K2 | C2 | D2 |
3 | K3 | C3 | D3 |
key | A | B | C | D | |
---|---|---|---|---|---|
0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | A1 | B1 | C1 | D1 |
2 | K2 | A2 | B2 | C2 | D2 |
3 | K3 | A3 | B3 | C3 | D3 |
Combinar dos DataFrames en función de una columna común - con valores duplicados¶
import pandas as pd
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K0', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
result = pd.merge(left, right, on='key')
display_side_by_side(left,right,result)
key | A | B | |
---|---|---|---|
0 | K0 | A0 | B0 |
1 | K1 | A1 | B1 |
2 | K2 | A2 | B2 |
3 | K3 | A3 | B3 |
key | C | D | |
---|---|---|---|
0 | K0 | C0 | D0 |
1 | K0 | C1 | D1 |
2 | K2 | C2 | D2 |
3 | K3 | C3 | D3 |
key | A | B | C | D | |
---|---|---|---|---|---|
0 | K0 | A0 | B0 | C0 | D0 |
1 | K0 | A0 | B0 | C1 | D1 |
2 | K2 | A2 | B2 | C2 | D2 |
3 | K3 | A3 | B3 | C3 | D3 |
Combinar dos DataFrames en función de múltiples columnas comunes¶
import pandas as pd
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
result = pd.merge(left, right, on=['key1', 'key2'])
display_side_by_side(left,right,result)
key1 | key2 | A | B | |
---|---|---|---|---|
0 | K0 | K0 | A0 | B0 |
1 | K0 | K1 | A1 | B1 |
2 | K1 | K0 | A2 | B2 |
3 | K2 | K1 | A3 | B3 |
key1 | key2 | C | D | |
---|---|---|---|---|
0 | K0 | K0 | C0 | D0 |
1 | K0 | K1 | C1 | D1 |
2 | K1 | K0 | C2 | D2 |
3 | K2 | K1 | C3 | D3 |
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K0 | K1 | A1 | B1 | C1 | D1 |
2 | K1 | K0 | A2 | B2 | C2 | D2 |
3 | K2 | K1 | A3 | B3 | C3 | D3 |
Unir dos DataFrames utilizando diferentes columnas en cada DataFrame¶
import pandas as pd
left = pd.DataFrame({'key1': ['A', 'B', 'C', 'D'], 'value1': [1, 2, 3, 4]})
right = pd.DataFrame({'key2': ['A', 'B', 'C', 'D'], 'value2': [5, 6, 7, 8]})
result = pd.merge(left, right, left_on='key1', right_on='key2')
display_side_by_side(left,right,result)
key1 | value1 | |
---|---|---|
0 | A | 1 |
1 | B | 2 |
2 | C | 3 |
3 | D | 4 |
key2 | value2 | |
---|---|---|
0 | A | 5 |
1 | B | 6 |
2 | C | 7 |
3 | D | 8 |
key1 | value1 | key2 | value2 | |
---|---|---|---|---|
0 | A | 1 | A | 5 |
1 | B | 2 | B | 6 |
2 | C | 3 | C | 7 |
3 | D | 4 | D | 8 |
Combinar Dataframes utilizando el comando how
¶
Existen distintos tipos de merge, para ello se utiliza la opción how
, el cual especificica el tipo de cruce que se realizará.
- left: usa las llaves solo de la tabla izquierda
- right: usa las llaves solo de la tabla derecha
- outer: usa las llaves de la unión de ambas tablas.
- inner: usa las llaves de la intersección de ambas tablas.
import pandas as pd
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
# left merge
result = pd.merge(left, right, how= 'left', on=['key1', 'key2'])
display_side_by_side(left,right,result)
key1 | key2 | A | B | |
---|---|---|---|---|
0 | K0 | K0 | A0 | B0 |
1 | K0 | K1 | A1 | B1 |
2 | K1 | K0 | A2 | B2 |
3 | K2 | K1 | A3 | B3 |
key1 | key2 | C | D | |
---|---|---|---|---|
0 | K0 | K0 | C0 | D0 |
1 | K1 | K0 | C1 | D1 |
2 | K1 | K0 | C2 | D2 |
3 | K2 | K0 | C3 | D3 |
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K0 | K1 | A1 | B1 | NaN | NaN |
2 | K1 | K0 | A2 | B2 | C1 | D1 |
3 | K1 | K0 | A2 | B2 | C2 | D2 |
4 | K2 | K1 | A3 | B3 | NaN | NaN |
# right merge
result = pd.merge(left, right, how='right', on=['key1', 'key2'])
display_side_by_side(left,right,result)
key1 | key2 | A | B | |
---|---|---|---|---|
0 | K0 | K0 | A0 | B0 |
1 | K0 | K1 | A1 | B1 |
2 | K1 | K0 | A2 | B2 |
3 | K2 | K1 | A3 | B3 |
key1 | key2 | C | D | |
---|---|---|---|---|
0 | K0 | K0 | C0 | D0 |
1 | K1 | K0 | C1 | D1 |
2 | K1 | K0 | C2 | D2 |
3 | K2 | K0 | C3 | D3 |
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | K0 | A2 | B2 | C1 | D1 |
2 | K1 | K0 | A2 | B2 | C2 | D2 |
3 | K2 | K0 | NaN | NaN | C3 | D3 |
# outer merge
result = pd.merge(left, right, how='outer', on=['key1', 'key2'])
display_side_by_side(left,right,result)
key1 | key2 | A | B | |
---|---|---|---|---|
0 | K0 | K0 | A0 | B0 |
1 | K0 | K1 | A1 | B1 |
2 | K1 | K0 | A2 | B2 |
3 | K2 | K1 | A3 | B3 |
key1 | key2 | C | D | |
---|---|---|---|---|
0 | K0 | K0 | C0 | D0 |
1 | K1 | K0 | C1 | D1 |
2 | K1 | K0 | C2 | D2 |
3 | K2 | K0 | C3 | D3 |
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K0 | K1 | A1 | B1 | NaN | NaN |
2 | K1 | K0 | A2 | B2 | C1 | D1 |
3 | K1 | K0 | A2 | B2 | C2 | D2 |
4 | K2 | K0 | NaN | NaN | C3 | D3 |
5 | K2 | K1 | A3 | B3 | NaN | NaN |
# inner merge
result = pd.merge(left, right, how='inner', on=['key1', 'key2'])
display_side_by_side(left,right,result)
key1 | key2 | A | B | |
---|---|---|---|---|
0 | K0 | K0 | A0 | B0 |
1 | K0 | K1 | A1 | B1 |
2 | K1 | K0 | A2 | B2 |
3 | K2 | K1 | A3 | B3 |
key1 | key2 | C | D | |
---|---|---|---|---|
0 | K0 | K0 | C0 | D0 |
1 | K1 | K0 | C1 | D1 |
2 | K1 | K0 | C2 | D2 |
3 | K2 | K0 | C3 | D3 |
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | K0 | A2 | B2 | C1 | D1 |
2 | K1 | K0 | A2 | B2 | C2 | D2 |
Problemas de llaves duplicadas¶
Cuando se quiere realizar el cruce de dos tablas, pero an ambas tablas existe una columna (key
) con el mismo nombre, para diferenciar la información entre la columna de una tabla y otra, pandas devulve el nombre de la columna con un guión bajo x (key_x
) y otra con un guión bajo y (key_y
)
left = pd.DataFrame({'A': [1, 2, 3], 'B': [1,2,3]})
right = pd.DataFrame({'A': [4, 5, 6], 'B': [1, 2, 3]})
result = pd.merge(left, right, on='B')
display_side_by_side(left,right,result)
A | B | |
---|---|---|
0 | 1 | 1 |
1 | 2 | 2 |
2 | 3 | 3 |
A | B | |
---|---|---|
0 | 4 | 1 |
1 | 5 | 2 |
2 | 6 | 3 |
A_x | B | A_y | |
---|---|---|---|
0 | 1 | 1 | 4 |
1 | 2 | 2 | 5 |
2 | 3 | 3 | 6 |