Data Wrangling#
Cargo Bay#
One day you decide to inspect the cargo bay of your spaceship.
What a mess!
Many of the containers are badly labeled. The radioactive waste is next to the ice cream. Some containers are not labeled at all.
Time for a proper cleanup of the cargo docs in cargo.csv
.
Sort Rows#
Change Column Data Type#
# convert to string
df['mass'] = df['body_mass_g'].astype(str) + 'g'
Set the Index Column#
The index is a special column of a DataFrame, because it is treated differently by many operations in pandas.
# put the species column in the index
df_species = df.set_index('species')
# now you can select by species easily:
df_species.loc['Gentoo']
Note that the inplace=True
parameter modifies the DataFrame instead
of returning a new one:
df.set_index('species', inplace=True)
This notation is more memory-efficient, but it is more tricky in Jupyter notebooks (e.g. when you run that line twice you get different results.
To move the index to a regular column, use:
df_reset = df.reset_index() # inserts a numerical index
Missing Values#
Missing values are a common phenomenon. A quick way to diagnose missing values is:
df.isna().sum().plot.bar()
Often, you might simply want to kick out all rows in which a None or NaN occurs:
df_dropped = df.dropna(inplace=False) # same logic as with set_index()
Alternatively, you might want to fill in a best guess value:
df_fixed = df.fillna(42)
# or
df_fixed = df.fillna(df.median())
There are many, many strategies to fix missing values (imputation methods).
Swap Rows and Columns#
Some operations (especially plotting) are easier to implement if you turn a DataFrame by 90°:
df.transpose()
Iterate#
Usually, it is possible to write one-liners or concise expressions that
get the job done. If this is not possible (or you are still learning
this stuff and can’t figure out a better way yet), you may want to fall
back to a for
loop over all the rows.
- for index, row in df.iterrows():
print(index, row[‘body_mass_g’])
Challenge#
Take care of the following clean-ups in the cargo docs cargo.csv
:
for the radioactive waste, replace the words in the units column by numbers
convert the units column to the type int
fill the missing values in the category column for the bamboo ice cream
fill the missing values in the units column
sort the crates by type and by identifier in ascending order