Data Wrangling#

Cargo Bay#

../_images/containers.jpeg

One day you decide to inspect the cargo bay of your spaceship.

What a mess!

Many of the containers are badly labeled. The radioactive waste is next to the ice cream. Some containers are not labeled at all.

Time for a proper cleanup of the cargo docs in cargo.csv.


Sort Rows#

Change Column Data Type#

# convert to string
df['mass'] = df['body_mass_g'].astype(str) + 'g'

Set the Index Column#

The index is a special column of a DataFrame, because it is treated differently by many operations in pandas.

# put the species column in the index
df_species = df.set_index('species')

# now you can select by species easily:
df_species.loc['Gentoo']

Note that the inplace=True parameter modifies the DataFrame instead of returning a new one:

df.set_index('species', inplace=True)

This notation is more memory-efficient, but it is more tricky in Jupyter notebooks (e.g. when you run that line twice you get different results.

To move the index to a regular column, use:

df_reset = df.reset_index()  # inserts a numerical index

Missing Values#

Missing values are a common phenomenon. A quick way to diagnose missing values is:

df.isna().sum().plot.bar()

Often, you might simply want to kick out all rows in which a None or NaN occurs:

df_dropped = df.dropna(inplace=False)  # same logic as with set_index()

Alternatively, you might want to fill in a best guess value:

df_fixed = df.fillna(42)
# or
df_fixed = df.fillna(df.median())

There are many, many strategies to fix missing values (imputation methods).

Swap Rows and Columns#

Some operations (especially plotting) are easier to implement if you turn a DataFrame by 90°:

df.transpose()

Iterate#

Usually, it is possible to write one-liners or concise expressions that get the job done. If this is not possible (or you are still learning this stuff and can’t figure out a better way yet), you may want to fall back to a for loop over all the rows.

for index, row in df.iterrows():

print(index, row[‘body_mass_g’])

../_images/bamboo.jpg

Challenge#

Take care of the following clean-ups in the cargo docs cargo.csv:

  • for the radioactive waste, replace the words in the units column by numbers

  • convert the units column to the type int

  • fill the missing values in the category column for the bamboo ice cream

  • fill the missing values in the units column

  • sort the crates by type and by identifier in ascending order