Last Letters#
🎯 Analyze which last letters appear most frequently in first names.
Task 1#
Read all files from the Baby Name Dataset into a single
DataFrame
. The DataFrame
should have the columns name, gender, number and year.
The following code snippets might be useful (although the order and indentation is wrong):
df['year'] = year
df = pd.concat([df1, df2, df3, ...])
for year in range(1880, 2015):
dataframes = []
Task 2#
Create an extra column with the last letter of each name.
Create string indexes for an entire column with the expression
df[col].str[i]
Alternatively, write a function that returns the last letter and use it
to create a pd.Series
:
df[col].apply(my_func)
Task 3#
Create a bar plot showing the count of each last letter.
Task 4#
Now, create a timeline for one last letter.
First, select that letter over all years. Second, group by the year and calculate the count for each year. The resulting table should look like this:
year count
1880 300
1881 317
1882 342
...
Finally, create a line plot from this data.
Task 5#
Let’s plot multiple timelines.
Count the names grouped by last letter and year. Use the expression:
df.groupby([col1, col2])[col3].count()
The resulting table should look something like this:
last year
a 1880 31446
1881 31581
1882 36536
...
b 1880 5432
This DataFrame
has a hierarchical index.
Convert the DataFrame to a crosstable that has the year in the row index and the letters in the column index. You can do this with the expression
df.unstack(0)
Draw a line plot showing the frequency of the letters d, n and y. Try other ones if you like.
Task 6#
Finally, let’s look for frequent first/last letter combinations.
Add an extra column containing the first letter.
Cross-tabulate by grouping by first and last letter and count the names (over all years).
Now you should have a table with first letters in columns an last letters in rows (or vice versa).
Plot a heatmap (check the Seaborn Example Gallery).
Hints:#
To make the plot nicer convert the names to upper or lower case at the very beginning with:
df['name'].str.uppper()
You also might sort the table by rows:
df.sort_values(by=col, axis=0)
For sorting by columns, set axis=0
Task 7#
Save your plots to .png
files with 150 dpi.
Task 8#
What visualization(s) would you use to compare the last letters of girls and boys?
Hint
Instead of the count, you can use the sum instead.
You might also try to log-transform the data with
np.log
before plotting.It might be a good idea to normalize the data before plotting.
Of course, the entire analysis also can be done for first letters, but for the last letters a research paper exists that had quite an impact.
Translated withwww.DeepL.com