Lesson 3 extension: Data wrangling, t-tests and ANOVA

Course website: https://hannahmetzler.eu/R_intro/

Author

Hannah Metzler

Published

February 20, 2025

1 Project for extension of lesson 3: Comparison of baby names in Austria and the US

What about baby names in Austria?

Below, we read in data from Austria, make it comparable to the US data set, and then run a t-test to compare the popularity of baby names in the two countries.

2 Reading in and looking at data

2.1 Data from Austria

  • I downloaded data about given names in Austria from 1984 to 2022 from data.bv.at. Download the first file from there and put it into your data folder.

Continue working in your cleaning script from lesson 3, in the # READ DATA #### section.

  • To read the Austrian data, try the function read_csv(). Add after the lines that read in the data from the USA.
data_aut <- readr::read_csv('data/OGDEXT_VORNAMEN_1.csv')
head(data_aut)
C-JAHR-0;C-WOHNBEZIRK-0;C-GESCHLECHT-0;F-VORNAME_NORMALISIERT;F-ANZAHL_LGEB
1984;101;1;Andreas;4
1984;101;1;Bernd;1
1984;101;1;Christian;2
1984;101;1;Christoph;2
1984;101;1;Clemens;1
1984;101;1;Daniel;3

You can see that the data was not read in correctly. Why?

  • Cells in CSV-files are most commonly separated by a comma, with the point used for the decimal point.
  • In some European countries (including Austria) cells are instead separated by a semicolon, and a comma used as the decimal point.
  • Here, you can see that these data from Austria are indeed separated by a semicolon.
  • If you check the help section for ?read_csv, you see that you can use the function read_csv2() to read in CSV-files with the field separator ;.
data_aut <- readr::read_csv2('data/OGDEXT_VORNAMEN_1.csv')
head(data_aut)
C-JAHR-0 C-WOHNBEZIRK-0 C-GESCHLECHT-0 F-VORNAME_NORMALISIERT F-ANZAHL_LGEB
1984 101 1 Andreas 4
1984 101 1 Bernd 1
1984 101 1 Christian 2
1984 101 1 Christoph 2
1984 101 1 Clemens 1
1984 101 1 Daniel 3

2.2 Data from the US

Here, I load the baby names data set, a default package included in the library “babynames”. To run the code below, you first need to install the package (via the console), and then load it by adding library(babynames) in your section for loading packages. This dataset, as opposed to the smaller one we used in lesson 3, includes all given names in the US, not just the ones from our course participants. Assign the dataset to data_usa.

# load the dataset
data(babynames)
# assign it to data_usa
data_usa = babynames

3 Get an overview of the dataset

To get an overview of this new data set, you could use any of the following functions:

dplyr::glimpse(data_aut)
Rows: 1,265,460
Columns: 5
$ `C-JAHR-0`               <dbl> 1984, 1984, 1984, 1984, 1984, 1984, 1984, 198…
$ `C-WOHNBEZIRK-0`         <dbl> 101, 101, 101, 101, 101, 101, 101, 101, 101, …
$ `C-GESCHLECHT-0`         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ `F-VORNAME_NORMALISIERT` <chr> "Andreas", "Bernd", "Christian", "Christoph",…
$ `F-ANZAHL_LGEB`          <dbl> 4, 1, 2, 2, 1, 3, 2, 1, 3, 1, 1, 1, 1, 1, 1, …
  • summary() shows us that the earliest year in the data is 1984, the latest one 2022.
    • It also shows us that the column “Geschlecht” (sex) is not yet coded as a categorical variable.
summary(data_aut)
    C-JAHR-0    C-WOHNBEZIRK-0 C-GESCHLECHT-0  F-VORNAME_NORMALISIERT
 Min.   :1984   Min.   :101    Min.   :1.000   Length:1265460        
 1st Qu.:1996   1st Qu.:319    1st Qu.:1.000   Class :character      
 Median :2007   Median :506    Median :2.000   Mode  :character      
 Mean   :2005   Mean   :563    Mean   :1.535                         
 3rd Qu.:2015   3rd Qu.:804    3rd Qu.:2.000                         
 Max.   :2022   Max.   :923    Max.   :2.000                         
 F-ANZAHL_LGEB  
 Min.   : 1.00  
 1st Qu.: 1.00  
 Median : 1.00  
 Mean   : 2.36  
 3rd Qu.: 2.00  
 Max.   :76.00  
  • The function View() opens the data in an extra window.
View(data_aut)
  • The function str() shows you the structure of the data set.
str(data_aut) 
spc_tbl_ [1,265,460 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ C-JAHR-0              : num [1:1265460] 1984 1984 1984 1984 1984 ...
 $ C-WOHNBEZIRK-0        : num [1:1265460] 101 101 101 101 101 101 101 101 101 101 ...
 $ C-GESCHLECHT-0        : num [1:1265460] 1 1 1 1 1 1 1 1 1 1 ...
 $ F-VORNAME_NORMALISIERT: chr [1:1265460] "Andreas" "Bernd" "Christian" "Christoph" ...
 $ F-ANZAHL_LGEB         : num [1:1265460] 4 1 2 2 1 3 2 1 3 1 ...
 - attr(*, "spec")=
  .. cols(
  ..   `C-JAHR-0` = col_double(),
  ..   `C-WOHNBEZIRK-0` = col_double(),
  ..   `C-GESCHLECHT-0` = col_double(),
  ..   `F-VORNAME_NORMALISIERT` = col_character(),
  ..   `F-ANZAHL_LGEB` = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

4 Cleaning and organizing the data

4.1 Renaming columns

Let’s use more useful column names for coding, and choose the same names as in the US baby names data. Check the names of the US data with names():

names(data_usa)
[1] "year" "sex"  "name" "n"    "prop"

We can use the dplyr::rename()function to change column names.

  • You simply assign each old variable name (in quotation marks) to the new variable name.
  • Add the following in your # CLEAN DATA #### section:
data_aut_clean <- data_aut %>%
  rename(year = 'C-JAHR-0', 
         sex = 'C-GESCHLECHT-0',
         living_district = 'C-WOHNBEZIRK-0',
         name = 'F-VORNAME_NORMALISIERT',
         n = 'F-ANZAHL_LGEB') 
head(data_aut_clean)
year living_district sex name n
1984 101 1 Andreas 4
1984 101 1 Bernd 1
1984 101 1 Christian 2
1984 101 1 Christoph 2
1984 101 1 Clemens 1
1984 101 1 Daniel 3

4.2 Selecting rows

If we compare the US and the Austrian data set, we can see that they cover different time spans.

summary(data_usa)
      year          sex                name                 n          
 Min.   :1880   Length:1924665     Length:1924665     Min.   :    5.0  
 1st Qu.:1951   Class :character   Class :character   1st Qu.:    7.0  
 Median :1985   Mode  :character   Mode  :character   Median :   12.0  
 Mean   :1975                                         Mean   :  180.9  
 3rd Qu.:2003                                         3rd Qu.:   32.0  
 Max.   :2017                                         Max.   :99686.0  
      prop          
 Min.   :2.260e-06  
 1st Qu.:3.870e-06  
 Median :7.300e-06  
 Mean   :1.363e-04  
 3rd Qu.:2.288e-05  
 Max.   :8.155e-02  
summary(data_aut_clean)
      year      living_district      sex            name          
 Min.   :1984   Min.   :101     Min.   :1.000   Length:1265460    
 1st Qu.:1996   1st Qu.:319     1st Qu.:1.000   Class :character  
 Median :2007   Median :506     Median :2.000   Mode  :character  
 Mean   :2005   Mean   :563     Mean   :1.535                     
 3rd Qu.:2015   3rd Qu.:804     3rd Qu.:2.000                     
 Max.   :2022   Max.   :923     Max.   :2.000                     
       n        
 Min.   : 1.00  
 1st Qu.: 1.00  
 Median : 1.00  
 Mean   : 2.36  
 3rd Qu.: 2.00  
 Max.   :76.00  

Let’s keep only years we have in both datasets. First, filtering the Austrian dataset:

data_aut_clean <- data_aut %>%
  dplyr::rename(year = 'C-JAHR-0', 
                sex = 'C-GESCHLECHT-0',
                living_district = 'C-WOHNBEZIRK-0',
                name = 'F-VORNAME_NORMALISIERT',
                n = 'F-ANZAHL_LGEB') %>%
  
  #keep only years up to 2017
  filter(year <= 2017)

# Check the new data frame
summary(data_aut_clean)
      year      living_district      sex            name          
 Min.   :1984   Min.   :101     Min.   :1.000   Length:1047654    
 1st Qu.:1994   1st Qu.:319     1st Qu.:1.000   Class :character  
 Median :2004   Median :506     Median :2.000   Mode  :character  
 Mean   :2002   Mean   :560     Mean   :1.541                     
 3rd Qu.:2011   3rd Qu.:804     3rd Qu.:2.000                     
 Max.   :2017   Max.   :923     Max.   :2.000                     
       n         
 Min.   : 1.000  
 1st Qu.: 1.000  
 Median : 1.000  
 Mean   : 2.458  
 3rd Qu.: 2.000  
 Max.   :76.000  
  • Next, filter out all years before 1984 from the USA data set:
data_usa_clean = data_usa %>% 
  #keep only years from 1984 on
  filter(year > 1983)

summary(data_usa_clean)
      year          sex                name                 n          
 Min.   :1984   Length:982833      Length:982833      Min.   :    5.0  
 1st Qu.:1994   Class :character   Class :character   1st Qu.:    7.0  
 Median :2003   Mode  :character   Mode  :character   Median :   11.0  
 Mean   :2002                                         Mean   :  129.1  
 3rd Qu.:2010                                         3rd Qu.:   29.0  
 Max.   :2017                                         Max.   :67736.0  
      prop          
 Min.   :2.260e-06  
 1st Qu.:3.290e-06  
 Median :5.640e-06  
 Mean   :6.431e-05  
 3rd Qu.:1.438e-05  
 Max.   :3.610e-02  

4.3 Changing the format of columns

To run tests comparing both countries, we need to join both data frames into one. There are several steps to prepare our data sets for this.

  • First, columns in both data frames need the same format.
  • The US data had the following formats (see second column):
dplyr::glimpse(data_usa_clean)
Rows: 982,833
Columns: 5
$ year <dbl> 1984, 1984, 1984, 1984, 1984, 1984, 1984, 1984, 1984, 1984, 1984,…
$ sex  <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", …
$ name <chr> "Jennifer", "Jessica", "Ashley", "Amanda", "Sarah", "Stephanie", …
$ n    <int> 50561, 45855, 38770, 33907, 25882, 23022, 22264, 21889, 21491, 20…
$ prop <dbl> 0.02804442, 0.02543416, 0.02150436, 0.01880703, 0.01435584, 0.012…

So, in the Austrian data, we need sex to be a character vector with levels “M” and “F” instead of “1” and “2”.

  • We will use the ifelse() function, within a mutate() call to transform the variable sex.
  • If you check the help for ?ifelse, the Usage section shows you how to use it: ifelse(test, yes, no).
  • The arguments for the ifelse() function start with a condition (rows where the column sex is equal to “1”), then specifies what happens if the condition is met (use the label “M”), and then lists what should happen otherwise/else (use the label “F”).
  • Now, add this as a last line to where you created data_aut_clean:
data_aut_clean <- data_aut %>% 
  dplyr::rename(year = 'C-JAHR-0', 
         sex = 'C-GESCHLECHT-0',
         living_district = 'C-WOHNBEZIRK-0',
         name = 'F-VORNAME_NORMALISIERT',
         n = 'F-ANZAHL_LGEB') %>% 
  #keep only years up to 2017
  filter(year <= 2017) %>% 
  
  # Recode the levels of sex from 1 and 2, to M and F
  mutate(sex = ifelse(sex == "1", "M", "F"))

# check if there is now F instead of 1, and M instead of M: 
glimpse(data_aut_clean)
Rows: 1,047,654
Columns: 5
$ year            <dbl> 1984, 1984, 1984, 1984, 1984, 1984, 1984, 1984, 1984, …
$ living_district <dbl> 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101,…
$ sex             <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M",…
$ name            <chr> "Andreas", "Bernd", "Christian", "Christoph", "Clemens…
$ n               <dbl> 4, 1, 2, 2, 1, 3, 2, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

4.4 Create a new variable

Next, we need to tell statistics from Austria and the US apart, once they are in a joint data frame. Therefore, let’s add a variable that identifies the country.

  • We use rep() to repeat “Austria” n times, and set nrow() to the number of rows of the Austrian data.
data_aut_clean <- data_aut %>% 
  dplyr::rename(year = 'C-JAHR-0', 
         sex = 'C-GESCHLECHT-0',
         living_district = 'C-WOHNBEZIRK-0',
         name = 'F-VORNAME_NORMALISIERT',
         n = 'F-ANZAHL_LGEB') %>% 
  #keep only years up to 2017
  filter(year <= 2017) %>% 
  # Recode the levels of sex from 1 and 2, to M and F
  mutate(sex = ifelse(sex == "1", "M", "F")) %>%
  
  #create a country variable for Austria
  mutate(country = rep("Austria", n=nrow(data_aut)))

Now, let’s repeat this step for the USA data. Just add a line to where you started to clean the US data:

data_usa_clean = data_usa %>%
  #keep only years from 1984 on
  filter(year > 1983) %>% 

  #Create a country variable for the US
  mutate(country = rep("USA", n=nrow(babynames)))

4.5 Grouping data and summarising statistics

The US data did not separate names per living district.

  • To make our Austrian data similar, let’s group names across districts, so that we get the total number of names per year and sex.
  • Using group_by() first group the data by year, sex and name, because we want the total n for each of these combined categories.
    • We also add the column country, because we want to keep it, although it does not create any additional grouping.
  • We then summarise() the data by calculating the sum() of the number of names (n).
    • summarise() is similar to mutate(), but creates one line of output per group. Here, it collapses across districts, reducing the row number for each name.
    • mutate() creates as many lines of output as input.
  • Finally, we ungroup() the data again.
data_aut_clean <- data_aut %>% 
  rename(year = 'C-JAHR-0', 
         sex = 'C-GESCHLECHT-0',
         living_district = 'C-WOHNBEZIRK-0',
         name = 'F-VORNAME_NORMALISIERT',
         n = 'F-ANZAHL_LGEB') %>% 
  #keep only years up to 2017
  filter(year <= 2017) %>% 
  # Recode the levels of sex from 1 and 2, to M and F
  mutate(sex = ifelse(sex == "1", "M", "F")) %>%
  #create a country variable for Austria
  mutate(country = rep("Austria", n=nrow(data_aut))) %>%
  
  #group by year, sex and name and calculate the new sum for each category
  group_by(year, sex, name, country) %>% 
  summarise(n = sum(n)) %>%
  ungroup()

Finally, we also want to calculate the proportion variable that exists in the US data set (propotion of babies getting each name).

  • N, the total number of babies, is obviously different because the US is larger.
  • Using the proportion allows us to meaningfully compare the popularity of names between countries.
  • We first want to calculate the sum of all babies born in each year (n_total), and then divide n of each name by the years’ total.
  • To calculate the sum per year, we again use group_by(), grouping only by year.
    • We again add country, also this does not create a group, to keep country variable.
data_aut_clean <- data_aut %>% 
  rename(year = 'C-JAHR-0', 
         sex = 'C-GESCHLECHT-0',
         living_district = 'C-WOHNBEZIRK-0',
         name = 'F-VORNAME_NORMALISIERT',
         n = 'F-ANZAHL_LGEB') %>% 
  #keep only years up to 2017
  filter(year <= 2017) %>% 
  # Recode the levels of sex from 1 and 2, to M and F
  mutate(sex = ifelse(sex == "1", "M", "F")) %>%
  #create a country variable for Austria
  mutate(country = rep("Austria", n=nrow(data_aut))) %>%
  #group by year, sex and name and calculate the new sum for each category
  group_by(year, sex, name, country) %>% 
  summarise(n = sum(n)) %>%  # see lesson 3 page for documentation
  ungroup()%>%
  
  #calculate total number of birth per year, and then proportion for each name out of this total
  group_by(year, country) %>% 
  mutate(n_total = sum(n), 
         prop = n/n_total) %>% 
  ungroup()

head(data_aut_clean)
year sex name country n n_total prop
1984 F Abigail Austria 1 83856 0.0000119
1984 F Adele Austria 2 83856 0.0000239
1984 F Adelheid Austria 23 83856 0.0002743
1984 F Adina Austria 2 83856 0.0000239
1984 F Adriane Austria 1 83856 0.0000119
1984 F Agata Austria 1 83856 0.0000119

4.6 Filter out one name

Once this is done, you can again filter one name. I recommend taking a name that occurs in both men and women (like “Kim”), so the results of the ANOVA below make more sense.

Add another filter() call to each of the data sets for this. After this, the cleaning and data organizing is finished.

Austrian data set complete data cleaning and organizing:

data_aut_clean <- data_aut %>% 
  rename(year = 'C-JAHR-0', 
         sex = 'C-GESCHLECHT-0',
         living_district = 'C-WOHNBEZIRK-0',
         name = 'F-VORNAME_NORMALISIERT',
         n = 'F-ANZAHL_LGEB') %>% 
  #keep only years up to 2017
  filter(year <= 2017) %>% 
  # Re-code the levels of sex from 1 and 2, to M and F
  mutate(sex = ifelse(sex == "1", "M", "F")) %>%
  #create a country variable for Austria
  mutate(country = rep("Austria", n=nrow(data_aut))) %>%
  #group by year, sex and name and calculate the new sum for each category
  group_by(year, sex, name, country) %>% 
  summarise(n = sum(n)) %>%  # see lesson 3 page for documentation
  ungroup()%>%
  #calculate total number of birth per year, and then proportion for each name out of this total
  group_by(year, country) %>% 
  mutate(n_total = sum(n), 
         prop = n/n_total) %>% 
  ungroup() %>% 
  
  #filter out your name
  filter(name == "Kim")

USA data set complete data cleaning and organizing:

data_usa_clean = data_usa %>%
  #keep only years from 1984 on
  filter(year > 1983) %>% 
  #Create a country variable for the US
  mutate(country = rep("USA", n=nrow(data_usa))) %>% 
      
  #filter out your name
  filter(name == "Kim")

Have a look at both organized data sets:

head(data_usa_clean)
year sex name n prop country
1984 F Kim 387 0.0002147 USA
1984 M Kim 92 0.0000490 USA
1985 F Kim 348 0.0001885 USA
1985 M Kim 92 0.0000478 USA
1986 F Kim 281 0.0001523 USA
1986 M Kim 81 0.0000422 USA
head(data_aut_clean)
year sex name country n n_total prop
1984 F Kim Austria 1 83856 1.19e-05
1984 M Kim Austria 2 83856 2.39e-05
1985 F Kim Austria 3 82379 3.64e-05
1986 F Kim Austria 6 82056 7.31e-05
1987 F Kim Austria 2 81351 2.46e-05
1988 F Kim Austria 7 82545 8.48e-05

5 Combine two data frames

Now we can combine the 2 data frames. We’ll learn about dplyr verbs that work on two data frames. Until now, all verbs (filter, mutate, summarise, rename, recode) manipulated only one data frame.

  • We will use the function full_join: it
    • R looks for column they have in common
    • then looks for rows that have the same matching values
    • combines them into a new data frame, including all the other columns from both data frames
  • full_join() is one of 4 Mutating joins. It keeps all observations from both data frames.
  • Check out inner_join, right_join, and left_join with the Examples provided in the help on ?full_join when you have a moment. It takes a bit of time to understand them.
# join the two data frames, keeping all columns and entries from both data frames
data_combined <- data_usa_clean %>% 
  full_join(data_aut_clean) 
head(data_combined)
year sex name n prop country n_total
1984 F Kim 387 0.0002147 USA NA
1984 M Kim 92 0.0000490 USA NA
1985 F Kim 348 0.0001885 USA NA
1985 M Kim 92 0.0000478 USA NA
1986 F Kim 281 0.0001523 USA NA
1986 M Kim 81 0.0000422 USA NA
  • To see data points from both countries, you could sort values by year using arrange().
  • In addition, lets transform the columns sex and country into categorical variables (factors).
data_combined <- data_usa_clean %>%
  # join the two data frames
  full_join(data_aut_clean) %>% 
  # sort by year
  arrange(year)%>%
  # transform sex into a factor
  mutate(sex = factor(sex)) %>% 
  # transform country into a factor with 2 levels:
  mutate(country = factor(country))
head(data_combined)
year sex name n prop country n_total
1984 F Kim 387 0.0002147 USA NA
1984 M Kim 92 0.0000490 USA NA
1984 F Kim 1 0.0000119 Austria 83856
1984 M Kim 2 0.0000239 Austria 83856
1985 F Kim 348 0.0001885 USA NA
1985 M Kim 92 0.0000478 USA NA
summary(data_combined)
      year      sex        name                 n               prop          
 Min.   :1984   F:68   Length:127         Min.   :  1.00   Min.   :3.210e-06  
 1st Qu.:1993   M:59   Class :character   1st Qu.:  7.00   1st Qu.:1.510e-05  
 Median :2001          Mode  :character   Median : 18.00   Median :3.701e-05  
 Mean   :2001                             Mean   : 50.66   Mean   :7.947e-05  
 3rd Qu.:2009                             3rd Qu.: 70.50   3rd Qu.:1.295e-04  
 Max.   :2017                             Max.   :387.00   Max.   :3.562e-04  
                                                                              
    country      n_total     
 Austria:59   Min.   :64431  
 USA    :68   1st Qu.:67375  
              Median :76058  
              Mean   :74895  
              3rd Qu.:82366  
              Max.   :85333  
              NA's   :68     

You might want to drop the column n_total, which only has values for Austria. You can do this with select():

  • It selects the columns you list.
  • if you add a minus, it deletes the columns you list
data_combined <- data_usa_clean %>%
  # join the two data frames
  full_join(data_aut_clean) %>% 
  # sort by year
  arrange(year)%>%
  # transform sex into a factor
  mutate(sex = factor(sex)) %>% 
  # transform country into a factor with 2 levels:
  mutate(country = factor(country)) %>% 
  
  # delete the column n_total
  select(-n_total) #to delete multiple columns: -c("column_1", "column_2")

data_combined
year sex name n prop country
1984 F Kim 387 0.0002147 USA
1984 M Kim 92 0.0000490 USA
1984 F Kim 1 0.0000119 Austria
1984 M Kim 2 0.0000239 Austria
1985 F Kim 348 0.0001885 USA
1985 M Kim 92 0.0000478 USA
1985 F Kim 3 0.0000364 Austria
1986 F Kim 281 0.0001523 USA
1986 M Kim 81 0.0000422 USA
1986 F Kim 6 0.0000731 Austria
1987 F Kim 280 0.0001494 USA
1987 M Kim 76 0.0000390 USA
1987 F Kim 2 0.0000246 Austria
1988 F Kim 252 0.0001311 USA
1988 M Kim 89 0.0000445 USA
1988 F Kim 7 0.0000848 Austria
1989 F Kim 259 0.0001300 USA
1989 M Kim 72 0.0000344 USA
1989 F Kim 15 0.0001818 Austria
1989 M Kim 6 0.0000727 Austria
1990 F Kim 257 0.0001251 USA
1990 M Kim 66 0.0000307 USA
1990 F Kim 12 0.0001441 Austria
1990 M Kim 1 0.0000120 Austria
1991 F Kim 207 0.0001018 USA
1991 M Kim 60 0.0000283 USA
1991 F Kim 20 0.0002348 Austria
1992 F Kim 182 0.0000908 USA
1992 M Kim 45 0.0000214 USA
1992 F Kim 20 0.0002407 Austria
1992 M Kim 1 0.0000120 Austria
1993 F Kim 185 0.0000938 USA
1993 M Kim 39 0.0000189 USA
1993 F Kim 17 0.0002080 Austria
1994 F Kim 160 0.0000821 USA
1994 M Kim 46 0.0000226 USA
1994 F Kim 20 0.0002526 Austria
1995 F Kim 135 0.0000703 USA
1995 M Kim 31 0.0000154 USA
1995 F Kim 8 0.0001055 Austria
1995 M Kim 2 0.0000264 Austria
1996 F Kim 130 0.0000678 USA
1996 M Kim 47 0.0000235 USA
1996 F Kim 15 0.0001972 Austria
1996 M Kim 2 0.0000263 Austria
1997 F Kim 110 0.0000576 USA
1997 M Kim 20 0.0000100 USA
1997 F Kim 18 0.0002487 Austria
1997 M Kim 1 0.0000138 Austria
1998 F Kim 102 0.0000526 USA
1998 M Kim 24 0.0000118 USA
1998 F Kim 15 0.0002148 Austria
1998 M Kim 1 0.0000143 Austria
1999 F Kim 133 0.0000683 USA
1999 M Kim 29 0.0000142 USA
1999 F Kim 16 0.0002393 Austria
1999 M Kim 2 0.0000299 Austria
2000 F Kim 116 0.0000581 USA
2000 M Kim 18 0.0000086 USA
2000 F Kim 10 0.0001493 Austria
2000 M Kim 2 0.0000299 Austria
2001 F Kim 100 0.0000505 USA
2001 M Kim 26 0.0000126 USA
2001 F Kim 12 0.0001855 Austria
2001 M Kim 1 0.0000155 Austria
2002 F Kim 111 0.0000562 USA
2002 M Kim 14 0.0000068 USA
2002 F Kim 12 0.0001781 Austria
2002 M Kim 2 0.0000297 Austria
2003 F Kim 83 0.0000414 USA
2003 M Kim 17 0.0000081 USA
2003 F Kim 14 0.0002098 Austria
2003 M Kim 4 0.0000600 Austria
2004 F Kim 91 0.0000451 USA
2004 M Kim 18 0.0000085 USA
2004 F Kim 23 0.0003349 Austria
2004 M Kim 1 0.0000146 Austria
2005 F Kim 73 0.0000360 USA
2005 M Kim 13 0.0000061 USA
2005 F Kim 14 0.0002070 Austria
2005 M Kim 1 0.0000148 Austria
2006 F Kim 85 0.0000407 USA
2006 M Kim 10 0.0000046 USA
2006 F Kim 24 0.0003562 Austria
2006 M Kim 2 0.0000297 Austria
2007 F Kim 104 0.0000492 USA
2007 M Kim 10 0.0000045 USA
2007 F Kim 23 0.0003504 Austria
2007 M Kim 2 0.0000305 Austria
2008 F Kim 63 0.0000303 USA
2008 M Kim 7 0.0000032 USA
2008 F Kim 21 0.0003161 Austria
2008 M Kim 5 0.0000753 Austria
2009 F Kim 48 0.0000237 USA
2009 M Kim 17 0.0000080 USA
2009 F Kim 14 0.0002173 Austria
2009 M Kim 1 0.0000155 Austria
2010 F Kim 52 0.0000266 USA
2010 M Kim 9 0.0000044 USA
2010 F Kim 15 0.0001946 Austria
2010 M Kim 1 0.0000130 Austria
2011 F Kim 50 0.0000258 USA
2011 M Kim 7 0.0000034 USA
2011 F Kim 21 0.0002746 Austria
2011 M Kim 1 0.0000131 Austria
2012 F Kim 91 0.0000470 USA
2012 M Kim 9 0.0000044 USA
2012 F Kim 19 0.0002458 Austria
2012 M Kim 2 0.0000259 Austria
2013 F Kim 62 0.0000322 USA
2013 M Kim 8 0.0000040 USA
2013 F Kim 14 0.0001802 Austria
2014 F Kim 69 0.0000354 USA
2014 M Kim 19 0.0000093 USA
2014 F Kim 17 0.0002128 Austria
2015 F Kim 72 0.0000370 USA
2015 M Kim 11 0.0000054 USA
2015 F Kim 11 0.0001336 Austria
2015 M Kim 2 0.0000243 Austria
2016 F Kim 52 0.0000270 USA
2016 M Kim 7 0.0000035 USA
2016 F Kim 11 0.0001290 Austria
2016 M Kim 1 0.0000117 Austria
2017 F Kim 46 0.0000245 USA
2017 M Kim 8 0.0000041 USA
2017 F Kim 4 0.0000469 Austria
2017 M Kim 1 0.0000117 Austria

6 Figures script

In the figures script, add the following figure:

# Boxplot of name propotion by sex and country
ggplot(data_combined, aes(country, prop, fill = sex)) +
  geom_boxplot() +
  scale_fill_manual(values = c("orange", "dodgerblue")) +
  labs(x = "Country", y = "Proportion") +
  theme(text = element_text(size = 20, family = "Times"))+
  theme_bw()+
  ggtitle("Popularity of the name \"Kim\" by country and sex")

Figure 1. Scores by pet type and country.
  • In both countries, there are more women than men with the name Kim (main effect of sex).
  • There is also a main effect of country: in the USA, the name “Kim” is less popular than in Austria.
  • In addition, there is an interaction: The difference in women named “Kim” between the two countries is larger than the difference for men.

7 Statistics script

In the statistics script, rerun the line with source() to read in the new data from the cleaning script.

7.1 T-test by country

Is your name more popular in Austria or the USA? Although this data is not normally distributed, I want to show you how to do a t-test:

# t-test between countries
t.test(prop ~ country, data = data_combined)

    Welch Two Sample t-test

data:  prop by country
t = 5.0499, df = 76.579, p-value = 2.914e-06
alternative hypothesis: true difference in means between group Austria and group USA is not equal to 0
95 percent confidence interval:
 4.558866e-05 1.049556e-04
sample estimates:
mean in group Austria     mean in group USA 
         1.197764e-04          4.450426e-05 
  • You can see that “Kim” is significantly more popular in Austria compared to the US.
# t-test between men and women
t.test(prop ~ sex, data = data_combined)

    Welch Two Sample t-test

data:  prop by sex
t = 9.6265, df = 71.905, p-value = 1.458e-14
alternative hypothesis: true difference in means between group F and group M is not equal to 0
95 percent confidence interval:
 8.733425e-05 1.329523e-04
sample estimates:
mean in group F mean in group M 
   1.306421e-04    2.049884e-05 
  • You can see that “Kim” is much more frequent in women than men.

7.2 Multi-factorial ANOVA: Difference in name popularity by sex and country

If you want to investigate if your name is more popular for men or women, and if this difference varies across countries, you could run an ANOVA (analysis of variance) with two factors (sex and country).

I recommend using the package “ez” for this. Load it at the top of your statistics script.

# Install in the console
install.packages("ez")

# In your script: 
library(ez)

After your other analyses (the linear regression and t-test), you can now add the ANOVA in your statistics script, using the function ezANOVA(). It’s arguments:

  • the data frame

  • dv: the dependent variable (proportion)

  • wid: the columns that identify unique rows in your data frame. Here these are 2 columns: each name exists for each year. The function ezANOVA requires that you enter multiple columns as a list: .("column 1", "column 2").

    • A list in R is a very flexible object type that can store all kinds of objects of different dimensions. The first object could be a string, the second a data frame, the third a vector.
  • between: the predictors. ezANOVA requires that you enter multiple predictors as a list: .(x, y, z)

  • Type 3 defines which sum of squares are calculated for the ANOVA. This is detailed statistics knowledge, and only slightly changes your result. In any case, SPSS by default gives you type 3, so we use it here.

country_sex.aov = ezANOVA(data = data_combined,
        dv = prop, 
        wid = .("year", "name"),
        # predictors
        between = .("country", "sex"),
        type = 3)
country_sex.aov
$ANOVA
       Effect DFn DFd         F            p p<.05       ges
2     country   1 123  42.13132 1.886914e-09     * 0.2551383
3         sex   1 123 127.77207 9.391571e-21     * 0.5095148
4 country:sex   1 123  30.59984 1.817898e-07     * 0.1992179

$`Levene's Test for Homogeneity of Variance`
  DFn DFd          SSn          SSd        F           p p<.05
1   3 123 6.930065e-08 1.709389e-07 16.62188 3.96263e-09     *

7.3 Make a table with statistics and export it to a csv-file

# Create a table with number of babyes with the name Kim per country and sex, with mean and sd: 
summary_table <- data_combined %>%
  group_by(country, sex) %>%
  #summarise: calculate one number/statistic per group created above
  summarise(
    n = n(), 
    # calculate mean and standard deviation
    mean_proportion = mean(prop), 
    sd_proportion = sd(prop), 
    # transform to percent for readibility
    mean_percent = mean_proportion*100, 
    sd_percent = sd_proportion*100
  ) %>% 
  ungroup()
# print
summary_table
country sex n mean_proportion sd_proportion mean_percent sd_percent
Austria F 34 0.0001888 8.94e-05 0.0188840 0.0089409
Austria M 25 0.0000259 1.80e-05 0.0025850 0.0017965
USA F 34 0.0000724 4.96e-05 0.0072444 0.0049594
USA M 34 0.0000166 1.45e-05 0.0016564 0.0014458
# To write results to an excel table, in order to include tables into papers, for example: 
readr::write_csv(summary_table, 'write_up/descriptives_table.csv')

8 Transforming data from long to wide format

In this section, you’ll learn about a few functions that you will regularly need - although we did not need them for the analyses above.

Sometimes you need your data in a wide format, that is, multiple columns for the dependent variable, as here separate columns for Austria and the US and this for each sex.

  • SPSS and other statistics software need data in this format by default.
  • In R, you can change the format of the data frame using the function pivot_wider() from the tidyverse package tidyr.
  • Install the package, and load the library at the start of your statistics script.
  • Arguments of pivot_wider():
    • the data frame
    • id_cols: the column or columns that identify a single unique row in your data frame. This could be you participant ID. Here, it is each name per year.
    • names_from: the variables for which you want to create new columns (or column/variable names)
    • values_from: the value/variable you want to fill these columns with, here proportion (log transformed)
  • Write the following code in your section for organizing data in the statistics script.
## ORGANIZE DATA ####

# create data frame in wide format
data_combined_wide = tidyr::pivot_wider(data_combined, 
                                        id_cols = c("year", "name"), 
                                        names_from = c("country", "sex"), 
                                        values_from = "prop")
head(data_combined_wide)
year name USA_F USA_M Austria_F Austria_M
1984 Kim 0.0002147 4.90e-05 0.0000119 2.39e-05
1985 Kim 0.0001885 4.78e-05 0.0000364 NA
1986 Kim 0.0001523 4.22e-05 0.0000731 NA
1987 Kim 0.0001494 3.90e-05 0.0000246 NA
1988 Kim 0.0001311 4.45e-05 0.0000848 NA
1989 Kim 0.0001300 3.44e-05 0.0001818 7.27e-05
  • This will create an extra column per country and gender (4 columns in total)
  • The function that transforms data frames in the reverse direction, is called pivot_longer(). Try to back transform your wide data frame to a long one as an exercise!

You can see that the wide data frame has empty values (NA = not available) for men in the USA. our pivot_wider call added a row for each year where the name Kim occurred in any country or sex, this is why cells with no data for Kim now have an NA.You could also check how many NAs are in the column with the line of code below.

# are there any missing values? How many?
sum(is.na(data_combined_wide$Austria_M))
[1] 9

We actually know that in this case, NA simply means there were no babies with that name. So let’s replace the NAs with 0 using ifelse(). It’s arguments:

  • The condition for if is that Austria_M is an NA, you can check this with the function is.na() which will give TRUE for missing values (NAs).
  • if this condition is met (TRUE), then replace with a 0
  • else, keep the value of Austria_M.
data_combined_wide = data_combined_wide %>% 
    # replace NAs with zero
  mutate(Austria_M = ifelse(is.na(Austria_M), 0, Austria_M))
data_combined_wide
year name USA_F USA_M Austria_F Austria_M
1984 Kim 0.0002147 4.90e-05 0.0000119 2.39e-05
1985 Kim 0.0001885 4.78e-05 0.0000364 0.00e+00
1986 Kim 0.0001523 4.22e-05 0.0000731 0.00e+00
1987 Kim 0.0001494 3.90e-05 0.0000246 0.00e+00
1988 Kim 0.0001311 4.45e-05 0.0000848 0.00e+00
1989 Kim 0.0001300 3.44e-05 0.0001818 7.27e-05
1990 Kim 0.0001251 3.07e-05 0.0001441 1.20e-05
1991 Kim 0.0001018 2.83e-05 0.0002348 0.00e+00
1992 Kim 0.0000908 2.14e-05 0.0002407 1.20e-05
1993 Kim 0.0000938 1.89e-05 0.0002080 0.00e+00
1994 Kim 0.0000821 2.26e-05 0.0002526 0.00e+00
1995 Kim 0.0000703 1.54e-05 0.0001055 2.64e-05
1996 Kim 0.0000678 2.35e-05 0.0001972 2.63e-05
1997 Kim 0.0000576 1.00e-05 0.0002487 1.38e-05
1998 Kim 0.0000526 1.18e-05 0.0002148 1.43e-05
1999 Kim 0.0000683 1.42e-05 0.0002393 2.99e-05
2000 Kim 0.0000581 8.60e-06 0.0001493 2.99e-05
2001 Kim 0.0000505 1.26e-05 0.0001855 1.55e-05
2002 Kim 0.0000562 6.80e-06 0.0001781 2.97e-05
2003 Kim 0.0000414 8.10e-06 0.0002098 6.00e-05
2004 Kim 0.0000451 8.50e-06 0.0003349 1.46e-05
2005 Kim 0.0000360 6.10e-06 0.0002070 1.48e-05
2006 Kim 0.0000407 4.60e-06 0.0003562 2.97e-05
2007 Kim 0.0000492 4.50e-06 0.0003504 3.05e-05
2008 Kim 0.0000303 3.20e-06 0.0003161 7.53e-05
2009 Kim 0.0000237 8.00e-06 0.0002173 1.55e-05
2010 Kim 0.0000266 4.40e-06 0.0001946 1.30e-05
2011 Kim 0.0000258 3.40e-06 0.0002746 1.31e-05
2012 Kim 0.0000470 4.40e-06 0.0002458 2.59e-05
2013 Kim 0.0000322 4.00e-06 0.0001802 0.00e+00
2014 Kim 0.0000354 9.30e-06 0.0002128 0.00e+00
2015 Kim 0.0000370 5.40e-06 0.0001336 2.43e-05
2016 Kim 0.0000270 3.50e-06 0.0001290 1.17e-05
2017 Kim 0.0000245 4.10e-06 0.0000469 1.17e-05

9 Feedback welcome!

If you find any errors or things that are not explained in enough detail in this lesson, please let me know: metzler[@]csh.ac.at.