Introduction

1. Preparation

Please make sure to complete the following preparatory steps before the training workshop:

  1. If you haven’t already, install R and RStudio. This is a useful tutorial: https://rstudio-education.github.io/hopr/starting.html
  2. Install the following packages in R:
install.packages("dplyr")
install.packages("tidyr")
install.packages("ggplot2")
install.packages("readr")
install.packages("knitr")
install.packages("data.table")
install.packages("Matrix")
# DemoKin is available on [CRAN](https://cran.r-project.org/web/packages/DemoKin/index.html), 
# but we'll use the development version on [GitHub](https://github.com/IvanWilli/DemoKin):
install.packages("remotes")
remotes::install_github("IvanWilli/DemoKin")

Now, check if the packages can be loaded:

library(dplyr)
library(tidyr)
library(ggplot2)
library(readr)
library(knitr)
library(DemoKin)
  1. Download the workshop’s GitHub repository to your computer: https://github.com/alburezg/kinship_workshop_NUS. The easiest way is by clicking on < > Code and then Download ZIP, as shown below:

  1. Unzip the files and keep them somewhere at hand.

2. Agenda (preliminary)

3. Acknowledgements

This workshop includes material from the course Matrix Approaches to Modelling Kinship: Theory and Applications prepared by Ivan Williams and adapted by Amanda Martins for the European Doctoral School of Demography and for the ALAP 2024 Conference. The multistate code was adapted from the corresponding vignette in DemoKin contributed by Joe Butterick and adapted to the case of Singapore by Sha Jiang. Saroja Adhikari provided support with the Wittgenstein educational data.

For a full list of DemoKin contributors, see https://github.com/IvanWilli/DemoKin.

Part 1: Simple models

Kinship is a fundamental property of human populations and a key form of social structure. Demographers have long been interested in the interplay between demographic change and family configuration. This has led to the development of sophisticated methodological and conceptual approaches for the study of kinship, some of which are reviewed in this course.

1. Installation

Load packages that will be useful:

library(dplyr)
library(tidyr)
library(ggplot2)
library(readr)
library(knitr)
library(DemoKin)

Load a couple of functions we prepared:

source("functions.R")

2. Demographic data

We will use data for Singapore from the 2024 Revision of the World Population Prospects, which we pre-processed for this workshop.

#Load data for women
rates_female <- read.csv("data/sgp_female.csv", stringsAsFactors = F) #female rates
pop_female <- read.csv("data/sgp_popfemale.csv", stringsAsFactors = F) #female population (N)
# For data for men
rates_male <- read.csv("data/sgp_male.csv", stringsAsFactors = F) # male rates
pop_male <- read.csv("data/sgp_popmale.csv", stringsAsFactors = F) # male population (N)

Let’s see what data we will be using. The first object is rates_female, a long-format data frame with year- and age-specific fertility rates (‘fx’) and survival probabilities from a life table (‘px’).

For women:

head(rates_female)
##   year age        px fx
## 1 1950   0 0.9149421  0
## 2 1950   1 0.9836904  0
## 3 1950   2 0.9884776  0
## 4 1950   3 0.9918538  0
## 5 1950   4 0.9942263  0
## 6 1950   5 0.9958862  0

Let us visualise some of our demographic data. We start with the probability of dying between ages \(x\) and \(x+1\),\(q_x\) in a life table for women:

rates_female %>%
    filter(year %in% seq(1950, 2020, 30)) %>% 
    mutate(qx = 1-px) %>%
    ggplot() +
    geom_line(aes(x = age, y = qx, col = as.character(year)), linewidth = 1) +
    scale_y_log10() +
    labs(y = "Probability of dying, qx", col = NULL) +
    theme_bw() +
    theme(legend.position = "bottom")

Next, we have age-specific fertility rates for multiple years:

rates_female %>%
  filter(year %in% seq(1950, 2020, 30)) %>% 
  ggplot() + 
  geom_line(aes(x = age, y = fx, col = as.character(year)), linewidth = 1) +
  labs(y = "Age-specific fertility rate, fx", col = NULL) +
  theme_bw() +
  theme(legend.position = "bottom")

Finally, we have the population counts:

pop_female %>%
  mutate(age = 0:100) %>% 
  pivot_longer(-age, names_to = "year", values_to = "pop") %>% 
  mutate(year = gsub("X", "", year)) %>% 
  filter(year %in% seq(1950, 2020, 30)) %>% 
  ggplot() + 
  geom_line(aes(x = age, y = pop, col = as.character(year)), linewidth = 1) +
  labs(y = "Population alive in thousands", col = NULL) +
  theme_bw() +
  theme(legend.position = "bottom")

3. The DemoKin package

DemoKin can be used to compute the number and age distribution of Focal’s relatives under a range of assumptions, including living and deceased kin. Let’s explore the main functions.

3.1. The function kin()

The function DemoKin::kin() currently does most of the heavy lifting in terms of implementing matrix kinship models.

For this example, we will run the simplest model, with the following assumptions:

  1. Rates are time-invariant; i.e., the same set of rates apply at all times. For this example, these will be the 2020 rates.
  2. The population has one one-sex; i.e., we only use female data as input and trace descent through female lines.

First, we need to extract the data for women in 2020 and save it as a matrix:

# First, reshape fertility and survival for a given year

asfr_2020 <- rates_female %>%
  filter(year == 2020) %>% 
  select(age, year, fx) %>%
  pivot_wider(names_from = year, values_from = fx) %>%
  select(-age) %>%
  as.matrix()

surv_2020 <- rates_female %>%
  filter(year == 2020) %>% 
  select(age, year, px) %>%
  pivot_wider(names_from = year, values_from = px) %>%
  select(-age) %>%
  as.matrix()

Now we can run the kinship models:

kin_2020 <- kin(p = surv_2020, f = asfr_2020, time_invariant = TRUE)

3.2. Arguments of kin()

  • p numeric. A vector (atomic) or matrix of survival probabilities with rows as ages (and columns as years in case of matrix).
  • f numeric. Same as U but for fertility rates.
  • time_invariant logical. Assume time-invariant rates. Default TRUE.
  • output_kin character. kin types to return: “m” for mother, “d” for daughter, …

3.3 Relative types

Relatives for the output_kin argument are identified by a unique code. Note that the relationship codes used in DemoKin differ from those in Caswell (2019). The equivalence between the two set of codes is given in the following table:

demokin_codes
##    DemoKin Caswell               Labels_female                   Labels_male
## 1      coa       t    Cousins from older aunts     Cousins from older uncles
## 2      cya       v  Cousins from younger aunts   Cousins from younger uncles
## 3        c    <NA>                     Cousins                       Cousins
## 4        d       a                   Daughters                          Sons
## 5       gd       b             Grand-daughters                    Grand-sons
## 6      ggd       c       Great-grand-daughters              Great-grand-sons
## 7      ggm       h          Great-grandmothers            Great-grandfathers
## 8       gm       g                Grandmothers                  Grandfathers
## 9        m       d                      Mother                        Father
## 10     nos       p   Nieces from older sisters   Nephews from older brothers
## 11     nys       q Nieces from younger sisters Nephews from younger brothers
## 12       n    <NA>                      Nieces                       Nephews
## 13      oa       r     Aunts older than mother     Uncles older than fathers
## 14      ya       s   Aunts younger than mother    Uncles younger than father
## 15       a    <NA>                       Aunts                        Uncles
## 16      os       m               Older sisters                Older brothers
## 17      ys       n             Younger sisters              Younger brothers
## 18       s    <NA>                     Sisters                      Brothers
##                          Labels_2sex
## 1    Cousins from older aunts/uncles
## 2  Cousins from younger aunts/uncles
## 3                            Cousins
## 4                           Children
## 5                    Grand-childrens
## 6              Great-grand-childrens
## 7                Great-grandfparents
## 8                       Grandparents
## 9                            Parents
## 10      Niblings from older siblings
## 11    Niblings from younger siblings
## 12                          Niblings
## 13   Aunts/Uncles older than parents
## 14 Aunts/Uncles younger than parents
## 15                      Aunts/Uncles
## 16                    Older siblings
## 17                  Younger siblings
## 18                          Siblings

3.4. Value

DemoKin::kin() returns a list containing two data frames: kin_full and kin_summary.

str(kin_2020)
## List of 2
##  $ kin_full   : tibble [142,814 × 7] (S3: tbl_df/tbl/data.frame)
##   ..$ kin      : chr [1:142814] "d" "d" "d" "d" ...
##   ..$ age_kin  : int [1:142814] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ age_focal: int [1:142814] 0 1 2 3 4 5 6 7 8 9 ...
##   ..$ living   : num [1:142814] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ dead     : num [1:142814] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ cohort   : logi [1:142814] NA NA NA NA NA NA ...
##   ..$ year     : logi [1:142814] NA NA NA NA NA NA ...
##  $ kin_summary: tibble [1,414 × 9] (S3: tbl_df/tbl/data.frame)
##   ..$ age_focal     : int [1:1414] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ kin           : chr [1:1414] "coa" "cya" "d" "gd" ...
##   ..$ year          : logi [1:1414] NA NA NA NA NA NA ...
##   ..$ count_living  : num [1:1414] 0.0773 0.0243 0 0 0 ...
##   ..$ mean_age      : num [1:1414] 7.84 3.84 NaN NaN NaN ...
##   ..$ sd_age        : num [1:1414] 5.83 3.57 NaN NaN NaN ...
##   ..$ count_dead    : num [1:1414] 0.00001534 0.00000902 0 0 0 ...
##   ..$ count_cum_dead: num [1:1414] 0.00001534 0.00000902 0 0 0 ...
##   ..$ mean_age_lost : num [1:1414] 0 0 NaN NaN NaN 0 0 0 0 NaN ...
kin_full

This data frame contains expected kin counts by year (or cohort), age of Focal, type of kin and, age of kin, including living and dead kin at that age.

head(kin_2020$kin_full)
## # A tibble: 6 × 7
##   kin   age_kin age_focal living  dead cohort year 
##   <chr>   <int>     <int>  <dbl> <dbl> <lgl>  <lgl>
## 1 d           0         0      0     0 NA     NA   
## 2 d           0         1      0     0 NA     NA   
## 3 d           0         2      0     0 NA     NA   
## 4 d           0         3      0     0 NA     NA   
## 5 d           0         4      0     0 NA     NA   
## 6 d           0         5      0     0 NA     NA
kin_summary

This is a ‘summary’ data frame derived from kin_full. To produce it, we sum over all ages of kin to produce a data frame of expected kin counts by year or cohort and age of Focal (but not by age of kin).

head(kin_2020$kin_summary)
## # A tibble: 6 × 9
##   age_focal kin   year  count_living mean_age sd_age count_dead count_cum_dead
##       <int> <chr> <lgl>        <dbl>    <dbl>  <dbl>      <dbl>          <dbl>
## 1         0 coa   NA          0.0773     7.84   5.83 0.0000153      0.0000153 
## 2         0 cya   NA          0.0243     3.84   3.57 0.00000902     0.00000902
## 3         0 d     NA          0        NaN    NaN    0              0         
## 4         0 gd    NA          0        NaN    NaN    0              0         
## 5         0 ggd   NA          0        NaN    NaN    0              0         
## 6         0 ggm   NA          0.226     88.5    6.07 0.0257         0.0257    
## # ℹ 1 more variable: mean_age_lost <dbl>

3.2. Kinship diagrams

We can visualize the kinship structure we just computed using a network or ‘Keyfitz’ kinship diagram (Keyfitz and Caswell 2005) with the plot_diagram function. Let’s see the expected number of living kin for a 65 yo woman in Singapore according to our model:

kin_2020$kin_summary %>% 
  filter(age_focal == 65) %>% 
  select(kin, count = count_living) %>% 
  plot_diagram(rounding = 2)

3.3. Number of living kin

Let’s run the model again with the same parameters, selecting specific kin types (note the output_kin argument:

kin_2020 <- 
  kin(
    p = surv_2020
    , f = asfr_2020
    , output_kin = c("c", "d", "gd", "ggd", "gm", "m", "n", "a", "s")
    , time_invariant = TRUE
    )

Now, let’s visualize how the expected number of daughters, siblings, cousins, etc., changes over the life course of Focal (now, with full names to identify each relative type using the function DemoKin::rename_kin()).

kin_2020$kin_summary %>%
  rename_kin() %>%
  ggplot() +
  geom_line(aes(age_focal, count_living))  +
  theme_bw() +
  labs(x = "Age of focal", y= "Number of living female relatives") +
  facet_wrap(~kin_label)

Note that we are working in a time invariant framework. You can think of the results as analogous to life expectancy (i.e., expected years of life for a synthetic cohort experiencing a given set of period mortality rates).

How does overall family size (and family composition) vary over life for an average woman who survives to each age?

counts <- 
  kin_2020$kin_summary %>%
  group_by(age_focal) %>% 
  summarise(count_living = sum(count_living)) %>% 
  ungroup()

kin_2020$kin_summary %>%
  select(age_focal, kin, count_living) %>% 
  rename_kin() %>% 
  ggplot(aes(x = age_focal, y = count_living)) +
  geom_area(aes(fill = kin_label), colour = "black") +
  geom_line(data = counts, size = 2) +
  labs(x = "Age of focal",
       y = "Number of living female relatives",
       fill = "Kin") +
  theme_bw() +
  theme(legend.position = "bottom")

3.4 Age distribution of living kin

How old are Focal’s relatives? Using the kin_full data frame, we can visualize the age distribution of Focal’s relatives throughout Focal’s life. For example when Focal is 65 (the red vertical line in the plot), what are the ages of her relatives:

kin_2020$kin_full %>%
  rename_kin() %>%
  filter(age_focal == 65) %>%
  ggplot(aes(age_kin, living)) +
  geom_line() +
  geom_vline(xintercept = 65, colour = "red") +
  labs(x = "Age of kin",
       y = "Expected number of living relatives") +
  theme_bw() +
  facet_wrap(~kin_label)

Part 2: 2-sex time-variant models

Human males generally live shorter and reproduce later than females. These sex-specific processes affect kinship dynamics in a number of ways. For example, the degree to which an average member of the population, call her Focal, has a living grandparent is affected by differential mortality affecting the parental generation at older ages. We may also be interested in considering how kinship structures vary by Focal’s sex: a male Focal may have a different number of grandchildren than a female Focal given differences in fertility by sex. Documenting these differences matters since women often face greater expectations to provide support and informal care to relatives. As they live longer, they may find themselves at greater risk of being having no living kin. The function kin2sex implements two-sex kinship models as introduced by Caswell (2022).

1. Setup

Load packages:

# Packages
library(dplyr)
library(tidyr)
library(ggplot2)
library(DemoKin)

Load male and female data (restricting it to data in the 1950-2023 period):

#Load data for women
rates_female <- read.csv("data/sgp_female.csv", stringsAsFactors = F) #female rates
pop_female <- read.csv("data/sgp_popfemale.csv", stringsAsFactors = F) #female population (N)
# For data for men
rates_male <- read.csv("data/sgp_male.csv", stringsAsFactors = F) # male rates
pop_male <- read.csv("data/sgp_popmale.csv", stringsAsFactors = F) # male population (N)

Since we’re not running projections (yet), we’ll keep only data in the 1950-2023 period:

rates_female <- rates_female %>% filter(year <= 2023)
pop_female <- pop_female[ , 1:(2024-1950)]
rates_male <- rates_male %>% filter(year <= 2023)
pop_male <- pop_male[ , 1:(2024-1950)]

Note that for men there is no fx (fertility) column because the WPP does not provide it:

head(rates_male)
##   year age        px
## 1 1950   0 0.8989109
## 2 1950   1 0.9800395
## 3 1950   2 0.9872590
## 4 1950   3 0.9917858
## 5 1950   4 0.9945839
## 6 1950   5 0.9962990

2. Run kinship model

2.1. Convert data to matrices

Before running the model, we need to reshape the data to have matrices as input.

# Female fertility
asfr_females <- 
  rates_female %>%
  select(age, year, fx) %>%
  pivot_wider(names_from = year, values_from = fx) %>%
  select(-age) %>%
  as.matrix()

# Female survival
surv_females <- 
  rates_female %>%
  select(age, year, px) %>%
  pivot_wider(names_from = year, values_from = px) %>%
  select(-age) %>%
  as.matrix()

# Male survival
surv_males <- 
  rates_male %>%
  select(age, year, px) %>%
  pivot_wider(names_from = year, values_from = px) %>%
  select(-age) %>%
  as.matrix()

The data now has years in columns and ages in rows. For example, male survival probabilities:

surv_males[1:10, 1:6]
##            1950      1951      1952      1953      1954      1955
##  [1,] 0.8989109 0.9081987 0.9168968 0.9248786 0.9321124 0.9383029
##  [2,] 0.9800395 0.9836895 0.9867753 0.9891777 0.9908956 0.9919618
##  [3,] 0.9872590 0.9892537 0.9910078 0.9924139 0.9935382 0.9944831
##  [4,] 0.9917858 0.9928559 0.9938370 0.9946455 0.9953902 0.9961763
##  [5,] 0.9945839 0.9951549 0.9957015 0.9961628 0.9966740 0.9972937
##  [6,] 0.9962990 0.9966080 0.9969170 0.9971827 0.9975558 0.9980201
##  [7,] 0.9973456 0.9975199 0.9977021 0.9978603 0.9981553 0.9984851
##  [8,] 0.9979803 0.9980872 0.9982033 0.9983044 0.9985583 0.9987756
##  [9,] 0.9983634 0.9984369 0.9985192 0.9985906 0.9988243 0.9989514
## [10,] 0.9985861 0.9986440 0.9987099 0.9987666 0.9989925 0.9990482

2.2. Run model

Be patient: this can take around 5-10 minutes to run…

kin_out <- 
  kin2sex(
    # Survival probabilities of women
    pf = surv_females
    # Survival probabilities of men
    , pm = surv_males
    # Fertility rates for women
    , ff = asfr_females
    # Fertility rates for men (in this example, we use female rates)
    , fm = asfr_females
    # Female population counts
    , nf = pop_female
    # Male population counts
    , nm = pop_male
    # If FALSE, rates vary over time
    , time_invariant = FALSE
    # Which kin you want estimates for (see demokin_codes)
    , output_kin = c("c", "d", "gd", "ggd", "ggm", "gm", "m", "n", "a", "s")
    # The sex of the Focal or anchor invididual
    , sex_focal = "f"
    ) 

2.3. Model assumptions

Our model assumes that:

  1. Rates vary over time (time-variant)
  2. There are two sexes in the population (two-sex)
  3. *However, since we don’t have male fertility data from the UNWPP, we assume that male fertility is equivalent to female fertility (androgynous assumption)
  4. Demographic rates are stable (time-invariant) before 1950 (the earliest observed data point)

3. Kinship structures

The value returned by kin2sex is equivalent to the one returned by kin, with a couple new columns:

  • sex_kin that distinguishes between male and female kin
  • year: period (calendar) year
  • cohort: birth cohort of Focal
head(kin_out$kin_summary)
## # A tibble: 6 × 11
##   age_focal kin   sex_kin  year cohort count_living mean_age sd_age count_dead
##       <int> <chr> <chr>   <int>  <int>        <dbl>    <dbl>  <dbl>      <dbl>
## 1         0 a     f        1950   1950         5.03     24.8   10.7     0.0169
## 2         0 a     f        1951   1951         5.02     25.7   10.8     0.0169
## 3         0 a     f        1952   1952         5.02     25.6   10.8     0.0166
## 4         0 a     f        1953   1953         5.02     25.5   10.8     0.0158
## 5         0 a     f        1954   1954         5.02     25.5   10.8     0.0151
## 6         0 a     f        1955   1955         5.03     25.4   10.8     0.0139
## # ℹ 2 more variables: count_cum_dead <dbl>, mean_age_lost <dbl>

A note on terminology: The kin2sex function uses the same codes as kin to identify relatives (see demokin_codes() for reference).
Please note that when running a two-sex model, the code “m” refers to either mothers or fathers!
Use the sex_kin column to filter by the sex of a specific relative.
For example, to consider only sons and ignore daughters, use:

kin_out$kin_summary %>%
  filter(kin == "d", sex_kin == "m") %>%
  head()
## # A tibble: 6 × 11
##   age_focal kin   sex_kin  year cohort count_living mean_age sd_age count_dead
##       <int> <chr> <chr>   <int>  <int>        <dbl>    <dbl>  <dbl>      <dbl>
## 1         0 d     m        1950   1950            0      NaN    NaN          0
## 2         0 d     m        1951   1951            0      NaN    NaN          0
## 3         0 d     m        1952   1952            0      NaN    NaN          0
## 4         0 d     m        1953   1953            0      NaN    NaN          0
## 5         0 d     m        1954   1954            0      NaN    NaN          0
## 6         0 d     m        1955   1955            0      NaN    NaN          0
## # ℹ 2 more variables: count_cum_dead <dbl>, mean_age_lost <dbl>

The two sex model provides kinship estimates by age, period, and birth cohort (of Focal) (Caswell 2019):

3.1. Cohort approach

Let’s take a look at the resulting kin counts for a Focal born in 1960. First, we plot total family size over the life course of Focal for all kin and irrespective of the sex of kin:

kin_out$kin_summary %>%
  filter(cohort == 1960) %>% 
  rename_kin(sex = "2sex") %>% 
  summarise(count=sum(count_living), .by = c(kin_label, age_focal)) %>%
  ggplot(aes(age_focal, count, fill=kin_label)) +
  geom_area(colour = "black") +
  labs(y = "Expected number of living kin by Focal's age",
       x = "Age of a Focal born 1960") +
  theme_bw() +
    theme(legend.position = "bottom")

Now, we see the distribution of kin by sex of kin for a subset of kin:

kin_out$kin_summary %>%
  filter(cohort == 1960) %>% 
  filter(kin%in% c("c", "d", "gd", "ggd", "gm", "m", "n", "a", "s")) %>% 
  rename_kin(sex = "2sex") %>% 
  summarise(count=sum(count_living), .by = c(kin_label, age_focal, sex_kin)) %>%
  ggplot(aes(age_focal, count, fill=sex_kin)) +
  geom_area() +
  labs(y = "Expected number of living kin by sex and Focal's age",
       x = "Age of a Focal born 1960",
       fill = "Sex of Kin") +
  facet_wrap(~kin_label, scales = "free_y") +
  theme_bw() +
  theme(legend.position = "bottom")

We now focus on the age distribution of kin. Note that we use the object kin_full for this:

kin_out$kin_full %>%
  filter(age_focal == 60, cohort == 1960) %>% 
  filter(kin%in% c("d", "gd", "m", "s")) %>% 
  rename_kin(sex = "2sex") %>% 
  ggplot(aes(x = age_kin, y = living, colour = sex_kin)) +
  geom_line() +
  theme_bw() +
  labs(y = "Expected number of living kin for a 65 yo Focal born in 1960",
       x = "Age of kin",
       colour = "Sex of Kin") +
  facet_wrap(.~kin_label)

3.2. Period Approach

Let us take a snapshot of the kin distribution in 2023. We can plot total family size over the life course of Focal for all kin and irrespective of the sex of kin:

kin_out$kin_summary %>%
  filter(year == 2023) %>% 
  rename_kin(sex = "2sex") %>% 
  summarise(count=sum(count_living), .by = c(kin_label, age_focal)) %>%
  ggplot(aes(age_focal, count, fill=kin_label)) +
  geom_area(colour = "black") +
  labs(y = "Expected number of living kin by sex and Focal's age",
       x = "Age of Focal") +
  theme_bw() +
    theme(legend.position = "bottom")

Next, let’s plot the number of kin over the life of Focal including the sex of kin for a selected number of kin:

kin_out$kin_summary %>%
  filter(year == 2023) %>% 
  filter(kin%in% c("c", "d", "gd", "ggd", "gm", "m", "n", "a", "s")) %>% 
  rename_kin(sex = "2sex") %>% 
  summarise(count=sum(count_living), .by = c(kin_label, age_focal, sex_kin)) %>%
  ggplot(aes(age_focal, count, fill=sex_kin)) +
  geom_area() +
  labs(y = "Expected number of living kin by sex and Focal's age",
       x = "Age of Focal",
       fill = "Sex of Kin") +
  facet_wrap(~kin_label, scales = "free_y") +
  theme_bw() +
  theme(legend.position = "bottom")

Finally, we see the age distribution of kin for a 65 yo woman in calendar year 2023. Note that we use the object kin_full for this:

kin_out$kin_full %>%
  filter(age_focal == 65, year == 2023) %>% 
  filter(kin%in% c("d", "gd", "m", "s")) %>% 
  rename_kin(sex = "2sex") %>% 
  ggplot(aes(x = age_kin, y = living, colour = sex_kin)) +
  geom_line() +
  theme_bw() +
  labs(y = "Expected number of living kin for a 65 yo Focal born in 1960",
       x = "Age of kin",
       colour = "Sex of Kin") +
  facet_wrap(.~kin_label)

Part 3: Projections of kinship by education

1. Setup

Load packages:

# Packages
library(dplyr)
library(tidyr)
library(ggplot2)
library(Matrix)
library(DemoKin)

2 Multistate models in DemoKin

Since Caswell’s (Caswell 2019) one-sex time-invariant age-structured matrix model of kinship, the framework has expanded to incorporate time-varying vital rates (2021), two-sex of kin and focal (2022) and multi-state kin populations (2020).
Here, we provide an R functionkin_multi_stage_time_variant_2sex, which combines the three aforementioned models and provide a two-sex time-varying multi-state model.

This function computes stage-specific kinship networks across both sexes for an average member of a population (focal) under time-varying demographic rates. It estimates the number, age, sex, and stage distribution of Focal’s relatives, for each age of Focal’s life, and as a function of the year in which Focal is born.

3. Data

In this example we use education as an example stage. Singaporean data ranging from 2020 - 2090 is sourced from the Wittgenstein Center. The data is aggregated into 5-year age group and 5-year time interval.

In order to implement the model, the function kin_multi_stage_time_variant_2sex expects the following 7 inputs of vital rates, fed in as lists:

  1. U_list_females A list of female age-and-education specific survival probabilities over the timescale (in matrix forms). This input list has length = the timescale, and each entry represents the rates of a specific period in matrix form: stage columns, age rows.

  2. U_list_males A list in the same format as U_list_females, but for males.

  3. F_list_females A list of female age-and-education specific fertility rates over the timescale (in matrix forms). This input list has length = the timescale, and each entry represents the rates of a specific period in matrix form: stage columns, age rows.

  4. F_list_males A list in the same format as F_list_females, but for males.

  5. T_list_females A list of lists of female age-specific probabilities of moving up education over the timescale (in matrix forms). The outer list has length = the timescale. The inner list has length = number of ages. Each outer list entry is comprised of a list of matrices (stage*stage dimensional), each matrix describes age-specific probabilities of moving stage. Thus for each year, we have a list of age-specific probabilities of moving from one stage to the next.

  6. T_list_males A list in the same format as T_list_females, but for males.

  7. H_list A list of length = timescale, in which each element is a matrix which assigns the offspring of individuals in some stage to the appropriate age class (age in rows and states in columns). This list of matrices applies to both male and female offspring.

We pre-processed the data for this workshop:

load("data/singapore_edu.rdata")

4. Run multistate kinship model

We feed the above inputs into the matrix model, along with other arguments:

this run takes some time (ca. 5 min)

# Run kinship model for a female Focal over a timescale of no_years
time_range <- seq(2020,2090,5)
no_years <- length(time_range)-1
# For 5-y age and year intervals
output_year_select <- seq(1, no_years+1, 1)

# since we use 5-year time frame, we are projecting from 2020 to 2020+5*14 = 2090
# We decide here to count accumulated kin by age of Focal, and not distributions of kin
kin_out_2020_2090 <-
  kin_multi_stage_time_variant_2sex(U_list_females = U_mat_fem_edu[1:(1+no_years)],
                                    U_list_males = U_mat_male_edu[1:(1+no_years)],
                                    F_list_females = F_mat_fem_edu[1:(1+no_years)],
                                    F_list_males = F_mat_male_edu[1:(1+no_years)],
                                    T_list_females = T_mat_fem_edu[1:(1+no_years)],
                                    T_list_males = T_mat_fem_edu[1:(1+no_years)],
                                    H_list = H_mat_edu[1:(1+no_years)],
                                    birth_female = 1/(1.06+1), ## Sex ratio at birth -- Singapore value (1.06)
                                    parity = FALSE,
                                    summary_kin = TRUE,
                                    sex_Focal = "Female", ##  define Focal's sex at birth
                                    initial_stage_Focal = 1, ## Define Focal's stage at birth
                                    output_years = output_year_select # it seems tricky to set up output years when the time interval is not 1
                                    # now I select every 10 year for output
)
## 81.69 sec elapsed

Now we have to recode year and age related variables to show the correct years and educational labels:

# also need to change the age_focal and year variables accordingly
kin_out_2020_2090$kin_summary <- 
  kin_out_2020_2090$kin_summary %>%
  mutate(year = (year-1)*5+min(time_range),
         age_focal = age_focal*5,
         cohort = year - age_focal,
         stage_kin = factor(stage_kin, levels = c(1, 2, 3, 4, 5, 6),
                       labels = c(
                         "no education",
                         "incomplete primary",
                         "primary",
                         "lower secondary",
                         "upper secondary",
                         "post-secondary"
                       )))

kin_out_2020_2090$kin_full <- kin_out_2020_2090$kin_full%>%
  mutate(year = (year-1)*5+min(time_range),
         age_focal = age_focal*5,
         age_kin = age_kin*5,
         cohort = year - age_focal,
         stage_kin = factor(stage_kin, levels = c(1, 2, 3, 4, 5, 6),
                       labels = c(
                         "no education",
                         "incomplete primary",
                         "primary",
                         "lower secondary",
                         "upper secondary",
                         "post-secondary"
                       )))

5. Number of living kin by education of kin

Notice the structure of the output data kin_summary. We have columns age_focal and kin_stage because we sum over all ages of kin, and produce the marginal stage distribution given age of Focal. We have a column corresponding to sex of kin sex_kin, a column showing which year we are considering, and a column headed group which selects the kin type. Finally, we have columns showing Focal’s cohort of birth cohort (e.g., year - age of Focal).

Let’s take a look at kin_summary by selecting 10 rows, from row 1000 to row 1010.

kin_out_2020_2090$kin_summary[1000:1010, ]
## # A tibble: 11 × 8
##    age_focal stage_kin            count sex_kin  year group cohort cohort_factor
##        <dbl> <fct>                <dbl> <chr>   <dbl> <chr>  <dbl> <fct>        
##  1       100 incomplete primary 1.72e-1 Male     2020 ggd     1920 -19          
##  2       100 primary            1.05e-1 Female   2020 ggd     1920 -19          
##  3       100 primary            1.11e-1 Male     2020 ggd     1920 -19          
##  4       100 lower secondary    3.71e-2 Female   2020 ggd     1920 -19          
##  5       100 lower secondary    3.93e-2 Male     2020 ggd     1920 -19          
##  6       100 upper secondary    1.14e-2 Female   2020 ggd     1920 -19          
##  7       100 upper secondary    1.20e-2 Male     2020 ggd     1920 -19          
##  8       100 post-secondary     1.14e-4 Female   2020 ggd     1920 -19          
##  9       100 post-secondary     1.21e-4 Male     2020 ggd     1920 -19          
## 10         0 no education       0       Female   2020 m       2020 1            
## 11         0 no education       0       Male     2020 m       2020 1

5.1. Cohort approach

We start by showing the number of living kin for an average woman born in 2020 in Singapore by level of educational attainment of her kin:

kin_out_2020_2090$kin_summary %>%
  # Multistate models also include an (educational) state for the Focal individual
  # which is not of interest here
  filter(group != "Focal") %>% 
  filter(cohort == 2020) %>%
  rename(kin = group) %>% 
  rename_kin(sex = "2sex") %>% 
  summarise(count=sum(count), .by = c(stage_kin, age_focal)) %>%
  ggplot(aes(x = age_focal, y = count, fill=stage_kin)) +
  geom_area(colour = "black") +
  labs(y = "Expected number of living kin",
       x = "Age of a woman born in 2020") +
  theme_bw() +
    theme(legend.position = "bottom")

Show the same values, but separately for selected kin:

kin_out_2020_2090$kin_summary %>%
  filter(group %in% c("d", "gd", "m", "gm")) %>% 
  filter(cohort == 2020) %>%
  rename(kin = group) %>% 
  rename_kin(sex = "2sex") %>% 
  summarise(count=sum(count), .by = c(kin_label, stage_kin, age_focal)) %>%
  ggplot(aes(x = age_focal, y = count, fill=stage_kin)) +
  geom_area(colour = "black") +
  labs(y = "Expected number of living kin",
       x = "Age of a woman born in 2020") +
  facet_wrap(. ~ kin_label) +
  theme_bw() +
  theme(legend.position = "bottom")

Next, we see the age distribution of kin (by education of kin) for a 60 yo woman born in the year 2020:

kin_out_2020_2090$kin_full %>%
  dplyr::filter(
    group %in%  c("d", "gd", "m", "gm")
    ,  age_focal == 60
    , cohort == 2020
    ) %>%
  rename(kin = group) %>% 
  rename_kin("2sex") %>% 
  ggplot2::ggplot(ggplot2::aes(x = age_kin, y = count, color = stage_kin, fill = stage_kin)) +
  ggplot2::geom_bar(position = "stack", stat = "identity") +
  ggplot2::scale_x_continuous(breaks = c(0,10,20,30,40,50,60,70,80,90,100)) +
  labs(x  ="Age of kin for a 65 yo woman born in 2020") +
  ggplot2::facet_wrap(. ~ kin_label) +
  ggplot2::theme_bw() +
  ggplot2::theme(
    axis.text.x = ggplot2::element_text(angle = 90, vjust = 0.5)
    , legend.position = "bottom"
    )

5.2. Period approach

Let’s see the expected number of living children and grandchildren for a 60 yo woman over time by the level of educational attainment of the children and grandchildren:

kin_out_2020_2090$kin_summary %>%
  filter(
    group %in% c("d", "gd")
    , age_focal == 60
    ) %>% 
  rename(kin = group) %>% 
  rename_kin(sex = "2sex") %>% 
  summarise(count=sum(count), .by = c(stage_kin, kin_label, year)) %>%
  ggplot(aes(x = year, y = count, fill=stage_kin)) +
  geom_area(colour = "black") +
  labs(y = "Expected number of descendants a for woman aged 60",
       x = "Year") +
  facet_grid( . ~ kin_label) +
  theme_bw() +
    theme(legend.position = "bottom")

And, finally, the level of education of the parents and grandparents of a newborn woman over time:

kin_out_2020_2090$kin_summary %>%
  filter(
    group %in% c("m", "gm")
    , age_focal == 0
    ) %>% 
  rename(kin = group) %>% 
  rename_kin(sex = "2sex") %>% 
  summarise(count=sum(count), .by = c(stage_kin, kin_label, year)) %>%
  ggplot(aes(x = year, y = count, fill=stage_kin)) +
  geom_area(colour = "black") +
  labs(y = "Expected number of descendants a for newborn woman (aged 0)",
       x = "Year") +
  facet_grid( . ~ kin_label) +
  theme_bw() +
    theme(legend.position = "bottom")

We note that there are following simplifying assumptions due to data limitations:

  1. Fertility rates vary over time and by education level but are identical for both sexes (the androgynous approximation).
  2. Age-specific education transition probabilities vary over time but not by sex (also an androgynous approximation).
  3. Since no transition data exist beyond age 10, we assume that at ages 0-4, individuals transition from no education to incomplete primary in the next 5-year interval, and at ages 5-9, they transition from incomplete primary to primary (based on Singapore’s Compulsory Education Act).
  4. Demographic rates remain stable (time-invariant) before 2020, the earliest observed data point.

Part 4: Other countries

Here, we show how to download fertility, mortality, and population data for other countries and years.

First, we will load the download_wpp24() function, which automates the process of downloading these datasets. Note that this process may take a couple of minutes.

These packages will be used:

# Load necessary libraries
library(dplyr)
library(tidyr)
library(data.table)

Load functions to download and process:

# Load function to filter data
source("functions.R")

Run the function to download data. In this case, we’re downloading data for Guatemala, but you can chose any country you like. To select the country of interest, refer to the country names available here.

NOTE: This function will take a really long time to run since it downloads a bunch of large files, so make sure to have good and stable internet connection. The good thing is you only need to download the files once!

# Download the data
# Use include_projections = T to include data after 2023
download_wpp24(include_projections = T)

Once the data is downloaded, we can filter the information of interest—for example, by country. To do this, we will load two functions:

  • UNWPP_data: This function filters mortality data (px, qx, mx) and fertility data (fx) by the country of interest and the desired time period.

  • UNWPP_pop: This function filters population size (N) data.

Now, we read the data and reformat the data from the csv files we downloaded in the earlier step.

## Example output for DemoKin using female rates
# Select country, year and sex to obtain px and fx

fert_female <-
  UNWPP_data(country = "Guatemala", 
             start_year = 1950, 
             end_year = 2100,
             sex = c("Female"),
             indicators = c("fx"), #indicators of interest
             output_format = "csv", #format you want to save (csv or rdata)
             output_file = "Guatemala") 

mort_both <-
  UNWPP_data(country = "Guatemala", 
             start_year = 1950, 
             end_year = 2100,
             sex = c("Male", "Female"),
             indicators = c("px"), #indicators of interest
             output_format = "csv", #format you want to save (csv or rdata)
             output_file = "Guatemala") 

pop_female <-
  UNWPP_pop(country = "Guatemala", 
            start_year = 1950, 
            end_year = 2100, 
            sex = "Female")

pop_male <-
  UNWPP_pop(country = "Guatemala", 
            start_year = 1950, 
            end_year = 2100, 
            sex = "Male")

# Join rate objects by sex

rates_female <- 
  left_join(fert_female, mort_both) %>% 
  filter(Sex == "Female")

rates_male <- 
  mort_both %>% 
  filter(Sex == "Male")

Finally, you can save this data to reuse it in your own project:

write.csv(rates_female, "data/gtm_female.csv", row.names = F)
write.csv(rates_male, "data/gtm_male.csv", row.names = F)
write.csv(pop_female, "data/gtm_popfemale.csv", row.names = F)
write.csv(pop_male, "data/gtm_popmale.csv", row.names = F)

Now we have all the input data needed to run the 2-sex time-variant model (see “1. Setup” in Part 2). We can simply repeat the analysis in Part 2 (starting from “2. Run kinship model”) and get the kin estimates for Guatemala.

References

Caswell, Hal. 2019. “The Formal Demography of Kinship: A Matrix Formulation.” Demographic Research 41 (September): 679–712. https://doi.org/10.4054/DemRes.2019.41.24.
———. 2020. “The Formal Demography of Kinship II: Multistate Models, Parity, and Sibship.” Demographic Research 42 (June): 1097–1146. https://doi.org/10.4054/DemRes.2020.42.38.
———. 2022. “The Formal Demography of Kinship IV: Two-Sex Models and Their Approximations.” Demographic Research 47 (September): 359–96. https://doi.org/10.4054/DemRes.2022.47.13.
Caswell, Hal, and Xi Song. 2021. “The Formal Demography of Kinship. III. Kinship Dynamics with Time-Varying Demographic Rates.” Demographic Research 45: 517–46.
Keyfitz, Nathan, and Hal Caswell. 2005. Applied Mathematical Demography. New York: Springer.