An introduction to the analysis of population-level kinship structures

Introduction

1. Preparation

Please make sure to complete the following preparatory steps before the training workshop:

If you haven’t already, install R and RStudio. This is a useful tutorial: https://rstudio-education.github.io/hopr/starting.html
Install the following packages in R:

install.packages("dplyr")
install.packages("tidyr")
install.packages("ggplot2")
# install.packages("knitr")
install.packages("data.table")
install.packages("Matrix")
# DemoKin is available on [CRAN](https://cran.r-project.org/web/packages/DemoKin/index.html), 
# but we'll use the development version on [GitHub](https://github.com/IvanWilli/DemoKin):
# install.packages("remotes")
remotes::install_github("IvanWilli/DemoKin")

Now, check if the packages can be loaded:

library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
library(DemoKin)

Download the workshop’s GitHub repository to your computer: https://github.com/alburezg/kinship_workshop_iips. The easiest way is by clicking on < > Code and then Download ZIP, as shown below:

Unzip the files and keep them somewhere at hand.

2. Agenda

3. Acknowledgements

This workshop includes material from the course Matrix Approaches to Modelling Kinship: Theory and Applications prepared by Ivan Williams and adapted by Amanda Martins for the European Doctoral School of Demography and for the ALAP 2024 Conference. The multistate code was adapted from the corresponding vignette in DemoKin contributed by Joe Butterick and adapted to the case Singapore by Sha Jiang. and for India by Saroja Adhikari using the Wittgenstein educational data.

For a full list of DemoKin contributors, see https://ivanwilli.github.io/DemoKin/.

Day 1: One-sex time-invariant models

Kinship is a fundamental property of human populations and a key form of social structure. Demographers have long been interested in the interplay between demographic change and family configuration. This has led to the development of sophisticated methodological and conceptual approaches for the study of kinship, some of which are reviewed in this course.

1. Installation

Load packages that will be useful:

library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
library(DemoKin)

Load a couple of functions we prepared:

source("functions.R")

2. Demographic data

We will use data for India from the 2024 Revision of the World Population Prospects, which we pre-processed for this workshop.

#Load data for women
rates_female <- read.csv("data/ind_female.csv", stringsAsFactors = F) #female rates
pop_female <- read.csv("data/ind_popfemale.csv", stringsAsFactors = F) #female population (N)
# For data for men
rates_male <- read.csv("data/ind_male.csv", stringsAsFactors = F) # male rates
pop_male <- read.csv("data/ind_popmale.csv", stringsAsFactors = F) # male population (N)

Let’s see what data we will be using. The first object is rates_female, a long-format data frame with year- and age-specific fertility rates (‘fx’) and survival probabilities from a life table (‘px’).

For women:

head(rates_female)

##      Sex year age        px fx
## 1 Female 1950   0 0.8263349  0
## 2 Female 1950   1 0.9161987  0
## 3 Female 1950   2 0.9573440  0
## 4 Female 1950   3 0.9763298  0
## 5 Female 1950   4 0.9856816  0
## 6 Female 1950   5 0.9912545  0

Let us visualise some of our demographic data. We start with the probability of dying between ages \(x\) and \(x+1\),\(q_x\) in a life table for women:

rates_female %>%
    filter(year %in% seq(1950, 2020, 30)) %>% 
    mutate(qx = 1-px) %>%
    ggplot() +
    geom_line(aes(x = age, y = qx, col = as.character(year)), linewidth = 1) +
    scale_y_log10() +
    labs(y = "Probability of dying, qx", col = NULL) +
    theme_bw() +
    theme(legend.position = "bottom")

Next, we have age-specific fertility rates for multiple years:

rates_female %>%
  filter(year %in% seq(1950, 2020, 30)) %>% 
  ggplot() + 
  geom_line(aes(x = age, y = fx, col = as.character(year)), linewidth = 1) +
  labs(y = "Age-specific fertility rate, fx", col = NULL) +
  theme_bw() +
  theme(legend.position = "bottom")

Finally, we have the population counts:

pop_female %>%
  mutate(age = 0:100) %>% 
  pivot_longer(-age, names_to = "year", values_to = "pop") %>% 
  mutate(year = gsub("X", "", year)) %>% 
  filter(year %in% seq(1950, 2020, 30)) %>% 
  ggplot() + 
  geom_line(aes(x = age, y = pop, col = as.character(year)), linewidth = 1) +
  labs(y = "Population alive in thousands", col = NULL) +
  theme_bw() +
  theme(legend.position = "bottom")

3. The DemoKin package

DemoKin can be used to compute the number and age distribution of Focal’s relatives under a range of assumptions, including living and deceased kin. Let’s explore the main functions.

3.1. The function `kin()`

The function DemoKin::kin() currently does most of the heavy lifting in terms of implementing matrix kinship models.

For this example, we will run the simplest model, with the following assumptions:

Rates are time-invariant; i.e., the same set of rates apply at all times. For this example, these will be the 2020 rates.
The population has one one-sex; i.e., we only use female data as input and trace descent through female lines.

First, we need to extract the data for women in 2020 and save it as a matrix:

# First, reshape fertility and survival for a given year

asfr_2020 <- rates_female %>%
  filter(year == 2020) %>% 
  select(age, year, fx) %>%
  pivot_wider(names_from = year, values_from = fx) %>%
  select(-age) %>%
  as.matrix()

surv_2020 <- rates_female %>%
  filter(year == 2020) %>% 
  select(age, year, px) %>%
  pivot_wider(names_from = year, values_from = px) %>%
  select(-age) %>%
  as.matrix()

Now we can run the kinship models:

kin_2020 <- kin(p = surv_2020, f = asfr_2020, time_invariant = TRUE)

3.2. Arguments of `kin()`

p numeric. A vector (atomic) or matrix of survival probabilities with rows as ages (and columns as years in case of matrix).
f numeric. Same as U but for fertility rates.
time_invariant logical. Assume time-invariant rates. Default TRUE.
output_kin character. kin types to return: “m” for mother, “d” for daughter, …

3.3 Relative types

Relatives for the output_kin argument are identified by a unique code. Note that the relationship codes used in DemoKin differ from those in Caswell (2019). The equivalence between the two set of codes is given in the following table:

demokin_codes

##    DemoKin Caswell               Labels_female                   Labels_male
## 1      coa       t    Cousins from older aunts     Cousins from older uncles
## 2      cya       v  Cousins from younger aunts   Cousins from younger uncles
## 3        c    <NA>                     Cousins                       Cousins
## 4        d       a                   Daughters                          Sons
## 5       gd       b             Grand-daughters                    Grand-sons
## 6      ggd       c       Great-grand-daughters              Great-grand-sons
## 7      ggm       h          Great-grandmothers            Great-grandfathers
## 8       gm       g                Grandmothers                  Grandfathers
## 9        m       d                      Mother                        Father
## 10     nos       p   Nieces from older sisters   Nephews from older brothers
## 11     nys       q Nieces from younger sisters Nephews from younger brothers
## 12       n    <NA>                      Nieces                       Nephews
## 13      oa       r     Aunts older than mother     Uncles older than fathers
## 14      ya       s   Aunts younger than mother    Uncles younger than father
## 15       a    <NA>                       Aunts                        Uncles
## 16      os       m               Older sisters                Older brothers
## 17      ys       n             Younger sisters              Younger brothers
## 18       s    <NA>                     Sisters                      Brothers
##                          Labels_2sex
## 1    Cousins from older aunts/uncles
## 2  Cousins from younger aunts/uncles
## 3                            Cousins
## 4                           Children
## 5                    Grand-childrens
## 6              Great-grand-childrens
## 7                Great-grandfparents
## 8                       Grandparents
## 9                            Parents
## 10      Niblings from older siblings
## 11    Niblings from younger siblings
## 12                          Niblings
## 13   Aunts/Uncles older than parents
## 14 Aunts/Uncles younger than parents
## 15                      Aunts/Uncles
## 16                    Older siblings
## 17                  Younger siblings
## 18                          Siblings

3.4. Value

DemoKin::kin() returns a list containing two data frames: kin_full and kin_summary.

str(kin_2020)

## List of 2
##  $ kin_full   : tibble [142,814 × 7] (S3: tbl_df/tbl/data.frame)
##   ..$ kin      : chr [1:142814] "d" "d" "d" "d" ...
##   ..$ age_kin  : int [1:142814] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ age_focal: int [1:142814] 0 1 2 3 4 5 6 7 8 9 ...
##   ..$ living   : num [1:142814] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ dead     : num [1:142814] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ cohort   : logi [1:142814] NA NA NA NA NA NA ...
##   ..$ year     : logi [1:142814] NA NA NA NA NA NA ...
##  $ kin_summary: tibble [1,414 × 9] (S3: tbl_df/tbl/data.frame)
##   ..$ age_focal     : int [1:1414] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ kin           : chr [1:1414] "coa" "cya" "d" "gd" ...
##   ..$ year          : logi [1:1414] NA NA NA NA NA NA ...
##   ..$ count_living  : num [1:1414] 0.309 0.103 0 0 0 ...
##   ..$ mean_age      : num [1:1414] 8.4 4.28 NaN NaN NaN ...
##   ..$ sd_age        : num [1:1414] 6.43 4.04 NaN NaN NaN ...
##   ..$ count_dead    : num [1:1414] 0.000772 0.000549 0 0 0 ...
##   ..$ count_cum_dead: num [1:1414] 0.000772 0.000549 0 0 0 ...
##   ..$ mean_age_lost : num [1:1414] 0 0 NaN NaN NaN 0 0 0 0 NaN ...

`kin_full`

This data frame contains expected kin counts by year (or cohort), age of Focal, type of kin and, age of kin, including living and dead kin at that age.

head(kin_2020$kin_full)

## # A tibble: 6 × 7
##   kin   age_kin age_focal living  dead cohort year 
##   <chr>   <int>     <int>  <dbl> <dbl> <lgl>  <lgl>
## 1 d           0         0      0     0 NA     NA   
## 2 d           0         1      0     0 NA     NA   
## 3 d           0         2      0     0 NA     NA   
## 4 d           0         3      0     0 NA     NA   
## 5 d           0         4      0     0 NA     NA   
## 6 d           0         5      0     0 NA     NA

`kin_summary`

This is a ‘summary’ data frame derived from kin_full. To produce it, we sum over all ages of kin to produce a data frame of expected kin counts by year or cohort and age of Focal (but not by age of kin).

head(kin_2020$kin_summary)

## # A tibble: 6 × 9
##   age_focal kin   year  count_living mean_age sd_age count_dead count_cum_dead
##       <int> <chr> <lgl>        <dbl>    <dbl>  <dbl>      <dbl>          <dbl>
## 1         0 coa   NA           0.309     8.40   6.43   0.000772       0.000772
## 2         0 cya   NA           0.103     4.28   4.04   0.000549       0.000549
## 3         0 d     NA           0       NaN    NaN      0              0       
## 4         0 gd    NA           0       NaN    NaN      0              0       
## 5         0 ggd   NA           0       NaN    NaN      0              0       
## 6         0 ggm   NA           0.384    75.7    6.66   0.0252         0.0252  
## # ℹ 1 more variable: mean_age_lost <dbl>

3.5. Kinship diagrams

We can visualize the kinship structure we just computed using a network or ‘Keyfitz’ kinship diagram (Keyfitz and Caswell 2005) with the plot_diagram function. Let’s see the expected number of living kin for a 65 yo woman in India according to our model:

kin_2020$kin_summary %>% 
  filter(age_focal == 65) %>% 
  select(kin, count = count_living) %>% 
  plot_diagram(rounding = 2)

3.6. Number of living kin

Let’s run the model again with the same parameters, selecting specific kin types (note the output_kin argument:

kin_2020 <- 
  kin(
    p = surv_2020
    , f = asfr_2020
    , output_kin = c("c", "d", "gd", "ggd", "gm", "m", "n", "a", "s")
    , time_invariant = TRUE
    )

Now, let’s visualize how the expected number of daughters, siblings, cousins, etc., changes over the life course of Focal (now, with full names to identify each relative type using the function DemoKin::rename_kin()).

kin_2020$kin_summary %>%
  rename_kin() %>%
  ggplot() +
  geom_line(aes(age_focal, count_living))  +
  theme_bw() +
  labs(x = "Age of focal", y= "Number of living female relatives") +
  facet_wrap(~kin_label)

Note that we are working in a time invariant framework. You can think of the results as analogous to life expectancy (i.e., expected years of life for a synthetic cohort experiencing a given set of period mortality rates).

How does overall family size (and family composition) vary over life for an average woman who survives to each age?

counts <- 
  kin_2020$kin_summary %>%
  group_by(age_focal) %>% 
  summarise(count_living = sum(count_living)) %>% 
  ungroup()

kin_2020$kin_summary %>%
  select(age_focal, kin, count_living) %>% 
  rename_kin() %>% 
  ggplot(aes(x = age_focal, y = count_living)) +
  geom_area(aes(fill = kin_label), colour = "black") +
  geom_line(data = counts, size = 2) +
  labs(x = "Age of focal",
       y = "Number of living female relatives",
       fill = "Kin") +
  theme_bw() +
  theme(legend.position = "bottom")

3.7 Age distribution of living kin

How old are Focal’s relatives? Using the kin_full data frame, we can visualize the age distribution of Focal’s relatives throughout Focal’s life. For example when Focal is 65 (the red vertical line in the plot), what are the ages of her relatives:

kin_2020$kin_full %>%
  rename_kin() %>%
  filter(age_focal == 65) %>%
  ggplot(aes(age_kin, living)) +
  geom_line() +
  geom_vline(xintercept = 65, colour = "red") +
  labs(x = "Age of kin",
       y = "Expected number of living relatives") +
  theme_bw() +
  facet_wrap(~kin_label)

4. Exercise

4.1 We saw the ‘Keyfitz’ kinship diagram for a Focal girl aged 5 yo, but what about an older Focal, for example, a woman 65 yo? Visualise the kinship diagram for an average 65yo Indian woman and discuss the results with your group.

4.2. In class, you ran a time-invariant model with the rates for 2020. How are the results different if you run a time-invariant model with 1950 rates? Run the models and discuss interesting findings.

Now share with the class what you have discussed in your groups!

5. Homework

We will download fertility, mortality and population data for all countries from 1950 to 2023. For this, you will use the download_wpp24() function, which automates the process of downloading these data. Please note that this process may take a couple of minutes, so make sure to do it before day 2 of the workshop!

# Load the function to download data
source("UNWPP_download.R")

# Download the data
download_wpp24()

Day 2: 2-sex time-variant models

Human males generally live shorter and reproduce later than females. These sex-specific processes affect kinship dynamics in a number of ways. For example, the degree to which an average member of the population, call her Focal, has a living grandparent is affected by differential mortality affecting the parental generation at older ages. We may also be interested in considering how kinship structures vary by Focal’s sex: a male Focal may have a different number of grandchildren than a female Focal given differences in fertility by sex. Documenting these differences matters since women often face greater expectations to provide support and informal care to relatives. As they live longer, they may find themselves at greater risk of being having no living kin. The function kin2sex implements two-sex kinship models as introduced by Caswell (2022).

1. Setup

Load packages:

# Packages
library(dplyr)
library(tidyr)
library(ggplot2)
library(DemoKin)

Load male and female data (restricting it to data in the 1950-2023 period):

#Load data for women
rates_female <- read.csv("data/ind_female.csv", stringsAsFactors = F) #female rates
pop_female <- read.csv("data/ind_popfemale.csv", stringsAsFactors = F) #female population (N)
# For data for men
rates_male <- read.csv("data/ind_male.csv", stringsAsFactors = F) # male rates
pop_male <- read.csv("data/ind_popmale.csv", stringsAsFactors = F) # male population (N)

Since we’re not running projections (yet), we’ll keep only data in the 1950-2023 period:

rates_female <- rates_female %>% filter(year <= 2023)
pop_female <- pop_female[ , 1:(2024-1950)]
rates_male <- rates_male %>% filter(year <= 2023)
pop_male <- pop_male[ , 1:(2024-1950)]

Note that for men there is no fx (fertility) column because the WPP does not provide it:

head(rates_male)

##    Sex year age        px
## 1 Male 1950   0 0.8206210
## 2 Male 1950   1 0.9282634
## 3 Male 1950   2 0.9656006
## 4 Male 1950   3 0.9803572
## 5 Male 1950   4 0.9875404
## 6 Male 1950   5 0.9921338

2. Run kinship model

2.1. Convert data to matrices

Before running the model, we need to reshape the data to have matrices as input.

# Female fertility
asfr_females <- 
  rates_female %>%
  select(age, year, fx) %>%
  pivot_wider(names_from = year, values_from = fx) %>%
  select(-age) %>%
  as.matrix()

# Female survival
surv_females <- 
  rates_female %>%
  select(age, year, px) %>%
  pivot_wider(names_from = year, values_from = px) %>%
  select(-age) %>%
  as.matrix()

# Male survival
surv_males <- 
  rates_male %>%
  select(age, year, px) %>%
  pivot_wider(names_from = year, values_from = px) %>%
  select(-age) %>%
  as.matrix()

The data now has years in columns and ages in rows. For example, male survival probabilities:

surv_males[1:10, 1:6]

##            1950      1951      1952      1953      1954      1955
##  [1,] 0.8206210 0.8234099 0.8263398 0.8289078 0.8323192 0.8349812
##  [2,] 0.9282634 0.9309625 0.9336908 0.9359920 0.9389950 0.9412540
##  [3,] 0.9656006 0.9669560 0.9683531 0.9694582 0.9711562 0.9723360
##  [4,] 0.9803572 0.9809870 0.9816732 0.9821535 0.9831458 0.9837511
##  [5,] 0.9875404 0.9878156 0.9881476 0.9883282 0.9889400 0.9892518
##  [6,] 0.9921338 0.9922197 0.9923563 0.9923799 0.9927620 0.9929089
##  [7,] 0.9949742 0.9949693 0.9950079 0.9949594 0.9952121 0.9952757
##  [8,] 0.9964943 0.9964654 0.9964738 0.9964090 0.9966080 0.9966462
##  [9,] 0.9971270 0.9971121 0.9971292 0.9970802 0.9972688 0.9973145
## [10,] 0.9973052 0.9973138 0.9973496 0.9973247 0.9975137 0.9975742

2.2. Run model

Be patient: this can take around 5-10 minutes to run…

kin_out <- 
  kin2sex(
    # Survival probabilities of women
    pf = surv_females
    # Survival probabilities of men
    , pm = surv_males
    # Fertility rates for women
    , ff = asfr_females
    # Fertility rates for men (in this example, we use female rates)
    , fm = asfr_females
    # Female population counts
    , nf = pop_female
    # Male population counts
    , nm = pop_male
    # If FALSE, rates vary over time
    , time_invariant = FALSE
    # Which kin you want estimates for (see demokin_codes)
    , output_kin = c("c", "d", "gd", "ggd", "ggm", "gm", "m", "n", "a", "s")
    # The sex of the Focal or anchor invididual
    , sex_focal = "f"
    )

2.3. Model assumptions

Our model assumes that:

Rates vary over time (time-variant)
There are two sexes in the population (two-sex)
*However, since we don’t have male fertility data from the UNWPP, we assume that male fertility is equivalent to female fertility (androgynous assumption)
Demographic rates are stable (time-invariant) before 1950 (the earliest observed data point)

3. Kinship structures

The value returned by kin2sex is equivalent to the one returned by kin, with a couple new columns:

sex_kin that distinguishes between male and female kin
year: period (calendar) year
cohort: birth cohort of Focal

head(kin_out$kin_summary)

## # A tibble: 6 × 11
##   age_focal kin   sex_kin  year cohort count_living mean_age sd_age count_dead
##       <int> <chr> <chr>   <int>  <int>        <dbl>    <dbl>  <dbl>      <dbl>
## 1         0 a     f        1950   1950         3.25     23.7   11.3     0.0329
## 2         0 a     f        1951   1951         3.25     23.8   11.3     0.0329
## 3         0 a     f        1952   1952         3.25     23.8   11.3     0.0325
## 4         0 a     f        1953   1953         3.25     23.8   11.3     0.0320
## 5         0 a     f        1954   1954         3.25     23.7   11.3     0.0319
## 6         0 a     f        1955   1955         3.25     23.7   11.3     0.0307
## # ℹ 2 more variables: count_cum_dead <dbl>, mean_age_lost <dbl>

A note on terminology: The kin2sex function uses the same codes as kin to identify relatives (see demokin_codes() for reference).
Please note that when running a two-sex model, the code “m” refers to either mothers or fathers!
Use the sex_kin column to filter by the sex of a specific relative.
For example, to consider only sons and ignore daughters, use:

kin_out$kin_summary %>%
  filter(kin == "d", sex_kin == "m") %>%
  head()

## # A tibble: 6 × 11
##   age_focal kin   sex_kin  year cohort count_living mean_age sd_age count_dead
##       <int> <chr> <chr>   <int>  <int>        <dbl>    <dbl>  <dbl>      <dbl>
## 1         0 d     m        1950   1950            0      NaN    NaN          0
## 2         0 d     m        1951   1951            0      NaN    NaN          0
## 3         0 d     m        1952   1952            0      NaN    NaN          0
## 4         0 d     m        1953   1953            0      NaN    NaN          0
## 5         0 d     m        1954   1954            0      NaN    NaN          0
## 6         0 d     m        1955   1955            0      NaN    NaN          0
## # ℹ 2 more variables: count_cum_dead <dbl>, mean_age_lost <dbl>

The two sex model provides kinship estimates by age, period, and birth cohort (of Focal) (Caswell 2019):

3.1. Cohort approach

Let’s take a look at the resulting kin counts for a Focal born in 1960. First, we plot total family size over the life course of Focal for all kin and irrespective of the sex of kin:

kin_out$kin_summary %>%
  filter(cohort == 1960) %>% 
  rename_kin(sex = "2sex") %>% 
  summarise(count=sum(count_living), .by = c(kin_label, age_focal)) %>%
  ggplot(aes(age_focal, count, fill=kin_label)) +
  geom_area(colour = "black") +
  labs(y = "Expected number of living kin by Focal's age",
       x = "Age of a Focal born 1960") +
  theme_bw() +
    theme(legend.position = "bottom")

Now, we see the distribution of kin by sex of kin for a subset of kin:

kin_out$kin_summary %>%
  filter(cohort == 1960) %>% 
  filter(kin%in% c("c", "d", "gd", "ggd", "gm", "m", "n", "a", "s")) %>% 
  rename_kin(sex = "2sex") %>% 
  summarise(count=sum(count_living), .by = c(kin_label, age_focal, sex_kin)) %>%
  ggplot(aes(age_focal, count, fill=sex_kin)) +
  geom_area() +
  labs(y = "Expected number of living kin by sex and Focal's age",
       x = "Age of a Focal born 1960",
       fill = "Sex of Kin") +
  facet_wrap(~kin_label, scales = "free_y") +
  theme_bw() +
  theme(legend.position = "bottom")

We now focus on the age distribution of kin. Note that we use the object kin_full for this:

kin_out$kin_full %>%
  filter(age_focal == 60, cohort == 1960) %>% 
  filter(kin%in% c("d", "gd", "m", "s")) %>% 
  rename_kin(sex = "2sex") %>% 
  ggplot(aes(x = age_kin, y = living, colour = sex_kin)) +
  geom_line() +
  theme_bw() +
  labs(y = "Expected number of living kin for a 65 yo Focal born in 1960",
       x = "Age of kin",
       colour = "Sex of Kin") +
  facet_wrap(.~kin_label)

3.2. Period Approach

Let us take a snapshot of the kin distribution in 2023. We can plot total family size over the life course of Focal for all kin and irrespective of the sex of kin:

kin_out$kin_summary %>%
  filter(year == 2023) %>% 
  rename_kin(sex = "2sex") %>% 
  summarise(count=sum(count_living), .by = c(kin_label, age_focal)) %>%
  ggplot(aes(age_focal, count, fill=kin_label)) +
  geom_area(colour = "black") +
  labs(y = "Expected number of living kin by sex and Focal's age",
       x = "Age of Focal") +
  theme_bw() +
    theme(legend.position = "bottom")

Next, let’s plot the number of kin over the life of Focal including the sex of kin for a selected number of kin:

kin_out$kin_summary %>%
  filter(year == 2023) %>% 
  filter(kin%in% c("c", "d", "gd", "ggd", "gm", "m", "n", "a", "s")) %>% 
  rename_kin(sex = "2sex") %>% 
  summarise(count=sum(count_living), .by = c(kin_label, age_focal, sex_kin)) %>%
  ggplot(aes(age_focal, count, fill=sex_kin)) +
  geom_area() +
  labs(y = "Expected number of living kin by sex and Focal's age",
       x = "Age of Focal",
       fill = "Sex of Kin") +
  facet_wrap(~kin_label, scales = "free_y") +
  theme_bw() +
  theme(legend.position = "bottom")

Finally, we see the age distribution of kin for a 65 yo woman in calendar year 2023. Note that we use the object kin_full for this:

kin_out$kin_full %>%
  filter(age_focal == 65, year == 2023) %>% 
  filter(kin%in% c("d", "gd", "m", "s")) %>% 
  rename_kin(sex = "2sex") %>% 
  ggplot(aes(x = age_kin, y = living, colour = sex_kin)) +
  geom_line() +
  theme_bw() +
  labs(y = "Expected number of living kin for a 65 yo Focal born in 1960",
       x = "Age of kin",
       colour = "Sex of Kin") +
  facet_wrap(.~kin_label)

Information on kin availability by sex allows us to consider sex ratios, a traditional measure in demography, with females often in denominator. The following figure, for example, shows that a 25yo French woman in our hypothetical population can expect to have 0.5 grandfathers for every grandmother. Is always the case that the sex ratio will decrease by Focal´s age?

kin_out$kin_summary %>%
  rename_kin(sex = "2sex") %>% 
  group_by(kin_label, age_focal) %>%
  summarise(sex_ratio = sum(count_living[sex_kin=="m"], na.rm=T)/sum(count_living[sex_kin=="f"], na.rm=T)) %>%
  ggplot(aes(age_focal, sex_ratio))+
  geom_line()+
  theme_bw() +
  labs(y = "Sex ratio",
       x = "Age of Focal") +
  facet_wrap(~kin_label, scales = "free")

4. Deceased kin

We have focused on living kin, but what about relatives who have died during Focal’s life? The output of kin also includes information of kin deaths experienced by Focal.

We start by considering the number of kin deaths that can expect to experience at each age. In other words, the non-cumulative number of deaths in the family that Focal experiences at a given age in the year 2000.

loss1 <- 
  kin_out$kin_summary %>%
  filter(age_focal<100, year == 2020) %>%
  group_by(age_focal) %>% 
  summarise(count_dead = sum(count_dead)) %>% 
  ungroup()

kin_out$kin_summary %>%
  rename_kin(sex = "2sex") %>%
  filter(age_focal<100, year == 2020) %>%
  group_by(age_focal, kin_label) %>%
  summarise(count_dead = sum(count_dead)) %>%
  ungroup() %>%
  ggplot(aes(x = age_focal, y = count_dead)) +
    geom_area(aes(fill = kin_label), colour = "black") +
    geom_line(data = loss1, size = 2) +
    labs(x = "Age of focal",
         y = "Number of kin deaths experienced at each age",
         fill = "Kin") +
    theme_bw() +
    theme(legend.position = "bottom")

Now, we combine all kin types to show the cumulative burden of kin death for an average member of the population surviving to each age:

loss2 <- 
  kin_out$kin_summary %>%
  filter(year == 2020) %>% 
  group_by(age_focal) %>% 
  summarise(count_cum_dead = sum(count_dead)) %>% 
  mutate(count_cum_dead = cumsum(count_cum_dead)) %>% 
  ungroup()

kin_out$kin_summary %>%
  rename_kin(sex = "2sex") %>%
  filter(year == 2020) %>% 
  group_by(age_focal, kin_label) %>% 
  summarise(count_cum_dead = sum(count_dead)) %>% 
  ungroup() %>% 
  group_by(kin_label) %>% 
  mutate(count_cum_dead = cumsum(count_cum_dead)) %>%
  ungroup() %>% 
 ggplot(aes(x = age_focal, y = count_cum_dead)) +
  geom_area(aes(fill = kin_label), colour = "black") +
  # geom_line(data = loss2, aes(y = count_cum_dead), size = 2) +
  labs(x = "Age of focal",
       y = "Number of kin deaths experienced (cumulative)",
       fill = "Kin") +
  theme_bw() +
  theme(legend.position = "bottom")

In 2020, a member of the population aged 15, 50, and 65 years old will have experienced, on average, the death of relatives, respectively:

loss2 %>%
  filter(age_focal %in% c(15, 50, 65)) %>%
  select(count_cum_dead) %>%
  pull(count_cum_dead) %>%
  round(1) %>%
  paste(collapse = ", ")

## [1] "3.6, 13.6, 21.4"

5. Exercise

5,1. Load data for your chosen country

yesterday, we asked you to use the function download_wpp24() to download UNWPP data. Now, we will show you how to lead the data into DemoKin for your chosen country.

In this example, we’re focusing on Nepal, but you can chose any country you like. To select the country of interest, refer to the country names available here.

Once the data is downloaded, we can filter the information of interest—for example, by country. To do this, we will load two functions:

UNWPP_data: This function filters mortality data (px, qx, mx) and fertility data (fx) by the country of interest and the desired time period.
UNWPP_pop: This function filters population size (N) data.

# Load function to filter data
source("functions.R")

Now, we read the data and reformat the data from the csv files we downloaded in the earlier step.

fert_female <-
  UNWPP_data(country = "Nepal", 
             start_year = 1950, 
             end_year = 2023,
             sex = c("Female"),
             indicators = c("fx"), #indicators of interest
             output_format = "csv", #format you want to save (csv or rdata)
             output_file = "Nepal") 

mort_both <-
  UNWPP_data(country = "Nepal", 
             start_year = 1950, 
             end_year = 2023,
             sex = c("Male", "Female"),
             indicators = c("px"), #indicators of interest
             output_format = "csv", #format you want to save (csv or rdata)
             output_file = "Nepal") 

pop_female <-
  UNWPP_pop(country = "Nepal", 
            start_year = 1950, 
            end_year = 2023, 
            sex = "Female")

pop_male <-
  UNWPP_pop(country = "Nepal", 
            start_year = 1950, 
            end_year = 2023, 
            sex = "Male")

# Join rate objects by sex

rates_female <- 
  left_join(fert_female, mort_both) %>% 
  filter(Sex == "Female")

rates_male <- 
  mort_both %>%  
  filter(Sex == "Male")

Finally, you can save this pre-processed data to reuse it in your own project:

write.csv(rates_female, "data/npl_female.csv", row.names = F)
write.csv(rates_male, "data/npl_male.csv", row.names = F)
write.csv(pop_female, "data/npl_popfemale.csv", row.names = F)
write.csv(pop_male, "data/npl_popmale.csv", row.names = F)

Now we have all the input data needed to run the 2-sex time-variant model (see “1. Setup” in Part 2). We can simply repeat the analysis in day 2 (starting from “2. Run kinship model”) and get the kin estimates for Nepal.

Start by reading the data into R:

#Load data for women
rates_female <- read.csv("data/npl_female.csv", stringsAsFactors = F) #female rates
pop_female <- read.csv("data/npl_popfemale.csv", stringsAsFactors = F) #female population (N)
# For data for men
rates_male <- read.csv("data/npl_male.csv", stringsAsFactors = F) # male rates
pop_male <- read.csv("data/npl_popmale.csv", stringsAsFactors = F) # male population (N)

5.2. Living kin by sex

E1 On the first day, we asked you to download data for your country of choice. Using this data, run a time-variant two-sex model (using data from 1950-2020 as input) and answer:

E1.1 How many living grand-mothers and grand-fathers can a woman expect to have at age 15 in 1950 and in 2020? Extract a conclusion based on the results.

E1.2 Compare ‘kin sex ratios’ of grandparents, parents, daughters and siblings for the cohort 1950, at each age of Focal.

5.3. Time-variant and time-invariant models

E2 For your country of choice, run a time-variant two-sex model (the same one as in E1) and a new time-invariant two-sex model, using 2020 rates.

E.2.1 Compare the average number of living kin for a 5-year-old woman in 2020 from both models (the time-variant and the time-invariant one).

E.2.2 Which kin are similar and which are different and why?

5.4. Sex ratios at birth

E3 DemoKin assumes a constant sex ratio at birth of 1/2.04 = 0.49. How can different sex ratios at borth influence kinship structures?

Day 3: Projections of kinship by education

1. Setup

Load packages:

# Packages
library(dplyr)
library(tidyr)
library(ggplot2)
library(Matrix)
library(DemoKin)

2 Multistate models in `DemoKin`

Since Caswell’s (Caswell 2019) one-sex time-invariant age-structured matrix model of kinship, the framework has expanded to incorporate time-varying vital rates (2021), two-sex of kin and focal (2022) and multi-state kin populations (2020).
Here, we provide an R functionkin_multi_stage_time_variant_2sex, which combines the three aforementioned models and provide a two-sex time-varying multi-state model.

This function computes stage-specific kinship networks across both sexes for an average member of a population (focal) under time-varying demographic rates. It estimates the number, age, sex, and stage distribution of Focal’s relatives, for each age of Focal’s life, and as a function of the year in which Focal is born.

2.1. Why education?

Q1:Why is it important to study the education levels of both the Focal and their kin in the context of India?

Q2:What input data are required to estimate the number of kin by education using a multistate model in the Demokin package?

Q3:Where can you find the input data needed to incorporate education into a multistate model?

3. Data

In this example we use education as an example stage. Indian data ranging from 2020 - 2090 is sourced from the Wittgenstein Center. The data is aggregated into 5-year age group and 5-year time interval.

In order to implement the model, the function kin_multi_stage_time_variant_2sex expects the following 7 inputs of vital rates, fed in as lists:

U_list_females A list of female age-and-education specific survival probabilities over the timescale (in matrix forms). This input list has length = the timescale, and each entry represents the rates of a specific period in matrix form: stage columns, age rows.
U_list_males A list in the same format as U_list_females, but for males.
F_list_females A list of female age-and-education specific fertility rates over the timescale (in matrix forms). This input list has length = the timescale, and each entry represents the rates of a specific period in matrix form: stage columns, age rows.
F_list_males A list in the same format as F_list_females, but for males.
T_list_females A list of lists of female age-specific probabilities of moving up education over the timescale (in matrix forms). The outer list has length = the timescale. The inner list has length = number of ages. Each outer list entry is comprised of a list of matrices (stage*stage dimensional), each matrix describes age-specific probabilities of moving stage. Thus for each year, we have a list of age-specific probabilities of moving from one stage to the next.
T_list_males A list in the same format as T_list_females, but for males.
H_list A list of length = timescale, in which each element is a matrix which assigns the offspring of individuals in some stage to the appropriate age class (age in rows and states in columns). This list of matrices applies to both male and female offspring.

We pre-processed the data for this workshop:

load("data/india_edu.rdata")

3.1 Input data

Before running the model, let’s examine some trends in the data:

3.1.1 Fertility rates

# Calculate and plot Total Fertility Rate by education level over time
tfr_data <- lapply(seq_along(F_mat_fem_edu), function(i) {
  col_sums <- colSums(F_mat_fem_edu[[i]])
  data.frame(
    year = 2020 + (i - 1) * 5,
    education = factor(colnames(F_mat_fem_edu[[i]]), 
                      levels = c("e1", "e2", "e3", "e4", "e5", "e6"),
                      labels = c("no education", "incomplete primary", 
                                "primary", "lower secondary", 
                                "upper secondary", "post-secondary")),
    tfr = col_sums
  )
})

tfr_df <- do.call(rbind, tfr_data)

# Plot TFR trends
ggplot(tfr_df%>%filter(year<2095), aes(x = year, y = tfr, color = education, group = education)) +
  geom_line(size = 1) +
  geom_point() +
  theme_bw() +theme(legend.position = "bottom")+

  labs(title = "Total fertility rate by education over time",
    x = "Year",
    y = "TFR",
    color = "Education"
  )

Because expert-based assumptions were applied to fertility in 2050, the projected trend shows a discontinuity, especially for lower education. Details about assumption is given in https://pure.iiasa.ac.at/id/eprint/19487/

3.1.2 Survival rates

Let’s look at survival ratio

# Calculate and plot survival ratio for female by education level over time
surv_data <- lapply(seq_along(U_mat_fem_edu), function(i) {
#extract survival ratio for ith year
  survdt <- U_mat_fem_edu[[i]]

  data.frame(
    year = 2020 + (i - 1) * 5,
    age  = as.numeric(rownames(survdt)),
    education = rep(
      factor(colnames(survdt),
             levels = c("e1","e2","e3","e4","e5","e6"),
             labels = c("no education","incomplete primary","primary",
                        "lower secondary","upper secondary","post-secondary")),
      each = nrow(survdt)
    ),
    survival = as.vector(survdt)
  )
})

surv_df <- do.call(rbind, surv_data)

# Plot Survival ratio trends
ggplot(surv_df %>% 
         filter(age %in% c(0,15, 40,65)),
       aes(x = year, y = survival, color = education, group = education)) +
  geom_line(size = 1) +
  geom_point() +
  theme_bw() + theme(legend.position = "bottom")+
  facet_wrap(~age) +
  labs(title = "Survival ratio by education",
    x = "Year",
    y = "Survival ratio",
    color = "Education"
  )

Because the COVID-19 pandemic reduced survival ratios in 2020, we observe pronounced recovery in 2025. Education differential in mortality is only applied from the age 15 in input data.

3.1.3 Education transition rates

Now let’s plot education transition probabilities

Education transition probability= Probability that a person moves from one level of education to another (only higher level or only upward transition)level of education as they age.

# Education labels
edu_labels <- c("e1","e2","e3","e4","e5","e6")
edu_names  <- c("no education","incomplete primary","primary",
                "lower secondary","upper secondary","post-secondary")

trans_df <- lapply(seq_along(T_mat_fem_edu), function(i) {
  
  year_val <- 2020 + (i - 1) * 5
  
  lapply(seq_along(T_mat_fem_edu[[i]]), function(a) {
    mat <- T_mat_fem_edu[[i]][[a]]
    
    expand.grid(
      to_stage   = 1:6,   # <-- rows
      from_stage = 1:6    # <-- columns
    ) %>%
      mutate(
        transition_prob = as.vector(mat),     
        age  = 5+ (a - 1) * 5,
        year = year_val,
        cohort=year-age,
        from_stage = factor(from_stage, levels = 1:6, labels = edu_labels),
        to_stage   = factor(to_stage,   levels = 1:6, labels = edu_labels),
        from_name  = factor(from_stage, labels = edu_names),
        to_name    = factor(to_stage,   labels = edu_names)
      )
  }) %>% bind_rows()
  
}) %>% bind_rows()


# Education transition plot ages 10-15 and 15-20 and 20-25
ggplot(trans_df%>%filter(age%in%c(15,20,25) & cohort<2065),aes(x = year, y = transition_prob, fill = to_name))+
    geom_area() +
    facet_grid(age ~ from_name) +
    labs(
      title = "Education transition probabilities from origin to destination education across 
age groups",
      x = "Year",
      y = "Transition probability",
      fill = "Destination education"
    ) +
    theme_bw()+theme(axis.text.x = element_text(angle = 90),legend.position = "bottom")

Over time, the probability of transitioning from lower to higher education levels is expected to increase.

Assumptions of education transition probabilities from Wittgenstein Center:

1.Individuals younger than 10 are still in basic schooling and are not included in the model. Therefore, education transitions are considered only from age 10 onward.

2.People are assumed to complete their education by age 30, meaning no further transitions occur after that age.

3.Education transitions depend on age. For example, a person who is 20 years old and at the primary level will not move to lower secondary. However, if someone at the same age is in upper secondary, they will have a probability of moving to post-secondary education.

4. Run multistate kinship model

We feed the above inputs into the matrix model, along with other arguments:

this run takes some time (ca. 5 min)

# Run kinship model for a female Focal over a timescale of no_years
time_range <- seq(2020,2090,5)
no_years <- length(time_range)-1
# For 5-y age and year intervals
output_year_select <- seq(1, no_years+1, 1)

# since we use 5-year time frame, we are projecting from 2020 to 2020+5*14 = 2090
# We decide here to count accumulated kin by age of Focal, and not distributions of kin
kin_out_2020_2090 <-
  kin_multi_stage_time_variant_2sex(U_list_females = U_mat_fem_edu[1:(1+no_years)],
                                    U_list_males = U_mat_male_edu[1:(1+no_years)],
                                    F_list_females = F_mat_fem_edu[1:(1+no_years)],
                                    F_list_males = F_mat_male_edu[1:(1+no_years)],
                                    T_list_females = T_mat_fem_edu[1:(1+no_years)],
                                    T_list_males = T_mat_fem_edu[1:(1+no_years)],
                                    H_list = H_mat_edu[1:(1+no_years)],
                                    birth_female = 1/(1.06+1), ## Sex ratio at birth --  (1.06)
                                    parity = FALSE,
                                    summary_kin = TRUE,
                                    sex_Focal = "Female", ##  define Focal's sex at birth
                                    initial_stage_Focal = 1, ## Define Focal's stage at birth
                                    output_years = output_year_select # it seems tricky to set up output years when the time interval is not 1
                                    # now I select every 10 year for output
)

## 45.53 sec elapsed

Now we have to recode year and age related variables to show the correct years and educational labels:

# also need to change the age_focal and year variables accordingly
kin_out_2020_2090$kin_summary <- 
  kin_out_2020_2090$kin_summary %>%
  mutate(year = (year-1)*5+min(time_range),
         age_focal = age_focal*5,
         cohort = year - age_focal,
          stage_kin = factor(stage_kin, levels = c(1, 2, 3, 4, 5, 6),
                       labels = c(
                         "no education",
                         "incomplete primary",
                         "primary",
                         "lower secondary",
                         "upper secondary & above",
                         "upper secondary & above"
                       )))

kin_out_2020_2090$kin_full <- kin_out_2020_2090$kin_full%>%
  mutate(year = (year-1)*5+min(time_range),
         age_focal = age_focal*5,
         age_kin = age_kin*5,
         cohort = year - age_focal,
         stage_kin = factor(stage_kin, levels = c(1, 2, 3, 4, 5, 6),
                       labels = c(
                         "no education",
                         "incomplete primary",
                         "primary",
                         "lower secondary",
                         "upper secondary & above",
                         "upper secondary & above"
                       )))

5. Number of living kin by education of kin

Notice the structure of the output data kin_summary. We have columns age_focal and kin_stage because we sum over all ages of kin, and produce the marginal stage distribution given age of Focal. We have a column corresponding to sex of kin sex_kin, a column showing which year we are considering, and a column headed group which selects the kin type. Finally, we have columns showing Focal’s cohort of birth cohort (e.g., year - age of Focal).

Let’s take a look at kin_summary by selecting 10 rows, from row 1000 to row 1010.

kin_out_2020_2090$kin_summary[1000:1010, ]

## # A tibble: 11 × 8
##    age_focal stage_kin            count sex_kin  year group cohort cohort_factor
##        <dbl> <fct>                <dbl> <chr>   <dbl> <chr>  <dbl> <fct>        
##  1       100 incomplete primary 7.44e-1 Male     2020 ggd     1920 -19          
##  2       100 primary            2.38e+0 Female   2020 ggd     1920 -19          
##  3       100 primary            2.52e+0 Male     2020 ggd     1920 -19          
##  4       100 lower secondary    5.45e-1 Female   2020 ggd     1920 -19          
##  5       100 lower secondary    5.78e-1 Male     2020 ggd     1920 -19          
##  6       100 upper secondary &… 3.79e-2 Female   2020 ggd     1920 -19          
##  7       100 upper secondary &… 4.01e-2 Male     2020 ggd     1920 -19          
##  8       100 upper secondary &… 2.91e-6 Female   2020 ggd     1920 -19          
##  9       100 upper secondary &… 3.08e-6 Male     2020 ggd     1920 -19          
## 10         0 no education       0       Female   2020 m       2020 1            
## 11         0 no education       0       Male     2020 m       2020 1

5.1. Cohort approach

We start by showing the number of living kin for an average woman born in 2020 in India by level of educational attainment of her kin:

kin_out_2020_2090$kin_summary %>%
  # Multistate models also include an (educational) state for the Focal individual
  # which is not of interest here
  filter(group != "Focal") %>% 
  filter(cohort == 2020) %>%
  rename(kin = group) %>% 
  rename_kin(sex = "2sex") %>% 
  summarise(count=sum(count), .by = c(stage_kin, age_focal)) %>%
  ggplot(aes(x = age_focal, y = count, fill=stage_kin)) +
  geom_area(colour = "black") +
  labs(y = "Expected number of living kin",
       x = "Age of a woman born in 2020") +
  theme_bw() +
    theme(legend.position = "bottom")

Show the same values, but separately for selected kin:

kin_out_2020_2090$kin_summary %>%
  filter(group %in% c("d", "gd", "m", "gm")) %>% 
  filter(cohort == 2020) %>%
  rename(kin = group) %>% 
  rename_kin(sex = "2sex") %>% 
  summarise(count=sum(count), .by = c(kin_label, stage_kin, age_focal)) %>%
  ggplot(aes(x = age_focal, y = count, fill=stage_kin)) +
  geom_area(colour = "black") +
  labs(y = "Expected number of living kin",
       x = "Age of a woman born in 2020") +
  facet_wrap(. ~ kin_label) +
  theme_bw() +
  theme(legend.position = "bottom")

Next, we see the age distribution of kin (by education of kin) for a 60 year old woman born in the year 2020:

kin_out_2020_2090$kin_full %>%
  dplyr::filter(
    group %in%  c("d", "gd", "m", "gm")
    ,  age_focal == 60
    , cohort == 2020
    ) %>%
  rename(kin = group) %>% 
  rename_kin("2sex") %>% 
  ggplot2::ggplot(ggplot2::aes(x = age_kin, y = count, color = stage_kin, fill = stage_kin)) +
  ggplot2::geom_bar(position = "stack", stat = "identity") +
  ggplot2::scale_x_continuous(breaks = c(0,10,20,30,40,50,60,70,80,90,100)) +
  labs(x  ="Age of kin for a 60 yo woman born in 2020") +
  ggplot2::facet_wrap(. ~ kin_label) +
  ggplot2::theme_bw() +
  ggplot2::theme(
    axis.text.x = ggplot2::element_text(angle = 90, vjust = 0.5)
    , legend.position = "bottom"
    )

5.2. Period approach

Let’s see the expected number of living children and grandchildren for a 60 year old woman over time by the level of educational attainment of the children and grandchildren:

kin_out_2020_2090$kin_summary %>%
  filter(
    group %in% c("d", "gd")
    , age_focal == 60
    ) %>% 
  rename(kin = group) %>% 
  rename_kin(sex = "2sex") %>% 
  summarise(count=sum(count), .by = c(stage_kin, kin_label, year)) %>%
  ggplot(aes(x = year, y = count, fill=stage_kin)) +
  geom_area(colour = "black") +
  labs(y = "Expected number of descendants a for woman aged 60",
       x = "Year") +
  facet_grid( . ~ kin_label) +
  theme_bw() +
    theme(legend.position = "bottom")

And, finally, the level of education of the parents and grandparents of a newborn woman over time:

kin_out_2020_2090$kin_summary %>%
  filter(
    group %in% c("m", "gm")
    , age_focal == 0
    ) %>% 
  rename(kin = group) %>% 
  rename_kin(sex = "2sex") %>% 
  summarise(count=sum(count), .by = c(stage_kin, kin_label, year)) %>%
  ggplot(aes(x = year, y = count, fill=stage_kin)) +
  geom_area(colour = "black") +
  labs(y = "Expected number of descendants a for newborn woman (aged 0)",
       x = "Year") +
  facet_grid( . ~ kin_label) +
  theme_bw() +
    theme(legend.position = "bottom")

We note that there are following simplifying assumptions due to data limitations: 1. Fertility rates vary over time and by education level but are identical for both sexes (the androgynous approximation). 2. Age-specific education transition probabilities vary over time but not by sex (also an androgynous approximation). 3. Since no transition data exist before age 10, we assume that at ages 0-4, individuals transition from no education to incomplete primary in the next 5-year interval, and at ages 5-9, they transition from incomplete primary to primary (based on India’s compulsory education act 2009).Everyone above age 10 will thus have at least primary education. 4. Demographic rates remain stable (time-invariant) before 2020, the earliest observed data point. Therefore every cohort before 2020 experience the same education specific fertility, mortality and transition rates.

6.Exercise

6.1 Educatuion composition of selected kin over the life course of Focal

E1.1 For a newborn in 2020, how many grandmothers and grandfathers with at least a primary education are they expected to have ?

E1.2 How does this number change when focal grow older?

E1.3 what patterns did you observe? Also, compare this result with the Focal born in 2050.

6.2 Transitions in education and fertility across time

E2.1 In which year can a woman born in 2020 expect that at least half of her children will reach at least a secondary level of education?

E2.2 What does the change in the education composition of children over the life course of women born in 2020 tell us about the demographic processes behind fertility timing and education progression?

E2.2.1 compare the number of children women born in 2020 had in 2050 and in 2060. what difference you observed? now compare the education composition of children in 2050 and 2060.

6.3 Education composition over age and time

E3.1 Compare the educational composition of children in 2020, 2050, and 2080 for a Focal of age 65. What trends do you observe in the number of primary vs. upper secondary & above educated children over time?

E3.2 Plot the age and education distribution of the children and grandchildren of a Focal born in 1950, 2000 and 2050. Then look at how the age distribution of children with upper secondary & above education changes over time. What pattern did you observe?

Appendix: Other countries

Here, we show how to download fertility, mortality, and population data for other countries and years.

First, we will load the download_wpp24() function, which automates the process of downloading these datasets. Note that this process may take a couple of minutes.

These packages will be used:

# Load necessary libraries
library(dplyr)
library(tidyr)
library(data.table)

Load functions to download and process:

# Load function to filter data
source("functions.R")

Run the function to download data. In this case, we’re downloading data for Nepal, but you can chose any country you like. To select the country of interest, refer to the country names available here.

NOTE: This function will take a really long time to run since it downloads a bunch of large files, so make sure to have good and stable internet connection. The good thing is you only need to download the files once!

# Download the data
# Use include_projections = T to include data after 2023
download_wpp24(include_projections = T)

Once the data is downloaded, we can filter the information of interest—for example, by country. To do this, we will load two functions:

UNWPP_data: This function filters mortality data (px, qx, mx) and fertility data (fx) by the country of interest and the desired time period.
UNWPP_pop: This function filters population size (N) data.

Now, we read the data and reformat the data from the csv files we downloaded in the earlier step.

## Example output for DemoKin using female rates
# Select country, year and sex to obtain px and fx

fert_female <-
  UNWPP_data(country = "Nepal", 
             start_year = 1950, 
             end_year = 2100,
             sex = c("Female"),
             indicators = c("fx"), #indicators of interest
             output_format = "csv", #format you want to save (csv or rdata)
             output_file = "Nepal") 

mort_both <-
  UNWPP_data(country = "Nepal", 
             start_year = 1950, 
             end_year = 2100,
             sex = c("Male", "Female"),
             indicators = c("px"), #indicators of interest
             output_format = "csv", #format you want to save (csv or rdata)
             output_file = "Nepal") 

pop_female <-
  UNWPP_pop(country = "Nepal", 
            start_year = 1950, 
            end_year = 2100, 
            sex = "Female")

pop_male <-
  UNWPP_pop(country = "Nepal", 
            start_year = 1950, 
            end_year = 2100, 
            sex = "Male")

# Join rate objects by sex

rates_female <- 
  left_join(fert_female, mort_both) %>% 
  filter(Sex == "Female")

rates_male <- 
  mort_both %>%  
  filter(Sex == "Male")

Finally, you can save this data to reuse it in your own project:

write.csv(rates_female, "data/npl_female.csv", row.names = F)
write.csv(rates_male, "data/npl_male.csv", row.names = F)
write.csv(pop_female, "data/npl_popfemale.csv", row.names = F)
write.csv(pop_male, "data/npl_popmale.csv", row.names = F)

Now we have all the input data needed to run the 2-sex time-variant model (see “1. Setup” in Part 2). We can simply repeat the analysis in Part 2 (starting from “2. Run kinship model”) and get the kin estimates for Nepal.

References

Caswell, Hal. 2019. “The Formal Demography of Kinship: A Matrix Formulation.” Demographic Research 41 (September): 679–712. https://doi.org/10.4054/DemRes.2019.41.24.

———. 2020. “The Formal Demography of Kinship II: Multistate Models, Parity, and Sibship.” Demographic Research 42 (June): 1097–1146. https://doi.org/10.4054/DemRes.2020.42.38.

———. 2022. “The Formal Demography of Kinship IV: Two-Sex Models and Their Approximations.” Demographic Research 47 (September): 359–96. https://doi.org/10.4054/DemRes.2022.47.13.

Caswell, Hal, and Xi Song. 2021. “The Formal Demography of Kinship. III. Kinship Dynamics with Time-Varying Demographic Rates.” Demographic Research 45: 517–46.

Keyfitz, Nathan, and Hal Caswell. 2005. Applied Mathematical Demography. New York: Springer.

An introduction to the analysis of population-level kinship structures

IIPS, Mumbai, India, November 24-27, 2025

Instructors: Diego Alburez-Gutierrez & Saroja Adhikari

Updated: 2025-11-26

Introduction

1. Preparation

2. Agenda

3. Acknowledgements

Day 1: One-sex time-invariant models

1. Installation

2. Demographic data

3. The DemoKin package

3.1. The function kin()

3.2. Arguments of kin()

3.3 Relative types

3.4. Value

kin_full

kin_summary

3.5. Kinship diagrams

3.6. Number of living kin

3.7 Age distribution of living kin

4. Exercise

4.1 We saw the ‘Keyfitz’ kinship diagram for a Focal girl aged 5 yo, but what about an older Focal, for example, a woman 65 yo? Visualise the kinship diagram for an average 65yo Indian woman and discuss the results with your group.

4.2. In class, you ran a time-invariant model with the rates for 2020. How are the results different if you run a time-invariant model with 1950 rates? Run the models and discuss interesting findings.

5. Homework

Day 2: 2-sex time-variant models

1. Setup

2. Run kinship model

2.1. Convert data to matrices

2.2. Run model

2.3. Model assumptions

3. Kinship structures

3.1. Cohort approach

3.2. Period Approach

4. Deceased kin

5. Exercise

5,1. Load data for your chosen country

5.2. Living kin by sex

5.3. Time-variant and time-invariant models

5.4. Sex ratios at birth

Day 3: Projections of kinship by education

1. Setup

2 Multistate models in DemoKin

2.1. Why education?

3. Data

3.1 Input data

3.1.1 Fertility rates

3.1.2 Survival rates

3.1.3 Education transition rates

4. Run multistate kinship model

5. Number of living kin by education of kin

5.1. Cohort approach

5.2. Period approach

6.Exercise

6.1 Educatuion composition of selected kin over the life course of Focal

6.2 Transitions in education and fertility across time

6.3 Education composition over age and time

Appendix: Other countries

References

3.1. The function `kin()`

3.2. Arguments of `kin()`

`kin_full`

`kin_summary`

2 Multistate models in `DemoKin`