Please make sure to complete the following preparatory steps before the training workshop:
install.packages("dplyr")
install.packages("tidyr")
install.packages("ggplot2")
# install.packages("knitr")
install.packages("data.table")
install.packages("Matrix")
# DemoKin is available on [CRAN](https://cran.r-project.org/web/packages/DemoKin/index.html),
# but we'll use the development version on [GitHub](https://github.com/IvanWilli/DemoKin):
# install.packages("remotes")
remotes::install_github("IvanWilli/DemoKin")Now, check if the packages can be loaded:
< > Code and then
Download ZIP, as shown below:This workshop includes material from the course Matrix
Approaches to Modelling Kinship: Theory and Applications prepared by
Ivan Williams and adapted by
Amanda
Martins for the European
Doctoral School of Demography and for the ALAP 2024
Conference. The multistate code was adapted from the corresponding
vignette
in DemoKin contributed by Joe
Butterick and adapted to the case Singapore by Sha
Jiang. and for India by Saroja
Adhikari using the Wittgenstein educational data.
For a full list of DemoKin contributors, see https://ivanwilli.github.io/DemoKin/.
Kinship is a fundamental property of human populations and a key form of social structure. Demographers have long been interested in the interplay between demographic change and family configuration. This has led to the development of sophisticated methodological and conceptual approaches for the study of kinship, some of which are reviewed in this course.
Load packages that will be useful:
Load a couple of functions we prepared:
We will use data for India from the 2024 Revision of the World Population Prospects, which we pre-processed for this workshop.
#Load data for women
rates_female <- read.csv("data/ind_female.csv", stringsAsFactors = F) #female rates
pop_female <- read.csv("data/ind_popfemale.csv", stringsAsFactors = F) #female population (N)
# For data for men
rates_male <- read.csv("data/ind_male.csv", stringsAsFactors = F) # male rates
pop_male <- read.csv("data/ind_popmale.csv", stringsAsFactors = F) # male population (N)Let’s see what data we will be using. The first object is
rates_female, a long-format data frame with year- and
age-specific fertility rates (‘fx’) and survival probabilities from a
life table (‘px’).
For women:
## Sex year age px fx
## 1 Female 1950 0 0.8263349 0
## 2 Female 1950 1 0.9161987 0
## 3 Female 1950 2 0.9573440 0
## 4 Female 1950 3 0.9763298 0
## 5 Female 1950 4 0.9856816 0
## 6 Female 1950 5 0.9912545 0
Let us visualise some of our demographic data. We start with the probability of dying between ages \(x\) and \(x+1\),\(q_x\) in a life table for women:
rates_female %>%
filter(year %in% seq(1950, 2020, 30)) %>%
mutate(qx = 1-px) %>%
ggplot() +
geom_line(aes(x = age, y = qx, col = as.character(year)), linewidth = 1) +
scale_y_log10() +
labs(y = "Probability of dying, qx", col = NULL) +
theme_bw() +
theme(legend.position = "bottom")Next, we have age-specific fertility rates for multiple years:
rates_female %>%
filter(year %in% seq(1950, 2020, 30)) %>%
ggplot() +
geom_line(aes(x = age, y = fx, col = as.character(year)), linewidth = 1) +
labs(y = "Age-specific fertility rate, fx", col = NULL) +
theme_bw() +
theme(legend.position = "bottom")Finally, we have the population counts:
pop_female %>%
mutate(age = 0:100) %>%
pivot_longer(-age, names_to = "year", values_to = "pop") %>%
mutate(year = gsub("X", "", year)) %>%
filter(year %in% seq(1950, 2020, 30)) %>%
ggplot() +
geom_line(aes(x = age, y = pop, col = as.character(year)), linewidth = 1) +
labs(y = "Population alive in thousands", col = NULL) +
theme_bw() +
theme(legend.position = "bottom")DemoKin can be used to compute the number and age
distribution of Focal’s relatives under a range of assumptions,
including living and deceased kin. Let’s explore the main functions.
kin()The function DemoKin::kin() currently does most of the
heavy lifting in terms of implementing matrix kinship models.
For this example, we will run the simplest model, with the following assumptions:
First, we need to extract the data for women in 2020 and save it as a matrix:
# First, reshape fertility and survival for a given year
asfr_2020 <- rates_female %>%
filter(year == 2020) %>%
select(age, year, fx) %>%
pivot_wider(names_from = year, values_from = fx) %>%
select(-age) %>%
as.matrix()
surv_2020 <- rates_female %>%
filter(year == 2020) %>%
select(age, year, px) %>%
pivot_wider(names_from = year, values_from = px) %>%
select(-age) %>%
as.matrix()Now we can run the kinship models:
kin()Relatives for the output_kin argument are identified by
a unique code. Note that the relationship codes used in
DemoKin differ from those in Caswell (2019). The equivalence between the two set of
codes is given in the following table:
## DemoKin Caswell Labels_female Labels_male
## 1 coa t Cousins from older aunts Cousins from older uncles
## 2 cya v Cousins from younger aunts Cousins from younger uncles
## 3 c <NA> Cousins Cousins
## 4 d a Daughters Sons
## 5 gd b Grand-daughters Grand-sons
## 6 ggd c Great-grand-daughters Great-grand-sons
## 7 ggm h Great-grandmothers Great-grandfathers
## 8 gm g Grandmothers Grandfathers
## 9 m d Mother Father
## 10 nos p Nieces from older sisters Nephews from older brothers
## 11 nys q Nieces from younger sisters Nephews from younger brothers
## 12 n <NA> Nieces Nephews
## 13 oa r Aunts older than mother Uncles older than fathers
## 14 ya s Aunts younger than mother Uncles younger than father
## 15 a <NA> Aunts Uncles
## 16 os m Older sisters Older brothers
## 17 ys n Younger sisters Younger brothers
## 18 s <NA> Sisters Brothers
## Labels_2sex
## 1 Cousins from older aunts/uncles
## 2 Cousins from younger aunts/uncles
## 3 Cousins
## 4 Children
## 5 Grand-childrens
## 6 Great-grand-childrens
## 7 Great-grandfparents
## 8 Grandparents
## 9 Parents
## 10 Niblings from older siblings
## 11 Niblings from younger siblings
## 12 Niblings
## 13 Aunts/Uncles older than parents
## 14 Aunts/Uncles younger than parents
## 15 Aunts/Uncles
## 16 Older siblings
## 17 Younger siblings
## 18 Siblings
DemoKin::kin() returns a list containing two data
frames: kin_full and kin_summary.
## List of 2
## $ kin_full : tibble [142,814 × 7] (S3: tbl_df/tbl/data.frame)
## ..$ kin : chr [1:142814] "d" "d" "d" "d" ...
## ..$ age_kin : int [1:142814] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ age_focal: int [1:142814] 0 1 2 3 4 5 6 7 8 9 ...
## ..$ living : num [1:142814] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ dead : num [1:142814] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ cohort : logi [1:142814] NA NA NA NA NA NA ...
## ..$ year : logi [1:142814] NA NA NA NA NA NA ...
## $ kin_summary: tibble [1,414 × 9] (S3: tbl_df/tbl/data.frame)
## ..$ age_focal : int [1:1414] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ kin : chr [1:1414] "coa" "cya" "d" "gd" ...
## ..$ year : logi [1:1414] NA NA NA NA NA NA ...
## ..$ count_living : num [1:1414] 0.309 0.103 0 0 0 ...
## ..$ mean_age : num [1:1414] 8.4 4.28 NaN NaN NaN ...
## ..$ sd_age : num [1:1414] 6.43 4.04 NaN NaN NaN ...
## ..$ count_dead : num [1:1414] 0.000772 0.000549 0 0 0 ...
## ..$ count_cum_dead: num [1:1414] 0.000772 0.000549 0 0 0 ...
## ..$ mean_age_lost : num [1:1414] 0 0 NaN NaN NaN 0 0 0 0 NaN ...
kin_fullThis data frame contains expected kin counts by year (or cohort), age of Focal, type of kin and, age of kin, including living and dead kin at that age.
## # A tibble: 6 × 7
## kin age_kin age_focal living dead cohort year
## <chr> <int> <int> <dbl> <dbl> <lgl> <lgl>
## 1 d 0 0 0 0 NA NA
## 2 d 0 1 0 0 NA NA
## 3 d 0 2 0 0 NA NA
## 4 d 0 3 0 0 NA NA
## 5 d 0 4 0 0 NA NA
## 6 d 0 5 0 0 NA NA
kin_summaryThis is a ‘summary’ data frame derived from kin_full. To
produce it, we sum over all ages of kin to produce a data frame of
expected kin counts by year or cohort and age of Focal (but not
by age of kin).
## # A tibble: 6 × 9
## age_focal kin year count_living mean_age sd_age count_dead count_cum_dead
## <int> <chr> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 coa NA 0.309 8.40 6.43 0.000772 0.000772
## 2 0 cya NA 0.103 4.28 4.04 0.000549 0.000549
## 3 0 d NA 0 NaN NaN 0 0
## 4 0 gd NA 0 NaN NaN 0 0
## 5 0 ggd NA 0 NaN NaN 0 0
## 6 0 ggm NA 0.384 75.7 6.66 0.0252 0.0252
## # ℹ 1 more variable: mean_age_lost <dbl>
We can visualize the kinship structure we just computed using a
network or ‘Keyfitz’ kinship diagram (Keyfitz and
Caswell 2005) with the plot_diagram function. Let’s
see the expected number of living kin for a 65 yo woman in India
according to our model:
kin_2020$kin_summary %>%
filter(age_focal == 65) %>%
select(kin, count = count_living) %>%
plot_diagram(rounding = 2)Let’s run the model again with the same parameters, selecting
specific kin types (note the output_kin argument:
kin_2020 <-
kin(
p = surv_2020
, f = asfr_2020
, output_kin = c("c", "d", "gd", "ggd", "gm", "m", "n", "a", "s")
, time_invariant = TRUE
)Now, let’s visualize how the expected number of daughters, siblings,
cousins, etc., changes over the life course of Focal (now, with full
names to identify each relative type using the function
DemoKin::rename_kin()).
kin_2020$kin_summary %>%
rename_kin() %>%
ggplot() +
geom_line(aes(age_focal, count_living)) +
theme_bw() +
labs(x = "Age of focal", y= "Number of living female relatives") +
facet_wrap(~kin_label)Note that we are working in a time invariant framework. You can think of the results as analogous to life expectancy (i.e., expected years of life for a synthetic cohort experiencing a given set of period mortality rates).
How does overall family size (and family composition) vary over life for an average woman who survives to each age?
counts <-
kin_2020$kin_summary %>%
group_by(age_focal) %>%
summarise(count_living = sum(count_living)) %>%
ungroup()
kin_2020$kin_summary %>%
select(age_focal, kin, count_living) %>%
rename_kin() %>%
ggplot(aes(x = age_focal, y = count_living)) +
geom_area(aes(fill = kin_label), colour = "black") +
geom_line(data = counts, size = 2) +
labs(x = "Age of focal",
y = "Number of living female relatives",
fill = "Kin") +
theme_bw() +
theme(legend.position = "bottom")How old are Focal’s relatives? Using the kin_full data
frame, we can visualize the age distribution of Focal’s relatives
throughout Focal’s life. For example when Focal is 65 (the red vertical
line in the plot), what are the ages of her relatives:
kin_2020$kin_full %>%
rename_kin() %>%
filter(age_focal == 65) %>%
ggplot(aes(age_kin, living)) +
geom_line() +
geom_vline(xintercept = 65, colour = "red") +
labs(x = "Age of kin",
y = "Expected number of living relatives") +
theme_bw() +
facet_wrap(~kin_label)Now share with the class what you have discussed in your groups!
We will download fertility, mortality and population data for all
countries from 1950 to 2023. For this, you will use the
download_wpp24() function, which automates the process of
downloading these data. Please note that this process may take a couple
of minutes, so make sure to do it before day 2 of the workshop!
Human males generally live shorter and reproduce later than females.
These sex-specific processes affect kinship dynamics in a number of
ways. For example, the degree to which an average member of the
population, call her Focal, has a living grandparent is affected by
differential mortality affecting the parental generation at older ages.
We may also be interested in considering how kinship structures vary by
Focal’s sex: a male Focal may have a different number of grandchildren
than a female Focal given differences in fertility by sex. Documenting
these differences matters since women often face greater expectations to
provide support and informal care to relatives. As they live longer,
they may find themselves at greater risk of being having no living kin.
The function kin2sex implements two-sex kinship models as
introduced by Caswell (2022).
Load packages:
Load male and female data (restricting it to data in the 1950-2023 period):
#Load data for women
rates_female <- read.csv("data/ind_female.csv", stringsAsFactors = F) #female rates
pop_female <- read.csv("data/ind_popfemale.csv", stringsAsFactors = F) #female population (N)
# For data for men
rates_male <- read.csv("data/ind_male.csv", stringsAsFactors = F) # male rates
pop_male <- read.csv("data/ind_popmale.csv", stringsAsFactors = F) # male population (N)Since we’re not running projections (yet), we’ll keep only data in the 1950-2023 period:
rates_female <- rates_female %>% filter(year <= 2023)
pop_female <- pop_female[ , 1:(2024-1950)]
rates_male <- rates_male %>% filter(year <= 2023)
pop_male <- pop_male[ , 1:(2024-1950)]Note that for men there is no fx (fertility) column
because the WPP does not provide it:
## Sex year age px
## 1 Male 1950 0 0.8206210
## 2 Male 1950 1 0.9282634
## 3 Male 1950 2 0.9656006
## 4 Male 1950 3 0.9803572
## 5 Male 1950 4 0.9875404
## 6 Male 1950 5 0.9921338
Before running the model, we need to reshape the data to have matrices as input.
# Female fertility
asfr_females <-
rates_female %>%
select(age, year, fx) %>%
pivot_wider(names_from = year, values_from = fx) %>%
select(-age) %>%
as.matrix()
# Female survival
surv_females <-
rates_female %>%
select(age, year, px) %>%
pivot_wider(names_from = year, values_from = px) %>%
select(-age) %>%
as.matrix()
# Male survival
surv_males <-
rates_male %>%
select(age, year, px) %>%
pivot_wider(names_from = year, values_from = px) %>%
select(-age) %>%
as.matrix()The data now has years in columns and ages in rows. For example, male survival probabilities:
## 1950 1951 1952 1953 1954 1955
## [1,] 0.8206210 0.8234099 0.8263398 0.8289078 0.8323192 0.8349812
## [2,] 0.9282634 0.9309625 0.9336908 0.9359920 0.9389950 0.9412540
## [3,] 0.9656006 0.9669560 0.9683531 0.9694582 0.9711562 0.9723360
## [4,] 0.9803572 0.9809870 0.9816732 0.9821535 0.9831458 0.9837511
## [5,] 0.9875404 0.9878156 0.9881476 0.9883282 0.9889400 0.9892518
## [6,] 0.9921338 0.9922197 0.9923563 0.9923799 0.9927620 0.9929089
## [7,] 0.9949742 0.9949693 0.9950079 0.9949594 0.9952121 0.9952757
## [8,] 0.9964943 0.9964654 0.9964738 0.9964090 0.9966080 0.9966462
## [9,] 0.9971270 0.9971121 0.9971292 0.9970802 0.9972688 0.9973145
## [10,] 0.9973052 0.9973138 0.9973496 0.9973247 0.9975137 0.9975742
Be patient: this can take around 5-10 minutes to run…
kin_out <-
kin2sex(
# Survival probabilities of women
pf = surv_females
# Survival probabilities of men
, pm = surv_males
# Fertility rates for women
, ff = asfr_females
# Fertility rates for men (in this example, we use female rates)
, fm = asfr_females
# Female population counts
, nf = pop_female
# Male population counts
, nm = pop_male
# If FALSE, rates vary over time
, time_invariant = FALSE
# Which kin you want estimates for (see demokin_codes)
, output_kin = c("c", "d", "gd", "ggd", "ggm", "gm", "m", "n", "a", "s")
# The sex of the Focal or anchor invididual
, sex_focal = "f"
) Our model assumes that:
The value returned by kin2sex is equivalent to the one
returned by kin, with a couple new columns:
sex_kin that distinguishes between male and female
kinyear: period (calendar) yearcohort: birth cohort of Focal## # A tibble: 6 × 11
## age_focal kin sex_kin year cohort count_living mean_age sd_age count_dead
## <int> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 0 a f 1950 1950 3.25 23.7 11.3 0.0329
## 2 0 a f 1951 1951 3.25 23.8 11.3 0.0329
## 3 0 a f 1952 1952 3.25 23.8 11.3 0.0325
## 4 0 a f 1953 1953 3.25 23.8 11.3 0.0320
## 5 0 a f 1954 1954 3.25 23.7 11.3 0.0319
## 6 0 a f 1955 1955 3.25 23.7 11.3 0.0307
## # ℹ 2 more variables: count_cum_dead <dbl>, mean_age_lost <dbl>
A note on terminology: The
kin2sexfunction uses the same codes askinto identify relatives (seedemokin_codes()for reference).
Please note that when running a two-sex model, the code “m” refers to either mothers or fathers!
Use thesex_kincolumn to filter by the sex of a specific relative.
For example, to consider only sons and ignore daughters, use:
## # A tibble: 6 × 11
## age_focal kin sex_kin year cohort count_living mean_age sd_age count_dead
## <int> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 0 d m 1950 1950 0 NaN NaN 0
## 2 0 d m 1951 1951 0 NaN NaN 0
## 3 0 d m 1952 1952 0 NaN NaN 0
## 4 0 d m 1953 1953 0 NaN NaN 0
## 5 0 d m 1954 1954 0 NaN NaN 0
## 6 0 d m 1955 1955 0 NaN NaN 0
## # ℹ 2 more variables: count_cum_dead <dbl>, mean_age_lost <dbl>
The two sex model provides kinship estimates by age, period, and birth cohort (of Focal) (Caswell 2019):
Let’s take a look at the resulting kin counts for a Focal born in 1960. First, we plot total family size over the life course of Focal for all kin and irrespective of the sex of kin:
kin_out$kin_summary %>%
filter(cohort == 1960) %>%
rename_kin(sex = "2sex") %>%
summarise(count=sum(count_living), .by = c(kin_label, age_focal)) %>%
ggplot(aes(age_focal, count, fill=kin_label)) +
geom_area(colour = "black") +
labs(y = "Expected number of living kin by Focal's age",
x = "Age of a Focal born 1960") +
theme_bw() +
theme(legend.position = "bottom")Now, we see the distribution of kin by sex of kin for a subset of kin:
kin_out$kin_summary %>%
filter(cohort == 1960) %>%
filter(kin%in% c("c", "d", "gd", "ggd", "gm", "m", "n", "a", "s")) %>%
rename_kin(sex = "2sex") %>%
summarise(count=sum(count_living), .by = c(kin_label, age_focal, sex_kin)) %>%
ggplot(aes(age_focal, count, fill=sex_kin)) +
geom_area() +
labs(y = "Expected number of living kin by sex and Focal's age",
x = "Age of a Focal born 1960",
fill = "Sex of Kin") +
facet_wrap(~kin_label, scales = "free_y") +
theme_bw() +
theme(legend.position = "bottom")We now focus on the age distribution of kin. Note that we use the
object kin_full for this:
kin_out$kin_full %>%
filter(age_focal == 60, cohort == 1960) %>%
filter(kin%in% c("d", "gd", "m", "s")) %>%
rename_kin(sex = "2sex") %>%
ggplot(aes(x = age_kin, y = living, colour = sex_kin)) +
geom_line() +
theme_bw() +
labs(y = "Expected number of living kin for a 65 yo Focal born in 1960",
x = "Age of kin",
colour = "Sex of Kin") +
facet_wrap(.~kin_label)Let us take a snapshot of the kin distribution in 2023.
We can plot total family size over the life course of Focal for all kin
and irrespective of the sex of kin:
kin_out$kin_summary %>%
filter(year == 2023) %>%
rename_kin(sex = "2sex") %>%
summarise(count=sum(count_living), .by = c(kin_label, age_focal)) %>%
ggplot(aes(age_focal, count, fill=kin_label)) +
geom_area(colour = "black") +
labs(y = "Expected number of living kin by sex and Focal's age",
x = "Age of Focal") +
theme_bw() +
theme(legend.position = "bottom")Next, let’s plot the number of kin over the life of Focal including the sex of kin for a selected number of kin:
kin_out$kin_summary %>%
filter(year == 2023) %>%
filter(kin%in% c("c", "d", "gd", "ggd", "gm", "m", "n", "a", "s")) %>%
rename_kin(sex = "2sex") %>%
summarise(count=sum(count_living), .by = c(kin_label, age_focal, sex_kin)) %>%
ggplot(aes(age_focal, count, fill=sex_kin)) +
geom_area() +
labs(y = "Expected number of living kin by sex and Focal's age",
x = "Age of Focal",
fill = "Sex of Kin") +
facet_wrap(~kin_label, scales = "free_y") +
theme_bw() +
theme(legend.position = "bottom")Finally, we see the age distribution of kin for a 65 yo woman in
calendar year 2023. Note that we use the object kin_full
for this:
kin_out$kin_full %>%
filter(age_focal == 65, year == 2023) %>%
filter(kin%in% c("d", "gd", "m", "s")) %>%
rename_kin(sex = "2sex") %>%
ggplot(aes(x = age_kin, y = living, colour = sex_kin)) +
geom_line() +
theme_bw() +
labs(y = "Expected number of living kin for a 65 yo Focal born in 1960",
x = "Age of kin",
colour = "Sex of Kin") +
facet_wrap(.~kin_label)Information on kin availability by sex allows us to consider sex ratios, a traditional measure in demography, with females often in denominator. The following figure, for example, shows that a 25yo French woman in our hypothetical population can expect to have 0.5 grandfathers for every grandmother. Is always the case that the sex ratio will decrease by Focal´s age?
kin_out$kin_summary %>%
rename_kin(sex = "2sex") %>%
group_by(kin_label, age_focal) %>%
summarise(sex_ratio = sum(count_living[sex_kin=="m"], na.rm=T)/sum(count_living[sex_kin=="f"], na.rm=T)) %>%
ggplot(aes(age_focal, sex_ratio))+
geom_line()+
theme_bw() +
labs(y = "Sex ratio",
x = "Age of Focal") +
facet_wrap(~kin_label, scales = "free")We have focused on living kin, but what about relatives who have died
during Focal’s life? The output of kin also includes
information of kin deaths experienced by Focal.
We start by considering the number of kin deaths that can expect to experience at each age. In other words, the non-cumulative number of deaths in the family that Focal experiences at a given age in the year 2000.
loss1 <-
kin_out$kin_summary %>%
filter(age_focal<100, year == 2020) %>%
group_by(age_focal) %>%
summarise(count_dead = sum(count_dead)) %>%
ungroup()
kin_out$kin_summary %>%
rename_kin(sex = "2sex") %>%
filter(age_focal<100, year == 2020) %>%
group_by(age_focal, kin_label) %>%
summarise(count_dead = sum(count_dead)) %>%
ungroup() %>%
ggplot(aes(x = age_focal, y = count_dead)) +
geom_area(aes(fill = kin_label), colour = "black") +
geom_line(data = loss1, size = 2) +
labs(x = "Age of focal",
y = "Number of kin deaths experienced at each age",
fill = "Kin") +
theme_bw() +
theme(legend.position = "bottom")Now, we combine all kin types to show the cumulative burden of kin death for an average member of the population surviving to each age:
loss2 <-
kin_out$kin_summary %>%
filter(year == 2020) %>%
group_by(age_focal) %>%
summarise(count_cum_dead = sum(count_dead)) %>%
mutate(count_cum_dead = cumsum(count_cum_dead)) %>%
ungroup()
kin_out$kin_summary %>%
rename_kin(sex = "2sex") %>%
filter(year == 2020) %>%
group_by(age_focal, kin_label) %>%
summarise(count_cum_dead = sum(count_dead)) %>%
ungroup() %>%
group_by(kin_label) %>%
mutate(count_cum_dead = cumsum(count_cum_dead)) %>%
ungroup() %>%
ggplot(aes(x = age_focal, y = count_cum_dead)) +
geom_area(aes(fill = kin_label), colour = "black") +
# geom_line(data = loss2, aes(y = count_cum_dead), size = 2) +
labs(x = "Age of focal",
y = "Number of kin deaths experienced (cumulative)",
fill = "Kin") +
theme_bw() +
theme(legend.position = "bottom")In 2020, a member of the population aged 15, 50, and 65 years old will have experienced, on average, the death of relatives, respectively:
loss2 %>%
filter(age_focal %in% c(15, 50, 65)) %>%
select(count_cum_dead) %>%
pull(count_cum_dead) %>%
round(1) %>%
paste(collapse = ", ")## [1] "3.6, 13.6, 21.4"
yesterday, we asked you to use the function
download_wpp24() to download UNWPP data. Now, we will show
you how to lead the data into DemoKin for your chosen country.
In this example, we’re focusing on Nepal, but you can chose any country you like. To select the country of interest, refer to the country names available here.
Once the data is downloaded, we can filter the information of interest—for example, by country. To do this, we will load two functions:
UNWPP_data: This function filters mortality data
(px, qx, mx) and fertility data (fx) by the country of interest and the
desired time period.
UNWPP_pop: This function filters population size (N)
data.
Now, we read the data and reformat the data from the csv files we downloaded in the earlier step.
fert_female <-
UNWPP_data(country = "Nepal",
start_year = 1950,
end_year = 2023,
sex = c("Female"),
indicators = c("fx"), #indicators of interest
output_format = "csv", #format you want to save (csv or rdata)
output_file = "Nepal")
mort_both <-
UNWPP_data(country = "Nepal",
start_year = 1950,
end_year = 2023,
sex = c("Male", "Female"),
indicators = c("px"), #indicators of interest
output_format = "csv", #format you want to save (csv or rdata)
output_file = "Nepal")
pop_female <-
UNWPP_pop(country = "Nepal",
start_year = 1950,
end_year = 2023,
sex = "Female")
pop_male <-
UNWPP_pop(country = "Nepal",
start_year = 1950,
end_year = 2023,
sex = "Male")
# Join rate objects by sex
rates_female <-
left_join(fert_female, mort_both) %>%
filter(Sex == "Female")
rates_male <-
mort_both %>%
filter(Sex == "Male")Finally, you can save this pre-processed data to reuse it in your own project:
write.csv(rates_female, "data/npl_female.csv", row.names = F)
write.csv(rates_male, "data/npl_male.csv", row.names = F)
write.csv(pop_female, "data/npl_popfemale.csv", row.names = F)
write.csv(pop_male, "data/npl_popmale.csv", row.names = F)Now we have all the input data needed to run the 2-sex time-variant model (see “1. Setup” in Part 2). We can simply repeat the analysis in day 2 (starting from “2. Run kinship model”) and get the kin estimates for Nepal.
Start by reading the data into R:
#Load data for women
rates_female <- read.csv("data/npl_female.csv", stringsAsFactors = F) #female rates
pop_female <- read.csv("data/npl_popfemale.csv", stringsAsFactors = F) #female population (N)
# For data for men
rates_male <- read.csv("data/npl_male.csv", stringsAsFactors = F) # male rates
pop_male <- read.csv("data/npl_popmale.csv", stringsAsFactors = F) # male population (N)E1 On the first day, we asked you to download data for your country of choice. Using this data, run a time-variant two-sex model (using data from 1950-2020 as input) and answer:
E1.1 How many living grand-mothers and grand-fathers can a woman expect to have at age 15 in 1950 and in 2020? Extract a conclusion based on the results.
E1.2 Compare ‘kin sex ratios’ of grandparents, parents, daughters and siblings for the cohort 1950, at each age of Focal.
E2 For your country of choice, run a time-variant two-sex model (the same one as in E1) and a new time-invariant two-sex model, using 2020 rates.
E.2.1 Compare the average number of living kin for a 5-year-old woman in 2020 from both models (the time-variant and the time-invariant one).
E.2.2 Which kin are similar and which are different and why?
E3 DemoKin assumes a constant sex ratio at birth of 1/2.04 = 0.49. How can different sex ratios at borth influence kinship structures?
Load packages:
DemoKinSince Caswell’s (Caswell 2019) one-sex
time-invariant age-structured matrix model of kinship, the framework has
expanded to incorporate time-varying vital rates (2021), two-sex of kin and focal (2022) and multi-state kin populations (2020).
Here, we provide an R
functionkin_multi_stage_time_variant_2sex, which combines
the three aforementioned models and provide a two-sex time-varying
multi-state model.
This function computes stage-specific kinship networks across both sexes for an average member of a population (focal) under time-varying demographic rates. It estimates the number, age, sex, and stage distribution of Focal’s relatives, for each age of Focal’s life, and as a function of the year in which Focal is born.
Q1:Why is it important to study the education levels of both the Focal and their kin in the context of India?
Q2:What input data are required to estimate the number of kin by education using a multistate model in the Demokin package?
Q3:Where can you find the input data needed to incorporate education into a multistate model?
In this example we use education as an example stage. Indian data ranging from 2020 - 2090 is sourced from the Wittgenstein Center. The data is aggregated into 5-year age group and 5-year time interval.
In order to implement the model, the function
kin_multi_stage_time_variant_2sex expects the following 7
inputs of vital rates, fed in as lists:
U_list_females A list of female age-and-education
specific survival probabilities over the timescale (in matrix forms).
This input list has length = the timescale, and each entry represents
the rates of a specific period in matrix form: stage columns, age
rows.
U_list_males A list in the same format as
U_list_females, but for males.
F_list_females A list of female age-and-education
specific fertility rates over the timescale (in matrix forms). This
input list has length = the timescale, and each entry represents the
rates of a specific period in matrix form: stage columns, age
rows.
F_list_males A list in the same format as
F_list_females, but for males.
T_list_females A list of lists of female
age-specific probabilities of moving up education over the timescale (in
matrix forms). The outer list has length = the timescale. The inner list
has length = number of ages. Each outer list entry is comprised of a
list of matrices (stage*stage dimensional), each matrix describes
age-specific probabilities of moving stage. Thus for each year, we have
a list of age-specific probabilities of moving from one stage to the
next.
T_list_males A list in the same format as
T_list_females, but for males.
H_list A list of length = timescale, in which each
element is a matrix which assigns the offspring of individuals in some
stage to the appropriate age class (age in rows and states in columns).
This list of matrices applies to both male and female
offspring.
We pre-processed the data for this workshop:
Before running the model, let’s examine some trends in the data:
# Calculate and plot Total Fertility Rate by education level over time
tfr_data <- lapply(seq_along(F_mat_fem_edu), function(i) {
col_sums <- colSums(F_mat_fem_edu[[i]])
data.frame(
year = 2020 + (i - 1) * 5,
education = factor(colnames(F_mat_fem_edu[[i]]),
levels = c("e1", "e2", "e3", "e4", "e5", "e6"),
labels = c("no education", "incomplete primary",
"primary", "lower secondary",
"upper secondary", "post-secondary")),
tfr = col_sums
)
})
tfr_df <- do.call(rbind, tfr_data)
# Plot TFR trends
ggplot(tfr_df%>%filter(year<2095), aes(x = year, y = tfr, color = education, group = education)) +
geom_line(size = 1) +
geom_point() +
theme_bw() +theme(legend.position = "bottom")+
labs(title = "Total fertility rate by education over time",
x = "Year",
y = "TFR",
color = "Education"
)
Because expert-based assumptions were applied to fertility in 2050, the
projected trend shows a discontinuity, especially for lower education.
Details about assumption is given in https://pure.iiasa.ac.at/id/eprint/19487/
Let’s look at survival ratio
# Calculate and plot survival ratio for female by education level over time
surv_data <- lapply(seq_along(U_mat_fem_edu), function(i) {
#extract survival ratio for ith year
survdt <- U_mat_fem_edu[[i]]
data.frame(
year = 2020 + (i - 1) * 5,
age = as.numeric(rownames(survdt)),
education = rep(
factor(colnames(survdt),
levels = c("e1","e2","e3","e4","e5","e6"),
labels = c("no education","incomplete primary","primary",
"lower secondary","upper secondary","post-secondary")),
each = nrow(survdt)
),
survival = as.vector(survdt)
)
})
surv_df <- do.call(rbind, surv_data)
# Plot Survival ratio trends
ggplot(surv_df %>%
filter(age %in% c(0,15, 40,65)),
aes(x = year, y = survival, color = education, group = education)) +
geom_line(size = 1) +
geom_point() +
theme_bw() + theme(legend.position = "bottom")+
facet_wrap(~age) +
labs(title = "Survival ratio by education",
x = "Year",
y = "Survival ratio",
color = "Education"
)Because the COVID-19 pandemic reduced survival ratios in 2020, we observe pronounced recovery in 2025. Education differential in mortality is only applied from the age 15 in input data.
Now let’s plot education transition probabilities
Education transition probability= Probability that a person moves from one level of education to another (only higher level or only upward transition)level of education as they age.
# Education labels
edu_labels <- c("e1","e2","e3","e4","e5","e6")
edu_names <- c("no education","incomplete primary","primary",
"lower secondary","upper secondary","post-secondary")
trans_df <- lapply(seq_along(T_mat_fem_edu), function(i) {
year_val <- 2020 + (i - 1) * 5
lapply(seq_along(T_mat_fem_edu[[i]]), function(a) {
mat <- T_mat_fem_edu[[i]][[a]]
expand.grid(
to_stage = 1:6, # <-- rows
from_stage = 1:6 # <-- columns
) %>%
mutate(
transition_prob = as.vector(mat),
age = 5+ (a - 1) * 5,
year = year_val,
cohort=year-age,
from_stage = factor(from_stage, levels = 1:6, labels = edu_labels),
to_stage = factor(to_stage, levels = 1:6, labels = edu_labels),
from_name = factor(from_stage, labels = edu_names),
to_name = factor(to_stage, labels = edu_names)
)
}) %>% bind_rows()
}) %>% bind_rows()
# Education transition plot ages 10-15 and 15-20 and 20-25
ggplot(trans_df%>%filter(age%in%c(15,20,25) & cohort<2065),aes(x = year, y = transition_prob, fill = to_name))+
geom_area() +
facet_grid(age ~ from_name) +
labs(
title = "Education transition probabilities from origin to destination education across
age groups",
x = "Year",
y = "Transition probability",
fill = "Destination education"
) +
theme_bw()+theme(axis.text.x = element_text(angle = 90),legend.position = "bottom")
Over time, the probability of transitioning from lower to higher
education levels is expected to increase.
Assumptions of education transition probabilities from Wittgenstein Center:
1.Individuals younger than 10 are still in basic schooling and are not included in the model. Therefore, education transitions are considered only from age 10 onward.
2.People are assumed to complete their education by age 30, meaning no further transitions occur after that age.
3.Education transitions depend on age. For example, a person who is 20 years old and at the primary level will not move to lower secondary. However, if someone at the same age is in upper secondary, they will have a probability of moving to post-secondary education.
We feed the above inputs into the matrix model, along with other arguments:
this run takes some time (ca. 5 min)
# Run kinship model for a female Focal over a timescale of no_years
time_range <- seq(2020,2090,5)
no_years <- length(time_range)-1
# For 5-y age and year intervals
output_year_select <- seq(1, no_years+1, 1)
# since we use 5-year time frame, we are projecting from 2020 to 2020+5*14 = 2090
# We decide here to count accumulated kin by age of Focal, and not distributions of kin
kin_out_2020_2090 <-
kin_multi_stage_time_variant_2sex(U_list_females = U_mat_fem_edu[1:(1+no_years)],
U_list_males = U_mat_male_edu[1:(1+no_years)],
F_list_females = F_mat_fem_edu[1:(1+no_years)],
F_list_males = F_mat_male_edu[1:(1+no_years)],
T_list_females = T_mat_fem_edu[1:(1+no_years)],
T_list_males = T_mat_fem_edu[1:(1+no_years)],
H_list = H_mat_edu[1:(1+no_years)],
birth_female = 1/(1.06+1), ## Sex ratio at birth -- (1.06)
parity = FALSE,
summary_kin = TRUE,
sex_Focal = "Female", ## define Focal's sex at birth
initial_stage_Focal = 1, ## Define Focal's stage at birth
output_years = output_year_select # it seems tricky to set up output years when the time interval is not 1
# now I select every 10 year for output
)## 45.53 sec elapsed
Now we have to recode year and age related variables to show the correct years and educational labels:
# also need to change the age_focal and year variables accordingly
kin_out_2020_2090$kin_summary <-
kin_out_2020_2090$kin_summary %>%
mutate(year = (year-1)*5+min(time_range),
age_focal = age_focal*5,
cohort = year - age_focal,
stage_kin = factor(stage_kin, levels = c(1, 2, 3, 4, 5, 6),
labels = c(
"no education",
"incomplete primary",
"primary",
"lower secondary",
"upper secondary & above",
"upper secondary & above"
)))
kin_out_2020_2090$kin_full <- kin_out_2020_2090$kin_full%>%
mutate(year = (year-1)*5+min(time_range),
age_focal = age_focal*5,
age_kin = age_kin*5,
cohort = year - age_focal,
stage_kin = factor(stage_kin, levels = c(1, 2, 3, 4, 5, 6),
labels = c(
"no education",
"incomplete primary",
"primary",
"lower secondary",
"upper secondary & above",
"upper secondary & above"
)))Notice the structure of the output data kin_summary. We
have columns age_focal and kin_stage because
we sum over all ages of kin, and produce the marginal stage distribution
given age of Focal. We have a column corresponding to sex of kin
sex_kin, a column showing which year we are
considering, and a column headed group which selects the
kin type. Finally, we have columns showing Focal’s cohort of birth
cohort (e.g., year - age of Focal).
Let’s take a look at kin_summary by selecting 10 rows,
from row 1000 to row 1010.
## # A tibble: 11 × 8
## age_focal stage_kin count sex_kin year group cohort cohort_factor
## <dbl> <fct> <dbl> <chr> <dbl> <chr> <dbl> <fct>
## 1 100 incomplete primary 7.44e-1 Male 2020 ggd 1920 -19
## 2 100 primary 2.38e+0 Female 2020 ggd 1920 -19
## 3 100 primary 2.52e+0 Male 2020 ggd 1920 -19
## 4 100 lower secondary 5.45e-1 Female 2020 ggd 1920 -19
## 5 100 lower secondary 5.78e-1 Male 2020 ggd 1920 -19
## 6 100 upper secondary &… 3.79e-2 Female 2020 ggd 1920 -19
## 7 100 upper secondary &… 4.01e-2 Male 2020 ggd 1920 -19
## 8 100 upper secondary &… 2.91e-6 Female 2020 ggd 1920 -19
## 9 100 upper secondary &… 3.08e-6 Male 2020 ggd 1920 -19
## 10 0 no education 0 Female 2020 m 2020 1
## 11 0 no education 0 Male 2020 m 2020 1
We start by showing the number of living kin for an average woman born in 2020 in India by level of educational attainment of her kin:
kin_out_2020_2090$kin_summary %>%
# Multistate models also include an (educational) state for the Focal individual
# which is not of interest here
filter(group != "Focal") %>%
filter(cohort == 2020) %>%
rename(kin = group) %>%
rename_kin(sex = "2sex") %>%
summarise(count=sum(count), .by = c(stage_kin, age_focal)) %>%
ggplot(aes(x = age_focal, y = count, fill=stage_kin)) +
geom_area(colour = "black") +
labs(y = "Expected number of living kin",
x = "Age of a woman born in 2020") +
theme_bw() +
theme(legend.position = "bottom")Show the same values, but separately for selected kin:
kin_out_2020_2090$kin_summary %>%
filter(group %in% c("d", "gd", "m", "gm")) %>%
filter(cohort == 2020) %>%
rename(kin = group) %>%
rename_kin(sex = "2sex") %>%
summarise(count=sum(count), .by = c(kin_label, stage_kin, age_focal)) %>%
ggplot(aes(x = age_focal, y = count, fill=stage_kin)) +
geom_area(colour = "black") +
labs(y = "Expected number of living kin",
x = "Age of a woman born in 2020") +
facet_wrap(. ~ kin_label) +
theme_bw() +
theme(legend.position = "bottom")Next, we see the age distribution of kin (by education of kin) for a 60 year old woman born in the year 2020:
kin_out_2020_2090$kin_full %>%
dplyr::filter(
group %in% c("d", "gd", "m", "gm")
, age_focal == 60
, cohort == 2020
) %>%
rename(kin = group) %>%
rename_kin("2sex") %>%
ggplot2::ggplot(ggplot2::aes(x = age_kin, y = count, color = stage_kin, fill = stage_kin)) +
ggplot2::geom_bar(position = "stack", stat = "identity") +
ggplot2::scale_x_continuous(breaks = c(0,10,20,30,40,50,60,70,80,90,100)) +
labs(x ="Age of kin for a 60 yo woman born in 2020") +
ggplot2::facet_wrap(. ~ kin_label) +
ggplot2::theme_bw() +
ggplot2::theme(
axis.text.x = ggplot2::element_text(angle = 90, vjust = 0.5)
, legend.position = "bottom"
)Let’s see the expected number of living children and grandchildren for a 60 year old woman over time by the level of educational attainment of the children and grandchildren:
kin_out_2020_2090$kin_summary %>%
filter(
group %in% c("d", "gd")
, age_focal == 60
) %>%
rename(kin = group) %>%
rename_kin(sex = "2sex") %>%
summarise(count=sum(count), .by = c(stage_kin, kin_label, year)) %>%
ggplot(aes(x = year, y = count, fill=stage_kin)) +
geom_area(colour = "black") +
labs(y = "Expected number of descendants a for woman aged 60",
x = "Year") +
facet_grid( . ~ kin_label) +
theme_bw() +
theme(legend.position = "bottom")And, finally, the level of education of the parents and grandparents of a newborn woman over time:
kin_out_2020_2090$kin_summary %>%
filter(
group %in% c("m", "gm")
, age_focal == 0
) %>%
rename(kin = group) %>%
rename_kin(sex = "2sex") %>%
summarise(count=sum(count), .by = c(stage_kin, kin_label, year)) %>%
ggplot(aes(x = year, y = count, fill=stage_kin)) +
geom_area(colour = "black") +
labs(y = "Expected number of descendants a for newborn woman (aged 0)",
x = "Year") +
facet_grid( . ~ kin_label) +
theme_bw() +
theme(legend.position = "bottom")We note that there are following simplifying assumptions due to data limitations: 1. Fertility rates vary over time and by education level but are identical for both sexes (the androgynous approximation). 2. Age-specific education transition probabilities vary over time but not by sex (also an androgynous approximation). 3. Since no transition data exist before age 10, we assume that at ages 0-4, individuals transition from no education to incomplete primary in the next 5-year interval, and at ages 5-9, they transition from incomplete primary to primary (based on India’s compulsory education act 2009).Everyone above age 10 will thus have at least primary education. 4. Demographic rates remain stable (time-invariant) before 2020, the earliest observed data point. Therefore every cohort before 2020 experience the same education specific fertility, mortality and transition rates.
E1.1 For a newborn in 2020, how many grandmothers and grandfathers with at least a primary education are they expected to have ?
E1.2 How does this number change when focal grow older?
E1.3 what patterns did you observe? Also, compare this result with the Focal born in 2050.
E2.1 In which year can a woman born in 2020 expect that at least half of her children will reach at least a secondary level of education?
E2.2 What does the change in the education composition of children over the life course of women born in 2020 tell us about the demographic processes behind fertility timing and education progression?
E2.2.1 compare the number of children women born in 2020 had in 2050 and in 2060. what difference you observed? now compare the education composition of children in 2050 and 2060.
E3.1 Compare the educational composition of children in 2020, 2050, and 2080 for a Focal of age 65. What trends do you observe in the number of primary vs. upper secondary & above educated children over time?
E3.2 Plot the age and education distribution of the children and grandchildren of a Focal born in 1950, 2000 and 2050. Then look at how the age distribution of children with upper secondary & above education changes over time. What pattern did you observe?
Here, we show how to download fertility, mortality, and population data for other countries and years.
First, we will load the download_wpp24() function, which
automates the process of downloading these datasets. Note that this
process may take a couple of minutes.
These packages will be used:
Load functions to download and process:
Run the function to download data. In this case, we’re downloading data for Nepal, but you can chose any country you like. To select the country of interest, refer to the country names available here.
NOTE: This function will take a really long time to run since it downloads a bunch of large files, so make sure to have good and stable internet connection. The good thing is you only need to download the files once!
# Download the data
# Use include_projections = T to include data after 2023
download_wpp24(include_projections = T)Once the data is downloaded, we can filter the information of interest—for example, by country. To do this, we will load two functions:
UNWPP_data: This function filters mortality data
(px, qx, mx) and fertility data (fx) by the country of interest and the
desired time period.
UNWPP_pop: This function filters population size (N)
data.
Now, we read the data and reformat the data from the csv files we downloaded in the earlier step.
## Example output for DemoKin using female rates
# Select country, year and sex to obtain px and fx
fert_female <-
UNWPP_data(country = "Nepal",
start_year = 1950,
end_year = 2100,
sex = c("Female"),
indicators = c("fx"), #indicators of interest
output_format = "csv", #format you want to save (csv or rdata)
output_file = "Nepal")
mort_both <-
UNWPP_data(country = "Nepal",
start_year = 1950,
end_year = 2100,
sex = c("Male", "Female"),
indicators = c("px"), #indicators of interest
output_format = "csv", #format you want to save (csv or rdata)
output_file = "Nepal")
pop_female <-
UNWPP_pop(country = "Nepal",
start_year = 1950,
end_year = 2100,
sex = "Female")
pop_male <-
UNWPP_pop(country = "Nepal",
start_year = 1950,
end_year = 2100,
sex = "Male")
# Join rate objects by sex
rates_female <-
left_join(fert_female, mort_both) %>%
filter(Sex == "Female")
rates_male <-
mort_both %>%
filter(Sex == "Male")Finally, you can save this data to reuse it in your own project:
write.csv(rates_female, "data/npl_female.csv", row.names = F)
write.csv(rates_male, "data/npl_male.csv", row.names = F)
write.csv(pop_female, "data/npl_popfemale.csv", row.names = F)
write.csv(pop_male, "data/npl_popmale.csv", row.names = F)Now we have all the input data needed to run the 2-sex time-variant model (see “1. Setup” in Part 2). We can simply repeat the analysis in Part 2 (starting from “2. Run kinship model”) and get the kin estimates for Nepal.