Background
In order to estimate frequency-dependent selection (FDS) using the framework of selection gradient analysis, we have proposed a regression model that takes genotype similarity into account as a covariate (Sato et al. 2023 Evolution). The purpose of this vignette is to show how to use the proposed method with toy data. Throughout this vignette, we will analyze toy data with a single-locus or GWAS-style structure under either a split or continuous subpopulation structure (see ‘Usage’ below). The Rmarkdown source and input data for this vignette are available at the following Github repository: https://github.com/yassato/RegressionFDS
Before analyzing the toy data, let us review the proposed model. The proposed method is given by either a linear regression model: \[y_i = \beta_0 + \beta_1x_i + \frac{\beta_2}{N_k}\sum^{N_{k}}_{j=1}{x_ix_j} + e_i~~~~~(1)\] or a multiplicative regression model: \[y_i = \beta_0 + \beta_1x_i + \frac{\beta_2}{N_k}\sum^{N_{k}}_{j=1}{x_ix_j} + \frac{\beta_{12}x_i}{N_k}\sum^{N_{k}}_{j=1}{x_ix_j} + e_i~~~~~(2)\] In these equations \(y_i\) is the fitness component of individual \(i\); \(x_i\) is the allelic status of individual \(i\) at a given locus; \(\beta_0\) is the intercept; and \(\beta_1\) is the coefficient of directional selection on the locus. The term \((\sum^{N_{k}}_{j=1}{x_ix_j}) / N_k\) represents the mean genotype similarity between the focal individual \(i\) and its counterpart individuals \(j\). As described in Sato et al. (2022), a positive effect of genotype similarity (\(\beta_2\)) on fitness component represents positive frequency-dependent selection (FDS) on relative fitness, while a negative effect represents negative FDS. The the interaction term coefficient \(\beta_{12}\) represents asymmetric FDS on absolute fitness between two alleles.
Usage
Here, we will first examine the structure of the toy data and then analyze these data using the proposed regression model (Eq. 2). Four types of toy data are prepared for: (1) a single-locus example with split subpopulations; (2) a single-locus example with a continuous space; (3) a GWAS example with split subpopulations; and (4) a GWAS example with continuous subpopulations.
Single-locus examples
The data for the single-locus examples are as simple as a single data frame can store. They are saved in R Data style (.rds), but CSV and TSV formats are also possible. Using the toy data, we will estimate \(\beta_1\), \(\beta_2\), and \(\beta_{12}\) and depict the fitness functions as follows.
(1) split subpopulations
For each individual ID
, data from split subpopulations
should include the fitness component measure y
, morph-type
‘gi’ (A or a), and its affiliated subpopulation group
.
d1 = readRDS("simulation/output/toy1.rds")
head(d1)
## ID y gi group
## 1 i1307 22.67740 a 1
## 2 i1871 22.65558 A 1
## 3 i1017 22.57326 a 1
## 4 i1941 22.58306 a 1
## 5 i1124 22.67228 a 1
## 6 i1003 22.66084 a 1
Before calculating genotype similarity, we convert genotypes into digits. Assuming complete dominance of the A morph over the a morph under Mendelian inheritance, we convert the A and a morphs into +1 and -1, respectively.
d1$gi[d1$gi=="A"] = 1
d1$gi[d1$gi=="a"] = -1
d1$gi = as.numeric(d1$gi) # make gi numeric
head(d1)
## ID y gi group
## 1 i1307 22.67740 -1 1
## 2 i1871 22.65558 1 1
## 3 i1017 22.57326 -1 1
## 4 i1941 22.58306 -1 1
## 5 i1124 22.67228 -1 1
## 6 i1003 22.66084 -1 1
Then we calculate the mean allelic similarity at the focal locus. The following implements the calculation of the mean allelic similarity within a subpopulation based on \((\sum^{N_{k}}_{j=1}{x_ix_j}) / N_k = x_i(\sum^{N_{k}}_{j=1}{x_j}) / N_k = x_i\bar{x_j}\).
gigj = c()
for(i in 1:nrow(d1)) {
ds = d1[-i,]
ds = subset(ds,group==d1$group[i])
gigj = c(gigj,d1$gi[i]*mean(ds$gi))
}
d1 = data.frame(d1,gigj)
hist(d1$gigj,xlim=c(-1,1))
Figure 1. Histogram of the term \((\sum^{N_{k}}_{j=1}{x_ix_j}) / N_k\) ranging from -1 to +1.
To estimate \(\beta_1\), \(\beta_2\), and \(\beta_{12}\), we use a linear mixed model with the subpopulation ID considered a random effect, as follows.
res = lme4::lmer(y~gi*gigj+(1|group),data=d1)
summary(res)
## Linear mixed model fit by REML ['lmerMod']
## Formula: y ~ gi * gigj + (1 | group)
## Data: d1
##
## REML criterion at convergence: -2769.1
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.90894 -0.68002 -0.01201 0.65199 2.97945
##
## Random effects:
## Groups Name Variance Std.Dev.
## group (Intercept) 2.765e-05 0.005258
## Residual 2.570e-03 0.050696
## Number of obs: 900, groups: group, 90
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 22.6012359 0.0018257 12379.695
## gi 0.0003922 0.0017442 0.225
## gigj -0.2057721 0.0051378 -40.051
## gi:gigj 0.0096332 0.0053303 1.807
##
## Correlation of Fixed Effects:
## (Intr) gi gigj
## gi 0.077
## gigj 0.022 0.215
## gi:gigj 0.215 0.051 0.122
Based on \(\hat{\beta}_1\), \(\hat{\beta}_2\), and \(\hat{\beta}_{12}\), we finally visualize fitness functions of FDS. According to Appendix S2 Sato et al. (2022), the fitness function of the A morph and the a morph is given by \(y_\mathrm{A} = \beta_0 + \beta_1 + (\beta_{12}+\beta_2)(2f_\mathrm{A}-1)\) and \(y_\mathrm{a} = \beta_0 - \beta_1 + (\beta_{12}-\beta_2)(2f_\mathrm{A}-1)\), respectively. The mean fitness is given by \(\bar{y} = f_\mathrm{A}y_\mathrm{A} + (1-f_\mathrm{A})y_\mathrm{a}\), where \(f_A\) is the frequency of the A morph.
freq = c() # calculate allele frequency within a subpopulation
for(i in 1:nrow(d1)) {
ds = subset(d1,group==d1$group[i])
freq = c(freq,mean(ds$gi))
}
freq = (freq/2) + 0.5
d1 = data.frame(d1,freq)
library(ggplot2)
plt = function(b0,b1,b2,b12) {
f_star = 0.5-(b1/(2*b2)) # f_star: equilibrium frequency
p = ggplot(d1, aes(x=freq,y=y)) + geom_jitter(pch=d1$gi+2,colour="grey",width=0.05) + # jitter plots for data points
theme_classic() + ylab("Fitness") + xlab("phenotype-level frequency of A") + xlim(0,1) +
geom_function(aes(x=1,y=1),fun=function(x,b0,b1,b2,b12) { (b12+b2)*(2*x-1)+b0+b1 }, args=list(b0,b1,b2,b12)) + # fitness function for A morph
geom_function(aes(x=1,y=1),fun=function(x,b0,b1,b2,b12) { (b12-b2)*(2*x-1)+b0-b1 }, args=list(b0,b1,b2,b12), colour=grey(0.0,0.33)) + # fitness function for a morph
geom_function(aes(x=1,y=1),fun=function(x,b0,b1,b2,b12) { 2*b2*x*(2*x-1)+(b12-b2)*(2*x-1)+b0-b1+2*b1*x }, args=list(b0,b1,b2,b12), lty=2) + # mean fitness
geom_point(aes(x=f_star,y=(b12+b2)*(2*f_star-1)+b0+b1),pch=16,size=3)
return(p)
}
p = plt(b0=coef(summary(res))[1,1],
b1=coef(summary(res))[2,1],
b2=coef(summary(res))[3,1],
b12=coef(summary(res))[4,1])
p
Figure 2. Negative frequency-dependent selection at a single locus in split subpopulations. A black line and plus marks indicate the fitness of the A morph. A grey line and open circles indicate the fitness of the a morph. A dashed curve indicates the population-level mean fitness.
(2) continuous space
Instead of the subpopulation ID, data from continuous space should
include the spatial positions of individuals in a two dimensional space
along the x- and y-axes (X
and Y
). Both
continuous values (e.g., GPS locality) and relative distance (e.g.,
positions in a lattice space) can be used to record the values of
X
and Y
.
d2 = readRDS("simulation/output/toy2.rds")
head(d2)
## ID y gi X Y
## 1 i1498 22.57069 a 1 1
## 2 i212 22.71575 A 1 2
## 3 i430 22.66228 A 1 3
## 4 i631 22.61170 a 1 4
## 5 i1653 22.53516 a 1 5
## 6 i1341 22.63255 a 1 6
Similarly to the split subpopulation case, we convert genotypes into
digit in the case of continuous space. The genotype similarity in a
continuous space can be calculated using the nei_coval
function in the rNeighborGWAS
package. For detailed usage
of the nei_coval
function, see also “GWAS examples”
below.
d2$gi[d2$gi=="A"] = 1
d2$gi[d2$gi=="a"] = -1
d2$gi = as.numeric(d2$gi)
smap = cbind(d2$X,d2$Y)
geno = as.matrix(d2$gi)
gigj = rNeighborGWAS::nei_coval(geno,smap,scale=sqrt(2+0.01))
d2 = data.frame(d2,gigj)
where scale=sqrt(2+0.01)
corresponds to the nearest
neighbors (i.e.,\(N_k\)=8). Then, we
can use a standard regression model since there are no grouping
covariates in the case of a continuous space.
res = lm(y~gi*gigj,data=d2)
summary(res)
##
## Call:
## lm(formula = y ~ gi * gigj, data = d2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.153648 -0.035457 -0.001249 0.034773 0.180099
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.5994027 0.0018361 12308.292 <2e-16 ***
## gi 0.0007062 0.0018361 0.385 0.701
## gigj -0.1932443 0.0050256 -38.452 <2e-16 ***
## gi:gigj 0.0043318 0.0050256 0.862 0.389
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05346 on 896 degrees of freedom
## Multiple R-squared: 0.6364, Adjusted R-squared: 0.6352
## F-statistic: 522.8 on 3 and 896 DF, p-value: < 2.2e-16
As with split subpopulations, we can depict the fitness functions in response to the local frequency of the A morph.
freq = (d2$gi*d2$gigj/2) + 0.5
d2 = data.frame(d2,freq)
plt = function(b0,b1,b2,b12) {
f_star = 0.5-(b1/(2*b2)) # f_star: equilibrium frequency
p = ggplot(d2, aes(x=freq,y=y)) + geom_jitter(pch=d2$gi+2,colour="grey",width=0.05) + # jitter plots for data points
theme_classic() + ylab("Fitness") + xlab("phenotype-level local frequency of A") + xlim(0,1) +
geom_function(aes(x=1,y=1),fun=function(x,b0,b1,b2,b12) { (b12+b2)*(2*x-1)+b0+b1 }, args=list(b0,b1,b2,b12)) + # fitness function for A morph
geom_function(aes(x=1,y=1),fun=function(x,b0,b1,b2,b12) { (b12-b2)*(2*x-1)+b0-b1 }, args=list(b0,b1,b2,b12), colour=grey(0.0,0.33)) + # fitness function for a morph
geom_function(aes(x=1,y=1),fun=function(x,b0,b1,b2,b12) { 2*b2*x*(2*x-1)+(b12-b2)*(2*x-1)+b0-b1+2*b1*x }, args=list(b0,b1,b2,b12), lty=2) + # mean fitness
geom_point(aes(x=f_star,y=(b12+b2)*(2*f_star-1)+b0+b1),pch=16,size=3)
return(p)
}
p = plt(b0=coef(summary(res))[1,1],
b1=coef(summary(res))[2,1],
b2=coef(summary(res))[3,1],
b12=coef(summary(res))[4,1])
p
## Warning: Removed 3 rows containing missing values (geom_point).
Figure 3. Negative frequency-dependent selection at a single locus in a continuous space. A black line and plus marks indicate the fitness of the A morph. A grey line and open circles indicate the fitness of the a morph. A dashed curve indicates the population-level mean fitness.
GWAS examples
Genotype data in a variant call format (.vcf) can be imported using
the gaston
package. Its select.snps
function
can cut off SNPs with minor allele frequency (MAF). The
gaston2neiGWAS
in the rNeighborGWAS package converts the
genotypes in an additive manner (AA, Aa, and aa) into +1, 0, -1,
respectively. The rows and columns in g$geno
contain
individuals and loci, respectively.
g = gaston::read.vcf("simulation/output/1_NFDS.vcf.gz")
## ped stats and snps stats have been set.
## 'p' has been set.
## 'mu' and 'sigma' have been set.
g = gaston::select.snps(g,g@snps$maf>0.01)
g = rNeighborGWAS::gaston2neiGWAS(g)
print(g$geno[1:10,1:10])
## . . . . . . . . . .
## i0 -1 -1 -1 -1 -1 -1 0 -1 1 -1
## i1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1
## i2 -1 -1 -1 -1 -1 -1 0 -1 0 -1
## i3 -1 -1 -1 -1 -1 -1 1 -1 0 -1
## i4 -1 -1 -1 -1 -1 -1 0 -1 0 -1
## i5 -1 -1 -1 -1 -1 -1 0 -1 0 -1
## i6 -1 -1 -1 -1 -1 -1 0 -1 0 -1
## i7 -1 -1 -1 -1 -1 -1 -1 -1 1 -1
## i8 -1 -1 -1 -1 -1 -1 1 -1 1 -1
## i9 -1 -1 -1 -1 -1 -1 1 -1 0 -1
To analyze FDS, we combine these genotypes with phenotype records as follows.
(3) split subpopulations
As with the single-locus example, phenotype data from split subpopulations should include the subpopulation ID ‘group’.
d3 = readRDS("simulation/output/toy3.rds")
head(d3)
## ID y group
## 1 i690 3.032313 1
## 2 i188 2.786912 1
## 3 i731 2.663417 1
## 4 i938 2.724622 1
## 5 i445 2.519885 1
## 6 i1642 2.595124 1
After applying the MAF cut-off to the genotype data, we extract the individuals and assume complete dominance of the A alleles over the a alleles.
geno = g$geno[d3$ID,] # select the same individuals as phenotype files
geno[geno==0] = 1 # skipping this line encodes genotypes in an additive way
We then calculate the genotype similarity using the
nei_coval
function in the rNeighborGWAS
package. Note that the scale=10^6
and grouping
options allow us to assign individuals to subpopulations and calculate
the genotype similarity within subpopulations.
smap = cbind(runif(nrow(d3),0,1),runif(nrow(d3),0,1)) # random dummy map in a small space
g_nei = rNeighborGWAS::nei_coval(geno, # matrix of individual genotypes
smap, # spatial map including positions at x- and y-axis (dummy)
scale=10^6, # scale=LARGE VALUE calculates the genotype similarity all over a subpopulation
grouping=d3$group) # grouping splits data into subpopulations
Association tests are performed using the nei_lmm
function in the rNeighborGWAS
package. The
asym=TRUE
option is selected to test the asymmetric effects
\(\beta_{12}\) as well as \(\beta_2\) and \(\beta_1\).
res = rNeighborGWAS::nei_lmm(geno, # matrix of individual genotypes
g_nei, # matrix of genotype similarities
d3$y, # phenotype vector
addcovar=model.matrix(~d3$group), # covariate can be given by a matrix
asym=TRUE) # asym=TRUE tests beta_12 as well
## PVE_self = 0.92217485125694
## PVE_nei = 0.0184920826388306
head(res)
## beta_self beta_nei beta_sxn p_self p_nei p_sxn
## 1 0.046881625 0.10912691 0.14889194 0.1765667 0.8403981 0.06597602
## 2 -0.005213942 0.01665381 -0.01913179 0.4707472 0.4670357 0.62055628
## 3 0.067605699 0.10268371 0.25744636 0.2799402 0.1037798 0.46680893
## 4 -0.185982431 -0.20429596 -0.21744986 0.9014824 0.9233724 0.21839482
## 5 -0.935841546 -0.95108108 -1.06530626 0.8001786 0.4598263 0.03108385
## 6 0.046881625 0.10912691 0.14889194 0.1765667 0.8403981 0.06597602
Manhattan plots are depicted using the gaston
package.
x = data.frame(g$gmap,res$p_nei)
colnames(x) = c("chr","pos","p")
gaston::manhattan(x,las=1,thinning=FALSE)
abline(h=-log10(0.05/nrow(g_nei)),lty=2,col="grey") # Bonferroni correction
Figure 4. Manhattan plot showing the association score of -log10(p) against genomic positions in the case of split subpopulations. A horizontal dashed line indicate the genome-wide Bonferroni threshold at p < 0.05.
(4) continuous space
As with the single-locus example, phenotype data from continuous
space should include the spatial positions along the x- and y-axes
(X
and Y
).
d4 = readRDS("simulation/output/toy4.rds")
head(d4)
## ID y X Y
## 1 i382 2.634093 1 1
## 2 i365 2.461118 1 2
## 3 i1757 3.226431 1 3
## 4 i1509 2.654648 1 4
## 5 i1685 2.900422 1 5
## 6 i1517 2.627776 1 6
As in the case of split subpopulations, we extract the individuals and assume complete dominance of the A alleles over the a alleles.
geno = g$geno[d4$ID,] # select the same individuals as phenotype files
geno[geno==0] = 1 # skipping this line encodes genotypes in an additive way
For a continuous space, we calculate the genotype similarity using
the nei_coval
function in the rNeighborGWAS
package. The option scale=sqrt(2+0.01)
corresponds to the
nearest neighbors (i.e.,\(N_k\)=8).
smap = cbind(d4$X,d4$Y)
g_nei = rNeighborGWAS::nei_coval(geno, # matrix of individual genotypes
smap, # spatial map including positions at x- and y-axis
scale=sqrt(2)+0.01) # scale=sqrt(2)+0.01 refers to the nearest neighbors
As in the case of split subpopulations, association tests are
performed using the nei_lmm
function in the
rNeighborGWAS
package. The asym=TRUE
option is
selected to test the asymmetric effects \(\beta_{12}\) as well as \(\beta_2\) and \(\beta_1\).
res = rNeighborGWAS::nei_lmm(geno, # matrix of individual genotypes
g_nei, # matrix of genotype similarities
d4$y, # phenotype vector
asym=TRUE) # asym=TRUE tests beta_12 as well
## PVE_self = 0.741886006467102
## PVE_nei = 0.178102451197045
## EM step failed to improve likelihood (this should not happen)
head(res)
## beta_self beta_nei beta_sxn p_self p_nei p_sxn
## 1 0.135293942 0.13333649 0.137079051 0.5212056 0.30820716 0.03204436
## 2 0.001689895 -0.03905978 -0.007133374 0.5338668 0.29617026 0.85914127
## 3 -0.459117295 -0.46484055 -0.315251047 0.5568374 0.05413914 0.29810376
## 4 0.098067450 0.11714256 0.012754219 0.6274098 0.23835098 0.97353821
## 5 0.064310110 0.04507351 -0.011163624 0.5561146 0.54849599 0.96508883
## 6 0.135293942 0.13333649 0.137079051 0.5212056 0.30820716 0.03204436
Manhattan plots are depicted using the gaston
package.
x = data.frame(g$gmap,res$p_nei)
colnames(x) = c("chr","pos","p")
gaston::manhattan(x,las=1,thinning=FALSE)
abline(h=-log10(0.05/nrow(g_nei)),lty=2,col="grey")
Figure 5. Manhattan plot showing the association score of -log10(p) against genomic positions in the case of continuous space. A horizontal dashed line indicate the genome-wide Bonferroni threshold at p < 0.05.
Notes
When the interaction term is significant:
If \(\beta_{12}\) is significant, the estimates of \(\beta_1\) and \(\beta_2\) may not be correct in Equation (2). Practically, we should first test \(\beta_{12}\) using the multiplicative model Equation (2). If \(\beta_{12}\) is significant in Equation (2), we then estimate \(\beta_2\) using the linear model Equation (1). Then \(\hat{\beta}_{12}\) from Equation (2), \(\hat{\beta}_1\) and \(\hat{\beta}_2\) from Equation (1) may be used to depict the fitness functions.
How to analyze multi-year/site data on continous space:
We can jointly use the scale
and grouping
option in the nei_coval
function in the
rNeighborGWAS
package. If data on a continuous space are
collected from multiple years or sites, individuals in different years
or sites should be separated using the grouping
option. The
years or sites should be made numeric and assigned to the
grouping
option. Then, the neighbor space to be referred
should be specified using the scale
option. In addition to
these options, the addcover
option in the
nei_lmm
function allows us to consider fixed effects of
years or sites when performing association tests. See the vignette of
the rNeighborGWAS package for this usage at https://cran.r-project.org/web/packages/rNeighborGWAS/vignettes/rNeighborGWAS.html.
How to determine the spatial scale in a continuous space:
This is not easy to determine but feasible by calculating the proportion of fitness variation explained (PVE) by the polygenic effects from the genotype similarity. If PVE peaks at a particular spatial scale, this scale would be optimal. See also Sato et al. (2021) Heredity for details. The vignette of the rNeighborGWAS package (https://cran.r-project.org/web/packages/rNeighborGWAS/vignettes/rNeighborGWAS.html) also provides more information on this topic.
Citation
When you need to cite this instruction, please cite the following publication: Sato Y, Takahashi Y, Xu C, Shimizu KK. (2023) Detecting frequency-dependent selection through the effects of genotype similarity on fitness components. Evolution. https://doi.org/10.1093/evolut/qpad028. This instruction file was developed as a part of the supplementary materials of Sato et al., (2023) Evolution during its peer-review.
References
- Bates D, Maechler M, Bolker B, Walker S. (2015) Fitting Linear
Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1),
1-48. doi:10.18637/jss.v067.i01.
- Perdry H, Dandine-Roulland C. (2020) gaston: Genetic Data Handling
(QC, GRM, LD, PCA) & Linear Mixed Models. R package version 1.5.7.
https://CRAN.R-project.org/package=gaston
- Sato Y, Yamamoto E, Shimizu KK, Nagano AJ. (2021) Neighbor GWAS:
incorporating neighbor genotypic identity into genome-wide association
studies of field herbivory. Heredity 126(4):597-614. https://doi.org/10.1038/s41437-020-00401-w
- Sato Y, Takahashi Y, Xu C, Shimizu KK. (2023) Detecting frequency-dependent selection through the effects of genotype similarity on fitness components. Evolution. https://doi.org/10.1093/evolut/qpad028
- Wickham H. (2016) ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.