Vignette: Regression model of FDS

Background

In order to estimate frequency-dependent selection (FDS) using the framework of selection gradient analysis, we have proposed a regression model that takes genotype similarity into account as a covariate (Sato et al. 2023 Evolution). The purpose of this vignette is to show how to use the proposed method with toy data. Throughout this vignette, we will analyze toy data with a single-locus or GWAS-style structure under either a split or continuous subpopulation structure (see ‘Usage’ below). The Rmarkdown source and input data for this vignette are available at the following Github repository: https://github.com/yassato/RegressionFDS

Before analyzing the toy data, let us review the proposed model. The proposed method is given by either a linear regression model: \[y_i = \beta_0 + \beta_1x_i + \frac{\beta_2}{N_k}\sum^{N_{k}}_{j=1}{x_ix_j} + e_i~~~~~(1)\] or a multiplicative regression model: \[y_i = \beta_0 + \beta_1x_i + \frac{\beta_2}{N_k}\sum^{N_{k}}_{j=1}{x_ix_j} + \frac{\beta_{12}x_i}{N_k}\sum^{N_{k}}_{j=1}{x_ix_j} + e_i~~~~~(2)\] In these equations \(y_i\) is the fitness component of individual \(i\); \(x_i\) is the allelic status of individual \(i\) at a given locus; \(\beta_0\) is the intercept; and \(\beta_1\) is the coefficient of directional selection on the locus. The term \((\sum^{N_{k}}_{j=1}{x_ix_j}) / N_k\) represents the mean genotype similarity between the focal individual \(i\) and its counterpart individuals \(j\). As described in Sato et al. (2022), a positive effect of genotype similarity (\(\beta_2\)) on fitness component represents positive frequency-dependent selection (FDS) on relative fitness, while a negative effect represents negative FDS. The the interaction term coefficient \(\beta_{12}\) represents asymmetric FDS on absolute fitness between two alleles.

Usage

Here, we will first examine the structure of the toy data and then analyze these data using the proposed regression model (Eq. 2). Four types of toy data are prepared for: (1) a single-locus example with split subpopulations; (2) a single-locus example with a continuous space; (3) a GWAS example with split subpopulations; and (4) a GWAS example with continuous subpopulations.

Single-locus examples

The data for the single-locus examples are as simple as a single data frame can store. They are saved in R Data style (.rds), but CSV and TSV formats are also possible. Using the toy data, we will estimate \(\beta_1\), \(\beta_2\), and \(\beta_{12}\) and depict the fitness functions as follows.

(1) split subpopulations

For each individual ID, data from split subpopulations should include the fitness component measure y, morph-type ‘gi’ (A or a), and its affiliated subpopulation group.

d1 = readRDS("simulation/output/toy1.rds")
head(d1)
##      ID        y gi group
## 1 i1307 22.67740  a     1
## 2 i1871 22.65558  A     1
## 3 i1017 22.57326  a     1
## 4 i1941 22.58306  a     1
## 5 i1124 22.67228  a     1
## 6 i1003 22.66084  a     1

Before calculating genotype similarity, we convert genotypes into digits. Assuming complete dominance of the A morph over the a morph under Mendelian inheritance, we convert the A and a morphs into +1 and -1, respectively.

d1$gi[d1$gi=="A"] = 1
d1$gi[d1$gi=="a"] = -1
d1$gi = as.numeric(d1$gi) # make gi numeric 
head(d1)
##      ID        y gi group
## 1 i1307 22.67740 -1     1
## 2 i1871 22.65558  1     1
## 3 i1017 22.57326 -1     1
## 4 i1941 22.58306 -1     1
## 5 i1124 22.67228 -1     1
## 6 i1003 22.66084 -1     1

Then we calculate the mean allelic similarity at the focal locus. The following implements the calculation of the mean allelic similarity within a subpopulation based on \((\sum^{N_{k}}_{j=1}{x_ix_j}) / N_k = x_i(\sum^{N_{k}}_{j=1}{x_j}) / N_k = x_i\bar{x_j}\).

gigj = c()
for(i in 1:nrow(d1)) {
  ds = d1[-i,]
  ds = subset(ds,group==d1$group[i])
  gigj = c(gigj,d1$gi[i]*mean(ds$gi))
}

d1 = data.frame(d1,gigj)
hist(d1$gigj,xlim=c(-1,1))

Figure 1. Histogram of the term \((\sum^{N_{k}}_{j=1}{x_ix_j}) / N_k\) ranging from -1 to +1.

To estimate \(\beta_1\), \(\beta_2\), and \(\beta_{12}\), we use a linear mixed model with the subpopulation ID considered a random effect, as follows.

res = lme4::lmer(y~gi*gigj+(1|group),data=d1)
summary(res)
## Linear mixed model fit by REML ['lmerMod']
## Formula: y ~ gi * gigj + (1 | group)
##    Data: d1
## 
## REML criterion at convergence: -2769.1
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.90894 -0.68002 -0.01201  0.65199  2.97945 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev.
##  group    (Intercept) 2.765e-05 0.005258
##  Residual             2.570e-03 0.050696
## Number of obs: 900, groups:  group, 90
## 
## Fixed effects:
##               Estimate Std. Error   t value
## (Intercept) 22.6012359  0.0018257 12379.695
## gi           0.0003922  0.0017442     0.225
## gigj        -0.2057721  0.0051378   -40.051
## gi:gigj      0.0096332  0.0053303     1.807
## 
## Correlation of Fixed Effects:
##         (Intr) gi    gigj 
## gi      0.077             
## gigj    0.022  0.215      
## gi:gigj 0.215  0.051 0.122

Based on \(\hat{\beta}_1\), \(\hat{\beta}_2\), and \(\hat{\beta}_{12}\), we finally visualize fitness functions of FDS. According to Appendix S2 Sato et al. (2022), the fitness function of the A morph and the a morph is given by \(y_\mathrm{A} = \beta_0 + \beta_1 + (\beta_{12}+\beta_2)(2f_\mathrm{A}-1)\) and \(y_\mathrm{a} = \beta_0 - \beta_1 + (\beta_{12}-\beta_2)(2f_\mathrm{A}-1)\), respectively. The mean fitness is given by \(\bar{y} = f_\mathrm{A}y_\mathrm{A} + (1-f_\mathrm{A})y_\mathrm{a}\), where \(f_A\) is the frequency of the A morph.

freq = c() # calculate allele frequency within a subpopulation
for(i in 1:nrow(d1)) {
  ds = subset(d1,group==d1$group[i])
  freq = c(freq,mean(ds$gi))
}
freq = (freq/2) + 0.5
d1 = data.frame(d1,freq)

library(ggplot2)
plt = function(b0,b1,b2,b12) {
  f_star = 0.5-(b1/(2*b2)) # f_star: equilibrium frequency
  p = ggplot(d1, aes(x=freq,y=y)) + geom_jitter(pch=d1$gi+2,colour="grey",width=0.05) + # jitter plots for data points
    theme_classic() + ylab("Fitness") + xlab("phenotype-level frequency of A") + xlim(0,1) +
    geom_function(aes(x=1,y=1),fun=function(x,b0,b1,b2,b12) { (b12+b2)*(2*x-1)+b0+b1 }, args=list(b0,b1,b2,b12)) + # fitness function for A morph
    geom_function(aes(x=1,y=1),fun=function(x,b0,b1,b2,b12) { (b12-b2)*(2*x-1)+b0-b1 }, args=list(b0,b1,b2,b12), colour=grey(0.0,0.33)) + # fitness function for a morph
    geom_function(aes(x=1,y=1),fun=function(x,b0,b1,b2,b12) { 2*b2*x*(2*x-1)+(b12-b2)*(2*x-1)+b0-b1+2*b1*x }, args=list(b0,b1,b2,b12), lty=2) + # mean fitness
    geom_point(aes(x=f_star,y=(b12+b2)*(2*f_star-1)+b0+b1),pch=16,size=3)
  
  return(p)
}

p = plt(b0=coef(summary(res))[1,1], 
        b1=coef(summary(res))[2,1],
        b2=coef(summary(res))[3,1],
        b12=coef(summary(res))[4,1])
p

Figure 2. Negative frequency-dependent selection at a single locus in split subpopulations. A black line and plus marks indicate the fitness of the A morph. A grey line and open circles indicate the fitness of the a morph. A dashed curve indicates the population-level mean fitness.

(2) continuous space

Instead of the subpopulation ID, data from continuous space should include the spatial positions of individuals in a two dimensional space along the x- and y-axes (X and Y). Both continuous values (e.g., GPS locality) and relative distance (e.g., positions in a lattice space) can be used to record the values of X and Y.

d2 = readRDS("simulation/output/toy2.rds")
head(d2)
##      ID        y gi X Y
## 1 i1498 22.57069  a 1 1
## 2  i212 22.71575  A 1 2
## 3  i430 22.66228  A 1 3
## 4  i631 22.61170  a 1 4
## 5 i1653 22.53516  a 1 5
## 6 i1341 22.63255  a 1 6

Similarly to the split subpopulation case, we convert genotypes into digit in the case of continuous space. The genotype similarity in a continuous space can be calculated using the nei_coval function in the rNeighborGWAS package. For detailed usage of the nei_coval function, see also “GWAS examples” below.

d2$gi[d2$gi=="A"] = 1
d2$gi[d2$gi=="a"] = -1
d2$gi = as.numeric(d2$gi)

smap = cbind(d2$X,d2$Y)
geno = as.matrix(d2$gi)
gigj = rNeighborGWAS::nei_coval(geno,smap,scale=sqrt(2+0.01))
d2 = data.frame(d2,gigj)

where scale=sqrt(2+0.01) corresponds to the nearest neighbors (i.e.,\(N_k\)=8). Then, we can use a standard regression model since there are no grouping covariates in the case of a continuous space.

res = lm(y~gi*gigj,data=d2)
summary(res)
## 
## Call:
## lm(formula = y ~ gi * gigj, data = d2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.153648 -0.035457 -0.001249  0.034773  0.180099 
## 
## Coefficients:
##               Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 22.5994027  0.0018361 12308.292   <2e-16 ***
## gi           0.0007062  0.0018361     0.385    0.701    
## gigj        -0.1932443  0.0050256   -38.452   <2e-16 ***
## gi:gigj      0.0043318  0.0050256     0.862    0.389    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05346 on 896 degrees of freedom
## Multiple R-squared:  0.6364, Adjusted R-squared:  0.6352 
## F-statistic: 522.8 on 3 and 896 DF,  p-value: < 2.2e-16

As with split subpopulations, we can depict the fitness functions in response to the local frequency of the A morph.

freq = (d2$gi*d2$gigj/2) + 0.5
d2 = data.frame(d2,freq)

plt = function(b0,b1,b2,b12) {
  f_star = 0.5-(b1/(2*b2)) # f_star: equilibrium frequency
  p = ggplot(d2, aes(x=freq,y=y)) + geom_jitter(pch=d2$gi+2,colour="grey",width=0.05) + # jitter plots for data points
    theme_classic() + ylab("Fitness") + xlab("phenotype-level local frequency of A") + xlim(0,1) + 
    geom_function(aes(x=1,y=1),fun=function(x,b0,b1,b2,b12) { (b12+b2)*(2*x-1)+b0+b1 }, args=list(b0,b1,b2,b12)) + # fitness function for A morph
    geom_function(aes(x=1,y=1),fun=function(x,b0,b1,b2,b12) { (b12-b2)*(2*x-1)+b0-b1 }, args=list(b0,b1,b2,b12), colour=grey(0.0,0.33)) + # fitness function for a morph
    geom_function(aes(x=1,y=1),fun=function(x,b0,b1,b2,b12) { 2*b2*x*(2*x-1)+(b12-b2)*(2*x-1)+b0-b1+2*b1*x }, args=list(b0,b1,b2,b12), lty=2) + # mean fitness
    geom_point(aes(x=f_star,y=(b12+b2)*(2*f_star-1)+b0+b1),pch=16,size=3)
  
  return(p)
}

p = plt(b0=coef(summary(res))[1,1], 
        b1=coef(summary(res))[2,1],
        b2=coef(summary(res))[3,1],
        b12=coef(summary(res))[4,1])
p
## Warning: Removed 3 rows containing missing values (geom_point).

Figure 3. Negative frequency-dependent selection at a single locus in a continuous space. A black line and plus marks indicate the fitness of the A morph. A grey line and open circles indicate the fitness of the a morph. A dashed curve indicates the population-level mean fitness.

GWAS examples

Genotype data in a variant call format (.vcf) can be imported using the gaston package. Its select.snps function can cut off SNPs with minor allele frequency (MAF). The gaston2neiGWAS in the rNeighborGWAS package converts the genotypes in an additive manner (AA, Aa, and aa) into +1, 0, -1, respectively. The rows and columns in g$geno contain individuals and loci, respectively.

g = gaston::read.vcf("simulation/output/1_NFDS.vcf.gz")
## ped stats and snps stats have been set. 
## 'p' has been set. 
## 'mu' and 'sigma' have been set.
g = gaston::select.snps(g,g@snps$maf>0.01)
g = rNeighborGWAS::gaston2neiGWAS(g)
print(g$geno[1:10,1:10])
##     .  .  .  .  .  .  .  .  .  .
## i0 -1 -1 -1 -1 -1 -1  0 -1  1 -1
## i1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1
## i2 -1 -1 -1 -1 -1 -1  0 -1  0 -1
## i3 -1 -1 -1 -1 -1 -1  1 -1  0 -1
## i4 -1 -1 -1 -1 -1 -1  0 -1  0 -1
## i5 -1 -1 -1 -1 -1 -1  0 -1  0 -1
## i6 -1 -1 -1 -1 -1 -1  0 -1  0 -1
## i7 -1 -1 -1 -1 -1 -1 -1 -1  1 -1
## i8 -1 -1 -1 -1 -1 -1  1 -1  1 -1
## i9 -1 -1 -1 -1 -1 -1  1 -1  0 -1

To analyze FDS, we combine these genotypes with phenotype records as follows.

(3) split subpopulations

As with the single-locus example, phenotype data from split subpopulations should include the subpopulation ID ‘group’.

d3 = readRDS("simulation/output/toy3.rds")
head(d3)
##      ID        y group
## 1  i690 3.032313     1
## 2  i188 2.786912     1
## 3  i731 2.663417     1
## 4  i938 2.724622     1
## 5  i445 2.519885     1
## 6 i1642 2.595124     1

After applying the MAF cut-off to the genotype data, we extract the individuals and assume complete dominance of the A alleles over the a alleles.

geno = g$geno[d3$ID,] # select the same individuals as phenotype files
geno[geno==0] = 1 # skipping this line encodes genotypes in an additive way

We then calculate the genotype similarity using the nei_coval function in the rNeighborGWAS package. Note that the scale=10^6 and grouping options allow us to assign individuals to subpopulations and calculate the genotype similarity within subpopulations.

smap = cbind(runif(nrow(d3),0,1),runif(nrow(d3),0,1)) # random dummy map in a small space
g_nei = rNeighborGWAS::nei_coval(geno, # matrix of individual genotypes
                                 smap, # spatial map including positions at x- and y-axis (dummy)
                                 scale=10^6, # scale=LARGE VALUE calculates the genotype similarity all over a subpopulation
                                 grouping=d3$group) # grouping splits data into subpopulations

Association tests are performed using the nei_lmm function in the rNeighborGWAS package. The asym=TRUE option is selected to test the asymmetric effects \(\beta_{12}\) as well as \(\beta_2\) and \(\beta_1\).

res = rNeighborGWAS::nei_lmm(geno, # matrix of individual genotypes
                             g_nei, # matrix of genotype similarities
                             d3$y, # phenotype vector
                             addcovar=model.matrix(~d3$group), # covariate can be given by a matrix
                             asym=TRUE) # asym=TRUE tests beta_12 as well
## PVE_self = 0.92217485125694
## PVE_nei = 0.0184920826388306
head(res)
##      beta_self    beta_nei    beta_sxn    p_self     p_nei      p_sxn
## 1  0.046881625  0.10912691  0.14889194 0.1765667 0.8403981 0.06597602
## 2 -0.005213942  0.01665381 -0.01913179 0.4707472 0.4670357 0.62055628
## 3  0.067605699  0.10268371  0.25744636 0.2799402 0.1037798 0.46680893
## 4 -0.185982431 -0.20429596 -0.21744986 0.9014824 0.9233724 0.21839482
## 5 -0.935841546 -0.95108108 -1.06530626 0.8001786 0.4598263 0.03108385
## 6  0.046881625  0.10912691  0.14889194 0.1765667 0.8403981 0.06597602

Manhattan plots are depicted using the gaston package.

x = data.frame(g$gmap,res$p_nei)
colnames(x) = c("chr","pos","p")
gaston::manhattan(x,las=1,thinning=FALSE)
abline(h=-log10(0.05/nrow(g_nei)),lty=2,col="grey") # Bonferroni correction

Figure 4. Manhattan plot showing the association score of -log10(p) against genomic positions in the case of split subpopulations. A horizontal dashed line indicate the genome-wide Bonferroni threshold at p < 0.05.

(4) continuous space

As with the single-locus example, phenotype data from continuous space should include the spatial positions along the x- and y-axes (X and Y).

d4 = readRDS("simulation/output/toy4.rds")
head(d4)
##      ID        y X Y
## 1  i382 2.634093 1 1
## 2  i365 2.461118 1 2
## 3 i1757 3.226431 1 3
## 4 i1509 2.654648 1 4
## 5 i1685 2.900422 1 5
## 6 i1517 2.627776 1 6

As in the case of split subpopulations, we extract the individuals and assume complete dominance of the A alleles over the a alleles.

geno = g$geno[d4$ID,] # select the same individuals as phenotype files
geno[geno==0] = 1 # skipping this line encodes genotypes in an additive way

For a continuous space, we calculate the genotype similarity using the nei_coval function in the rNeighborGWAS package. The option scale=sqrt(2+0.01) corresponds to the nearest neighbors (i.e.,\(N_k\)=8).

smap = cbind(d4$X,d4$Y)
g_nei = rNeighborGWAS::nei_coval(geno, # matrix of individual genotypes
                                 smap, # spatial map including positions at x- and y-axis
                                 scale=sqrt(2)+0.01) # scale=sqrt(2)+0.01 refers to the nearest neighbors

As in the case of split subpopulations, association tests are performed using the nei_lmm function in the rNeighborGWAS package. The asym=TRUE option is selected to test the asymmetric effects \(\beta_{12}\) as well as \(\beta_2\) and \(\beta_1\).

res = rNeighborGWAS::nei_lmm(geno, # matrix of individual genotypes
                             g_nei, # matrix of genotype similarities
                             d4$y, # phenotype vector
                             asym=TRUE) # asym=TRUE tests beta_12 as well
## PVE_self = 0.741886006467102
## PVE_nei = 0.178102451197045
## EM step failed to improve likelihood (this should not happen)
head(res)
##      beta_self    beta_nei     beta_sxn    p_self      p_nei      p_sxn
## 1  0.135293942  0.13333649  0.137079051 0.5212056 0.30820716 0.03204436
## 2  0.001689895 -0.03905978 -0.007133374 0.5338668 0.29617026 0.85914127
## 3 -0.459117295 -0.46484055 -0.315251047 0.5568374 0.05413914 0.29810376
## 4  0.098067450  0.11714256  0.012754219 0.6274098 0.23835098 0.97353821
## 5  0.064310110  0.04507351 -0.011163624 0.5561146 0.54849599 0.96508883
## 6  0.135293942  0.13333649  0.137079051 0.5212056 0.30820716 0.03204436

Manhattan plots are depicted using the gaston package.

x = data.frame(g$gmap,res$p_nei)
colnames(x) = c("chr","pos","p")
gaston::manhattan(x,las=1,thinning=FALSE)
abline(h=-log10(0.05/nrow(g_nei)),lty=2,col="grey")

Figure 5. Manhattan plot showing the association score of -log10(p) against genomic positions in the case of continuous space. A horizontal dashed line indicate the genome-wide Bonferroni threshold at p < 0.05.

Notes

When the interaction term is significant:

If \(\beta_{12}\) is significant, the estimates of \(\beta_1\) and \(\beta_2\) may not be correct in Equation (2). Practically, we should first test \(\beta_{12}\) using the multiplicative model Equation (2). If \(\beta_{12}\) is significant in Equation (2), we then estimate \(\beta_2\) using the linear model Equation (1). Then \(\hat{\beta}_{12}\) from Equation (2), \(\hat{\beta}_1\) and \(\hat{\beta}_2\) from Equation (1) may be used to depict the fitness functions.

How to analyze multi-year/site data on continous space:

We can jointly use the scale and grouping option in the nei_coval function in the rNeighborGWAS package. If data on a continuous space are collected from multiple years or sites, individuals in different years or sites should be separated using the grouping option. The years or sites should be made numeric and assigned to the grouping option. Then, the neighbor space to be referred should be specified using the scale option. In addition to these options, the addcover option in the nei_lmm function allows us to consider fixed effects of years or sites when performing association tests. See the vignette of the rNeighborGWAS package for this usage at https://cran.r-project.org/web/packages/rNeighborGWAS/vignettes/rNeighborGWAS.html.

How to determine the spatial scale in a continuous space:

This is not easy to determine but feasible by calculating the proportion of fitness variation explained (PVE) by the polygenic effects from the genotype similarity. If PVE peaks at a particular spatial scale, this scale would be optimal. See also Sato et al. (2021) Heredity for details. The vignette of the rNeighborGWAS package (https://cran.r-project.org/web/packages/rNeighborGWAS/vignettes/rNeighborGWAS.html) also provides more information on this topic.

Citation

When you need to cite this instruction, please cite the following publication: Sato Y, Takahashi Y, Xu C, Shimizu KK. (2023) Detecting frequency-dependent selection through the effects of genotype similarity on fitness components. Evolution. https://doi.org/10.1093/evolut/qpad028. This instruction file was developed as a part of the supplementary materials of Sato et al., (2023) Evolution during its peer-review.

References

  1. Bates D, Maechler M, Bolker B, Walker S. (2015) Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1), 1-48. doi:10.18637/jss.v067.i01.
  2. Perdry H, Dandine-Roulland C. (2020) gaston: Genetic Data Handling (QC, GRM, LD, PCA) & Linear Mixed Models. R package version 1.5.7. https://CRAN.R-project.org/package=gaston
  3. Sato Y, Yamamoto E, Shimizu KK, Nagano AJ. (2021) Neighbor GWAS: incorporating neighbor genotypic identity into genome-wide association studies of field herbivory. Heredity 126(4):597-614. https://doi.org/10.1038/s41437-020-00401-w
  4. Sato Y, Takahashi Y, Xu C, Shimizu KK. (2023) Detecting frequency-dependent selection through the effects of genotype similarity on fitness components. Evolution. https://doi.org/10.1093/evolut/qpad028
  5. Wickham H. (2016) ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.