Background
In order to estimate frequency-dependent selection (FDS) using the framework of selection gradient analysis, we have proposed a regression model that takes genotype similarity into account as a covariate (Sato et al. 2023 Evolution). The purpose of this vignette is to show how to use the proposed method with toy data. Throughout this vignette, we will analyze toy data with a single-locus or GWAS-style structure under either a split or continuous subpopulation structure (see ‘Usage’ below). The Rmarkdown source and input data for this vignette are available at the following Github repository: https://github.com/yassato/RegressionFDS
Before analyzing the toy data, let us review the proposed model. The proposed method is given by either a linear regression model: \[y_i = \beta_0 + \beta_1x_i + \frac{\beta_2}{N_k}\sum^{N_{k}}_{j=1}{x_ix_j} + e_i~~~~~(1)\] or a multiplicative regression model: \[y_i = \beta_0 + \beta_1x_i + \frac{\beta_2}{N_k}\sum^{N_{k}}_{j=1}{x_ix_j} + \frac{\beta_{12}x_i}{N_k}\sum^{N_{k}}_{j=1}{x_ix_j} + e_i~~~~~(2)\] In these equations \(y_i\) is the fitness component of individual \(i\); \(x_i\) is the allelic status of individual \(i\) at a given locus; \(\beta_0\) is the intercept; and \(\beta_1\) is the coefficient of directional selection on the locus. The term \((\sum^{N_{k}}_{j=1}{x_ix_j}) / N_k\) represents the mean genotype similarity between the focal individual \(i\) and its counterpart individuals \(j\). As described in Sato et al. (2022), a positive effect of genotype similarity (\(\beta_2\)) on fitness component represents positive frequency-dependent selection (FDS) on relative fitness, while a negative effect represents negative FDS. The the interaction term coefficient \(\beta_{12}\) represents asymmetric FDS on absolute fitness between two alleles.
Usage
Here, we will first examine the structure of the toy data and then analyze these data using the proposed regression model (Eq. 2). Four types of toy data are prepared for: (1) a single-locus example with split subpopulations; (2) a single-locus example with a continuous space; (3) a GWAS example with split subpopulations; and (4) a GWAS example with continuous subpopulations.
Single-locus examples
The data for the single-locus examples are as simple as a single data frame can store. They are saved in R Data style (.rds), but CSV and TSV formats are also possible. Using the toy data, we will estimate \(\beta_1\), \(\beta_2\), and \(\beta_{12}\) and depict the fitness functions as follows.
(1) split subpopulations
For each individual ID, data from split subpopulations
should include the fitness component measure y, morph-type
‘gi’ (A or a), and its affiliated subpopulation group.
d1 = readRDS("simulation/output/toy1.rds")
head(d1)##      ID        y gi group
## 1 i1307 22.67740  a     1
## 2 i1871 22.65558  A     1
## 3 i1017 22.57326  a     1
## 4 i1941 22.58306  a     1
## 5 i1124 22.67228  a     1
## 6 i1003 22.66084  a     1Before calculating genotype similarity, we convert genotypes into digits. Assuming complete dominance of the A morph over the a morph under Mendelian inheritance, we convert the A and a morphs into +1 and -1, respectively.
d1$gi[d1$gi=="A"] = 1
d1$gi[d1$gi=="a"] = -1
d1$gi = as.numeric(d1$gi) # make gi numeric 
head(d1)##      ID        y gi group
## 1 i1307 22.67740 -1     1
## 2 i1871 22.65558  1     1
## 3 i1017 22.57326 -1     1
## 4 i1941 22.58306 -1     1
## 5 i1124 22.67228 -1     1
## 6 i1003 22.66084 -1     1Then we calculate the mean allelic similarity at the focal locus. The following implements the calculation of the mean allelic similarity within a subpopulation based on \((\sum^{N_{k}}_{j=1}{x_ix_j}) / N_k = x_i(\sum^{N_{k}}_{j=1}{x_j}) / N_k = x_i\bar{x_j}\).
gigj = c()
for(i in 1:nrow(d1)) {
  ds = d1[-i,]
  ds = subset(ds,group==d1$group[i])
  gigj = c(gigj,d1$gi[i]*mean(ds$gi))
}
d1 = data.frame(d1,gigj)
hist(d1$gigj,xlim=c(-1,1))
Figure 1. Histogram of the term \((\sum^{N_{k}}_{j=1}{x_ix_j}) / N_k\)
ranging from -1 to +1.
To estimate \(\beta_1\), \(\beta_2\), and \(\beta_{12}\), we use a linear mixed model with the subpopulation ID considered a random effect, as follows.
res = lme4::lmer(y~gi*gigj+(1|group),data=d1)
summary(res)## Linear mixed model fit by REML ['lmerMod']
## Formula: y ~ gi * gigj + (1 | group)
##    Data: d1
## 
## REML criterion at convergence: -2769.1
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.90894 -0.68002 -0.01201  0.65199  2.97945 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev.
##  group    (Intercept) 2.765e-05 0.005258
##  Residual             2.570e-03 0.050696
## Number of obs: 900, groups:  group, 90
## 
## Fixed effects:
##               Estimate Std. Error   t value
## (Intercept) 22.6012359  0.0018257 12379.695
## gi           0.0003922  0.0017442     0.225
## gigj        -0.2057721  0.0051378   -40.051
## gi:gigj      0.0096332  0.0053303     1.807
## 
## Correlation of Fixed Effects:
##         (Intr) gi    gigj 
## gi      0.077             
## gigj    0.022  0.215      
## gi:gigj 0.215  0.051 0.122Based on \(\hat{\beta}_1\), \(\hat{\beta}_2\), and \(\hat{\beta}_{12}\), we finally visualize fitness functions of FDS. According to Appendix S2 Sato et al. (2022), the fitness function of the A morph and the a morph is given by \(y_\mathrm{A} = \beta_0 + \beta_1 + (\beta_{12}+\beta_2)(2f_\mathrm{A}-1)\) and \(y_\mathrm{a} = \beta_0 - \beta_1 + (\beta_{12}-\beta_2)(2f_\mathrm{A}-1)\), respectively. The mean fitness is given by \(\bar{y} = f_\mathrm{A}y_\mathrm{A} + (1-f_\mathrm{A})y_\mathrm{a}\), where \(f_A\) is the frequency of the A morph.
freq = c() # calculate allele frequency within a subpopulation
for(i in 1:nrow(d1)) {
  ds = subset(d1,group==d1$group[i])
  freq = c(freq,mean(ds$gi))
}
freq = (freq/2) + 0.5
d1 = data.frame(d1,freq)
library(ggplot2)
plt = function(b0,b1,b2,b12) {
  f_star = 0.5-(b1/(2*b2)) # f_star: equilibrium frequency
  p = ggplot(d1, aes(x=freq,y=y)) + geom_jitter(pch=d1$gi+2,colour="grey",width=0.05) + # jitter plots for data points
    theme_classic() + ylab("Fitness") + xlab("phenotype-level frequency of A") + xlim(0,1) +
    geom_function(aes(x=1,y=1),fun=function(x,b0,b1,b2,b12) { (b12+b2)*(2*x-1)+b0+b1 }, args=list(b0,b1,b2,b12)) + # fitness function for A morph
    geom_function(aes(x=1,y=1),fun=function(x,b0,b1,b2,b12) { (b12-b2)*(2*x-1)+b0-b1 }, args=list(b0,b1,b2,b12), colour=grey(0.0,0.33)) + # fitness function for a morph
    geom_function(aes(x=1,y=1),fun=function(x,b0,b1,b2,b12) { 2*b2*x*(2*x-1)+(b12-b2)*(2*x-1)+b0-b1+2*b1*x }, args=list(b0,b1,b2,b12), lty=2) + # mean fitness
    geom_point(aes(x=f_star,y=(b12+b2)*(2*f_star-1)+b0+b1),pch=16,size=3)
  
  return(p)
}
p = plt(b0=coef(summary(res))[1,1], 
        b1=coef(summary(res))[2,1],
        b2=coef(summary(res))[3,1],
        b12=coef(summary(res))[4,1])
p
Figure 2. Negative frequency-dependent selection at a
single locus in split subpopulations. A black line and plus marks
indicate the fitness of the A morph. A grey line and open circles
indicate the fitness of the a morph. A dashed curve indicates the
population-level mean fitness.
(2) continuous space
Instead of the subpopulation ID, data from continuous space should
include the spatial positions of individuals in a two dimensional space
along the x- and y-axes (X and Y). Both
continuous values (e.g., GPS locality) and relative distance (e.g.,
positions in a lattice space) can be used to record the values of
X and Y.
d2 = readRDS("simulation/output/toy2.rds")
head(d2)##      ID        y gi X Y
## 1 i1498 22.57069  a 1 1
## 2  i212 22.71575  A 1 2
## 3  i430 22.66228  A 1 3
## 4  i631 22.61170  a 1 4
## 5 i1653 22.53516  a 1 5
## 6 i1341 22.63255  a 1 6Similarly to the split subpopulation case, we convert genotypes into
digit in the case of continuous space. The genotype similarity in a
continuous space can be calculated using the nei_coval
function in the rNeighborGWAS package. For detailed usage
of the nei_coval function, see also “GWAS examples”
below.
d2$gi[d2$gi=="A"] = 1
d2$gi[d2$gi=="a"] = -1
d2$gi = as.numeric(d2$gi)
smap = cbind(d2$X,d2$Y)
geno = as.matrix(d2$gi)
gigj = rNeighborGWAS::nei_coval(geno,smap,scale=sqrt(2+0.01))
d2 = data.frame(d2,gigj)where scale=sqrt(2+0.01) corresponds to the nearest
neighbors (i.e.,\(N_k\)=8). Then, we
can use a standard regression model since there are no grouping
covariates in the case of a continuous space.
res = lm(y~gi*gigj,data=d2)
summary(res)## 
## Call:
## lm(formula = y ~ gi * gigj, data = d2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.153648 -0.035457 -0.001249  0.034773  0.180099 
## 
## Coefficients:
##               Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 22.5994027  0.0018361 12308.292   <2e-16 ***
## gi           0.0007062  0.0018361     0.385    0.701    
## gigj        -0.1932443  0.0050256   -38.452   <2e-16 ***
## gi:gigj      0.0043318  0.0050256     0.862    0.389    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05346 on 896 degrees of freedom
## Multiple R-squared:  0.6364, Adjusted R-squared:  0.6352 
## F-statistic: 522.8 on 3 and 896 DF,  p-value: < 2.2e-16As with split subpopulations, we can depict the fitness functions in response to the local frequency of the A morph.
freq = (d2$gi*d2$gigj/2) + 0.5
d2 = data.frame(d2,freq)
plt = function(b0,b1,b2,b12) {
  f_star = 0.5-(b1/(2*b2)) # f_star: equilibrium frequency
  p = ggplot(d2, aes(x=freq,y=y)) + geom_jitter(pch=d2$gi+2,colour="grey",width=0.05) + # jitter plots for data points
    theme_classic() + ylab("Fitness") + xlab("phenotype-level local frequency of A") + xlim(0,1) + 
    geom_function(aes(x=1,y=1),fun=function(x,b0,b1,b2,b12) { (b12+b2)*(2*x-1)+b0+b1 }, args=list(b0,b1,b2,b12)) + # fitness function for A morph
    geom_function(aes(x=1,y=1),fun=function(x,b0,b1,b2,b12) { (b12-b2)*(2*x-1)+b0-b1 }, args=list(b0,b1,b2,b12), colour=grey(0.0,0.33)) + # fitness function for a morph
    geom_function(aes(x=1,y=1),fun=function(x,b0,b1,b2,b12) { 2*b2*x*(2*x-1)+(b12-b2)*(2*x-1)+b0-b1+2*b1*x }, args=list(b0,b1,b2,b12), lty=2) + # mean fitness
    geom_point(aes(x=f_star,y=(b12+b2)*(2*f_star-1)+b0+b1),pch=16,size=3)
  
  return(p)
}
p = plt(b0=coef(summary(res))[1,1], 
        b1=coef(summary(res))[2,1],
        b2=coef(summary(res))[3,1],
        b12=coef(summary(res))[4,1])
p## Warning: Removed 3 rows containing missing values (geom_point).
Figure 3. Negative frequency-dependent selection at a
single locus in a continuous space. A black line and plus marks indicate
the fitness of the A morph. A grey line and open circles indicate the
fitness of the a morph. A dashed curve indicates the population-level
mean fitness.
GWAS examples
Genotype data in a variant call format (.vcf) can be imported using
the gaston package. Its select.snps function
can cut off SNPs with minor allele frequency (MAF). The
gaston2neiGWAS in the rNeighborGWAS package converts the
genotypes in an additive manner (AA, Aa, and aa) into +1, 0, -1,
respectively. The rows and columns in g$geno contain
individuals and loci, respectively.
g = gaston::read.vcf("simulation/output/1_NFDS.vcf.gz")## ped stats and snps stats have been set. 
## 'p' has been set. 
## 'mu' and 'sigma' have been set.g = gaston::select.snps(g,g@snps$maf>0.01)
g = rNeighborGWAS::gaston2neiGWAS(g)
print(g$geno[1:10,1:10])##     .  .  .  .  .  .  .  .  .  .
## i0 -1 -1 -1 -1 -1 -1  0 -1  1 -1
## i1 -1 -1 -1 -1 -1 -1  1 -1 -1 -1
## i2 -1 -1 -1 -1 -1 -1  0 -1  0 -1
## i3 -1 -1 -1 -1 -1 -1  1 -1  0 -1
## i4 -1 -1 -1 -1 -1 -1  0 -1  0 -1
## i5 -1 -1 -1 -1 -1 -1  0 -1  0 -1
## i6 -1 -1 -1 -1 -1 -1  0 -1  0 -1
## i7 -1 -1 -1 -1 -1 -1 -1 -1  1 -1
## i8 -1 -1 -1 -1 -1 -1  1 -1  1 -1
## i9 -1 -1 -1 -1 -1 -1  1 -1  0 -1To analyze FDS, we combine these genotypes with phenotype records as follows.
(3) split subpopulations
As with the single-locus example, phenotype data from split subpopulations should include the subpopulation ID ‘group’.
d3 = readRDS("simulation/output/toy3.rds")
head(d3)##      ID        y group
## 1  i690 3.032313     1
## 2  i188 2.786912     1
## 3  i731 2.663417     1
## 4  i938 2.724622     1
## 5  i445 2.519885     1
## 6 i1642 2.595124     1After applying the MAF cut-off to the genotype data, we extract the individuals and assume complete dominance of the A alleles over the a alleles.
geno = g$geno[d3$ID,] # select the same individuals as phenotype files
geno[geno==0] = 1 # skipping this line encodes genotypes in an additive wayWe then calculate the genotype similarity using the
nei_coval function in the rNeighborGWAS
package. Note that the scale=10^6 and grouping
options allow us to assign individuals to subpopulations and calculate
the genotype similarity within subpopulations.
smap = cbind(runif(nrow(d3),0,1),runif(nrow(d3),0,1)) # random dummy map in a small space
g_nei = rNeighborGWAS::nei_coval(geno, # matrix of individual genotypes
                                 smap, # spatial map including positions at x- and y-axis (dummy)
                                 scale=10^6, # scale=LARGE VALUE calculates the genotype similarity all over a subpopulation
                                 grouping=d3$group) # grouping splits data into subpopulationsAssociation tests are performed using the nei_lmm
function in the rNeighborGWAS package. The
asym=TRUE option is selected to test the asymmetric effects
\(\beta_{12}\) as well as \(\beta_2\) and \(\beta_1\).
res = rNeighborGWAS::nei_lmm(geno, # matrix of individual genotypes
                             g_nei, # matrix of genotype similarities
                             d3$y, # phenotype vector
                             addcovar=model.matrix(~d3$group), # covariate can be given by a matrix
                             asym=TRUE) # asym=TRUE tests beta_12 as well## PVE_self = 0.92217485125694## PVE_nei = 0.0184920826388306head(res)##      beta_self    beta_nei    beta_sxn    p_self     p_nei      p_sxn
## 1  0.046881625  0.10912691  0.14889194 0.1765667 0.8403981 0.06597602
## 2 -0.005213942  0.01665381 -0.01913179 0.4707472 0.4670357 0.62055628
## 3  0.067605699  0.10268371  0.25744636 0.2799402 0.1037798 0.46680893
## 4 -0.185982431 -0.20429596 -0.21744986 0.9014824 0.9233724 0.21839482
## 5 -0.935841546 -0.95108108 -1.06530626 0.8001786 0.4598263 0.03108385
## 6  0.046881625  0.10912691  0.14889194 0.1765667 0.8403981 0.06597602Manhattan plots are depicted using the gaston
package.
x = data.frame(g$gmap,res$p_nei)
colnames(x) = c("chr","pos","p")
gaston::manhattan(x,las=1,thinning=FALSE)
abline(h=-log10(0.05/nrow(g_nei)),lty=2,col="grey") # Bonferroni correction
Figure 4. Manhattan plot showing the association score
of -log10(p) against genomic positions in the case
of split subpopulations. A horizontal dashed line indicate the
genome-wide Bonferroni threshold at p < 0.05.
(4) continuous space
As with the single-locus example, phenotype data from continuous
space should include the spatial positions along the x- and y-axes
(X and Y).
d4 = readRDS("simulation/output/toy4.rds")
head(d4)##      ID        y X Y
## 1  i382 2.634093 1 1
## 2  i365 2.461118 1 2
## 3 i1757 3.226431 1 3
## 4 i1509 2.654648 1 4
## 5 i1685 2.900422 1 5
## 6 i1517 2.627776 1 6As in the case of split subpopulations, we extract the individuals and assume complete dominance of the A alleles over the a alleles.
geno = g$geno[d4$ID,] # select the same individuals as phenotype files
geno[geno==0] = 1 # skipping this line encodes genotypes in an additive wayFor a continuous space, we calculate the genotype similarity using
the nei_coval function in the rNeighborGWAS
package. The option scale=sqrt(2+0.01) corresponds to the
nearest neighbors (i.e.,\(N_k\)=8).
smap = cbind(d4$X,d4$Y)
g_nei = rNeighborGWAS::nei_coval(geno, # matrix of individual genotypes
                                 smap, # spatial map including positions at x- and y-axis
                                 scale=sqrt(2)+0.01) # scale=sqrt(2)+0.01 refers to the nearest neighborsAs in the case of split subpopulations, association tests are
performed using the nei_lmm function in the
rNeighborGWAS package. The asym=TRUE option is
selected to test the asymmetric effects \(\beta_{12}\) as well as \(\beta_2\) and \(\beta_1\).
res = rNeighborGWAS::nei_lmm(geno, # matrix of individual genotypes
                             g_nei, # matrix of genotype similarities
                             d4$y, # phenotype vector
                             asym=TRUE) # asym=TRUE tests beta_12 as well## PVE_self = 0.741886006467102## PVE_nei = 0.178102451197045## EM step failed to improve likelihood (this should not happen)head(res)##      beta_self    beta_nei     beta_sxn    p_self      p_nei      p_sxn
## 1  0.135293942  0.13333649  0.137079051 0.5212056 0.30820716 0.03204436
## 2  0.001689895 -0.03905978 -0.007133374 0.5338668 0.29617026 0.85914127
## 3 -0.459117295 -0.46484055 -0.315251047 0.5568374 0.05413914 0.29810376
## 4  0.098067450  0.11714256  0.012754219 0.6274098 0.23835098 0.97353821
## 5  0.064310110  0.04507351 -0.011163624 0.5561146 0.54849599 0.96508883
## 6  0.135293942  0.13333649  0.137079051 0.5212056 0.30820716 0.03204436Manhattan plots are depicted using the gaston
package.
x = data.frame(g$gmap,res$p_nei)
colnames(x) = c("chr","pos","p")
gaston::manhattan(x,las=1,thinning=FALSE)
abline(h=-log10(0.05/nrow(g_nei)),lty=2,col="grey")
Figure 5. Manhattan plot showing the association score
of -log10(p) against genomic positions in the case
of continuous space. A horizontal dashed line indicate the genome-wide
Bonferroni threshold at p < 0.05.
Notes
When the interaction term is significant:
If \(\beta_{12}\) is significant, the estimates of \(\beta_1\) and \(\beta_2\) may not be correct in Equation (2). Practically, we should first test \(\beta_{12}\) using the multiplicative model Equation (2). If \(\beta_{12}\) is significant in Equation (2), we then estimate \(\beta_2\) using the linear model Equation (1). Then \(\hat{\beta}_{12}\) from Equation (2), \(\hat{\beta}_1\) and \(\hat{\beta}_2\) from Equation (1) may be used to depict the fitness functions.
How to analyze multi-year/site data on continous space:
We can jointly use the scale and grouping
option in the nei_coval function in the
rNeighborGWAS package. If data on a continuous space are
collected from multiple years or sites, individuals in different years
or sites should be separated using the grouping option. The
years or sites should be made numeric and assigned to the
grouping option. Then, the neighbor space to be referred
should be specified using the scale option. In addition to
these options, the addcover option in the
nei_lmm function allows us to consider fixed effects of
years or sites when performing association tests. See the vignette of
the rNeighborGWAS package for this usage at https://cran.r-project.org/web/packages/rNeighborGWAS/vignettes/rNeighborGWAS.html.
How to determine the spatial scale in a continuous space:
This is not easy to determine but feasible by calculating the proportion of fitness variation explained (PVE) by the polygenic effects from the genotype similarity. If PVE peaks at a particular spatial scale, this scale would be optimal. See also Sato et al. (2021) Heredity for details. The vignette of the rNeighborGWAS package (https://cran.r-project.org/web/packages/rNeighborGWAS/vignettes/rNeighborGWAS.html) also provides more information on this topic.
Citation
When you need to cite this instruction, please cite the following publication: Sato Y, Takahashi Y, Xu C, Shimizu KK. (2023) Detecting frequency-dependent selection through the effects of genotype similarity on fitness components. Evolution. https://doi.org/10.1093/evolut/qpad028. This instruction file was developed as a part of the supplementary materials of Sato et al., (2023) Evolution during its peer-review.
References
- Bates D, Maechler M, Bolker B, Walker S. (2015) Fitting Linear
Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1),
1-48. doi:10.18637/jss.v067.i01.
 
- Perdry H, Dandine-Roulland C. (2020) gaston: Genetic Data Handling
(QC, GRM, LD, PCA) & Linear Mixed Models. R package version 1.5.7.
https://CRAN.R-project.org/package=gaston
 
- Sato Y, Yamamoto E, Shimizu KK, Nagano AJ. (2021) Neighbor GWAS:
incorporating neighbor genotypic identity into genome-wide association
studies of field herbivory. Heredity 126(4):597-614. https://doi.org/10.1038/s41437-020-00401-w
 
- Sato Y, Takahashi Y, Xu C, Shimizu KK. (2023) Detecting frequency-dependent selection through the effects of genotype similarity on fitness components. Evolution. https://doi.org/10.1093/evolut/qpad028
- Wickham H. (2016) ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.