-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pi cluster within-pool #50
Comments
Greetings! First, I am a little confused by "characterize each sample by its nucleotide diversity (i.e. pi, piS, piN, pi_coding, pi_noncoding) as function of variant frequency". But all measures of π are determined entirely by the variant numbers and frequencies — so what exactly are you comparing and why? I do not well understand your distance method, but it seems you are characterizing differences between the π of samples, i.e., you are asking which are the most (or least) diverse? My first thought is that, depending on your system, pi_coding may simply be too low (e.g., too few variants) for any meaning to be extracted. This will depend on how polymorphic your system is. What do you mean by "running SNPGenie_sliding_windows.R on my codon_results.txt to get relationship between samples?" I think I need to know more what you're trying to compare between samples. piN/piS and dN/dS are (often) calculated identically; their major difference is 'philosophical'. pi refers to variation within a population or species; d refers to divergence between different species. Because species are much more different than members of the same population, d (between-species) is sometimes further 'corrected' for the possibility of multiple mutations having occurred at the same site ('multiple hits'), e.g., using a Jukes-Cantor correction. You're exactly right that this is totally different than the MK test. The MK test uses total counts of sites with that type of difference (the number of sites that differ), that is, it's count data, and does not factor in their frequencies. Thus, "dN" and "dS" for MK are not the same as the most common definition of dN and dS, and in my opinion should not be called dN and dS because it's so confusing.. Their original 1991 paper does not use those names. Perhaps calling them fN and fS (numbers of sites with fixed nonsynoynmous and synonymous differences between species) and pN and pS (numbers of sites with polymorphic nonsynoynmous and synonymous differences within species) would be less confusing. For all of these questions, the answer about what approach to use totally depends on exactly what hypothesis you want to test. I suggest carefully defining your hypothesis/question first, then choosing the one test that is best suited to address it. Otherwise, there are infinite possibilities, and one is just aimlessly "fishing" for results. Let me know if that helps! |
Hi Chase,
I run SNPGenie using different --minfreq, with the aim to see how the different calculated output differ as function of the min freq. I would expect that at determined value of freq, there is not difference between samples, because the samples are from a short period of time and hence the pathogen should not have time to move strongly ahead in the evolution. So basically, which is the minimum freq by which the samples starts to be different? Using only the major variants they are very similar. See pic, in which the colored dots are the variants. A1/A2 are identical between them at %5 but different with A3. After 5% they are all identical.
The idea is that using π diversity (distance matrix) I would find clusters between samples, so basically some samples could have similar diversity. Then this similar diversity background could be related to the metadata (i.e. household, contact direct) or outcome?
From #48 and https://github.com/chasewnelson/SNPGenie#sliding-windows-bootstrapping my idea is to get also the πN/πS per sample and then compare πN/πS per each sample. Does it make sense?
My hypotheses are: One of the read inspiring is this: Best |
Greetings, @aspitaleri ! Thanks for sharing this fascinating idea. Indeed, if there has not been enough time for new variants to fix, it makes perfect sense that the consensus (major allele) sequences of all the lineages would be identical. Now I understand better. Yes, I think it makes sense to calculate πN and πS individually for each sample, and then see how samples compare. You will just want to be sure that the differences are not an artifact of your SNP calling method. Finally, I think the within vs. between host diversity distinction you make is extremely valuable and important. I'd definitely think that low-frequency diversity within host offers valuable insight into evolution, even when the consensus sequences are identical. I think there is a literature on the concept of "Beyond the Consensus" you may find valuable. Let me know if that helps! |
Hi @cwnelson88 thanks for your feedback! Now, if I want to compare samples I could fall in the cluster problem to get different genomes in the same cluster since they have similar pi_nocoding but the pi_nocoding arises from different contribution along the genome. For instance: |
Hello @aspitaleri ! These are interesting questions. They're technically on the user end (this is something the user could do with SNPGenie results), but one possibility is to "manually" calculate noncoding pi yourself, for custom sites/regions/sliding windows. Note that π = (N_diffs+S_diffs)/(N_sites+S_sites). For example it might be convenient or meaningful to split your genome into genes, or overlapping windows, or non-overlapping windows. You could even go so far as to cluster samples using each site as a different attribute (variable). Each site could be assigned its own pi as a value. But these are just ideas — all of this really depends on exactly what hypothesis you're trying to test. For example, do you already know other characteristics of these samples? Could you attempt to predict those characteristics based on your clustering-by-diversity method? Just ideas! |
Hi @cwnelson88 indeed I was thinking to use the data from product_results.txt, which contains information per product. I will then use π = (N_diffs+S_diffs)/(N_sites+S_sites) and test my idea over those products, pairwise. |
That sounds very reasonable! Then, if certain products are showing similarities, you can get more specific and examine individual sites too. Good luck! |
As follow up questions regarding the MK-test:
|
Hi @pascalangst ! Unfortunately, I do not think it is possible to make MK calculations directly from either SNPGenie file. But it may be possible to derive it from one of the outputs. Please allow me to 'think aloud' as we consider just For For Thus, I think the best bet is the Please keep in mind there are many variations of the MK test; I never use it and am not an expert in its applications; and SNPGenie is not designed for it. One limitation of the MK test is that it does not consider the frequencies of the different types of variants (it just counts sites), which (unlike dN/dS and piN/piS) eliminates a critical source of information. Another limitation is that the relaxation of purifying selection during a population bottleneck can mimic selection. Given those caveats, the 'hack' suggested above can probably yield the counts you want if this is the version of MK you're interested in using. Hope that helps — |
Hi Chase. Many thanks for your thoughts on that. Mapping to a closely related species is what we did. The results of the MK calculations differ only slightly from the original idea of creating them from the product_results, compared to creating them from the site_results. However, I also think making them from the site_results is probably more correct and I'll move forward using calculations thereof. |
Thanks @pascalangst ! That sounds like a good plan. It makes sense that using product_results would yield a close approximation for a closely related species. However, that approximation considers a codon to be at most 1 site rather than 3, and therefore would increasingly underestimate the numbers of fixed or polymorphic sites with (1) increased divergence or (2) increased polymorphism, because more and more codons would contain more than one fixed diff or polymorphism (yet still are only counted as 1). For these reasons, I affirm your choice to use the site_results — it's more accurate and ultimately more flexible, i.e., can be used for data/circumstances with different amounts of diversity. Chase |
Hi Chase -
I have performed the within-pool using the vcf report as input. Now I have several output and I am trying to understand (after reading your papers and references reported in the README.md) how to use them for my purpose. I have been reading previous issues #48 and #49 since they are strongly correlated to my purpose.
My aim is to characterize each sample by its nucleotide diversity (i.e. pi, piS, piN, pi_coding, pi_noncoding) as function of variant frequency and then compare between samples. At given frequency, I built a pairwise distance matrix between samples using as metric Euclidean distance between pi, pi_coding and pi_noncoding. On that distance matrix I have carried out either a linkage clustering or a dimension reduction as UMAP, to perform a clustering analysis. Interestingly, pi_coding is giving few clusters.
Said so, now I'd like to link those cluster to the effect of the mutations, i.e. piN/piS which is identical to dN/dS as far as I understood in my case. I will make an average/variance of piN/piS per cluster.
Any comment on that?
The thread #48 could be useful in my case too, running SNPGenie_sliding_windows.R on my codon_results.txt to get relationship between samples?
Then, reading further I read the McDonald-Kreitman test:
[pN/pS] / [dN/dS]
where
dS: the number of synonymous substitutions per gene
dN: the number of non-synonymous substitutions per gene
pS: the number of synonymous polymorphisms per gene
pN: the number of non-synonymous polymorphisms per gene
Now I got lost/confused and I ask your kind help.
pS and pN are not piS and piN, right? I am asking because the definition are very similar between them.
thank for any comment on that! very appreciate
Best
The text was updated successfully, but these errors were encountered: