-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
problem with minfreq #74
Comments
Greetings, Wanting! Thanks for this question. In fact, this is the desired behavior. For example, suppose there is a polymorphic site with C=9999 and T=1. The frequency of T is 1 / (9999 + 1) = 0.0001, or 0.01%, and it doesn't matter whether it's the REF or ALT allele. If C is the REF allele, then the VCF will report ALT as T with AF=0.0001; alternatively, if T if the REF allele, then the VCF will report ALT as C with AF=0.9999. However, in both cases there is an allele (T) with a frequency below your cut-off, and it will be eliminated. Put another way, the frequency filter will eliminate an allele regardless of whether it is REF or ALT. Note that which allele is REF is often arbitrary (e.g., a standard reference sequence against which variants were called). Let me know if this makes sense. If there's a reason you wanted to filter out low-frequency ALT alleles only (but not low-frequency REF alleles), you could filter your VCF file on your own before submitting it to SNPGenie, and then simply not use --minfreq. However, in most circumstances I don't think it makes sense to give low-frequency REF alleles a 'free pass' but eliminate low-frequency ALT alleles. But, it's possible your situation is an exception! Hope this helps, let me know! |
Thank you for response so fast. Just want to confirm that I understand correctly if C is the REF with 0.99% and T is ALT 0.01%. The T will be filtered out but C will remain in the calculation. |
My pleasure! It's correct that, in this example, T will be filtered out and C will not be filtered out. However, regarding calculation, it depends what you mean. If you mean π, there is no longer a 'calculation', i.e., the site will no longer be polymorphic so its π = 0. Let me know! |
Gotcha. I am trying to compare piN/piS to access selective pressure for my sample. I am sequencing omicron samples, that call variants based on wuhan-1, in that case should I consider those "fixed or 99%" mutation. Or it would be more reasonable to filter the vcf without lower variant and using default minfreq in the program. |
Understood! If you're interested in constraint then πn/πs is the way to go. To me, your question is more about how to make sure to quality filter variants — using minfreq is just one overly simple way to do that. Indeed, when you use a minfreq cutoff of 0.01, you are effectively saying you believe that anything with AF < 1% isn't real, i.e., is a sequencing error. I see you're working with Tom — if you have questions about study design and QC that's probably beyond the ambit of SNPGenie and GitHub, but feel free to send an email to us! Of interest, we just published a program called VCFgenie to perform QC filtering of iSNVs that you may wish to use:
Let me know! |
I am recently running snpgenie within pool analysis, using command
snpgenie.pl --minfreq=0.01 --snpreport=${sample}_snpgenie.vcf --vcfformat=2 --slidingwindow=30 --fastafile=./seqs/MN908947_3.fasta -- gtffile=./seqs/MN908947_3.gff.txt.
I want to exclude all the variants below 1% frequency so using minfreq=0.01, however I realized all the mutations with frequency >0.99 have been excluded. I am not sure why this is happening, can someone help me figure it out. I attached the screen shot of the script out put and my vcf file.The text was updated successfully, but these errors were encountered: