Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommend repeat mask? #64

Open
sheinasim opened this issue Dec 6, 2024 · 3 comments
Open

Recommend repeat mask? #64

sheinasim opened this issue Dec 6, 2024 · 3 comments

Comments

@sheinasim
Copy link

Hello there!

I'm running the latest version of EGAPx alpha and I was wondering what the recommendation is for repeat masking the assembly before annotation. Is there a repeat masker that is recommended, or is it necessary?

Thanks!
Sheina

@pstrope
Copy link
Contributor

pstrope commented Dec 9, 2024

Hi @sheinasim
Masking is not needed, and not used in EGAPx at this time.

Pooja

@xinghui-guo
Copy link

In this NCBI EGAPX pipeline,why not use the masked.fasta?I can not understand,Can you explain it in more detail?Thanks !

@murphyte
Copy link

EGAPx is predominantly an evidence-based predictor, using RNA-seq and protein alignments as the primary basis for nearly all models. Most aligners, including STAR and miniprot, don't care about and ignore soft-masking, and recommend against hard-masking, so there's no need for it. lncRNAs and 3' UTRs of coding genes also often include repeats which are valid to include in the model.

A carefully vetted masking library can be useful for identifying gene predictions on transposons and other repeats; however, without curation that can over-filter real genes (e.g. high-copy number gene families like histones can get masked). EGAPx includes some alternative logic to identify gene predictions that are predominantly transposon based on protein hits, and we pre-filter our protein evidence sets to remove repeat-based proteins. It is an area that we've been exploring for improvements, but I think focusing on protein characteristics (e.g. domain analysis) will be more suitable for the purpose. We've also set up EGAPx to require at least some alignment evidence for all models, whereas in RefSeq EGAP we include some models that are entirely based on ab initio prediction. That ab initio path can find a few more real genes, but is the major source of transposon noise in RefSeq XP models, so the EGAPx settings help improve precision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants