-
Notifications
You must be signed in to change notification settings - Fork 42
Size training set ? #5
Comments
Hello, and thanks for your interest in the project. You are asking very good questions, but unfortunately it is difficult to give simple answers. It will depend very much on your sample prep and the aims of your project. It's worth repeating that taiyaki is a tool for research. For "normal" DNA or RNA samples, most users should just use the models that ship with Guppy. We think 2 main groups of people will benefit from taiyaki:
I can offer some general guidelines, but it help me to offer more specific advice if you can answer a few questions about your project:
If you don't want to give too many details publicly then you can drop me an email and we can discuss it further in private ([email protected]). Now follows some generic advice on training set size: Training set sizeThere are two important measures of training set size:
The first measure is always bigger than the second, e.g. if you have 1000x coverage of a 1 kbase amplicon then you have 1 Mbase of total sequence but only 1 kbase of unique reference sequence. We think the second measure (unique reference sequence) is more important for training a general basecaller. If you have too little data there is a risk that the model would overfit to the reference sequence. We are not sure what the lower limit is, and there is a lot of literature on techniques to avoid overfitting, but we can say that you are probably ok if you have at least 1 Mbase. Another kind of overfitting that can occur is if the composition of your training set is biased in some way (e.g. extreme GC content) then the model might not generalise very well. For comparison, the models released in Guppy are trained with hundreds of Mbases of unique reference sequence from a variety of organisms. If you have only a small amount of training data to start with, then you might have to employ some other method to bootstrap your way to a larger training set. If that is the case then we can offer further guidance. |
Dear, Say I fall in the category 1., it is DNA, I already can basecall the data and the accuracy is "good" i.e., the same than people report. \o/ Thank you for your explanations. Is the list of variety of organisms to train Guppy available elsewhere? Say the Guppy architecture uses a window of Some kmers are more represented in some organisms than other. Because, the company will not release the basecaller architecture (and I understand), it will help when understanding the origin of errors to know the list of the organisms used to trained Guppy. |
Dear,
I am leaving the Nanopore Day in Bordeaux and there Stephen Rudd presented the release of this promising tools.
What should be the size of the training set? Number of reads of the same reference?
I have a couple of experiments where I am able to link 1-to-1 the reads and the real expected sequence. Before writing some code to clean my data and format it to be acceptable by
taiyaki
, I would like to know if I can expect any improvement or not. :-)Thank you in advance for any comments.
The text was updated successfully, but these errors were encountered: