Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporate bug fixes and update version number #46

Merged
merged 2 commits into from
Aug 15, 2023

Conversation

SPPearce
Copy link
Contributor

@SPPearce SPPearce commented Jul 6, 2023

This pull request incorporates the bug fixes included in #31, #37, as well as some further changes:
I have been using this version "in production" for the last 2 years, running over 3000 samples through NGSCheckMate in batches. The tool is extremely useful, and I am ready to include it into several Nextflow workflows in the nf-core community, but can not without the fix that #31 provides to prevent divide by zero crashes in particular.

@SooLee , please can you take a look at this PR, and release a new version of the tool.

  • BAM mode no longer crashes if samples with few reads (i.e. failed samples) is provided, causing a divide by zero error (incorporating Fix dataset reading and pearson correlation is 0 #31)
  • split the directory only once if the dir contains "=" in the name (incorporating Update ncm.py #37)
  • The output_tag prefix is added to the output_corr_matrix.txt file
  • Making the size of the dendrogram pdf scale better as more (100+) samples are added
  • Ensured a dendrogram is generated if only two samples are compared
  • Changed the dendrogram background colour to white from grey
  • Updated the version information from v1.0 to v1.0.1

@SPPearce
Copy link
Contributor Author

SPPearce commented Jul 6, 2023

@SooLee , I see that you have been active on the repository, it would be great if we can get a new tagged version of the tool especially with the divide by zero fix, so that I can update the bioconda container (which will update the docker/singularity images on biocontainers) and then I can incorporate the tool properly into pipelines in the nf-core project.

@SPPearce SPPearce changed the title Update ncm.py Incorporate bug fixes and update version number Jul 6, 2023
@SPPearce
Copy link
Contributor Author

@SooLee , please can you look at this?

@SPPearce
Copy link
Contributor Author

@SooLee , are you able to look at this please?

@SooLee
Copy link
Member

SooLee commented Aug 13, 2023

@sejooning Can you please look at this and merge?

@SPPearce
Copy link
Contributor Author

Hi @SooLee , there has been no activity on github from @sejooning for several years, have you been able to contact them outside of github?

Thanks

@aleelab
Copy link
Collaborator

aleelab commented Aug 14, 2023 via email

@SPPearce
Copy link
Contributor Author

Thank you, that is great!

@sejooning sejooning merged commit ef7a38c into parklab:master Aug 15, 2023
@sejooning
Copy link
Collaborator

Hi Simon,

Thank you for addressing the bug and offering valuable suggestions for version updates. The mentioned changes have been committed. It would be fantastic if NGSCheckMate could be incorporated into Nextflow. Thank you very much.

Best regards,
Sejoon Lee

@SPPearce
Copy link
Contributor Author

Thank you both! Can you please release a version with the 1.0.1 tag, to make it clear what that version number relates to? Unless you are planning on making any other changes to the tool in the near future?

@ImNotaGit
Copy link

ImNotaGit commented Nov 29, 2023

Hi, @SPPearce, I am a regular user of NGSCheckMate, thanks a lot for the effort! There is still one very minor issue though:

The output_tag prefix is added to the output_corr_matrix.txt file

You made changes corresponding to the point above only in ncm.py but not in vaf_ncm.py, thus make the naming of the output files inconsistent between fastq mode and bam/vcf mode. Besides, the "output" in "output_corr_matrix.txt" is supposed to be the default prefix, and when providing a custom prefix with -N the output should be <prefix>_corr_matrix.txt; however, in your code the output would become <prefix>_output_corr_matrix.txt (also, the default output changed from output_corr_matrix.txt to output_output_corr_matrix.txt.) It happened that I have been using a snakemake pipeline that I wrote which wraps NGSCheckMate, and the above change and esp. inconsistency between fastq and bam/vcf modes broke my pipeline:)

Although I can make tweak in my pipeline to allow fastq mode and bam/vcf mode outputs being handled differently, it would be great if you could help address the above inconsistency issues upstream. Basically, 1. remove the "output" part in the file name in ncm.py; 2. do the same for vaf_ncm.py.

Thanks a lot.

@SPPearce
Copy link
Contributor Author

Hi @ImNotaGit ,
I'm glad it has been helpful to you too.
Yes, I spotted those issues when incorporating the tool into the Nextflow community organisation nf-core. I generally use the bam mode so hadn't noticed that vaf_ncm.py needed to be synced up with the ncm.py.
Do you also find that the last .vaf file is missed when using vaf_ncm.py?

@SPPearce
Copy link
Contributor Author

(to be precise, that the last file is missing from the correlation matrix, but is present in the text file)

@ImNotaGit
Copy link

ImNotaGit commented Nov 30, 2023

Do you also find that the last .vaf file is missed when using vaf_ncm.py?
(to be precise, that the last file is missing from the correlation matrix, but is present in the text file)

Honestly, I did not notice that, since (now that you reminded me) in my pipeline I do not use the correlation matrix (output_corr_matrix.txt), but generate a report using results in the output_all.txt instead -- I would regenerate the correlation matrix manually by reshaping the output_all.txt matrix, with a tiny loss of numerical precision (I noticed that the values in output_corr_matrix.txt have more digits after the decimal point). I do it this way because:

  1. I noticed that the predicted unmatched sample pairs have their raw correlation values replaced with zeros in the output_corr_matrix.txt, but I would want to check the raw correlation matrix.
  2. Specifically in fastq mode the output_corr_matrix.txt is often not made symmetric, that only one of the two off-diagonal values is filled, but in a random way (i.e. sometimes in the upper triangle and sometimes in the lower...)

Oh, BTW, I recall another inconsistency between fastq and bam/vcf modes: in bam/vcf mode a output_matched.txt table is generated from NGSCheckMate, but in fastq mode this file is not automatically generated...

Yeah, so it seems that there are additional things to be fixed:P

@SPPearce
Copy link
Contributor Author

Yes, I'm also aware that the correlation matrix is only half filled in in the fastq mode too, and the lack of the matched file.
I think I have fixed all those bits except for the missing last sample by copying code over from ncm.py, I'll push to a new PR for a new version soon. Can't immediately see why the last sample is missing from the correlation matrix at the moment.
Is there anything else that could do with fixing?

@ImNotaGit
Copy link

Thanks a lot again for all the efforts! I cannot think of other things that need to be fixed for the moment.

@SPPearce
Copy link
Contributor Author

SPPearce commented Nov 30, 2023

@ImNotaGit , can you test out this vaf_ncm.py file before I make a PR. I'm struggling to test it outside a container on my laptop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants