Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing cnv's in cnv_data.txt #134

Open
ardydavari opened this issue Mar 2, 2021 · 4 comments
Open

Missing cnv's in cnv_data.txt #134

ardydavari opened this issue Mar 2, 2021 · 4 comments

Comments

@ardydavari
Copy link

ardydavari commented Mar 2, 2021

I have been having an issue where the number of cnvs that are present in cnv_data.txt are much smaller than the number created by parse_cnvs.py

When I run parse_cnv.py on my tumor sample I get approximately 534 lines with many regions that have nondiploid copy numbers.

biopsy_cnvs.txt

chromosome      start   end     copy_number     minor_cn        major_cn        cellular_prevalence
1       770502  7955500 4       0       4       0.76228
1       7969259 19193393        5       1       4       0.76228
1       19199091        37788546        4       0       4       0.76228
1       37848458        40188029        5       1       4       0.76228
1       40255487        47175369        4       0       4       0.76228
1       47182705        47737136        4       0       4       0.0843364667695
1       47182705        47737136        5       1       4       0.677943533231
1       47738318        48888191        5       1       4       0.436941290323
1       47738318        48888191        6       2       4       0.325338709677
1       48888622        79171641        4       0       4       0.76228
1       79196750        79316969        6       2       4       0.76228
1       79336373        83974480        4       0       4       0.76228
1       84227477        84735545        6       2       4       0.76228
1       84737421        85233983        4       0       4       0.76228

After running create_phylowgs_inputs.py (command below). I get only 16 variants in my final file

create_phylowgs_inputs.py \
-s 5000 \
--cnvs biopsy=/data/biopsy_cnvs.txt \
--vcf-type biopsy=sanger \
biopsy=/data/biopsy.muts.vcf \

cnvs_data.txt

cnv     a       d       ssms    physical_cnvs
c0      135530  219000  s1699,3,3;s1700,3,3;s1701,3,3;s1702,3,3;s1703,3,3;s1704,3,3;s1705,3,3;s1706,3,3;s1707,3,3;s1708,3,3;s1709,3,3;s1710,3,3;s1711,0,3;s
c1      16253   25479   s1655,0,1;s1656,0,1     chrom=18,start=4518557,end=5017167,major_cn=1,minor_cn=0,cell_prev=0.724137931034
c2      120207  186989  s996,0,1        chrom=8,start=16302361,end=19961643,major_cn=1,minor_cn=0,cell_prev=0.714285714286
c3      55739   84363   s997,0,1        chrom=8,start=22409353,end=24060296,major_cn=1,minor_cn=0,cell_prev=0.678571428571
c4      151214  219000  s1527,1,2;s1528,1,2;s1529,1,2;s1530,1,2;s1531,1,2;s1532,1,2;s1533,1,2;s1534,1,2;s1535,1,2;s1536,1,2;s1537,1,2;s1538,1,2;s1539,1,2;s
c5      74672   92697           chrom=8,start=25388663,end=27202687,major_cn=2,minor_cn=1,cell_prev=0.388888888889
c6      40686   49728   s981,1,2        chrom=8,start=1511992,end=2485144,major_cn=2,minor_cn=1,cell_prev=0.363636363636
c7      2174    2652            chrom=8,start=2500261,end=2552162,major_cn=1,minor_cn=0,cell_prev=0.360396039604
c8      2390    2900            chrom=18,start=5018374,end=5075134,major_cn=2,minor_cn=1,cell_prev=0.351635514019
c9      37378   45136   s1625,1,2;s1626,1,2     chrom=17,start=21854462,end=22737746,major_cn=2,minor_cn=1,cell_prev=0.34375
c10     194113  219000  s259,1,2;s260,1,2;s261,1,2;s262,1,2;s263,1,2;s264,1,2;s265,1,2;s266,1,2;s267,1,2;s268,1,2;s269,1,2;s270,1,2;s271,1,2;s272,1,2;s273,
c11     195195  219000  s1263,1,2;s1264,1,2;s1265,1,2;s1266,1,2;s1267,1,2;s1268,1,2;s1269,1,2;s1270,1,2;s1271,1,2;s1272,1,2;s1273,1,2;s1274,1,2;s1275,1,2;s
c12     16957   18743   s1208,0,1       chrom=10,start=46165506,end=46532287,major_cn=1,minor_cn=0,cell_prev=0.190476190476
c13     168974  173802          chrom=8,start=27206516,end=30607739,major_cn=2,minor_cn=1,cell_prev=0.0555555555556
c14     40427   41520           chrom=8,start=13435023,end=14247547,major_cn=2,minor_cn=1,cell_prev=0.0526315789474
c15     19905   20430           chrom=8,start=212218,end=612014,major_cn=2,minor_cn=1,cell_prev=0.0513149454779
c16     39563   40578           chrom=8,start=9695182,end=10489271,major_cn=1,minor_cn=0,cell_prev=0.05

What happened to the other copy number variants?

@shaghayeghsoudi
Copy link

May I ask how did you run parse_cnvs.py? did you have Battenberg cnvs? Did you do any specific filtering before running parse_cnvs.py on your CNV data?
I am getting weird message

python ./parse_cnvs.py -f battenberg -c 0.27 data.test.battenberg.txt

error:
File "./parse_cnvs.py", line 195, in 
main()
File "./parse_cnvs.py", line 191, in main
regions = parser.parse()
File "./parse_cnvs.py", line 111, in parse
end = int(fields[3 + self._field_offset])
ValueError: invalid literal for int() with base 10: '0.610923189999321'

Wonder if you run it in a different way? I appreciate your answer

@ardydavari
Copy link
Author

I noticed that on lines 409-423 the CNVs are separated into two types.

I noticed that the ones in the in the else block are being included in the cnv_data.txt. But I still don't understand what's happening to the other ones. Are they being treated like SNVs?

@ardydavari
Copy link
Author

I also took a closer look at my cnv_data.txt file, it seems like many of the missing cnvs are present under row c0 and column physical_cnvs. However none of the other rows have multiple cnvs associated with them.

@ardydavari
Copy link
Author

ardydavari commented Mar 4, 2021

I found that if i change line 726 from return None to continue, the missing CNVs are now showing up.

for sampidx, cell_prev, major, minor in zip(cnv['sampidx'], cnv['cell_prev'], cnv['major_cn'], cnv['minor_cn']):
# Region may be (clonal or subclonal) normal in a sample, so ignore such records.
if self._is_region_normal_cn(chrom, major, minor):
continue
# Either we haven't observed an abnormal CN state in this region before,
# or the observed abnormal state matches what we've already seen.
if abnormal_state is None or abnormal_state == (major, minor):
abnormal_state = (major, minor)
filtered.append({'sampidx': sampidx, 'cell_prev': cell_prev, 'major_cn': major, 'minor_cn': minor})
continue
# The abnormal state (i.e., major & minor alleles) is *different* from
# what we've seen before. The PWGS model doesn't currently account for
# such cases, so ignore the region.
else:
return None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants