-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update of genes.refGene files #26
Comments
The format of genes.refGene is almost the same as refGene table from UCSC (hg19 schema, hg19 dump). I bet it is the data source, except that there are versions in the accession name but not in UCSC's table, e.g. UCSC provides another table gbCdnaInfo with accession name and version (schema, dump). You can load the two tables into a MySQL database and join the two tables to get the required data, e.g: SELECT r.bin,
CONCAT(r.name, '.', g.version) AS name,
r.chrom,
r.strand,
r.txStart,
r.txEnd,
r.cdsStart,
r.cdsEnd,
r.exonCount,
r.exonStarts,
r.exonEnds,
r.score,
r.name2,
r.cdsStartStat,
r.cdsEndStat,
r.exonFrames
INTO OUTFILE '/tmp/genes.refGene'
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
FROM refGene r, gbCdnaInfo g
WHERE r.name = g.acc; If you have difficulty in setting up a database, you may try querying UCSC's public database instance directly as the resulting data is only around 20MB (for now): mysql -ugenomep --password=password -hgenome-mysql.cse.ucsc.edu -ACD hg19 -BNe "SELECT r.bin,
CONCAT(r.name, '.', g.version) AS name,
r.chrom,
r.strand,
r.txStart,
r.txEnd,
r.cdsStart,
r.cdsEnd,
r.exonCount,
r.exonStarts,
r.exonEnds,
r.score,
r.name2,
r.cdsStartStat,
r.cdsEndStat,
r.exonFrames
FROM hg19.refGene r, hgFixed.gbCdnaInfo g
WHERE r.name = g.acc" > genes.refGene" And for LRG_RefSeqGene, latest file is simple available at NCBI's FTP: ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/RefSeqGene/LRG_RefSeqGene |
Is this "genes.refGene" file for hg19 or hg18 ? |
@anopperl The gbCdnaInfo table is same for all assembly. But the links to refGene table posted above were for hg19 only. For hg18, you may simply change the
Similarly for hg38:
On the other hand, if you prefer querying the public UCSC database directly, simply replace all mysql -ugenomep --password=password -hgenome-mysql.cse.ucsc.edu -ACD hg18 -BNe "SELECT r.bin,
CONCAT(r.name, '.', g.version) AS name,
r.chrom,
r.strand,
r.txStart,
r.txEnd,
r.cdsStart,
r.cdsEnd,
r.exonCount,
r.exonStarts,
r.exonEnds,
r.score,
r.name2,
r.cdsStartStat,
r.cdsEndStat,
r.exonFrames
FROM hg18.refGene r, hgFixed.gbCdnaInfo g
WHERE r.name = g.acc" > genes.refGene" Finally, you can concatenate genes.refGene for different assemblies into a single file. |
thanks lacek |
UCSC doesn't keep track of all versions of accessions so we're only getting latest version of transcript from their database. For this particular transcript You will need a data source with all version transcript. One I could think of is UTA. You need to either write a query to produce data in the format of genes.refGene file, or write a python adapter function ( On the other hand, you may also want to take a look at hgvs, an HGVS parser that is based on UTA. It seems more robust but less easy to use. |
I've made a Python package that provides ~800k transcripts (both RefSeq and Ensembl) for PyHGVS You can either download a JSON.gz file, or use a REST service. To use it:
|
Thank you all for the info how to generate refgene files. Is there a way to get a hs1 refGene. Using the solution from @lacek with 'hs1' does not work. |
@simzep By hs1 are you referring to HCLS1? Anyhow, this library is unmaintained, and comes with a number of problems (e.g. bugs in parsing dup and ins, lack of reference bases checking, no support on inversion and mitochondrial, etc). You're better off with other similar library (e.g. https://github.com/biocommons/hgvs) or tool (e.g. https://asia.ensembl.org/info/docs/tools/vep/recoder/index.html) for parsing HGVS. |
@lacek Thanks for the reply. |
@simzep Didn't aware of that you're referring to a reference genome of CHM13. For UCSC data, you can search on it's table browser page. E.g. for refGene of hg38: https://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_track=refSeqComposite&hgta_table=refGene&hgta_doSchema=data%20format%20description At the moment, I cannot find the refGene table for hs1 on UCSC. The closest one maybe https://genome.ucsc.edu/cgi-bin/hgTables?db=hs1&hgta_track=hub_3671779_refSeqComposite&hgta_table=hub_3671779_ncbiRefSeq&hgta_doSchema=data%20format%20description. It's in BigBed format though and require conversion if you find it appropriate in your use case. |
Hi, I added T2T support to cdot, so you should be able to convert to/from HGVS (using the Biocommons HGVS) reasonably easily, see example code here: https://github.com/SACGF/cdot/wiki/Biocommons-T2T-CHM13v2.0-example-code |
I need to use an updated version of refseq. Is it available any script to download the current version of the file 'genes.refGene' or I should to build it by hand?. Thank you. Angela
The text was updated successfully, but these errors were encountered: