Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backslash added to metabolite names #23

Open
tabbassidaloii opened this issue Aug 28, 2023 · 14 comments
Open

backslash added to metabolite names #23

tabbassidaloii opened this issue Aug 28, 2023 · 14 comments
Assignees
Labels
bug Something isn't working

Comments

@tabbassidaloii
Copy link
Collaborator

tabbassidaloii commented Aug 28, 2023

@DeniseSl22
In the metabolite files when there is ' or " in the metabolite name a \ has been added, see the example below:

Q425039	"bromcresol purple"	"5',5\"-dibromo-o-cresolsulfophthalein"
@tabbassidaloii tabbassidaloii added the bug Something isn't working label Aug 28, 2023
@DeniseSl22
Copy link
Collaborator

Yes, chemical names can be difficult..... But this is the actual name in the Wikidata entry:
https://www.wikidata.org/wiki/Q425039

I believe the backslash is there, so that R can read the data, right? Or is that not working?

@DeniseSl22
Copy link
Collaborator

I've now changes the entry, so the name is 5',5'' (so using the single quote twice, iso using a double quote).
That might resolve things for this entry; I'll run the Wikidata GitHub action again. Any others which are causing issues @tabbassidaloii ?

@tabbassidaloii
Copy link
Collaborator Author

It messes up in R when there is a /
It reads with no issue when I remove the unnecessary / manually.
Also when there is a / in the name, an extra / would be added.

@DeniseSl22
Copy link
Collaborator

Forward or backward slash, which one is causing the issue?

@tabbassidaloii
Copy link
Collaborator Author

backslashes are added when there is ', ", or a backslash (/).

@tabbassidaloii
Copy link
Collaborator Author

tabbassidaloii commented Aug 28, 2023

I removed them manually in the file, to keep the app running correctly for now.

@DeniseSl22
Copy link
Collaborator

DeniseSl22 commented Aug 28, 2023

How many are there, could you give me some more examples?
Then I can (try) to construct a regex to solve this.... using TSV should already solve issues around using a comma within a name...

@tabbassidaloii
Copy link
Collaborator Author

it is around 20 to 30, I think.
Here are some more examples:

5',5\\"-dibromo-o-cresolsulfophthalein
Germa-Medica \\"Mg\\"
2,5-O,O-BIS-(3',3\\"-AMIDINOPHENYL)-1,4:3,6-DIANHYDRO-D-SORBITOL
diinosine-5',5\\"-pentaphosphate
3-O-beta-D-glucopyranosylpresenegenin 28-O-{[beta-D-apiofuranosyl(1-3)][beta-D-galactopyranosyl(1-4)-beta-D-xylopyranosyl(1-4)]-alpha-L-rhamnopyranosyl(1-3)}{4-O-(E)-4\\"-methoxycinnamoyl}-beta-D-fucopyranoside
3-O-beta-D-glucopyranosylpresenegenin 28-O-{[beta-D-apiofuranosyl(1-3)][beta-D-galactopyranosyl(1-4)-beta-D-xylopyranosyl(1-4)]-alpha-L-rhamnopyranosyl(1-3)}{4-O-(Z)-4\\"-methoxycinnamoyl}-beta-D-fucopyranosid
alpha-GalCer-6\\"-(pyridin-4-yl)carbamate
alpha-GalCer-6\\"-(4-pyridyl)carbamate
2'-[2\\"-(5'''-phosphoribosyl)-5\\"-phosphoribosyl]adenosine 5'-monophosphate
2'-[2\\"-(1'''-ribosyl)-1\\"-ribosyl]adenosine 5',5\\",5'''-tris(phosphate)
alpha-GalCer-6\\"(1-naphthyl)carbamate
2'-(5\\"-phosphoribosyl)adenosine 5'-monophosphate
2'-(1\\"-ribosyl)adenosine 5',5\\"-bis(phosphate)
alpha-GalCer-6\\"-(4-chlorophenyl)carbamate
N,N',N\\"-trimethyl-1,4,7-triazacyclononane
12,24-Dihydro-5H-naphtho[2,3-h]naphth[2\\",3\\":6,7]anthra[2,1,9-mna]acridine-5,10,13,18,25-pentone

@DeniseSl22
Copy link
Collaborator

When I filter the results with a regex in the query of the names itself "FILTER(REGEX(?name, "['"\/]", "i")).", I get: 96963 results in 29005 ms.

We could either filter these out before we obtain the data (changing the SPARQL query), or find a way so that R can read these (e.g. like this).... I don't think we should replace these characters with another one (besides changing the double quote to two single quotes).

What's your preference @tabbassidaloii ?

@tabbassidaloii
Copy link
Collaborator Author

tabbassidaloii commented Aug 28, 2023

@DeniseSl22
I don't understand why you filter the results.
The backslashes are to the outputs of the queries (and I am not sure how to avoid it) so I would try to solve it before reading the file in R.
for example, Germa-Medica "Mg" in the output of query instead of Germa-Medica \\"Mg\\".

There are options in R to remove them, but I have concerns about causing unnecessary changes in the other values (e.g. there are metabolites with a backslash (/) in their names, and we should keep those backslashes). So I would fix it in bash script.

@DeniseSl22
Copy link
Collaborator

When I look at the raw Github data, I don't see this issue....
I think the reading of the file in the code of the Shiny App is not going correctly (something like this line:

dataset <- data.table::fread("processed_mapping_files/HGNC_secID2priID.tsv")

Might need to be adapted?
We could maybe have a look at this together (I have some time on Wednesday morning) @tabbassidaloii
Just to get a clearer idea of where the issue is coming from...

@tabbassidaloii
Copy link
Collaborator Author

This is weird because I see it in the files I downloaded from GitHub, before opening them in R.
Yeah, let's meet either at 9 or at 11 am.

@DeniseSl22
Copy link
Collaborator

DeniseSl22 commented Aug 30, 2023

As discussed:

  • Not push Wikidata data into Github, since this added backslashes which doesn't work in R for reading the data.
  • Create a logfile, counting the lines or (unique) wordcount for each file (bash), push logfile to Github.
  • Compare logfile of previous release to new one, and print out results to terminal (bash)

@tabbassidaloii
Copy link
Collaborator Author

@DeniseSl22 if #9 is solved we can work on #106 considering below

As discussed:

  • Not push Wikidata data into Github, since this added backslashes which doesn't work in R for reading the data.
  • Create a logfile, counting the lines or (unique) wordcount for each file (bash), push logfile to Github.
  • Compare logfile of previous release to new one, and print out results to terminal (bash)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants