Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add task 1160 from MRS #375

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Add task 1160 from MRS #375

wants to merge 1 commit into from

Conversation

Palipoor
Copy link
Contributor

@Palipoor Palipoor commented Oct 5, 2021

This task is created from the MRS dataset from this issue #283.
However, I am in doubt whether this is a good addition or not. The data is driven from Reddit replies and they're not good quality examples to learn from. I've cleaned them as much as I could but there're still lots of nonsense going on. I'm submitting the English task, to get other people's opinions. If it's good enough, or there's a good way to filter out nonsense, I will go on and add other languages too.
@swarooprm @danyaljj

@danyaljj
Copy link
Contributor

danyaljj commented Oct 7, 2021

Yeah, the data is quite noisy. I am leaning towards a "no", unless we can somehow clean it up around a particular subject.
@swarooprm feel free to share your thoughts.

@swarooprm
Copy link
Contributor

I like this task, but I agree that noise is a concern.
I see that, longer instances (both longer inputs and outputs) are less prone to noise; I checked a few instances only though.
Check if this is true and we can filter instances based on that.

@Palipoor
Copy link
Contributor Author

I checked again and I think there's noisy data in short instances too.

@swarooprm
Copy link
Contributor

I checked again and I think there's noisy data in short instances too.

What I meant above was to retain only the longer instances. Longer instances seem to contain lesser noise.
It's ok if you still see noise in longer sentences and in that case I am fine if we drop this task.
Subject wise filtering may be another option (some subjects may contain less noise e.g. scientific topic discussion)

@Palipoor
Copy link
Contributor Author

Oh sorry, I didn't read it carefully!
I will check it tomorrow and update the task if it's ok.

@Palipoor
Copy link
Contributor Author

Sorry for being late. I think it makes sense to keep the longer instances(not sure about the threshold though). Should I add other languages too?

@swarooprm
Copy link
Contributor

Sorry for being late. I think it makes sense to keep the longer instances(not sure about the threshold though). Should I add other languages too?

Yes, if you have time, feel free to add. It's also fine if you skip this and decide to focus on other ToDos we have in this project.

@danyaljj
Copy link
Contributor

I agree with Swaroop. If cleaning up this PR will take more than 1hr, I would say, it's not worth it.

@danyaljj danyaljj added the onhold label Nov 4, 2021
@danyaljj danyaljj marked this pull request as draft November 4, 2021 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants