Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover Reddit text using get_text.py fails for gdtb/disrpt #197

Open
GuifuLiu opened this issue Dec 9, 2024 · 2 comments
Open

Recover Reddit text using get_text.py fails for gdtb/disrpt #197

GuifuLiu opened this issue Dec 9, 2024 · 2 comments

Comments

@GuifuLiu
Copy link

GuifuLiu commented Dec 9, 2024

I ran get_text.py. The output is:

o Restored text for DISRPT format in 6 .tok files and 6 .rels files in .../gum//rst/disrpt/
o Restored text for DISRPT format in 0 .tok files and 235 .rels files in .../gum//rst/gdtb/disrpt/
o Processing 18 files in .../gum//rst/gdtb/pdtb/raw/00/...
o Processing 18 files in .../gum//const/...

I am able to see reddit text recovered at files eng.pdtb.gum_... in .../gum//rst/disrpt/

but not .../gum//rst/gdtb/disrpt/ despite that there are file changes after I ran get_text.py. There are still dashes in any of these files.
However, text is recovered for full text files in .../gum//rst/gdtb/pdtb/raw

@amir-zeldes
Copy link
Owner

Hm, you're right, I can reproduce this bug - thanks for reporting it! I should be able to push a fix to the dev branch soon. A new stable release of GUM is expected in early winter, so I would probably wait with merging the fix for a little while longer.

amir-zeldes added a commit that referenced this issue Dec 13, 2024
  * was failing if running from root before process_reddit.py has been run since src/dep/ is not yet restored
  * now checks for restored dep files and uses top level dep/ instead if running from get_text.py
  * fixes #197
@amir-zeldes
Copy link
Owner

OK, you should be able to use this fix. Either pull from the dev branch or just patch the file _build/utils/get_reddit/underscores_disrpt.py based on dev.

Leaving this issue open until the next stable release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants