Repisotory has two py script which is responsible to

This repisotory is a part of my thesis study to detect negation on biomedial corpus with DL methods.

Repisotory has two py script which is responsible to

Collect the articles by simple web scraping applied to a spesific web page https://dergipark.org.tr/tr/.
Translate collected pdf s into text files for pre-process stage to create a corpora.

The web page is a subproject of formal science institution in Turkey and Medical/Bimedical articles have been collect here by using sub branches URL for different departments of medicine.

Unfortunately URL logic is not straightforward to download pdf s directly yet it solved by applying additional constant URL part "&section=articles&aggs%5BarticleType.id%5D%5B0%5D=55" in collect_articles.py.

To translate pdf s into text file for post edit, PyPDF2 library has been used. Here the files not translated in easy format because of pdf properties or the need of advance properties for pdf translator. Because of this reason the text file should be review and correct via hand craft techniques or some code snippets.

For the automated corrections:

clean_text(input_text) function is written to manage the correction of some mistypes which can be automated with some regex functions in 're' library.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
__pycache__		__pycache__
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
collect_articles.py		collect_articles.py
main.py		main.py
octocat-random.png		octocat-random.png
pdf2txt.py		pdf2txt.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Repisotory has two py script which is responsible to

For the automated corrections:

About

Releases

Packages

Languages

License

zanasgt/python-web-scraping-for-dataset-preparation

Folders and files

Latest commit

History

Repository files navigation

Repisotory has two py script which is responsible to

For the automated corrections:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages