Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full HTML website #9

Open
alexandruvesa opened this issue Sep 11, 2023 · 1 comment
Open

Full HTML website #9

alexandruvesa opened this issue Sep 11, 2023 · 1 comment

Comments

@alexandruvesa
Copy link

Hello !
First of all thanks for sharing the code with us!

Do you have any idea how to extend your code to process a whole website?
For example extract the content of website which has ~107000 tokens .

Thanks,
Alex

@GianfrancoCorrea
Copy link
Collaborator

GianfrancoCorrea commented Sep 20, 2023

Hi @mediflux95 !
I thought about this many times, there are 2 main processes
1- the first input gives an example of HTML to extract from the website, for example, an item from an Amazon store, and the GPT bot creates an expected output format, this is a JSON with the relevant values of the item example. After that, it generates the scraping code.
2- the second input, just takes the whole HTML code and test the generated code to scrape.

The first input is hard to process automatically due to the number of tokens, but we can replace the second input with an input to paste the URL, so you can run the code without copy/paste the whole HTML code.

The problem that we can face is the pages with client-side rendering. I know that there are some paid services, but also packages like selenium that I think can help with this.

Anyway, I would like to make the whole process just with the URL, but I can't realize yet how to handle the first step.

Feel free to propose ideas to implement it if you are interested.

Regards,
Gian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants