Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why scrape? #8

Open
gavanderhoorn opened this issue Apr 6, 2023 · 11 comments
Open

Why scrape? #8

gavanderhoorn opened this issue Apr 6, 2023 · 11 comments
Assignees

Comments

@gavanderhoorn
Copy link

Hi guys. Interesting project.

I was curious as to why you're using web scraping to get ROS Answers content? IIRC, there is support for exporting/dumping the database (using a web API) in a relatively usable format. That would seem to allow more convenient processing of it.

The dump / API access was used by @DLu to create the ROS Answers section of metrics.ros.org (source).

Perhaps he could say something as to whether that could also be made available for scientific research purposes.

@DLu
Copy link

DLu commented Apr 10, 2023

Gladly: http://metrorobots.com/answers.db

@gavanderhoorn
Copy link
Author

gavanderhoorn commented Apr 10, 2023

@DLu: does that also contain Q&A content? 170MB seems small for the entirety of ROS Answers?


Edit: looks like it does.

@DLu
Copy link

DLu commented Apr 10, 2023

Sorta, the database structure is here: https://github.com/DLu/ros_metrics/blob/main/data/answers.yaml

The question title/summary is included.

The answer text is not.

@gavanderhoorn
Copy link
Author

The answer text is not.

ah, hm.

So that might still need scraping then.

Would you know of a way to retrieve the answer bodies as well, without scraping? This must exist right?

@DLu
Copy link

DLu commented Apr 10, 2023

ASKBOT/askbot-devel#828

Been there, done that.

@DLu
Copy link

DLu commented Apr 10, 2023

https://answers.ros.org/api/v1/answers/13122/

@pcanelas
Copy link
Member

Hello everyone,

I had no idea that this API existed, thank you so much @gavanderhoorn and @DLu!

@DLu I was wondering, I noticed in the database structure that it provides a summary of the question content and not the entire content of the question, and also the comments seem to be missing. Is it possible to also obtain this information using the API?

@pcanelas pcanelas self-assigned this Apr 12, 2023
@gavanderhoorn
Copy link
Author

@DLu wrote:

Gladly: http://metrorobots.com/answers.db

@DLu: when was that .db created/copied/downloaded? Trying some toy SQL queries and I can't get it to return the same nrs answers.ros.org shows.

Either my SQL is crap incorrect (very much possible) or the .db is not up-to-date?

@DLu
Copy link

DLu commented May 19, 2023

@DLu I was wondering, I noticed in the database structure that it provides a summary of the question content and not the entire content of the question

I think the field is just named summary, but its actually the whole text. See https://answers.ros.org/api/v1/questions/408502/

and also the comments seem to be missing. Is it possible to also obtain this information using the API?

Last I checked, no

@DLu: when was that .db created/copied/downloaded? Trying some toy SQL queries and I can't get it to return the same nrs answers.ros.org shows.

I would have guessed the beginning of April. How off are the numbers you're getting?

@gavanderhoorn
Copy link
Author

Somewhat off-topic perhaps, but the following query (5184 is my user id):

select id from answers where user_id == 5184

returns 3479 for me. ROS Answers says (as of today) 3517.

I also can't get the total karma to match what ROS Answers shows, but that's not really important.

@DLu
Copy link

DLu commented May 19, 2023

My local copy says 3506 so it doesn't seem that off. I'll believe that you have 11 answers since I updated the database.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants