-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
download internal hyperlinks #498
Comments
I don't remember the UI being able to perform this, no. But this command line tool here (namely the |
tnx :), i tried to use it and download the csv file, but this file is not completed like gexf format, cause i should convert them to gexf format and it would not be like that format, specially when i want to use them with NetworkX, which would not be like the original format. can i download the as gexf format? |
I don't think you can. But the CSV file should be very easy to load as a graph using networkx with some data wrangling. Else you can also try this tool: https://medialab.github.io/table2net/ |
merci, well i tried to load the csv with networkx, but the main problem is the csv file have the sources link which crawled, but not the target sources which from a web page goes to other webpages, so its like just nodes of webpages but without the paths or targets. do you know how can i download or include them in csv file? tnx :) |
Hello there, I'm not sure I understand exactly your needs : are you trying to do a graph of links between webpages and not of websites ? |
thank you, well i'm looking forward to download each by each these webpages corpus, internal hyperlinks corpus, for instance, i have a website which i crawled, so the internal network corpus within the website can be download manually as gexf format, i mean the internal hyperlinks can be downloadable by each of them, i need to download these gexf file each by each cause i have too many webpages too crawl and the process of downloading each gexf file takes time. Merci :) |
it might be better to mention like this: i can't download internal hyperlinks, with minet also, i tried to put ignore_internal_links= True, but didnt work, i'm trying to download the internal hyperlinks and dots with 'gexf' format, but for large amount of data i cant manually do this. |
I mean the data inside of '.../webentityPagesNetwork/...' in I cant download them all, just can have the csv for nodes (pages), but not the hyperlinks between them. |
I repeat my question: what is your methodological need ? Using Hyphe to work with inner links is like using an anvil instead of a hammer. If you're looking to build a network of webpages in general, Hyphe is not the good tool for this and there are ways to collect that data but it would way more straightforward for you to just do it with other tools (minet has great crawl and links extraction tools for this for instance, issuecrawler or socscibot might be also good options with graphical interfaces). Although if you have needs relative to the aggregation of those webpages into groups such as Hyphe's webentities, then Hyphe might be adapted, but in that case I'm not sure what you expect from gathering the whole detailed links page by page and I'd be curious to understand. |
are there any access point or button to download all the internal hyperlinks in gexf file in a same time?
for instance, i have 100 urls to crawl, so for each of them i should go manually and download its own gexf file which includes their internal hyperlinks, so i wonder are there any access to download all these internal hyperlinks of all these 100 urls in a same time?
tnx :)
The text was updated successfully, but these errors were encountered: