-
-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Networks from Documents #509
Conversation
Codecov Report
@@ Coverage Diff @@
## master #509 +/- ##
==========================================
+ Coverage 63.84% 64.59% +0.75%
==========================================
Files 59 61 +2
Lines 6322 6598 +276
Branches 829 870 +41
==========================================
+ Hits 4036 4262 +226
- Misses 2151 2187 +36
- Partials 135 149 +14 |
This is cool! 🕸 I was just wondering whether it would make sense to add 'Title' as an additional string attribute to the Node Data? Currently, if title doesn't exist, it is generated and perhaps we should use this in the output. What do you think, @PrimozGodec. |
Oh, I hadn't noticed that title attribute is being automatically generated. We can definitely add it. I'm just wondering what should we name it because naming it 'Title' might cause ambiguity in cases where 'Title' is already part of corpus metas (e.g. in grimm-tales corpus)? |
I think Title, if already present in the corpus, will be title of the corpus anyway. I think the heuristic is the following: if something similar to 'Title' is in the corpus, use that, else figure out something remotely appropriate. |
Title is a property of the attribute, if I remember correctly and Corpus widget normally sets this, right @PrimozGodec? So just use whichever attribute is the 'title' attribute. |
@ajdapretnar you are correct title attribute is set by the corpus widget. By default, it checks if there is a title attribute in the domain, if not something else is set (heuristic). In the Corpus widget, the title attribute can be manually changed by the user - so it is not guaranteed that the Title attribute will be the title in the corpus. So we have the following two possible cases:
The document titles are retrieved from I do not understand where exactly do you want to add it to the network? In case when we have documents - shouldn't it be the name of the vertex (currently names are numbers)? |
orangecontrib/text/__init__.py
Outdated
@@ -4,6 +4,7 @@ | |||
os.environ['NLTK_DATA'] = nltk_data_dir() | |||
|
|||
from .corpus import Corpus | |||
from .corpus_to_network import CorpusToNetwork |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When someone does not have network addon installed this line fails and the complete addon is not shown in Orange. Can we avoid having this line here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a line from Louvain. I think that widget would only be missing an output, but we can easily hide the widget if Network is not installed. I think also Pubmed was hidden if one if its dependencies was missing.
try:
from orangecontrib.network.network import Network
except ImportError:
Network = None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh and yes, I think it should not be in init.
from Orange.widgets.utils.concurrent import ConcurrentWidgetMixin, TaskState | ||
from Orange.data import Table | ||
|
||
from orangecontrib.network import Network |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would make that widget does not show up when someone does not have a Network addon installed. Is it ok? @ajdapretnar do we have any similar case of widget that needs the other addon anywhere else? what is the practice in this case? Is it ok that widget just does not shows up or should it shows up and show an error - since it needs a network addon.
ticks = iter(np.linspace(0., 90., num_ticks)) | ||
for i in range(len(self.corpus)): | ||
for j in range(i + 1, len(self.corpus)): | ||
w = set(self.corpus.tokens[i]).intersection( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok that you use tokens here or should we use ngrams
? tokens
include preprocessed corpus without Ingram preprocessing step, while ngrams
includes all preprocessing.
if not isinstance(document_nodes, bool): | ||
raise ValueError("Document_nodes must be bool.") | ||
|
||
self.state = state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not nice to store a parameter that appears on every call as a class attribute. I would suggest passing it to functions as an additional parameter. Since I see that state is used only for interesting and progress I suggest not passing the complete state to this function but preferably a callback function, such it is done in embedding - it is the more standard way. It is nicer that the callback function is created at the widget side (in runner).
Btw. Vesna learned me an even better way to interrupt the process. Before I was always making that with passion true/false out of the callback. An easier way is that callback directly raises an error and that way stops the processing.
""" | ||
return self.document_items if document_nodes else self.word_items | ||
|
||
def _set_progress_or_stop(self, progress): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function should go to the widget and be passed to the script as a callback. This way script is more flexible for future use and changes
requirements.txt
Outdated
@@ -17,3 +17,4 @@ docx2txt>=0.6 | |||
lxml | |||
biopython # Enables Pubmed widget. | |||
ufal.udpipe >=1.2.0.3 | |||
Orange3-Network |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok if we add this requirement to the add-on then two of the above comments can be ignored. :P Do we really want to push the user to install the other addon just because of one widget? I would prefer one of the other options:
- the widget does not show when Orange3-network not installed (current situation) - but then users would write issues why this widget not shown
- the widget shows error and warns the user to install Orange3-network.
What do you think @ajdapretnar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The widget does not show. At least that was the current practice so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So then this must be removed from requirements
While testing the widget I have noticed that button text does not change to stop when the input change, and then if I click on it, it restarts the network creation. One another question (I do not know network widget well). I was checking the network in the Network explorer widget. And I saw that there is an option to set labels, colors, shapes of nodes. Is it possible to add the attributes from the Corpus to the Network (when network from documents) that they are used for coloring, shapes, labels? |
@ajdapretnar I see that the heuristic exists but I asked the question because for example on 'grimm-tales' corpus, attribute Title is part of domain but @PrimozGodec The widget has two outputs, Network and Node Data so if you connect Node Data to the corresponding input to Network Explorer widget you'll be able to use all other corpus features and that's where we can also add Title attribute. |
@djukicn For me, grimm-tales shows the correct title... 🤷 |
The reason for having titles Document 1, Document2, ... is that in the dropdown of the Corpus widget Title is set to (none).
Ok, I didn't notice that. :) I have another idea but I do not know whether it is possible. The document network is created like this:
And nodes are just numbers. Is it possible to pass document tiles as nodes? |
4f57ca0
to
160acd7
Compare
@ajdapretnar I've added Title as node label as @PrimozGodec suggested. Titles property didn't work for me in terminal but works fine in widget. I also removed import from init and network add-on from requirements. @PrimozGodec The button should work properly on input change now. Script now works on ngrams instead of tokens. |
@lanzagar Is suggesting to handle the dependency on Network add-on differently. Similar to SQL widget in Orange. First, there's a try-except block for imports. If no Network add-on is installed, the widget remembers this. Then the GUI of the widget loads and when it does, it checks the above parameter, say Network=False.
If Network=None, the widget reports an error ('Install Network add-on to enable the widget.'), if it exists, it works. |
This would also solve the failing tests. |
9e65a44
to
4fabfca
Compare
4fabfca
to
7bd93ad
Compare
@ajdapretnar I made it that way. The widget is now shown even without network add-on. I'm not sure how could that solve failing tests so I added |
Issue
Implements #228
Description of changes
Includes