Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Networks from Documents #509

Merged
merged 1 commit into from
Apr 23, 2020
Merged

Conversation

djukicn
Copy link
Collaborator

@djukicn djukicn commented Mar 30, 2020

Issue

Implements #228

Description of changes
  • Script and widget for creating networks from corpus
  • Tests
Includes
  • Code changes
  • Tests
  • Documentation

@codecov-io
Copy link

Codecov Report

Merging #509 into master will increase coverage by 0.75%.
The diff coverage is 82.07%.

@@            Coverage Diff             @@
##           master     #509      +/-   ##
==========================================
+ Coverage   63.84%   64.59%   +0.75%     
==========================================
  Files          59       61       +2     
  Lines        6322     6598     +276     
  Branches      829      870      +41     
==========================================
+ Hits         4036     4262     +226     
- Misses       2151     2187      +36     
- Partials      135      149      +14     

@ajdapretnar
Copy link
Collaborator

This is cool! 🕸

I was just wondering whether it would make sense to add 'Title' as an additional string attribute to the Node Data? Currently, if title doesn't exist, it is generated and perhaps we should use this in the output. What do you think, @PrimozGodec.

@djukicn
Copy link
Collaborator Author

djukicn commented Mar 31, 2020

Oh, I hadn't noticed that title attribute is being automatically generated. We can definitely add it. I'm just wondering what should we name it because naming it 'Title' might cause ambiguity in cases where 'Title' is already part of corpus metas (e.g. in grimm-tales corpus)?

@ajdapretnar
Copy link
Collaborator

I think Title, if already present in the corpus, will be title of the corpus anyway. I think the heuristic is the following: if something similar to 'Title' is in the corpus, use that, else figure out something remotely appropriate.

@ajdapretnar
Copy link
Collaborator

Title is a property of the attribute, if I remember correctly and Corpus widget normally sets this, right @PrimozGodec? So just use whichever attribute is the 'title' attribute.

@PrimozGodec
Copy link
Collaborator

PrimozGodec commented Apr 14, 2020

@ajdapretnar you are correct title attribute is set by the corpus widget. By default, it checks if there is a title attribute in the domain, if not something else is set (heuristic). In the Corpus widget, the title attribute can be manually changed by the user - so it is not guaranteed that the Title attribute will be the title in the corpus.

So we have the following two possible cases:

  • the document titles are set as values of one of the meta attributes (in the case when it is set by the corpus widget)
  • the document titles are not set (option (no title) ) in corpus widget - in this case document titles are generated in the corpus, they will be (Document 1, Document 2, ...)

The document titles are retrieved from corpus.titles property - it returns a list with title for each document.

I do not understand where exactly do you want to add it to the network? In case when we have documents - shouldn't it be the name of the vertex (currently names are numbers)?

@@ -4,6 +4,7 @@
os.environ['NLTK_DATA'] = nltk_data_dir()

from .corpus import Corpus
from .corpus_to_network import CorpusToNetwork
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When someone does not have network addon installed this line fails and the complete addon is not shown in Orange. Can we avoid having this line here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a line from Louvain. I think that widget would only be missing an output, but we can easily hide the widget if Network is not installed. I think also Pubmed was hidden if one if its dependencies was missing.

try:
    from orangecontrib.network.network import Network
except ImportError:
    Network = None

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and yes, I think it should not be in init.

from Orange.widgets.utils.concurrent import ConcurrentWidgetMixin, TaskState
from Orange.data import Table

from orangecontrib.network import Network
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make that widget does not show up when someone does not have a Network addon installed. Is it ok? @ajdapretnar do we have any similar case of widget that needs the other addon anywhere else? what is the practice in this case? Is it ok that widget just does not shows up or should it shows up and show an error - since it needs a network addon.

ticks = iter(np.linspace(0., 90., num_ticks))
for i in range(len(self.corpus)):
for j in range(i + 1, len(self.corpus)):
w = set(self.corpus.tokens[i]).intersection(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok that you use tokens here or should we use ngrams? tokens include preprocessed corpus without Ingram preprocessing step, while ngrams includes all preprocessing.

if not isinstance(document_nodes, bool):
raise ValueError("Document_nodes must be bool.")

self.state = state
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not nice to store a parameter that appears on every call as a class attribute. I would suggest passing it to functions as an additional parameter. Since I see that state is used only for interesting and progress I suggest not passing the complete state to this function but preferably a callback function, such it is done in embedding - it is the more standard way. It is nicer that the callback function is created at the widget side (in runner).

Btw. Vesna learned me an even better way to interrupt the process. Before I was always making that with passion true/false out of the callback. An easier way is that callback directly raises an error and that way stops the processing.

"""
return self.document_items if document_nodes else self.word_items

def _set_progress_or_stop(self, progress):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should go to the widget and be passed to the script as a callback. This way script is more flexible for future use and changes

requirements.txt Outdated
@@ -17,3 +17,4 @@ docx2txt>=0.6
lxml
biopython # Enables Pubmed widget.
ufal.udpipe >=1.2.0.3
Orange3-Network
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok if we add this requirement to the add-on then two of the above comments can be ignored. :P Do we really want to push the user to install the other addon just because of one widget? I would prefer one of the other options:

  • the widget does not show when Orange3-network not installed (current situation) - but then users would write issues why this widget not shown
  • the widget shows error and warns the user to install Orange3-network.

What do you think @ajdapretnar

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The widget does not show. At least that was the current practice so far.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So then this must be removed from requirements

@PrimozGodec
Copy link
Collaborator

While testing the widget I have noticed that button text does not change to stop when the input change, and then if I click on it, it restarts the network creation.

One another question (I do not know network widget well). I was checking the network in the Network explorer widget. And I saw that there is an option to set labels, colors, shapes of nodes. Is it possible to add the attributes from the Corpus to the Network (when network from documents) that they are used for coloring, shapes, labels?

@djukicn
Copy link
Collaborator Author

djukicn commented Apr 15, 2020

@ajdapretnar I see that the heuristic exists but I asked the question because for example on 'grimm-tales' corpus, attribute Title is part of domain but corpus.titles property still outputs list of generic names (i.e. Document i).

@PrimozGodec The widget has two outputs, Network and Node Data so if you connect Node Data to the corresponding input to Network Explorer widget you'll be able to use all other corpus features and that's where we can also add Title attribute.

@ajdapretnar
Copy link
Collaborator

@djukicn For me, grimm-tales shows the correct title... 🤷

@PrimozGodec
Copy link
Collaborator

@ajdapretnar I see that the heuristic exists but I asked the question because for example on 'grimm-tales' corpus, attribute Title is part of domain but corpus.titles property still outputs list of generic names (i.e. Document i).

The reason for having titles Document 1, Document2, ... is that in the dropdown of the Corpus widget Title is set to (none).

@PrimozGodec The widget has two outputs, Network and Node Data so if you connect Node Data to the corresponding input to Network Explorer widget you'll be able to use all other corpus features and that's where we can also add Title attribute.

Ok, I didn't notice that. :) I have another idea but I do not know whether it is possible. The document network is created like this:

    self.document_network = Network(nodes=np.arange(len(self.corpus)),
                                    edges=csr_matrix(edges),
                                    name='Document Network')

And nodes are just numbers. Is it possible to pass document tiles as nodes?

@djukicn
Copy link
Collaborator Author

djukicn commented Apr 22, 2020

@ajdapretnar I've added Title as node label as @PrimozGodec suggested. Titles property didn't work for me in terminal but works fine in widget. I also removed import from init and network add-on from requirements.

@PrimozGodec The button should work properly on input change now. Script now works on ngrams instead of tokens.

@ajdapretnar
Copy link
Collaborator

@lanzagar Is suggesting to handle the dependency on Network add-on differently. Similar to SQL widget in Orange.

First, there's a try-except block for imports. If no Network add-on is installed, the widget remembers this.

Then the GUI of the widget loads and when it does, it checks the above parameter, say Network=False.

try:
    from orangecontrib.network.network import Network
except ImportError:
    Network = None

If Network=None, the widget reports an error ('Install Network add-on to enable the widget.'), if it exists, it works.

@ajdapretnar
Copy link
Collaborator

This would also solve the failing tests.

@djukicn djukicn force-pushed the document-network branch 2 times, most recently from 9e65a44 to 4fabfca Compare April 22, 2020 16:48
@djukicn
Copy link
Collaborator Author

djukicn commented Apr 22, 2020

@ajdapretnar I made it that way. The widget is now shown even without network add-on. I'm not sure how could that solve failing tests so I added pip install Orange3-Network to .travis.yml as @PrimozGodec suggested.

@ajdapretnar ajdapretnar merged commit 07197be into biolab:master Apr 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants