biolab · PrimozGodec · Jan 6, 2020 · Dec 11, 2019 · Dec 12, 2019 · Dec 12, 2019
diff --git a/doc/widgets/images/Topic-Modelling-DataTable.png b/doc/widgets/images/Topic-Modelling-DataTable.png
diff --git a/doc/widgets/images/Topic-Modelling-Example2-BoxPlot.png b/doc/widgets/images/Topic-Modelling-Example2-BoxPlot.png
diff --git a/doc/widgets/images/Topic-Modelling-Example2-MDS.png b/doc/widgets/images/Topic-Modelling-Example2-MDS.png
diff --git a/doc/widgets/images/Topic-Modelling-Example2.png b/doc/widgets/images/Topic-Modelling-Example2.png
diff --git a/doc/widgets/topicmodelling-widget.md b/doc/widgets/topicmodelling-widget.md
@@ -11,14 +11,20 @@ Topic modelling with Latent Dirichlet Allocation, Latent Semantic Indexing or Hi
 
 - Corpus: Corpus with topic weights appended.
 - Topics: Selected topics with word weights.
-- All Topics: Topic weights by tokens.
+- All Topics: Token weights per topic.
 
 **Topic Modelling** discovers abstract topics in a corpus based on clusters of words found in each document and their respective frequency. A document typically contains multiple topics in different proportions, thus the widget also reports on the topic weight per document.
 
+The widget wraps gensim's topic models ([LSI](https://radimrehurek.com/gensim/models/lsimodel.html), [LDA](https://radimrehurek.com/gensim/models/ldamodel.html), [HDP](https://radimrehurek.com/gensim/models/hdpmodel.html)).
+
+The first, LSI, can return both positive and negative words (words that are in a topic and those that aren't) and concurrently topic weights, that can be positive or negative. As stated by the main gensim's developer, Radim Řehůřek: *"LSI topics are not supposed to make sense; since LSI allows negative numbers, it boils down to delicate cancellations between topics and there's no straightforward way to interpret a topic."*
+
+LDA can be more easily interpreted, but is slower than LSI. HDP has many parameters - the parameter that corresponds to the number of topics is *Top level truncation level (T)*. The smallest number of topics that one can retrieve is 10.
+
 ![](images/Topic-Modelling-stamped.png)
 
 1. Topic modelling algorithm:
-   - [Latent Semantic Indexing](https://en.wikipedia.org/wiki/Latent_semantic_analysis)
+   - [Latent Semantic Indexing](https://en.wikipedia.org/wiki/Latent_semantic_analysis). Returns both negative and positive words and topic weights.
    - [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)
    - [Hierarchical Dirichlet Process](https://en.wikipedia.org/wiki/Hierarchical_Dirichlet_process)
 2. Parameters for the algorithm. LSI and LDA accept only the number of topics modelled, with the default set to 10. HDP, however, has more parameters. As this algorithm is computationally very demanding, we recommend you to try it on a subset or set all the required parameters in advance and only then run the algorithm (connect the input to the widget).
@@ -32,8 +38,10 @@ Topic modelling with Latent Dirichlet Allocation, Latent Semantic Indexing or Hi
 3. Produce a report.
 4. If *Commit Automatically* is on, changes are communicated automatically. Alternatively press *Commit*.
 
-Example
--------
+Examples
+--------
+
+#### Exploring Individual Topics
 
 In the first example, we present a simple use of the **Topic Modelling** widget. First we load *grimm-tales-selected.tab* data set and use [Preprocess Text](preprocesstext.md) to tokenize by words only and remove stopwords. Then we connect **Preprocess Text** to **Topic Modelling**, where we use a simple *Latent Semantic Indexing* to find 10 topics in the text.
 
@@ -45,18 +53,20 @@ We then select the first topic and display the most frequent words in the topic
 
 Now we can observe all the documents containing the word *little* in [Corpus Viewer](corpusviewer.md).
 
-In the second example, we will look at the correlation between topics and words/documents. Connect **Topic Modelling** to **Heat Map**. Ensure the link is set to *All Topics* - *Data*. **Topic Modelling** will output a matrix of topic weights by words from text (more precisely, tokens).
+#### Topic Visualization
+
+In the second example, we will look at the correlation between topics and words/documents. We are still using the *grimm-tales-selected.tab* corpus. In **Preprocess Text** we are using the default preprocessing, with an additional filter by *document frequency* (0.1 - 0.9). In **Topic Modelling** we are using LDA model with 5 topics.
 
-We can observe the output in a **Data Table**. Tokens are in rows and retrieved topics in columns. Values represent how much a word is represented in a topic.
+Connect Topic Modelling to **MDS**. Ensure the link is set to *All Topics* - *Data*. Topic Modelling will output a matrix of word weights by topic.
 
-![](images/Topic-Modelling-DataTable.png)
+In MDS, the points are now topics. We have set the size of the points to *Marginal topic probability*, which is an additional columns of *All Topics* - it reports on the marginal probability of the topic in the corpus (how strongly represented is the topic in the corpus).
 
-To visualize this matrix, open **Heat Map**. Select *Merge by k-means* and *Cluster* - *Rows* to merge similar rows into one and sort them by similarity, which makes the visualization more compact.
+![](images/Topic-Modelling-Example2-MDS.png)
 
-In the upper part of the visualization, we have words that highly define topics 1-3 and in the lower part those that define topics 5 and 10.
+We can now explore which words are representative for the topic. Select, say, Topic 5 from the plot and connect MDS to **Box Plot**. Make sure the output is set to *Data* - *Data* (not *Selected Data* - *Data*).
 
-We can similarly observe topic representation across documents. We connect another **Heat Map** to **Topic Modelling** and set link to *Corpus* - *Data*. We set *Merge* and *Cluster* as above.
+In Box Plot, set the subgroup to Selected and check the *Order by relevance to subgroups* box. This option will sort the variables by how well they separate between the selected subgroup values. In our case, this means which words are the most representative for the topic we have selected in the plot (subgroup Yes means selected).
 
-In this visualization we see how much is a topic represented in a document. Looks like Topic 1 is represented almost across the entire corpus, while other topics are more specific. To observe a specific set of document, select either a clustering node or a row in the visualization. Then pass the data to [Corpus Viewer](corpusviewer.md).
+We can see that little, children and kings are the most representative words for Topic 5, with good separation between the word frequency for this topic and all the others. Select other topics in MDS and see how the Box Plot changes.
 
-![](images/Topic-Modelling-Example2.png)
+![](images/Topic-Modelling-Example2-BoxPlot.png)
diff --git a/orangecontrib/text/tests/test_topic_modeling.py b/orangecontrib/text/tests/test_topic_modeling.py
@@ -24,9 +24,9 @@ def test_get_topic_table_by_id(self):
         self.assertFalse(any(topic1.W == np.nan))
 
     def test_get_all_topics(self):
-        self.model.fit(self.corpus)
+        self.model.fit_transform(self.corpus)
         topics = self.model.get_all_topics_table()
-        self.assertEqual(len(topics.domain), self.model.num_topics)
+        self.assertEqual(len(topics), self.model.actual_topics)
 
     def test_top_words_by_topic(self):
         self.model.fit(self.corpus)
@@ -59,6 +59,24 @@ def test_get_top_words(self):
         self.model.fit(self.corpus)
         self.assertRaises(ValueError, self.model.get_topics_table_by_id, 1000)
 
+    def test_marginal_probability(self):
+        tokens = [['a', 'b', 'c', 'd'],
+                  ['a', 'd', 'e'],
+                  ['e', 'c']]
+        doc_topics = np.array([[0.6, 0.1, 0.3],
+                               [0.2, 0.6, 0.2],
+                               [0.2, 0.3, 0.5]])
+        np.testing.assert_allclose(self.model._marginal_probability(
+                                   tokens, doc_topics),
+                                   [[0.37777778], [0.31111111], [0.31111111]])
+
+    def test_existing_attributes(self):
+        """ doc_topic should not include existing X of corpus, just topics """
+        corpus = Corpus.from_file('election-tweets-2016')[:100]
+        self.model.fit_transform(corpus)
+        self.assertEqual(self.model.doc_topic.shape[1],
+                         self.model.actual_topics)
+
 
 class LDATests(unittest.TestCase, BaseTests):
     def setUp(self):

diff --git a/orangecontrib/text/topics/topics.py b/orangecontrib/text/topics/topics.py
@@ -34,6 +34,9 @@ def __init__(self, **kwargs):
         self.topic_names = []
         self.n_words = 0
         self.running = False
+        self.doc_topic = None
+        self.tokens = None
+        self.actual_topics = None
 
     def fit(self, corpus, **kwargs):
         """ Train the model with the corpus.
@@ -70,10 +73,13 @@ def update(self, documents):
     def transform(self, corpus):
         """ Create a table with topics representation. """
         topics = self.model[corpus.ngrams_corpus]
+        self.actual_topics = self.model.get_topics().shape[0]
         matrix = matutils.corpus2dense(topics, num_docs=len(corpus),
                                        num_terms=self.num_topics).T
-
-        corpus.extend_attributes(matrix[:, :len(self.topic_names)], self.topic_names)
+        corpus.extend_attributes(matrix[:, :self.actual_topics],
+                                 self.topic_names[:self.actual_topics])
+        self.doc_topic = matrix[:, :self.actual_topics]
+        self.tokens = corpus.tokens
         return corpus
 
     def fit_transform(self, corpus, **kwargs):
@@ -105,6 +111,18 @@ def get_topics_table_by_id(self, topic_id):
         t.name = 'Topic {}'.format(topic_id + 1)
         return t
 
+    @staticmethod
+    def _marginal_probability(tokens, doc_topic):
+        """
+        Compute marginal probability of a topic, that is the probability of a
+        topic across all documents.
+
+        :return: np.array of marginal topic probabilities
+        """
+        doc_length = [len(i) for i in tokens]
+        doc_length[:] = [x / sum(doc_length) for x in doc_length]
+        return np.reshape(np.sum(doc_topic.T * doc_length, axis=1), (-1, 1))
+
     def get_all_topics_table(self):
         """ Transform all topics from gensim model to table. """
         all_words = self._topics_words(self.n_words)
@@ -116,15 +134,19 @@ def get_all_topics_table(self):
         for words, weights in zip(all_words, all_weights):
             weights = [we for wo, we in sorted(zip(words, weights))]
             X.append(weights)
-        X = np.array(X).T
+        X = np.array(X)
 
         # take only first n_topics; e.g. when user requested 10, but gensim
         # returns only 9 — when the rank is lower than num_topics requested
-        attrs = [ContinuousVariable(n)
-                 for n in self.topic_names[:n_topics]]
+        names = np.array(self.topic_names[:n_topics])[:, None]
+        attrs = [ContinuousVariable(w) for w in sorted_words]
+        metas = [StringVariable('Topics'),
+                 ContinuousVariable('Marginal Topic Probability')]
+
+        topic_proba = self._marginal_probability(self.tokens, self.doc_topic)
 
-        t = Table.from_numpy(Domain(attrs, metas=[StringVariable('Word')]),
-                             X=X, metas=np.array(sorted_words)[:, None])
+        t = Table.from_numpy(Domain(attrs, metas=metas), X=X,
+                             metas=np.hstack((names, topic_proba)))
         t.name = 'All topics'
         return t
 

diff --git a/orangecontrib/text/widgets/owtopicmodeling.py b/orangecontrib/text/widgets/owtopicmodeling.py
@@ -8,7 +8,7 @@
 
 from Orange.widgets import settings
 from Orange.widgets import gui
-from Orange.widgets.widget import OWWidget, Input, Output
+from Orange.widgets.widget import OWWidget, Input, Output, Msg
 from Orange.data import Table
 from Orange.widgets.data.contexthandlers import DomainContextHandler
 from orangecontrib.text.corpus import Corpus
@@ -134,6 +134,9 @@ class Outputs:
 
     control_area_width = 300
 
+    class Warning(OWWidget.Warning):
+        less_topics_found = Msg('Less topics found than requested.')
+
     def __init__(self):
         super().__init__()
         self.corpus = None
@@ -173,6 +176,7 @@ def __init__(self):
 
     @Inputs.corpus
     def set_data(self, data=None):
+        self.Warning.less_topics_found.clear()
         self.corpus = data
         self.apply()
 
@@ -208,6 +212,7 @@ def learning_task(self):
 
     @learning_task.on_start
     def on_start(self):
+        self.Warning.less_topics_found.clear()
         self.progressBarInit()
         self.topic_desc.clear()
 
@@ -224,6 +229,8 @@ def on_result(self, corpus):
             if self.__pending_selection:
                 self.topic_desc.select(self.__pending_selection)
                 self.__pending_selection = None
+            if self.model.actual_topics != self.model.num_topics:
+                self.Warning.less_topics_found()
             self.Outputs.all_topics.send(self.model.get_all_topics_table())
 
     @learning_task.callback

diff --git a/orangecontrib/text/widgets/tests/test_owtopicmodeling.py b/orangecontrib/text/widgets/tests/test_owtopicmodeling.py
@@ -54,6 +54,19 @@ def until(widget=self.widget):
         np.testing.assert_allclose(m1[:, 1].astype(float),
                                    m2[:, 1].astype(float))
 
+    def test_all_topics_output(self):
+        # LSI produces 9 topics for deerwester, output should be 9
+        def until(widget=self.widget):
+            return bool(self.get_output(widget.Outputs.selected_topic,
+                                        widget=widget))
+
+        self.send_signal(self.widget.Inputs.corpus, self.corpus)
+        self.process_events(until)
+        output = self.get_output(self.widget.Outputs.all_topics)
+        self.assertEqual(len(output), self.widget.model.actual_topics)
+        self.assertEqual(output.metas.shape[1],
+                         self.widget.corpus.metas.shape[1] + 1)
+
 
 if __name__ == "__main__":
     unittest.main()