You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To play: python main.py --csv-file raw_dataframe.csv Then add flags to explore the parameter space:
--source-thresh SOURCE_THRESH Min % of events a news source must cover, to be included.
Default 0.5; lowering this would include a broader set of news sources.
--min-article-length MIN_ARTICLE_LENGTH Min # words in an article (pre-parsing)
Set to 250. Are longer articles more biased?
--min-vocab-length MIN_VOCAB_LENGTH Min # words in an article (post-lemmatizing, vectorizing)
Set to 100. Are longer articles more biased?
--lda-min-appearances LDA_MIN_APPEARANCES Min # appearances of a word, to be included in the vocabulary
Set to 2. Could raise this, to focus on the most common words.
--lda-vectorization-type {count,tfidf} Type of vectorization of article to word counts, to do.
Set to count. Not 100% tfidf is working, but if it is, we should use it.
--lda-groupby {source,article} Run LDA on text separated by article, or by news source?
Set to article right now. this just means: what are the "documents" (sets of words) sent into LDA? Could be by article, or could aggregate over source.
--lda-topics LDA_TOPICS # of LDA topics
Set to 10. Clusters indicate that maybe a higher number could be helpful.
--lda-iters LDA_ITERS # of LDA iterations
1500. Probably could be lowered for larger datasets.
--truth-frequency-thresh TRUTH_FREQUENCY_THRESH % of articles in a news event that must mention a word, for it to be "truth" / removed.
Set to 0.5. Could be higher (e.g. 1.1 - force no words to be removed) or lower (e.g. 0.1, remove most words and leave only infrequent words for bias. Could also be implemented as a range, to say: bias words appear often, but not as often as truth words, and not as infrequently as random garbage.
The text was updated successfully, but these errors were encountered:
Note that the internal app caching may over-cache. If you think that's happening, just run with the --force command-line flag, to force the app to re-run all steps.
To play:
python main.py --csv-file raw_dataframe.csv
Then add flags to explore the parameter space:--source-thresh SOURCE_THRESH Min % of events a news source must cover, to be included.
Default 0.5; lowering this would include a broader set of news sources.
--min-article-length MIN_ARTICLE_LENGTH Min # words in an article (pre-parsing)
Set to 250. Are longer articles more biased?
--min-vocab-length MIN_VOCAB_LENGTH Min # words in an article (post-lemmatizing, vectorizing)
Set to 100. Are longer articles more biased?
--lda-min-appearances LDA_MIN_APPEARANCES Min # appearances of a word, to be included in the vocabulary
Set to 2. Could raise this, to focus on the most common words.
--lda-vectorization-type {count,tfidf} Type of vectorization of article to word counts, to do.
Set to
count
. Not 100%tfidf
is working, but if it is, we should use it.--lda-groupby {source,article} Run LDA on text separated by article, or by news source?
Set to
article
right now. this just means: what are the "documents" (sets of words) sent into LDA? Could be by article, or could aggregate over source.--lda-topics LDA_TOPICS # of LDA topics
Set to 10. Clusters indicate that maybe a higher number could be helpful.
--lda-iters LDA_ITERS # of LDA iterations
1500. Probably could be lowered for larger datasets.
--truth-frequency-thresh TRUTH_FREQUENCY_THRESH % of articles in a news event that must mention a word, for it to be "truth" / removed.
Set to 0.5. Could be higher (e.g. 1.1 - force no words to be removed) or lower (e.g. 0.1, remove most words and leave only infrequent words for bias. Could also be implemented as a range, to say: bias words appear often, but not as often as truth words, and not as infrequently as random garbage.
The text was updated successfully, but these errors were encountered: