Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore the parameter space to try and improve results. #11

Open
bcipolli opened this issue Jun 25, 2017 · 1 comment
Open

Explore the parameter space to try and improve results. #11

bcipolli opened this issue Jun 25, 2017 · 1 comment

Comments

@bcipolli
Copy link
Owner

To play: python main.py --csv-file raw_dataframe.csv Then add flags to explore the parameter space:

  • --source-thresh SOURCE_THRESH Min % of events a news source must cover, to be included.
    Default 0.5; lowering this would include a broader set of news sources.

  • --min-article-length MIN_ARTICLE_LENGTH Min # words in an article (pre-parsing)
    Set to 250. Are longer articles more biased?

  • --min-vocab-length MIN_VOCAB_LENGTH Min # words in an article (post-lemmatizing, vectorizing)
    Set to 100. Are longer articles more biased?

  • --lda-min-appearances LDA_MIN_APPEARANCES Min # appearances of a word, to be included in the vocabulary
    Set to 2. Could raise this, to focus on the most common words.

  • --lda-vectorization-type {count,tfidf} Type of vectorization of article to word counts, to do.
    Set to count. Not 100% tfidf is working, but if it is, we should use it.

  • --lda-groupby {source,article} Run LDA on text separated by article, or by news source?
    Set to article right now. this just means: what are the "documents" (sets of words) sent into LDA? Could be by article, or could aggregate over source.

  • --lda-topics LDA_TOPICS # of LDA topics
    Set to 10. Clusters indicate that maybe a higher number could be helpful.

  • --lda-iters LDA_ITERS # of LDA iterations
    1500. Probably could be lowered for larger datasets.

  • --truth-frequency-thresh TRUTH_FREQUENCY_THRESH % of articles in a news event that must mention a word, for it to be "truth" / removed.
    Set to 0.5. Could be higher (e.g. 1.1 - force no words to be removed) or lower (e.g. 0.1, remove most words and leave only infrequent words for bias. Could also be implemented as a range, to say: bias words appear often, but not as often as truth words, and not as infrequently as random garbage.

@bcipolli
Copy link
Owner Author

Note that the internal app caching may over-cache. If you think that's happening, just run with the --force command-line flag, to force the app to re-run all steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant