-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[backend] Adding a keyword filter #37
Conversation
…nce we don't use rustup...
(got error `error: error validating "k8s/overlays/local": error validating data: ValidationError(Deployment.spec.template.spec.containers): invalid type for io.k8s.api.core.v1.PodSpec.containers: got "map", expected "array"; if you choose to ignore these errors, turn validation off with --validate=false`)
WIP kustomize
to reduce the size of the DB and avoid saving all commits and changelogs
}) | ||
}); | ||
|
||
parse_texts(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I understand correctly you're re-using the UpdateMetadata
field instead of creating a new field for this analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this is not about analysis yet, this is about saving storage space.
I thought the analysis would come later in the prioritization engine.
fn parse_texts(input: Result<UpdateMetadata>) -> Result<UpdateMetadata> { | ||
if input.is_err() { | ||
return input; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be done on the caller side. You can re-write the caller side as:
let data: UpdateMetadata = serde_json::from_slice(&output.stdout).map_err(|e| {
error!("{}", String::from_utf8_lossy(&output.stdout));
anyhow::Error::msg(e)
})?;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, better be foolproof in case someone changes the code later, no? I could do the check on the caller side as well, to exit earlier, agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how making the check here makes it foolproof. Accepting a Result
as argument is not really idiomatic in Rust : o
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I should change it to accept an UpdateMetadata directly, yeah
if std::env::var("RETAIN_ALL").is_ok() && std::env::var("RETAIN_ALL").unwrap() != "" { | ||
println!("DISABLED TEXT PARSING, RETAINING ALL DATA. $RETAIN_ALL={}", std::env::var("RETAIN_ALL").unwrap()); | ||
return input; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so two things:
- do we really want this RETAIN feature? Is it ever going to be useful if we don't really display that information to the user anyway?
- I think this
if
should be on the caller side. Actually, I think it shouldn't even live in this file as this priority engine logic, and this file is just about bindings to dependabot.
wdyt?
commit.message = "".to_string(); | ||
} | ||
} | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's left here for the review to have some food for thoughts:
- do we want to remove the entire commits, or do we want to just truncate uninteresting messages to the null string?
The first method is more effective to save storage space.
The second would allow us to keep the html_url
field of all commits... But since I'm not filtering the commits_url: Option<String>,
field, we already have all commits urls even these of the commits that were not "flagged".
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the first question we should ask ourselves is: do we want to keep these in? I know we can reduce storage by removing them, but we're still 10MB away from mongodb limit on documents so we can still ask ourselves if it's worth keeping. I don't necessary like having devs make decisions based on changelog and commit because these can be faked, and don't necessarily reflect the reality of what's on crates.io. But it can still be useful to have if we want to allow the dev to make a more informed decision on prioritization. wdyt?
If we do want to display it to the user, we can simply display the thing in the create PR/review page.
If we don't want to display it to the user, we can add serde::skip
to these fields to avoid storing in storage, but have still have the fields to do priority analysis on it (but save the analysis in another field).
|
||
fn flagged_text(text: &String) -> bool { | ||
for word in FLAGGED_WORDS { | ||
if text.contains(word) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's better if we use regexes, otherwise you'll get words that include the letters "rce" without being the actual "rce" word.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for "sec". I predict we will have a lot of false positive : D
actually, it's fine to have false positives, but it'd be good to see how useful these keywords are. I'm wondering if it would be a good idea to show the user context around the words we grepped. I guess this is what you're doing already by only retaining commits or changelog that passed this test...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I thought "false positives aren't an issue". But we could move to regexes to be me accurate. As you want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we could have a latter prioritization step that would give different weights to different words, since RCE is prolly worse than "bugfix"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we could have a latter prioritization step that would give different weights to different words, since RCE is prolly worse than "bugfix"
let's keep it simple for now : o but you can that add the idea for later
Yeah I thought "false positives aren't an issue". But we could move to regexes to be me accurate. As you want?
we can keep this for now and see how much false positives we get, from past experience I predict we'll too much but no need to optimize this now anyway
} | ||
} | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't know if we want to care about EOF, we don't allow a file that doesn't end with a linebreak in diem/diem
@@ -23,6 +27,7 @@ services: | |||
environment: | |||
- "GITHUB_TOKEN=$GITHUB_TOKEN" # an optional PAT for Github | |||
- "CARGO_HOME=/cargo" # used with a volume to persist cargo stuff | |||
- "RETAIN_ALL=$RETAIN_ALL" # to disable parsing of the commit and changelog messages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the problem with this is that currently I imagine that we can prioritize things on the frontend side by checking if there are changelog/commits present in the UpdateMetadata
field. But if we enable this, then every update has these fields so we can't prioritize anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is good as a first pass. We should set some time to figure out how to show that on the frontend. I think we need to figure out these before merging:
- either remove
RETAIN_ALL
or figure out a way to keep the prioritization whenRETAIN_ALL
is enabled. I think using a different field thanUpdateMetadata
is a good idea: we could have acommit_words: Vec<String>
containing the words that were flagged. - move the filtering out of
dependanbot
. Perhaps directly in thepriority()
function - use regexes to parse commits/changelogs
I was thinking that we might have 2 different things:
This tries to address the latter as it's easy and it's better to save some storage from scratch, no? |
This is meant to reduce the size of the DB and avoid saving all commits and changelogs:
Fixes #33