-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hotfix #27 #28
Hotfix #27 #28
Conversation
api/src/helperClasses.js
Outdated
deduplicateData (rows) { | ||
const dedupe = rows.filter((row, index) => { | ||
const _row = JSON.stringify(row); | ||
return index === rows.findIndex(obj => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an odd implementation, deduping should be an O(N) operation and should use a hash table somewhere. Either a set or an object or a map and use the keys to deduplicate.
That being said, I thought we saw a clear fix? Why do you think changing the query would be so much slower? Generally speaking, it's always faster to do something at database level.
Also, as a consumer, getting the first 1000 slightly faster but getting the whole bucket slower doesn't help! I want the whole thing as fast as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eric-gt - any thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my original clear fix turned out to be unsupported in BigQuery. It's impossible to SELECT DISTINCT on a single column, or GROUP BY on a single column in BigQuery's SQL implementation. SELECT DISTINCT does not work on repeatable columns (structs, arrays, etc) so we would have to group on every single column being returned, which would make the query much more verbose and hard to modify (e.g. removing or adding a column from the query would require edits in 4-5 different places in the query)
I can reimplement the deduplication function to use a hash table or take the plunge and rewrite the query to do the much more complex grouping operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nandanrao let me know your thoughts on which course to take with this last change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nandanrao the change to use an Object in the application code was only about an hour of work, so I went ahead and made the change. Unit and Postman checks of deduplication are passing.
deduplicate data received from database prior to formatting payload
closes #27