Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hotfix #27 #28

Closed
wants to merge 4 commits into from
Closed

Hotfix #27 #28

wants to merge 4 commits into from

Conversation

eric-gt
Copy link
Contributor

@eric-gt eric-gt commented Oct 18, 2021

deduplicate data received from database prior to formatting payload
closes #27

@eric-gt eric-gt added the bug Something isn't working label Oct 18, 2021
@eric-gt eric-gt requested a review from nandanrao October 18, 2021 13:47
@eric-gt eric-gt self-assigned this Oct 18, 2021
@eric-gt eric-gt marked this pull request as draft October 18, 2021 13:49
@eric-gt eric-gt marked this pull request as ready for review October 18, 2021 14:05
deduplicateData (rows) {
const dedupe = rows.filter((row, index) => {
const _row = JSON.stringify(row);
return index === rows.findIndex(obj => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an odd implementation, deduping should be an O(N) operation and should use a hash table somewhere. Either a set or an object or a map and use the keys to deduplicate.

That being said, I thought we saw a clear fix? Why do you think changing the query would be so much slower? Generally speaking, it's always faster to do something at database level.

Also, as a consumer, getting the first 1000 slightly faster but getting the whole bucket slower doesn't help! I want the whole thing as fast as possible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-gt - any thoughts on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my original clear fix turned out to be unsupported in BigQuery. It's impossible to SELECT DISTINCT on a single column, or GROUP BY on a single column in BigQuery's SQL implementation. SELECT DISTINCT does not work on repeatable columns (structs, arrays, etc) so we would have to group on every single column being returned, which would make the query much more verbose and hard to modify (e.g. removing or adding a column from the query would require edits in 4-5 different places in the query)
I can reimplement the deduplication function to use a hash table or take the plunge and rewrite the query to do the much more complex grouping operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nandanrao let me know your thoughts on which course to take with this last change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nandanrao the change to use an Object in the application code was only about an hour of work, so I went ahead and made the change. Unit and Postman checks of deduplication are passing.

@eric-gt eric-gt closed this Jan 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Not getting the same number of events for a set of users with different attribution_ids
2 participants