Hotfix #27 #28

eric-gt · 2021-10-18T13:47:50Z

deduplicate data received from database prior to formatting payload
closes #27

nandanrao · 2021-10-18T14:07:30Z

api/src/helperClasses.js

+  deduplicateData (rows) {
+    const dedupe = rows.filter((row, index) => {
+      const _row = JSON.stringify(row);
+      return index === rows.findIndex(obj => {


This is an odd implementation, deduping should be an O(N) operation and should use a hash table somewhere. Either a set or an object or a map and use the keys to deduplicate.

That being said, I thought we saw a clear fix? Why do you think changing the query would be so much slower? Generally speaking, it's always faster to do something at database level.

Also, as a consumer, getting the first 1000 slightly faster but getting the whole bucket slower doesn't help! I want the whole thing as fast as possible.

@eric-gt - any thoughts on this?

So my original clear fix turned out to be unsupported in BigQuery. It's impossible to SELECT DISTINCT on a single column, or GROUP BY on a single column in BigQuery's SQL implementation. SELECT DISTINCT does not work on repeatable columns (structs, arrays, etc) so we would have to group on every single column being returned, which would make the query much more verbose and hard to modify (e.g. removing or adding a column from the query would require edits in 4-5 different places in the query)
I can reimplement the deduplication function to use a hash table or take the plunge and rewrite the query to do the much more complex grouping operations.

@nandanrao let me know your thoughts on which course to take with this last change

@nandanrao the change to use an Object in the application code was only about an hour of work, so I went ahead and made the change. Unit and Postman checks of deduplication are passing.

eric-gt added 2 commits October 15, 2021 16:06

prettify the statement for readability

d69f513

deduplicate data prior to formatting

8122335

eric-gt added the bug Something isn't working label Oct 18, 2021

eric-gt requested a review from nandanrao October 18, 2021 13:47

eric-gt self-assigned this Oct 18, 2021

eric-gt marked this pull request as draft October 18, 2021 13:49

add deduplication test coverage

4e91512

eric-gt marked this pull request as ready for review October 18, 2021 14:05

nandanrao reviewed Oct 18, 2021

View reviewed changes

refactor dedupe to work in linear time

4cee29b

eric-gt closed this Jan 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hotfix #27 #28

Hotfix #27 #28

eric-gt commented Oct 18, 2021

nandanrao Oct 18, 2021

nandanrao Oct 22, 2021

eric-gt Oct 22, 2021

eric-gt Oct 27, 2021

eric-gt Oct 29, 2021

Hotfix #27 #28

Hotfix #27 #28

Conversation

eric-gt commented Oct 18, 2021

nandanrao Oct 18, 2021

Choose a reason for hiding this comment

nandanrao Oct 22, 2021

Choose a reason for hiding this comment

eric-gt Oct 22, 2021

Choose a reason for hiding this comment

eric-gt Oct 27, 2021

Choose a reason for hiding this comment

eric-gt Oct 29, 2021

Choose a reason for hiding this comment