-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Enhance bulk insert speed using copy command #370
base: main
Are you sure you want to change the base?
Conversation
@visch did we document somewhere the steps needed to regenerate the SSL cert files so CI can pass? |
We didn't :/ we should have at least linked to https://www.postgresql.org/docs/current/ssl-tcp.html#SSL-CERTIFICATE-CREATION which has the steps. Really our steps should setup them to be 250 years or something. |
I've benchmarked the time savings on my local machine. I'm loading 1,000,000 rows directly from a Average running time of 3 invocations using COPY (kinghuang:bulk-insert-copy): Timing Results
Average of three invocations using INSERT (MeltanoLabs:main): Timing Results
A noticeable improvement, which I imagine would be even more pronounced with a more complicated data set (my benchmark used only a few columns). That said, I do notice that the current COPY setup fails to handle certain types of data. For example, the Log Output
|
'"{' | ||
+ ",".join('""' + v.replace('"', r'\""') + '""' for v in value) | ||
+ '}"' | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think once the tests are passing, we're good to go here. This method of escaping arrays stands out as awkward, but I think it's necessary. I tried putting together an implementation that avoids it using a custom dialect of csv writer. But I think that anything I come up with there is going to be even more janky.
Thoughts @visch ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, surprising but if this is better that makes sense to me!
CREATE TABLE public.testing_table (
id SERIAL PRIMARY KEY,
jsonb_data JSONB[]
);
INSERT INTO public.testing_table --succeeds, but breaks backward compatibility
(jsonb_data)
VALUES('{"[\"string1\", \"string2\", 1, 2, 3]"}'::jsonb[]);
INSERT INTO public.testing_table --succeeds
(jsonb_data)
VALUES('{\"string1\", \"string2\", 1, 2, 3}'::jsonb[]);
INSERT INTO public.testing_table --fails
(jsonb_data)
VALUES('[\"string1\", \"string2\", 1, 2, 3]'::jsonb[]);
--ERROR: malformed array literal: "[\"string1\", \"string2\", 1, 2, 3]"
-- Detail: "[" must introduce explicitly-specified array dimensions.
INSERT INTO public.testing_table --fails
(jsonb_data)
VALUES('"[\"string1\", \"string2\", 1, 2, 3]"'::jsonb[]);
--ERROR: malformed array literal: ""[\"string1\", \"string2\", 1, 2, 3]""
-- Detail: Array value must start with "{" or dimension information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check those out. The csv.QUOTE_NOTNULL and csv.QUOTE_STRINGS would both be more convenient, but they are new in Python 3.12.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kinghuang are you interested in getting this PR to the finish line? Is there anything I can do to help?
I'll check those out. The csv.QUOTE_NOTNULL and csv.QUOTE_STRINGS would both be more convenient, but they are new in Python 3.12.
Would it be worth using those to make the tests pass on Python 3.12? If they significantly simplify the implementation, we can think about backporting the csv
library from Python 3.12+ and publish the wheels to PyPI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refer to @sebastianswms 's comments
c94e3f5
to
b609e27
Compare
Create a variation of generate_insert_statement that returns a PostgreSQL copy statement, suitable for bulk loading of data formatted as csv.
Use copy instead of insert to bulk insert records. In PostgreSQL, copy is the fastest way to insert bulk data.
The override of generate_insert_statement is no longer used.
Use the type bind processors to generate values to ensure that values are represented correctly to PostgreSQL, especially for things like ARRAY and JSONB. Also, handle null values.
Always quote strings and handle array values.
fbfe133
to
ed4838f
Compare
@kinghuang thanks for this PR! Is there anything you need to get it across the finish line? We're motivated to get this feature in 😄 . I see some comments/feedback but I'm not sure what is needed. @visch and @sebastianswms said theyre able to assist with getting this merged if thats something of interest. |
@edgarrmondragon @pnadolny13 Yes we are definitely interested in getting this released as well! |
FYI to get a massive speed increase turn off |
Sorry, I haven't had any time to finish this off. It's been quite busy at work! There's some failing tests that indicate an escaping/encoding error somewhere. One thing that I've thought about trying is changing the delimiters used. The blog post CSVs Are Kinda Bad. DSVs Are Kinda Good makes me think that it might be simpler and better to use the record and unit separator characters instead of trying to deal with escaping quotes and commas. The escape and quote characters are configurable on the COPY command. |
@kinghuang Thoughts on using the copy functionality in psycopg3? https://www.psycopg.org/articles/2020/11/15/psycopg3-copy/ It should avoid many of the issues mentioned above. |
I've been doing very large loads into PostgreSQL, and found
target-postgres
to be slow at bulk inserts. In PostgreSQL, the COPY command is preferable for high performance loading of data.This PR changes the implementation of
bulk_insert_records()
by usingCOPY
with CSV data instead of anINSERT
statement with values. When combined with larger batches (batch_size_rows
), I am able to achieve 2 to 10× bulk insert speeds compared to the existing implementation.Here's an example difference for moving 1 million rows via
tap-snowflake target-postgres
.Useful Links: