feat: Add support for `pgvector` [NAIVE] #251

amotl · 2023-12-13T03:02:49Z

Note: This PR is just for educational purposes and is not to be taken seriously, unless otherwise endorsed.

About

This patch aims to add very basic and naive support for populating data into a vector()-type column as provided by pgvector. A vector() is actually an array of floating point numbers, so the implementation tries to follow that.

Details

As outlined below, my implementation is very naive and doesn't take any Singer SCHEMA standardization processes into account, effectively just hacking in a "additionalProperties": {"storage": {"type": "vector", "dim": 4}} extra attribute.

I am sure this will not be appropriate, so I will take it as an opportunity to learn how it actually works to integrate new types to the type system, at the same time asking for your patience with me.

The patch is stacked on top of GH-250, that's why the diff is not readable well. Merging GH-250 and rebasing this branch will improve the situation. In the meanwhile, the diff can be inspected by visiting the commit ea08740.

Usage

Install package in development mode including the pgvector extra.

poetry install --extras=pgvector

Invoke all array tests, including test_array_float_vector.

pytest -vvv -k array

Thoughts

Let me know if you think it is a good idea to explore this feature within target-postgres, or if you think it should be approached on behalf a different target implementation, like target-pgvector, which builds upon the former, but separates concerns.

target_postgres/connector.py

amotl · 2023-12-13T23:34:09Z

target_postgres/connector.py

        if "array" in jsonschema_type["type"]:
+            # Select between different kinds of `ARRAY` data types.
+            #
+            # This currently leverages an unspecified definition for the Singer SCHEMA,
+            # using the `additionalProperties` attribute to convey additional type
+            # information, agnostic of the target database.
+            #
+            # In this case, it is about telling different kinds of `ARRAY` types apart:
+            # Either it is a vanilla `ARRAY`, to be stored into a `jsonb[]` type, or,
+            # alternatively, it can be a "vector" kind `ARRAY` of floating point numbers,
+            # effectively what pgvector is storing in its `VECTOR` type.
+            #
+            # Still, `type: "vector"` is only a surrogate label here, because other
+            # database systems may use different types for implementing the same thing,
+            # and need to translate accordingly.
+            """
+            Schema override rule in `meltano.yml`:
+
+            type: "array"
+            items:
+              type: "number"
+            additionalProperties:
+              storage:
+                type: "vector"
+                dim: 4
+
+            Produced schema annotation in `catalog.json`:
+
+            {"type": "array",
+             "items": {"type": "number"},
+             "additionalProperties": {"storage": {"type": "vector", "dim": 4}}}
+            """
+            if (
+                "additionalProperties" in jsonschema_type
+                and "storage" in jsonschema_type["additionalProperties"]
+            ):
+                storage_properties = jsonschema_type["additionalProperties"]["storage"]
+                if (
+                    "type" in storage_properties
+                    and storage_properties["type"] == "vector"
+                ):
+                    # On PostgreSQL/pgvector, use the corresponding type definition
+                    # from its SQLAlchemy dialect.
+                    from pgvector.sqlalchemy import (
+                        Vector,  # type: ignore[import-untyped]
+                    )
+
+                    return Vector(storage_properties["dim"])


After learning more details about the Singer specification, I've now used the additionalProperties schema slot to convey additional type information, agnostic of the target database.

It might not be what you had in mind, but it seems to works well. 🤷

Hi Andreas!

Reading through the JSON Schema Draft 7 spec and thinking about our own use of additionalProperties in the SDK I think I have a different understanding of its use cases:

This keyword determines how child instances validate for objects, and does not directly validate the immediate instance itself.
Validation with "additionalProperties" applies only to the child values of instance names that do not match any names in "properties", and do not match any regular expression in "patternProperties".

For all such properties, validation succeeds if the child instance validates against the "additionalProperties" schema.

So I think it's used to constraint the schema that extra fields not included in an object's properties mapping should have.

What do you think?

Hi Edgar,

thanks for your response. I am absolutely sure I failed on respecting any specifications. 💥 😇

Please bear with me that I've not made friends with the baseline specification too much yet, and apologies that I take up your time.

It is clear that I've abused the additionalProperties field, and I will be happy to wait for you unlocking corresponding other attributes how to convey additional type information, as you suggested at meltano/sdk#2102.

My patch was merely thought to exercise a few other details which may be needed to make this fly, just on the level of the target, and to see if the details could be re-assembled to make it actually runnable ¹, also as a learning exercise.

With kind regards,
Andreas.

Footnotes

https://github.com/singer-contrib/meltano-examples/tree/main/to-database ↩

Let me know if you we should better close this PR, and let the topic sit until corresponding improvements made it to the SDK.

I probably can't help there, as I believe it needs more intrinsic knowledge and discussions amongst you and your colleagues.

Alternatively, if you think there are other more feasible workarounds in the same style like I've hacked it, I will also be happy to receive your guidance.

docker-compose.yml

In PostgreSQL, all boils down to the `jsonb[]` type, but arrays are reflected as `sqlalchemy.dialects.postgresql.ARRAY` instead of `sqlalchemy.dialects.postgresql.JSONB`. In order to prepare for more advanced type mangling & validation, and to better support databases pretending to be compatible with PostgreSQL, the new test cases exercise arrays with different kinds of inner values, because, on other databases, ARRAYs may need to have uniform content. Along the lines, it adds a `verify_schema` utility function in the spirit of the `verify_data` function, refactored and generalized from the `test_anyof` test case.

Dispose the SQLAlchemy engine object after use within test utility functions.

Within `BasePostgresSDKTests`, new database connections via SQLAlchemy haven't been closed, and started filling up the connection pool, eventually saturating it.

Dispose the SQLAlchemy engine object after use within `PostgresConnector`.

By wrapping them into a container class `AssertionHelper`, it is easy to parameterize them, and to provide them to the test functions using a pytest fixture. This way, they are reusable from database adapter implementations which derive from PostgreSQL. The motivation for this is because the metadata column prefix `_sdc` needs to be adjusted for other database systems, as they reject such columns, being reserved for system purposes. In the specific case of CrateDB, it is enough to rename it like `__sdc`. Sad but true.

amotl force-pushed the pgvector branch from 312395d to bfa0f6c Compare December 13, 2023 03:03

amotl commented Dec 13, 2023

View reviewed changes

target_postgres/connector.py Outdated Show resolved Hide resolved

amotl mentioned this pull request Dec 13, 2023

feat: Add support for pgvector's vector data type singer-contrib/meltanolabs-target-postgres#1

Draft

amotl changed the title ~~[NAIVE] feat: Add support for pgvector~~ feat: Add support for pgvector [NAIVE] Dec 13, 2023

amotl force-pushed the pgvector branch 2 times, most recently from 292c9f2 to 4c0f5a7 Compare December 13, 2023 03:38

amotl mentioned this pull request Dec 13, 2023

test: Add test cases for arrays and objects, and introduce verify_schema #250

Open

amotl force-pushed the pgvector branch 2 times, most recently from abcb147 to bb99a40 Compare December 13, 2023 06:17

amotl mentioned this pull request Dec 13, 2023

Struggling to define schema overrides (on taps) crate-workbench/meltano-target-cratedb#6

Closed

amotl force-pushed the pgvector branch 2 times, most recently from 039ef87 to c14a169 Compare December 13, 2023 23:31

amotl commented Dec 13, 2023

View reviewed changes

amotl force-pushed the pgvector branch from c14a169 to c15e8aa Compare December 13, 2023 23:35

amotl commented Dec 13, 2023

View reviewed changes

docker-compose.yml Show resolved Hide resolved

amotl force-pushed the pgvector branch 2 times, most recently from ea08740 to 0f0f4fa Compare December 14, 2023 21:27

amotl mentioned this pull request Dec 14, 2023

Add support for container types ARRAY, OBJECT, and FLOAT_VECTOR crate-workbench/meltano-target-cratedb#8

Merged

amotl force-pushed the pgvector branch from 7b0c6f8 to 9ce04b1 Compare December 14, 2023 23:41

amotl added 9 commits December 19, 2023 16:16

chore: Add .idea to .gitignore

fe92e45

test: Fix FATAL: sorry, too many clients already

8b3ea4f

Dispose the SQLAlchemy engine object after use within test utility functions.

test: Fix FATAL: sorry, too many clients already

a9d1796

Within `BasePostgresSDKTests`, new database connections via SQLAlchemy haven't been closed, and started filling up the connection pool, eventually saturating it.

test: Fix FATAL: sorry, too many clients already

723e1fa

Dispose the SQLAlchemy engine object after use within `PostgresConnector`.

chore: Fix parameter names in docstrings

9e2f6db

feat: Add support for pgvector's vector data type

f007377

chore: Re-add Python 3.8 compatibility

fb320f1

amotl force-pushed the pgvector branch from 9ce04b1 to fb320f1 Compare December 19, 2023 21:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add support for `pgvector` [NAIVE] #251

feat: Add support for `pgvector` [NAIVE] #251

amotl commented Dec 13, 2023 •

edited

Loading

amotl Dec 13, 2023

edgarrmondragon Dec 14, 2023

amotl Dec 14, 2023 •

edited

Loading

amotl Dec 14, 2023 •

edited

Loading

feat: Add support for pgvector [NAIVE] #251

Are you sure you want to change the base?

feat: Add support for pgvector [NAIVE] #251

Conversation

amotl commented Dec 13, 2023 • edited Loading

About

Details

Usage

Thoughts

amotl Dec 13, 2023

Choose a reason for hiding this comment

edgarrmondragon Dec 14, 2023

Choose a reason for hiding this comment

amotl Dec 14, 2023 • edited Loading

Choose a reason for hiding this comment

Footnotes

amotl Dec 14, 2023 • edited Loading

Choose a reason for hiding this comment

feat: Add support for `pgvector` [NAIVE] #251

feat: Add support for `pgvector` [NAIVE] #251

amotl commented Dec 13, 2023 •

edited

Loading

amotl Dec 14, 2023 •

edited

Loading

amotl Dec 14, 2023 •

edited

Loading