-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add support for pgvector
[NAIVE]
#251
base: main
Are you sure you want to change the base?
Conversation
pgvector
pgvector
[NAIVE]
292c9f2
to
4c0f5a7
Compare
abcb147
to
bb99a40
Compare
039ef87
to
c14a169
Compare
target_postgres/connector.py
Outdated
if "array" in jsonschema_type["type"]: | ||
# Select between different kinds of `ARRAY` data types. | ||
# | ||
# This currently leverages an unspecified definition for the Singer SCHEMA, | ||
# using the `additionalProperties` attribute to convey additional type | ||
# information, agnostic of the target database. | ||
# | ||
# In this case, it is about telling different kinds of `ARRAY` types apart: | ||
# Either it is a vanilla `ARRAY`, to be stored into a `jsonb[]` type, or, | ||
# alternatively, it can be a "vector" kind `ARRAY` of floating point numbers, | ||
# effectively what pgvector is storing in its `VECTOR` type. | ||
# | ||
# Still, `type: "vector"` is only a surrogate label here, because other | ||
# database systems may use different types for implementing the same thing, | ||
# and need to translate accordingly. | ||
""" | ||
Schema override rule in `meltano.yml`: | ||
|
||
type: "array" | ||
items: | ||
type: "number" | ||
additionalProperties: | ||
storage: | ||
type: "vector" | ||
dim: 4 | ||
|
||
Produced schema annotation in `catalog.json`: | ||
|
||
{"type": "array", | ||
"items": {"type": "number"}, | ||
"additionalProperties": {"storage": {"type": "vector", "dim": 4}}} | ||
""" | ||
if ( | ||
"additionalProperties" in jsonschema_type | ||
and "storage" in jsonschema_type["additionalProperties"] | ||
): | ||
storage_properties = jsonschema_type["additionalProperties"]["storage"] | ||
if ( | ||
"type" in storage_properties | ||
and storage_properties["type"] == "vector" | ||
): | ||
# On PostgreSQL/pgvector, use the corresponding type definition | ||
# from its SQLAlchemy dialect. | ||
from pgvector.sqlalchemy import ( | ||
Vector, # type: ignore[import-untyped] | ||
) | ||
|
||
return Vector(storage_properties["dim"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After learning more details about the Singer specification, I've now used the additionalProperties
schema slot to convey additional type information, agnostic of the target database.
It might not be what you had in mind, but it seems to works well. 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Andreas!
Reading through the JSON Schema Draft 7 spec and thinking about our own use of additionalProperties
in the SDK I think I have a different understanding of its use cases:
This keyword determines how child instances validate for objects, and does not directly validate the immediate instance itself.Validation with "additionalProperties" applies only to the child values of instance names that do not match any names in "properties", and do not match any regular expression in "patternProperties".
For all such properties, validation succeeds if the child instance validates against the "additionalProperties" schema.
So I think it's used to constraint the schema that extra fields not included in an object's properties
mapping should have.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Edgar,
thanks for your response. I am absolutely sure I failed on respecting any specifications. 💥 😇
Please bear with me that I've not made friends with the baseline specification too much yet, and apologies that I take up your time.
It is clear that I've abused the additionalProperties
field, and I will be happy to wait for you unlocking corresponding other attributes how to convey additional type information, as you suggested at meltano/sdk#2102.
My patch was merely thought to exercise a few other details which may be needed to make this fly, just on the level of the target, and to see if the details could be re-assembled to make it actually runnable 1, also as a learning exercise.
With kind regards,
Andreas.
Footnotes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you we should better close this PR, and let the topic sit until corresponding improvements made it to the SDK.
I probably can't help there, as I believe it needs more intrinsic knowledge and discussions amongst you and your colleagues.
Alternatively, if you think there are other more feasible workarounds in the same style like I've hacked it, I will also be happy to receive your guidance.
ea08740
to
0f0f4fa
Compare
In PostgreSQL, all boils down to the `jsonb[]` type, but arrays are reflected as `sqlalchemy.dialects.postgresql.ARRAY` instead of `sqlalchemy.dialects.postgresql.JSONB`. In order to prepare for more advanced type mangling & validation, and to better support databases pretending to be compatible with PostgreSQL, the new test cases exercise arrays with different kinds of inner values, because, on other databases, ARRAYs may need to have uniform content. Along the lines, it adds a `verify_schema` utility function in the spirit of the `verify_data` function, refactored and generalized from the `test_anyof` test case.
Dispose the SQLAlchemy engine object after use within test utility functions.
Within `BasePostgresSDKTests`, new database connections via SQLAlchemy haven't been closed, and started filling up the connection pool, eventually saturating it.
Dispose the SQLAlchemy engine object after use within `PostgresConnector`.
By wrapping them into a container class `AssertionHelper`, it is easy to parameterize them, and to provide them to the test functions using a pytest fixture. This way, they are reusable from database adapter implementations which derive from PostgreSQL. The motivation for this is because the metadata column prefix `_sdc` needs to be adjusted for other database systems, as they reject such columns, being reserved for system purposes. In the specific case of CrateDB, it is enough to rename it like `__sdc`. Sad but true.
Note: This PR is just for educational purposes and is not to be taken seriously, unless otherwise endorsed.
About
This patch aims to add very basic and naive support for populating data into a
vector()
-type column as provided by pgvector. Avector()
is actually an array of floating point numbers, so the implementation tries to follow that.Details
As outlined below, my implementation is very naive and doesn't take any Singer SCHEMA standardization processes into account, effectively just hacking in a
"additionalProperties": {"storage": {"type": "vector", "dim": 4}}
extra attribute.I am sure this will not be appropriate, so I will take it as an opportunity to learn how it actually works to integrate new types to the type system, at the same time asking for your patience with me.
The patch is stacked on top of GH-250, that's why the diff is not readable well. Merging GH-250 and rebasing this branch will improve the situation. In the meanwhile, the diff can be inspected by visiting the commit ea08740.
Usage
Install package in development mode including the
pgvector
extra.Invoke all array tests, including
test_array_float_vector
.Thoughts
Let me know if you think it is a good idea to explore this feature within
target-postgres
, or if you think it should be approached on behalf a different target implementation, liketarget-pgvector
, which builds upon the former, but separates concerns.