sync with open source how #118

lesterhaynes · 2024-03-14T08:15:08Z

Please add a meaningful description for your change here

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

* This is a follow-up PR to #31953, and part of the issue #31905. This PR adds the actual writer functionality, and some additional testing, including integration testing. This should be final PR for the SolaceIO write connector to be complete. * Use static imports for Preconditions * Remove unused method * Logging has builtin formatting support * Use TypeDescriptors to check the type used as input * Fix parameter name * Use interface + utils class for MessageProducer * Use null instead of optional * Avoid using ByteString just to create an empty byte array. * Fix documentation, we are not using ByteString now. * Not needed anymore, we are not using ByteString * Defer transforming latency from nanos to millis. The transform into millis is done at the presentation moment, when the metric is reported to Beam. * Avoid using top level classes with a single inner class. A couple of DoFns are moved to their own files too, as the abstract class forthe UnboundedSolaceWriter was in practice a "package". This commits addresses a few comments about the structure of UnboundedSolaceWriter and some base classes of that abstract class. * Remove using a state variable, there is already a timer. This DoFn is a stateful DoFn to force a shuffling with a given input key set cardinality. * Properties must always be set. The warnings are only shown if the user decided to set the properties that are overriden by the connector. This was changed in one of the previous commits but it is actually a bug. I am reverting that change and changing this to a switch block, to make it more clear that the properties need to be set always by the connector. * Add a new custom mode so no JCSMP property is overridden. This lets the user to fully control all the properties used by the connector, instead of making sensible choices on its behalf. This also adds some logging to be more explicit about what the connector is doing. This does not add too much logging pressure, this only adds logging at the producer creation moment. * Add some more documentation about the new custom submission mode. * Fix bug introduced with the refactoring of code for this PR. I forgot to pass the submission mode when the write session is created, and I called the wrong method in the base class because it was defined as public. This makes sure that the submission mode is passed to the session when the session is created for writing messages. * Remove unnecessary Serializable annotation. * Make the PublishResult class for handling callbacks non-static to handle pipelines with multiple write transforms. * Rename maxNumOfUsedWorkers to numShards * Use RoundRobin assignment of producers to process bundles. * Output results in a GlobalWindow * Add ErrorHandler * Fix docs * Remove PublishResultHandler class that was just a wrapper around a Queue * small refactors * Revert CsvIO docs fix * Add withErrorHandler docs * fix var scope --------- Co-authored-by: Bartosz Zablocki <[email protected]>

…#32928)

* managed bigqueryio * spotless * move managed dependency to test only * cleanup after merging snake_case PR * choose write method based on boundedness and pipeline options * rename bigquery write config class * spotless * change read output tag to 'output' * spotless * revert logic that depends on DataflowServiceOptions. switching BQ methods can instead be done in Dataflow service side * spotless * fix typo * separate BQ write config to a new class * fix doc * resolve after syncing to HEAD * spotless * fork on batch/streaming * cleanup * spotless * portable bigquery destinations * move forking logic to BQ schematransform side * add file loads translation and tests; add test checks that the correct transform is chosen * set top-level wrapper to be the underlying managed BQ transform urn; change tests to verify underlying transform name * move unit tests to respectvie schematransform test classes * expose to Python SDK as well * cleanup * address comment * set enable_streaming_engine option; add to CHANGES

Co-authored-by: Claude <[email protected]>

Beam Yaml's error handling framework returns per-record errors as a schema'd PCollection with associated error metadata (e.g. error messages, tracebacks). Currently there is no way to "unnest" the nested rececords (except for field by field) back to the top level if one wants to re-process these records (or otherwise ignore the metadata). Even if there was a way to do this "up-one-level" unnesting it's not clear that this would be obvious to users to find. Worse, various forms of error handling are not consistent in what the "bad records" schema is, or even where the original record is found (though we do have a caveat in the docs that this is still not set in stone). This adds a simple, easy to identify transform that abstracts all of these complexities away for the basic usecase.

* set enable_streaming_engine option * trigger test * trigger test * revert test trigger

Update dataframes to PEP 585 typing

* Make AvroUtils compatible with older versions of Avro * Create beam_PostCommit_Java_Avro_Versions.json * Update AvroUtils.java * Fix nullness

* fix JDBC providers Signed-off-by: Jeffrey Kinard <[email protected]> * fix test failures Signed-off-by: Jeffrey Kinard <[email protected]> * fix typo Signed-off-by: Jeffrey Kinard <[email protected]> --------- Signed-off-by: Jeffrey Kinard <[email protected]>

Co-authored-by: Naireen <[email protected]>

Update container image to pick up recent changes.

…#33108) * Adds detailed instructions on how to execute ExampleEchoPipeline.java * spacing in exampleechopipeline * revert header * spacing * spacing * spacing * license * spotless

* Update build_release_candidate.yml * Add back maven workaround

Bumps [github.com/aws/aws-sdk-go-v2/feature/s3/manager](https://github.com/aws/aws-sdk-go-v2) from 1.17.33 to 1.17.38. - [Release notes](https://github.com/aws/aws-sdk-go-v2/releases) - [Commits](aws/aws-sdk-go-v2@credentials/v1.17.33...credentials/v1.17.38) --- updated-dependencies: - dependency-name: github.com/aws/aws-sdk-go-v2/feature/s3/manager dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…hour) (#32939)

- Fix custom coder not being used in Reshuffle (global window) (#33339) - Fix custom coders not being used in Reshuffle (non global window) #33363 - Add missing to_type_hint to WindowedValueCoder #33403

… releases from snapshot repo

* [YAML] Better docs for Filter and MapToFields. * Remove redundant optional indicators. * Update sdks/python/apache_beam/yaml/yaml_mapping.py Co-authored-by: Jeff Kinard <[email protected]> --------- Co-authored-by: Jeff Kinard <[email protected]>

* Fix env variable loading in Cost Benchmark workflow * fix output file for tf mnist * add load test requirements file arg * update mnist args * revert how args are passed * assign result correctly

Revert three commits related to supporting custom coder in reshuffle

* improve python multi-lang examples * minor adjustments

Bumps [github.com/docker/docker](https://github.com/docker/docker) from 27.3.1+incompatible to 27.4.1+incompatible. - [Release notes](https://github.com/docker/docker/releases) - [Commits](moby/moby@v27.3.1...v27.4.1) --- updated-dependencies: - dependency-name: github.com/docker/docker dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Bumps [google.golang.org/api](https://github.com/googleapis/google-api-go-client) from 0.212.0 to 0.214.0. - [Release notes](https://github.com/googleapis/google-api-go-client/releases) - [Changelog](https://github.com/googleapis/google-api-go-client/blob/main/CHANGES.md) - [Commits](googleapis/google-api-go-client@v0.212.0...v0.214.0) --- updated-dependencies: - dependency-name: google.golang.org/api dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* [yaml] add RunInference support with VertexAI Signed-off-by: Jeffrey Kinard <[email protected]> * address comments and fix tests Signed-off-by: Jeffrey Kinard <[email protected]> * add more docs Signed-off-by: Jeffrey Kinard <[email protected]> * fix failing tests Signed-off-by: Jeffrey Kinard <[email protected]> * fix errors Signed-off-by: Jeffrey Kinard <[email protected]> * fix lint Signed-off-by: Jeffrey Kinard <[email protected]> --------- Signed-off-by: Jeffrey Kinard <[email protected]>

* Remove static GCP credentials from workflow * Remove workflow_dispatch blocking input * Remove conditional from Python SDK source step * Fix parenthesis error * Remove redundant Dataflow test

Signed-off-by: Jeffrey Kinard <[email protected]>

* Update .GitHub/workflows README.md * Update .github/workflows/README.md Co-authored-by: Danny McCormick <[email protected]> * Add screenshot showing how to run workflow against branch. * Remove trailing whitespace --------- Co-authored-by: Danny McCormick <[email protected]>

…ance (#33392) * use avro file format * add comment * add unit test

* Support Iceberg partition identity transform * remove uneeded avro dep * Trigger icerberg integration tests * Revert "remove uneeded avro dep" This reverts commit 0b075af.

…g before write on the number of streams set for BigQueryIO.write() using .withNumStorageWriteApiStreams(numStorageWriteApiStreams) (#32805) * BigQueryIO : control StorageWrite parallelism in batch, by reshuffling before write on the number of streams set for BigQueryIO.write() using .withNumStorageWriteApiStreams(numStorageWriteApiStreams) * fix unused dep and comment * spotlessApply * spotlessApply * fix typo

github-actions bot added build java python examples go infra kotlin learning model labels Mar 14, 2024

iht and others added 21 commits November 13, 2024 11:11

fix imports

c643645

Share AvgRecordSize and KafkaLatestOffsetEstimator caches among DoFns (…

306c6d7

…#32928)

Disable gradle cache for gcp expansion service (#33099)

7f268ac

Co-authored-by: Claude <[email protected]>

Set streaming engine option to fix V1 tests (#33100)

cb06b1b

* set enable_streaming_engine option * trigger test * trigger test * revert test trigger

Moving to 2.62.0-SNAPSHOT on master branch.

3c664e9

Merge pull request #33098 from jrmccluskey/dataframesUpdate

965c3c6

Update dataframes to PEP 585 typing

Make AvroUtils compatible with older versions of Avro (#33102)

38415b8

* Make AvroUtils compatible with older versions of Avro * Create beam_PostCommit_Java_Avro_Versions.json * Update AvroUtils.java * Fix nullness

Add logging to see which topic each split is reading from (#33031)

f4d07c4

Co-authored-by: Naireen <[email protected]>

Update names.py (#33107)

bff3eac

Update container image to pick up recent changes.

[yaml] Fix examples catalog tests (#33027)

ab5c069

Merge branch 'master' into update-type-hints

2d69dde

Add 2.62.0 to changes (#33116)

c694701

Fix small typos in release email (#33118)

7accd8d

Adds detailed instructions on how to execute ExampleEchoPipeline.java (…

f38edb7

…#33108) * Adds detailed instructions on how to execute ExampleEchoPipeline.java * spacing in exampleechopipeline * revert header * spacing * spacing * spacing * license * spotless

Add temp location to batch file loads (#33115)

667269a

Build docker RC on self-hosted (#33114)

0cafb46

* Update build_release_candidate.yml * Add back maven workaround

ahmedabu98 and others added 30 commits December 17, 2024 16:30

Python ExternalTransformProvider improvements (#33359)

a9f50fa

[Managed Iceberg] Support partitioning time types (year, month, day, …

286e29c

…hour) (#32939)

Add missing imports.

d704422

Validate circular reference for yaml (#33208)

e68a79c

Deserialization fix.

71a5ced

Revert three commits related to supporting custom coder in reshuffle

4cbf257

- Fix custom coder not being used in Reshuffle (global window) (#33339) - Fix custom coders not being used in Reshuffle (non global window) #33363 - Add missing to_type_hint to WindowedValueCoder #33403

Added support for SparkRunner streaming stateful processing (#33267)

f476417

Update lineage query function.

14f7caf

Merge pull request #33386: In contributor docs, don't try to download…

d17f77a

… releases from snapshot repo

Fix env variable loading in Cost Benchmark workflow (#33404)

116df9f

* Fix env variable loading in Cost Benchmark workflow * fix output file for tf mnist * add load test requirements file arg * update mnist args * revert how args are passed * assign result correctly

Merge pull request #33414 from shunping/revert-reshuffle-changes

e9424b9

Revert three commits related to supporting custom coder in reshuffle

fix error on start-build-env.sh (#33401)

5eed396

Moving to 2.63.0-SNAPSHOT on master branch.

ec2e150

Bump Iceberg to v1.6.1 (#33294)

ca8a35a

Add 2.63.0 section to CHANGES.md

55b0ce9

Merge pull request #33423: Add 2.63.0 section to CHANGES.md

fe6b7aa

Improve existing Python multi-lang SchemaTransform examples (#33361)

8fee3ca

* improve python multi-lang examples * minor adjustments

Merge pull request #33381 Migrate lineage counters to bounded tries.

e8df26f

Move Dataflow Python Managed tests to prod (#33426)

a51a0e1

Clean up Python Tests workflow (#33396)

c1b9fac

* Remove static GCP credentials from workflow * Remove workflow_dispatch blocking input * Remove conditional from Python SDK source step * Fix parenthesis error * Remove redundant Dataflow test

[yaml] allow logging bytes in LogForTesting (#33433)

74bcba1

Signed-off-by: Jeffrey Kinard <[email protected]>

[Managed BigQuery] use file loads with Avro format for better perform…

bf7e317

…ance (#33392) * use avro file format * add comment * add unit test

Support Iceberg partition identity transform (#33332)

1abc6c8

* Support Iceberg partition identity transform * remove uneeded avro dep * Trigger icerberg integration tests * Revert "remove uneeded avro dep" This reverts commit 0b075af.

python multi-lang with schematransforms guide (#33362)

067deed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync with open source how #118

sync with open source how #118

lesterhaynes commented Mar 14, 2024

sync with open source how #118

Are you sure you want to change the base?

sync with open source how #118

Conversation

lesterhaynes commented Mar 14, 2024

GitHub Actions Tests Status (on master branch)