Since porting to 2.1.0, Dataflow is leaving Datasets/Tables behind in BigQuery #609

polleyg · 2017-10-09T06:09:20Z

Since porting to 2.1.0, Dataflow is leaving Datasets/Tables behind in BigQuery when the pipeline is cancelled or when it fails. We've been on 1.8.0/1.9.0 previous to this, and we've never see this before. We skipped 2.0.0, so unsure which version it was actually introduced in.

I cancelled a job (2017-10-08_18_35_30-13495977675828673253), and it left behind a dataset and table in BigQuery:

Why is Dataflow now creating datasets and tables in BigQuery for temp use?
Why has Dataflow not deleted these temp datasets and tables when it was cancelled or when it failed?

nguyent · 2017-10-10T12:12:23Z

I haven't seen anything like this in 2.0.0; we run batch jobs on a daily basis and have restarted our streaming pipelines a few times now.

Is this in streaming, batch, or both?
What is the delta between job cancellation time and table creation/update time?
Is there a reproducible case?

polleyg · 2017-10-13T04:33:18Z

Only seen it in batch so far, and cannot reproduce yet.

jamespercy · 2017-12-12T10:09:13Z

Still happening in 2.2.0 templated batch jobs on our side. We're currently managing it with cleanup scripts but it's a PITA.
I was wondering if it might be an idea to put an expiry on those datasets to auto cleanup?
I guess determining that might be difficult depending on how long a batch can run for but a day seems safe. At least that would limit the number of temp files to 24 if one were running them every hour.

polleyg · 2017-12-12T10:11:52Z

I was just thinking about this today because it happened yet again. Agree, auto expire on the datasets makes sense.

jamespercy · 2017-12-12T10:25:20Z

So I did a little investigation and it does look like that's actually implemented... not sure why it's still happening though.

https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryQuerySource.java

LOG.info("Creating temporary dataset {} for query results", tableToExtract.getDatasetId());
tableService.createDataset(
    tableToExtract.getProjectId(),
    tableToExtract.getDatasetId(),
    location,
    "Temporary tables for query results of job " + bqOptions.getJobName(),
    // Set a TTL of 1 day on the temporary tables, which ought to be enough in all cases:
    // the temporary tables are used only to immediately extract them into files.
    // They are normally cleaned up, but in case of job failure the cleanup step may not run,
    // and then they'll get deleted after the TTL.
    24 * 3600 * 1000L /* 1 day */);

I think I'll try do a bit more debugging of my own... p.s. is this the correct forum to be discussing this?

lukecwik · 2017-12-12T18:36:51Z

[email protected] is a good place and also by opening a tracking issue on https://issues.apache.org/jira/projects/BEAM so people can follow the bug.

rumeshkrish · 2019-01-25T09:45:13Z

I am also facing this issue. If job failed, I observed that table got delete after 1 day. But DataSet still remain exist. Can we have option to clean temp dataset and tables immediately if job failed. ?
This method cleanupTempResource(options.as(BigQueryOptions.class)); is responsible for cleaning temp dataset and tables. This executed if job succeed in public List<BoundedSource<T>> split(long desiredBundleSizeBytes, PipelineOptions options) method call . If any error also we need to clean based on pipeline option.

Can any one have better idea. ?

lobdellb · 2021-05-19T14:01:45Z

I have this as well, python sdk version 2.27 running on Google dataflow. See attached.

Would be nice if the tables would at least expire automatically. Or if the temp dataset name was configurable. Or something else.

kennknowles · 2021-05-19T14:20:45Z

Check out https://beam.apache.org/community/contact-us/ for ways to reach the Beam community with bug reports and questions.

akolchin-MM mentioned this issue Aug 30, 2018

clean datasets ngulam-ai/Sherlock#50

Open

damccorm mentioned this issue Jun 4, 2022

GCP DataFlow not cleaning up GCP BigQuery temporary datasets apache/beam#20748

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Since porting to 2.1.0, Dataflow is leaving Datasets/Tables behind in BigQuery #609

Since porting to 2.1.0, Dataflow is leaving Datasets/Tables behind in BigQuery #609

polleyg commented Oct 9, 2017 •

edited

Loading

nguyent commented Oct 10, 2017 •

edited

Loading

polleyg commented Oct 13, 2017

jamespercy commented Dec 12, 2017 •

edited

Loading

polleyg commented Dec 12, 2017

jamespercy commented Dec 12, 2017

lukecwik commented Dec 12, 2017

rumeshkrish commented Jan 25, 2019

lobdellb commented May 19, 2021

kennknowles commented May 19, 2021

Since porting to 2.1.0, Dataflow is leaving Datasets/Tables behind in BigQuery #609

Since porting to 2.1.0, Dataflow is leaving Datasets/Tables behind in BigQuery #609

Comments

polleyg commented Oct 9, 2017 • edited Loading

nguyent commented Oct 10, 2017 • edited Loading

polleyg commented Oct 13, 2017

jamespercy commented Dec 12, 2017 • edited Loading

polleyg commented Dec 12, 2017

jamespercy commented Dec 12, 2017

lukecwik commented Dec 12, 2017

rumeshkrish commented Jan 25, 2019

lobdellb commented May 19, 2021

kennknowles commented May 19, 2021

polleyg commented Oct 9, 2017 •

edited

Loading

nguyent commented Oct 10, 2017 •

edited

Loading

jamespercy commented Dec 12, 2017 •

edited

Loading