Merge pull request #4375 from szarnyasg/arrow-faq

Arrow FAQ
duckdb · Dec 16, 2024 · b776027 · b776027
2 parents 84e8728 + e84a644
commit b776027
Show file tree

Hide file tree

Showing 14 changed files with 33 additions and 17 deletions.
diff --git a/_posts/2021-12-03-duck-arrow.md b/_posts/2021-12-03-duck-arrow.md
@@ -6,7 +6,7 @@ excerpt: The zero-copy integration between DuckDB and Apache Arrow allows for ra
 tags: ["using DuckDB"]
 ---
 
-This post is a collaboration with and cross-posted on [the Arrow blog](https://arrow.apache.org/blog/2021/12/03/arrow-duckdb/).
+This post is a collaboration with and cross-posted on the [Arrow blog](https://arrow.apache.org/blog/2021/12/03/arrow-duckdb/).
 
 Part of [Apache Arrow](https://arrow.apache.org) is an in-memory data format optimized for analytical libraries. Like Pandas and R Dataframes, it uses a columnar data model. But the Arrow project contains more than just the format: The Arrow C++ library, which is accessible in Python, R, and Ruby via bindings, has additional features that allow you to compute efficiently on datasets. These additional features are on top of the implementation of the in-memory format described above. The datasets may span multiple files in Parquet, CSV, or other formats, and files may even be on remote or cloud storage like HDFS or Amazon S3. The Arrow C++ query engine supports the streaming of query results, has an efficient implementation of complex data types (e.g., Lists, Structs, Maps), and can perform important scan optimizations like Projection and Filter Pushdown.
 

diff --git a/docs/api/cpp.md b/docs/api/cpp.md
@@ -54,7 +54,7 @@ std::unique_ptr<PreparedStatement> prepare = con.Prepare("SELECT count(*) FROM a
 std::unique_ptr<QueryResult> result = prepare->Execute(12);
 ```
 
-> Warning Do **not** use prepared statements to insert large amounts of data into DuckDB. See [the data import documentation]({% link docs/data/overview.md %}) for better options.
+> Warning Do **not** use prepared statements to insert large amounts of data into DuckDB. See the [data import documentation]({% link docs/data/overview.md %}) for better options.
 
 ### UDF API
 

diff --git a/docs/api/java.md b/docs/api/java.md
@@ -121,7 +121,7 @@ try (PreparedStatement stmt = conn.prepareStatement("INSERT INTO items VALUES (?
 }
 ```
 
-> Warning Do *not* use prepared statements to insert large amounts of data into DuckDB. See [the data import documentation]({% link docs/data/overview.md %}) for better options.
+> Warning Do *not* use prepared statements to insert large amounts of data into DuckDB. See the [data import documentation]({% link docs/data/overview.md %}) for better options.
 
 ### Arrow Methods
 

diff --git a/docs/api/r.md b/docs/api/r.md
@@ -109,7 +109,7 @@ print(res)
 
 > DuckDB keeps a reference to the R data frame after registration. This prevents the data frame from being garbage-collected. The reference is cleared when the connection is closed, but can also be cleared manually using the `duckdb_unregister()` method.
 
-Also refer to [the data import documentation]({% link docs/data/overview.md %}) for more options of efficiently importing data.
+Also refer to the [data import documentation]({% link docs/data/overview.md %}) for more options of efficiently importing data.
 
 ## dbplyr
 

diff --git a/docs/extensions/arrow.md b/docs/extensions/arrow.md
@@ -5,6 +5,7 @@ github_repository: https://github.com/duckdb/arrow
 ---
 
 The `arrow` extension implements features for using [Apache Arrow](https://arrow.apache.org/), a cross-language development platform for in-memory analytics.
+See the [announcement blog post]({% post_url 2021-12-03-duck-arrow %}) for more details.
 
 ## Installing and Loading
 
@@ -18,7 +19,6 @@ LOAD arrow;
 
 ## Functions
 
-
 | Function | Type | Description |
 |--|----|-------|
 | `to_arrow_ipc` | Table in-out function | Serializes a table into a stream of blobs containing Arrow IPC buffers |

diff --git a/docs/extensions/azure.md b/docs/extensions/azure.md
@@ -123,7 +123,7 @@ The Azure extension has two ways to configure the authentication. The preferred
 
 Multiple [Secret Providers]({% link docs/configuration/secrets_manager.md %}#secret-providers) are available for the Azure extension:
 
-* If you need to define different secrets for different storage accounts, use [the `SCOPE` configuration]({% link docs/configuration/secrets_manager.md %}#creating-multiple-secrets-for-the-same-service-type). Note that the `SCOPE` requires a trailing slash (`SCOPE 'azure://some_container/'`).
+* If you need to define different secrets for different storage accounts, use the [`SCOPE` configuration]({% link docs/configuration/secrets_manager.md %}#creating-multiple-secrets-for-the-same-service-type). Note that the `SCOPE` requires a trailing slash (`SCOPE 'azure://some_container/'`).
 * If you use fully qualified path then the `ACCOUNT_NAME` attribute is optional.
 
 #### `CONFIG` Provider

diff --git a/docs/guides/data_viewers/tableau.md b/docs/guides/data_viewers/tableau.md
@@ -114,7 +114,7 @@ do shell script "\"/Applications/Tableau Desktop 2023.2.app/Contents/MacOS/Table
 quit
 ```
 
-Create this file with [the Script Editor](https://support.apple.com/guide/script-editor/welcome/mac)
+Create this file with the [Script Editor](https://support.apple.com/guide/script-editor/welcome/mac)
 (located in `/Applications/Utilities`)
 and [save it as a packaged application](https://support.apple.com/guide/script-editor/save-a-script-as-an-app-scpedt1072/mac):
 

diff --git a/docs/guides/python/relational_api_pandas.md b/docs/guides/python/relational_api_pandas.md
@@ -29,4 +29,4 @@ output_df = transformed_rel.df()
 
 Relational operators can also be used to group rows, aggregate, find distinct combinations of values, join, union, and more. They are also able to directly insert results into a DuckDB table or write to a CSV.
 
-Please see [these additional examples](https://github.com/duckdb/duckdb/blob/main/examples/python/duckdb-python.py) and [the available relational methods on the `DuckDBPyRelation` class]({% link docs/api/python/reference/index.md %}#duckdb.DuckDBPyRelation).
+Please see [these additional examples](https://github.com/duckdb/duckdb/blob/main/examples/python/duckdb-python.py) and the [available relational methods on the `DuckDBPyRelation` class]({% link docs/api/python/reference/index.md %}#duckdb.DuckDBPyRelation).
diff --git a/docs/sql/data_types/numeric.md b/docs/sql/data_types/numeric.md
@@ -63,7 +63,7 @@ For more complex mathematical operations, however, floating-point arithmetic is
 In general, we advise that:
 
 * If you require exact storage of numbers with a known number of decimal digits and require exact additions, subtractions, and multiplications (such as for monetary amounts), use the [`DECIMAL` data type](#fixed-point-decimals) or its `NUMERIC` alias instead.
-* If you want to do fast or complicated calculations, the floating-point data types may be more appropriate. However, if you use the results for anything important, you should evaluate your implementation carefully for corner cases (ranges, infinities, underflows, invalid operations) that may be handled differently from what you expect and you should familiarize yourself with common floating-point pitfalls. The article [“What Every Computer Scientist Should Know About Floating-Point Arithmetic” by David Goldberg](https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html) and [the floating point series on Bruce Dawson's blog](https://randomascii.wordpress.com/2017/06/19/sometimes-floating-point-math-is-perfect/) provide excellent starting points.
+* If you want to do fast or complicated calculations, the floating-point data types may be more appropriate. However, if you use the results for anything important, you should evaluate your implementation carefully for corner cases (ranges, infinities, underflows, invalid operations) that may be handled differently from what you expect and you should familiarize yourself with common floating-point pitfalls. The article [“What Every Computer Scientist Should Know About Floating-Point Arithmetic” by David Goldberg](https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html) and the [floating point series on Bruce Dawson's blog](https://randomascii.wordpress.com/2017/06/19/sometimes-floating-point-math-is-perfect/) provide excellent starting points.
 
 On most platforms, the `FLOAT` type has a range of at least 1E-37 to 1E+37 with a precision of at least 6 decimal digits. The `DOUBLE` type typically has a range of around 1E-307 to 1E+308 with a precision of at least 15 digits. Positive numbers outside of these ranges (and negative numbers ourside the mirrored ranges) may cause errors on some platforms but will usually be converted to zero or infinity, respectively.
 

diff --git a/docs/sql/data_types/union.md b/docs/sql/data_types/union.md
@@ -106,7 +106,7 @@ So how do we disambiguate if we want to create a `UNION` with multiple members o
 
 ## Comparison and Sorting
 
-Since `UNION` types are implemented on top of `STRUCT` types internally, they can be used with all the comparison operators as well as in both `WHERE` and `HAVING` clauses with [the same semantics as `STRUCT`s]({% link docs/sql/data_types/struct.md %}#comparison-operators). The “tag” is always stored as the first struct entry, which ensures that the `UNION` types are compared and ordered by “tag” first.
+Since `UNION` types are implemented on top of `STRUCT` types internally, they can be used with all the comparison operators as well as in both `WHERE` and `HAVING` clauses with the [same semantics as `STRUCT`s]({% link docs/sql/data_types/struct.md %}#comparison-operators). The “tag” is always stored as the first struct entry, which ensures that the `UNION` types are compared and ordered by “tag” first.
 
 ## Functions
 

diff --git a/docs/sql/dialect/postgresql_compatibility.md b/docs/sql/dialect/postgresql_compatibility.md
@@ -201,7 +201,7 @@ Both DuckDB and PostgreSQL return `current_timestamp` as `TIMESTAMPTZ`. DuckDB a
 
 DuckDB does not currently offer `current_localdate()`; though this can be computed via `current_timestamp::DATE` or `current_localtimestamp()::DATE`.
 
-> See [the DuckDB blog entry on time zones]({% post_url 2022-01-06-time-zones %}) for more information on timestamps and timezones and DuckDB's handling thereof.
+> See the [DuckDB blog entry on time zones]({% post_url 2022-01-06-time-zones %}) for more information on timestamps and timezones and DuckDB's handling thereof.
 
 ## Resolution of Type Names in the Schema
 

diff --git a/docs/sql/expressions/in.md b/docs/sql/expressions/in.md
@@ -8,7 +8,7 @@ railroad: expressions/in.js
 
 ## `IN`
 
-The `IN` operator checks containment of the left expression inside the collection the right hand side (RHS). The `IN` operator returns `true` if the expression is present in the RHS, `false` if the expression is not in the RHS and the RHS has no `NULL` values, or `NULL` if the expression is not in the RHS and the RHS has `NULL` values. Supported collections on the RHS are tuples, lists, maps and subqueries that return a single column (see [the subqueries page]({% link docs/sql/expressions/subqueries.md %})). For maps, the `IN` operator checks for containment in the keys, not in the values.
+The `IN` operator checks containment of the left expression inside the collection the right hand side (RHS). The `IN` operator returns `true` if the expression is present in the RHS, `false` if the expression is not in the RHS and the RHS has no `NULL` values, or `NULL` if the expression is not in the RHS and the RHS has `NULL` values. Supported collections on the RHS are tuples, lists, maps and subqueries that return a single column (see the [subqueries page]({% link docs/sql/expressions/subqueries.md %})). For maps, the `IN` operator checks for containment in the keys, not in the values.
 
 ```sql
 SELECT 'Math' IN ('CS', 'Math');

diff --git a/docs/sql/statements/select.md b/docs/sql/statements/select.md
@@ -97,31 +97,31 @@ The [`FROM` clause]({% link docs/sql/query_syntax/from.md %}) specifies the *sou
 
 <div id="rrdiagram10"></div>
 
-[The `SAMPLE` clause]({% link docs/sql/query_syntax/sample.md %}) allows you to run the query on a sample from the base table. This can significantly speed up processing of queries, at the expense of accuracy in the result. Samples can also be used to quickly see a snapshot of the data when exploring a data set. The `SAMPLE` clause is applied right after anything in the `FROM` clause (i.e., after any joins, but before the where clause or any aggregates). See the [Samples]({% link docs/sql/samples.md %}) page for more information.
+The [`SAMPLE` clause]({% link docs/sql/query_syntax/sample.md %}) allows you to run the query on a sample from the base table. This can significantly speed up processing of queries, at the expense of accuracy in the result. Samples can also be used to quickly see a snapshot of the data when exploring a data set. The `SAMPLE` clause is applied right after anything in the `FROM` clause (i.e., after any joins, but before the where clause or any aggregates). See the [Samples]({% link docs/sql/samples.md %}) page for more information.
 
 ## `WHERE` Clause
 
 <div id="rrdiagram5"></div>
 
-[The `WHERE` clause]({% link docs/sql/query_syntax/where.md %}) specifies any filters to apply to the data. This allows you to select only a subset of the data in which you are interested. Logically the `WHERE` clause is applied immediately after the `FROM` clause.
+The [`WHERE` clause]({% link docs/sql/query_syntax/where.md %}) specifies any filters to apply to the data. This allows you to select only a subset of the data in which you are interested. Logically the `WHERE` clause is applied immediately after the `FROM` clause.
 
 ## `GROUP BY` and `HAVING` Clauses
 
 <div id="rrdiagram6"></div>
 
-[The `GROUP BY` clause]({% link docs/sql/query_syntax/groupby.md %}) specifies which grouping columns should be used to perform any aggregations in the `SELECT` clause. If the `GROUP BY` clause is specified, the query is always an aggregate query, even if no aggregations are present in the `SELECT` clause.
+The [`GROUP BY` clause]({% link docs/sql/query_syntax/groupby.md %}) specifies which grouping columns should be used to perform any aggregations in the `SELECT` clause. If the `GROUP BY` clause is specified, the query is always an aggregate query, even if no aggregations are present in the `SELECT` clause.
 
 ## `WINDOW` Clause
 
 <div id="rrdiagram7"></div>
 
-[The `WINDOW` clause]({% link docs/sql/query_syntax/window.md %}) allows you to specify named windows that can be used within window functions. These are useful when you have multiple window functions, as they allow you to avoid repeating the same window clause.
+The [`WINDOW` clause]({% link docs/sql/query_syntax/window.md %}) allows you to specify named windows that can be used within window functions. These are useful when you have multiple window functions, as they allow you to avoid repeating the same window clause.
 
 ## `QUALIFY` Clause
 
 <div id="rrdiagram11"></div>
 
-[The `QUALIFY` clause]({% link docs/sql/query_syntax/qualify.md %}) is used to filter the result of [`WINDOW` functions]({% link docs/sql/functions/window_functions.md %}).
+The [`QUALIFY` clause]({% link docs/sql/query_syntax/qualify.md %}) is used to filter the result of [`WINDOW` functions]({% link docs/sql/functions/window_functions.md %}).
 
 ## `ORDER BY`, `LIMIT` and `OFFSET` Clauses
 

diff --git a/faq.md b/faq.md
@@ -166,6 +166,22 @@ DuckDB can make use of available memory for caching, it also fully supports disk
 
 <div class="qa-wrap" markdown="1">
 
+### Is DuckDB built on Arrow?
+
+<div class="answer" markdown="1">
+
+DuckDB does not use the [Apache Arrow format](https://arrow.apache.org/) internally.
+However, DuckDB supports reading from / writing to Arrow using the [`arrow` extension]({% link docs/extensions/arrow.md %}).
+It can also run SQL queries directly on Arrow using [`pyarrow`]({% link docs/guides/python/sql_on_arrow.md %}).
+
+</div>
+
+</div>
+
+<!-- ----- ----- ----- ----- ----- ----- Q&A entry ----- ----- ----- ----- ----- ----- -->
+
+<div class="qa-wrap" markdown="1">
+
 ### Are DuckDB's database files portable between different DuckDB versions and clients?
 
 <div class="answer" markdown="1">
-Original file line number
+Diff line change
@@ Expand Up @@
     }
     ```
-    > Warning Do *not* use prepared statements to insert large amounts of data into DuckDB. See [the data import documentation]({% link docs/data/overview.md %}) for better options.
+    > Warning Do *not* use prepared statements to insert large amounts of data into DuckDB. See the [data import documentation]({% link docs/data/overview.md %}) for better options.
     ### Arrow Methods
@@ Expand Down @@
Original file line number	Diff line number	Diff line change
Expand Up		@@ -29,4 +29,4 @@ output_df = transformed_rel.df()

		Relational operators can also be used to group rows, aggregate, find distinct combinations of values, join, union, and more. They are also able to directly insert results into a DuckDB table or write to a CSV.

		Please see [these additional examples](https://github.com/duckdb/duckdb/blob/main/examples/python/duckdb-python.py) and [the available relational methods on the `DuckDBPyRelation` class]({% link docs/api/python/reference/index.md %}#duckdb.DuckDBPyRelation).
		Please see [these additional examples](https://github.com/duckdb/duckdb/blob/main/examples/python/duckdb-python.py) and the [available relational methods on the `DuckDBPyRelation` class]({% link docs/api/python/reference/index.md %}#duckdb.DuckDBPyRelation).