Merge pull request #4341 from szarnyasg/json2

JSON pages rework
duckdb · Dec 10, 2024 · 5e029d3 · 5e029d3
2 parents 253ba7d + 4d81040
commit 5e029d3
Show file tree

Hide file tree

Showing 9 changed files with 19 additions and 11 deletions.
diff --git a/_posts/2023-02-13-announcing-duckdb-070.md b/_posts/2023-02-13-announcing-duckdb-070.md
@@ -21,7 +21,7 @@ The new release contains many improvements to the JSON support, new SQL features
 
 ## Data Ingestion/Export Improvements
 
-**JSON Ingestion.** This version introduces the [`read_json` and `read_json_auto`](https://github.com/duckdb/duckdb/pull/5992) methods. These can be used to ingest JSON files into a tabular format. Similar to `read_csv`, the `read_json` method requires a schema to be specified, while the `read_json_auto` automatically infers the schema of the JSON from the file using sampling. Both [new-line delimited JSON](http://ndjson.org) and regular JSON are supported.
+**JSON Ingestion.** This version introduces the [`read_json` and `read_json_auto`](https://github.com/duckdb/duckdb/pull/5992) methods. These can be used to ingest JSON files into a tabular format. Similar to `read_csv`, the `read_json` method requires a schema to be specified, while the `read_json_auto` automatically infers the schema of the JSON from the file using sampling. Both [new-line delimited JSON](https://github.com/ndjson/ndjson-spec) and regular JSON are supported.
 
 ```sql
 FROM 'data/json/with_list.json';

diff --git a/_posts/2023-03-03-json.md b/_posts/2023-03-03-json.md
@@ -86,7 +86,7 @@ DuckDB will read multiple files in parallel.
 ## Newline Delimited JSON
 
 Not all JSON adheres to the format used in `todos.json`, which is an array of 'records'.
-Newline-delimited JSON, or [NDJSON](http://ndjson.org), stores each row on a new line.
+Newline-delimited JSON, or [NDJSON](https://github.com/ndjson/ndjson-spec), stores each row on a new line.
 DuckDB also supports reading (and writing!) this format.
 First, let's write our TODO list as NDJSON:
 

diff --git a/docs/archive/0.10/extensions/json.md b/docs/archive/0.10/extensions/json.md
@@ -163,7 +163,7 @@ With `'unstructured'`, the top-level JSON is read, e.g.:
 
 will result in two objects being read.
 
-With `'newline_delimited'`, [NDJSON](http://ndjson.org) is read, where each JSON is separated by a newline (`\n`), e.g.:
+With `'newline_delimited'`, [NDJSON](https://github.com/ndjson/ndjson-spec) is read, where each JSON is separated by a newline (`\n`), e.g.:
 
 ```json
 {"duck": 42}

diff --git a/docs/archive/0.8/extensions/json.md b/docs/archive/0.8/extensions/json.md
@@ -64,7 +64,7 @@ With `'unstructured'`, the top-level JSON is read, e.g.:
 ```
 Will result in two objects being read.
 
-With `'newline_delimited'`, [NDJSON](http://ndjson.org) is read, where each JSON is separated by a newline (`\n`), e.g.:
+With `'newline_delimited'`, [NDJSON](https://github.com/ndjson/ndjson-spec) is read, where each JSON is separated by a newline (`\n`), e.g.:
 ```json
 {"duck": 42}
 {"goose": [1,2,3]}

diff --git a/docs/archive/0.9/extensions/json.md b/docs/archive/0.9/extensions/json.md
@@ -87,7 +87,7 @@ With `'unstructured'`, the top-level JSON is read, e.g.:
 ```
 Will result in two objects being read.
 
-With `'newline_delimited'`, [NDJSON](http://ndjson.org) is read, where each JSON is separated by a newline (`\n`), e.g.:
+With `'newline_delimited'`, [NDJSON](https://github.com/ndjson/ndjson-spec) is read, where each JSON is separated by a newline (`\n`), e.g.:
 ```json
 {"duck": 42}
 {"goose": [1, 2, 3]}

diff --git a/docs/archive/1.0/extensions/json.md b/docs/archive/1.0/extensions/json.md
@@ -161,7 +161,7 @@ With `'unstructured'`, the top-level JSON is read, e.g.:
 
 will result in two objects being read.
 
-With `'newline_delimited'`, [NDJSON](http://ndjson.org) is read, where each JSON is separated by a newline (`\n`), e.g.:
+With `'newline_delimited'`, [NDJSON](https://github.com/ndjson/ndjson-spec) is read, where each JSON is separated by a newline (`\n`), e.g.:
 
 ```json
 {"duck": 42}

diff --git a/docs/data/json/json_functions.md b/docs/data/json/json_functions.md
@@ -15,13 +15,16 @@ These functions supports the same two location notations as [JSON Scalar functio
 | `json_extract_string(json, path)` | `json_extract_path_text` | `->>` | Extracts `VARCHAR` from `json` at the given `path`. If `path` is a `LIST`, the result will be a `LIST` of `VARCHAR`. |
 | `json_value(json, path)` | | | Extracts `JSON` from `json` at the given `path`. If the `json` at the supplied path is not a scalar value, it will return `NULL`. |
 
-Note that the equality comparison operator (`=`) has a higher precedence than the `->` JSON extract operator. Therefore, surround the uses of the `->` operator with parentheses when making equality comparisons. For example:
+Note that the arrow operator `->`, which is used for JSON extracts, has a low precedence as it is also used in [lambda functions]({% link docs/sql/functions/lambda.md %}).
+
+Therefore, you need to surround the `->` operator with parentheses when expressing operations such as equality comparisons (`=`).
+For example:
 
 ```sql
 SELECT ((JSON '{"field": 42}')->'field') = 42;
 ```
 
-> Warning DuckDB's JSON data type uses [0-based indexing](#indexing).
+> Warning DuckDB's JSON data type uses [0-based indexing]({% link docs/data/json/overview.md %}#indexing).
 
 Examples:
 
@@ -127,7 +130,7 @@ SELECT j->'species'->>['0','1'] FROM example;
 [duck, goose]
 ```
 
-Note that DuckDB's JSON data type uses [0-based indexing](#indexing).
+Note that DuckDB's JSON data type uses [0-based indexing]({% link docs/data/json/overview.md %}#indexing).
 
 If multiple values need to be extracted from the same JSON, it is more efficient to extract a list of paths:
 
@@ -199,7 +202,7 @@ SELECT json_extract('{"duck": [1, 2, 3]}', '$.duck[0]');
 1
 ```
 
-Note that DuckDB's JSON data type uses [0-based indexing](#indexing).
+Note that DuckDB's JSON data type uses [0-based indexing]({% link docs/data/json/overview.md %}#indexing).
 
 JSONPath is more expressive, and can also access from the back of lists:
 

diff --git a/docs/data/json/loading_json.md b/docs/data/json/loading_json.md
@@ -106,7 +106,7 @@ will result in two objects being read:
 └──────────────────────────────┘
 ```
 
-With `'newline_delimited'`, [NDJSON](http://ndjson.org) is read, where each JSON is separated by a newline (`\n`), e.g., for `birds-nd.json`:
+With `'newline_delimited'`, [NDJSON](https://github.com/ndjson/ndjson-spec) is read, where each JSON is separated by a newline (`\n`), e.g., for `birds-nd.json`:
 
 ```json
 {"duck": 42}
@@ -184,6 +184,7 @@ Besides the `maximum_object_size`, `format`, `ignore_errors` and `compression`,
 | `timestampformat` | Specifies the date format to use when parsing timestamps. See [Date Format]({% link docs/sql/functions/dateformat.md %}) | `VARCHAR` | `'iso'`|
 | `union_by_name` | Whether the schema's of multiple JSON files should be [unified]({% link docs/data/multiple_files/combining_schemas.md %}) | `BOOL` | `false` |
 | `map_inference_threshold` | Controls the threshold for number of columns whose schema will be auto-detected; if JSON schema auto-detection would infer a `STRUCT` type for a field that has _more_ than this threshold number of subfields, it infers a `MAP` type instead. Set to `-1` to disable `MAP` inference. | `BIGINT` | `24` |
+| `field_appearance_threshold` | The JSON reader divides the number of appearances of each JSON field by the auto-detection sample size. If the average over the fields of an object is less than this threshold, it will default to using a `MAP` type with value type of merged field types. | `0.1` |
 
 Note that DuckDB can convert JSON arrays directly to its internal `LIST` type, and missing keys become `NULL`:
 

diff --git a/docs/data/json/overview.md b/docs/data/json/overview.md
@@ -15,6 +15,10 @@ If you would like to install or load it manually, please consult the [“Install
 JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other serializable values).
 While it is not a very efficient format for tabular data, it is very commonly used, especially as a data interchange format.
 
+> Bestpractice DuckDB implements multiple interfaces for JSON extraction: [JSONPath](https://goessner.net/articles/JsonPath/) and [JSON Pointer](https://datatracker.ietf.org/doc/html/rfc6901). Both of them work with the arrow operator (`->`) and the `json_extract` function call. It's best to pick one syntax and use it in your entire application.
+
+<!-- DuckDB mostly uses the PostgreSQL syntax, some functions from SQLite, and a few functions from other SQL systems -->
+
 ## Indexing
 
 > Warning Following [PostgreSQL's conventions]({% link docs/sql/dialect/postgresql_compatibility.md %}), DuckDB uses 1-based indexing for its [`ARRAY`]({% link docs/sql/data_types/array.md %}) and [`LIST`]({% link docs/sql/data_types/list.md %}) data types but [0-based indexing for the JSON data type](https://www.postgresql.org/docs/17/functions-json.html#FUNCTIONS-JSON-PROCESSING).