Skip to content

Commit

Permalink
Merge pull request #4341 from szarnyasg/json2
Browse files Browse the repository at this point in the history
JSON pages rework
  • Loading branch information
szarnyasg authored Dec 10, 2024
2 parents 253ba7d + 4d81040 commit 5e029d3
Show file tree
Hide file tree
Showing 9 changed files with 19 additions and 11 deletions.
2 changes: 1 addition & 1 deletion _posts/2023-02-13-announcing-duckdb-070.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ The new release contains many improvements to the JSON support, new SQL features

## Data Ingestion/Export Improvements

**JSON Ingestion.** This version introduces the [`read_json` and `read_json_auto`](https://github.com/duckdb/duckdb/pull/5992) methods. These can be used to ingest JSON files into a tabular format. Similar to `read_csv`, the `read_json` method requires a schema to be specified, while the `read_json_auto` automatically infers the schema of the JSON from the file using sampling. Both [new-line delimited JSON](http://ndjson.org) and regular JSON are supported.
**JSON Ingestion.** This version introduces the [`read_json` and `read_json_auto`](https://github.com/duckdb/duckdb/pull/5992) methods. These can be used to ingest JSON files into a tabular format. Similar to `read_csv`, the `read_json` method requires a schema to be specified, while the `read_json_auto` automatically infers the schema of the JSON from the file using sampling. Both [new-line delimited JSON](https://github.com/ndjson/ndjson-spec) and regular JSON are supported.

```sql
FROM 'data/json/with_list.json';
Expand Down
2 changes: 1 addition & 1 deletion _posts/2023-03-03-json.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ DuckDB will read multiple files in parallel.
## Newline Delimited JSON

Not all JSON adheres to the format used in `todos.json`, which is an array of 'records'.
Newline-delimited JSON, or [NDJSON](http://ndjson.org), stores each row on a new line.
Newline-delimited JSON, or [NDJSON](https://github.com/ndjson/ndjson-spec), stores each row on a new line.
DuckDB also supports reading (and writing!) this format.
First, let's write our TODO list as NDJSON:

Expand Down
2 changes: 1 addition & 1 deletion docs/archive/0.10/extensions/json.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@ With `'unstructured'`, the top-level JSON is read, e.g.:

will result in two objects being read.

With `'newline_delimited'`, [NDJSON](http://ndjson.org) is read, where each JSON is separated by a newline (`\n`), e.g.:
With `'newline_delimited'`, [NDJSON](https://github.com/ndjson/ndjson-spec) is read, where each JSON is separated by a newline (`\n`), e.g.:

```json
{"duck": 42}
Expand Down
2 changes: 1 addition & 1 deletion docs/archive/0.8/extensions/json.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ With `'unstructured'`, the top-level JSON is read, e.g.:
```
Will result in two objects being read.

With `'newline_delimited'`, [NDJSON](http://ndjson.org) is read, where each JSON is separated by a newline (`\n`), e.g.:
With `'newline_delimited'`, [NDJSON](https://github.com/ndjson/ndjson-spec) is read, where each JSON is separated by a newline (`\n`), e.g.:
```json
{"duck": 42}
{"goose": [1,2,3]}
Expand Down
2 changes: 1 addition & 1 deletion docs/archive/0.9/extensions/json.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ With `'unstructured'`, the top-level JSON is read, e.g.:
```
Will result in two objects being read.

With `'newline_delimited'`, [NDJSON](http://ndjson.org) is read, where each JSON is separated by a newline (`\n`), e.g.:
With `'newline_delimited'`, [NDJSON](https://github.com/ndjson/ndjson-spec) is read, where each JSON is separated by a newline (`\n`), e.g.:
```json
{"duck": 42}
{"goose": [1, 2, 3]}
Expand Down
2 changes: 1 addition & 1 deletion docs/archive/1.0/extensions/json.md
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,7 @@ With `'unstructured'`, the top-level JSON is read, e.g.:

will result in two objects being read.

With `'newline_delimited'`, [NDJSON](http://ndjson.org) is read, where each JSON is separated by a newline (`\n`), e.g.:
With `'newline_delimited'`, [NDJSON](https://github.com/ndjson/ndjson-spec) is read, where each JSON is separated by a newline (`\n`), e.g.:

```json
{"duck": 42}
Expand Down
11 changes: 7 additions & 4 deletions docs/data/json/json_functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,16 @@ These functions supports the same two location notations as [JSON Scalar functio
| `json_extract_string(json, path)` | `json_extract_path_text` | `->>` | Extracts `VARCHAR` from `json` at the given `path`. If `path` is a `LIST`, the result will be a `LIST` of `VARCHAR`. |
| `json_value(json, path)` | | | Extracts `JSON` from `json` at the given `path`. If the `json` at the supplied path is not a scalar value, it will return `NULL`. |

Note that the equality comparison operator (`=`) has a higher precedence than the `->` JSON extract operator. Therefore, surround the uses of the `->` operator with parentheses when making equality comparisons. For example:
Note that the arrow operator `->`, which is used for JSON extracts, has a low precedence as it is also used in [lambda functions]({% link docs/sql/functions/lambda.md %}).

Therefore, you need to surround the `->` operator with parentheses when expressing operations such as equality comparisons (`=`).
For example:

```sql
SELECT ((JSON '{"field": 42}')->'field') = 42;
```

> Warning DuckDB's JSON data type uses [0-based indexing](#indexing).
> Warning DuckDB's JSON data type uses [0-based indexing]({% link docs/data/json/overview.md %}#indexing).
Examples:

Expand Down Expand Up @@ -127,7 +130,7 @@ SELECT j->'species'->>['0','1'] FROM example;
[duck, goose]
```

Note that DuckDB's JSON data type uses [0-based indexing](#indexing).
Note that DuckDB's JSON data type uses [0-based indexing]({% link docs/data/json/overview.md %}#indexing).

If multiple values need to be extracted from the same JSON, it is more efficient to extract a list of paths:

Expand Down Expand Up @@ -199,7 +202,7 @@ SELECT json_extract('{"duck": [1, 2, 3]}', '$.duck[0]');
1
```

Note that DuckDB's JSON data type uses [0-based indexing](#indexing).
Note that DuckDB's JSON data type uses [0-based indexing]({% link docs/data/json/overview.md %}#indexing).

JSONPath is more expressive, and can also access from the back of lists:

Expand Down
3 changes: 2 additions & 1 deletion docs/data/json/loading_json.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ will result in two objects being read:
└──────────────────────────────┘
```

With `'newline_delimited'`, [NDJSON](http://ndjson.org) is read, where each JSON is separated by a newline (`\n`), e.g., for `birds-nd.json`:
With `'newline_delimited'`, [NDJSON](https://github.com/ndjson/ndjson-spec) is read, where each JSON is separated by a newline (`\n`), e.g., for `birds-nd.json`:

```json
{"duck": 42}
Expand Down Expand Up @@ -184,6 +184,7 @@ Besides the `maximum_object_size`, `format`, `ignore_errors` and `compression`,
| `timestampformat` | Specifies the date format to use when parsing timestamps. See [Date Format]({% link docs/sql/functions/dateformat.md %}) | `VARCHAR` | `'iso'`|
| `union_by_name` | Whether the schema's of multiple JSON files should be [unified]({% link docs/data/multiple_files/combining_schemas.md %}) | `BOOL` | `false` |
| `map_inference_threshold` | Controls the threshold for number of columns whose schema will be auto-detected; if JSON schema auto-detection would infer a `STRUCT` type for a field that has _more_ than this threshold number of subfields, it infers a `MAP` type instead. Set to `-1` to disable `MAP` inference. | `BIGINT` | `24` |
| `field_appearance_threshold` | The JSON reader divides the number of appearances of each JSON field by the auto-detection sample size. If the average over the fields of an object is less than this threshold, it will default to using a `MAP` type with value type of merged field types. | `0.1` |

Note that DuckDB can convert JSON arrays directly to its internal `LIST` type, and missing keys become `NULL`:

Expand Down
4 changes: 4 additions & 0 deletions docs/data/json/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@ If you would like to install or load it manually, please consult the [“Install
JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other serializable values).
While it is not a very efficient format for tabular data, it is very commonly used, especially as a data interchange format.

> Bestpractice DuckDB implements multiple interfaces for JSON extraction: [JSONPath](https://goessner.net/articles/JsonPath/) and [JSON Pointer](https://datatracker.ietf.org/doc/html/rfc6901). Both of them work with the arrow operator (`->`) and the `json_extract` function call. It's best to pick one syntax and use it in your entire application.
<!-- DuckDB mostly uses the PostgreSQL syntax, some functions from SQLite, and a few functions from other SQL systems -->

## Indexing

> Warning Following [PostgreSQL's conventions]({% link docs/sql/dialect/postgresql_compatibility.md %}), DuckDB uses 1-based indexing for its [`ARRAY`]({% link docs/sql/data_types/array.md %}) and [`LIST`]({% link docs/sql/data_types/list.md %}) data types but [0-based indexing for the JSON data type](https://www.postgresql.org/docs/17/functions-json.html#FUNCTIONS-JSON-PROCESSING).
Expand Down

0 comments on commit 5e029d3

Please sign in to comment.