diff --git a/NEWS.md b/NEWS.md index fa384dc10..b56a3956f 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1013,4 +1013,7 @@ rowwiseDT( 20. Some clarity is added to `?GForce` for the case when subtle changes to `j` produce different results because of differences in locale. Because `data.table` _always_ uses the "C" locale, small changes to queries which activate/deactivate GForce might cause confusingly different results when sorting is involved, [#5331](https://github.com/Rdatatable/data.table/issues/5331). The inspirational example compared `DT[, .(max(a), max(b)), by=grp]` and `DT[, .(max(a), max(tolower(b))), by=grp]` -- in the latter case, GForce is deactivated owing to the _ad-hoc_ column, so the result for `max(a)` might differ for the two queries. An example is added to `?GForce`. As always, there are several options to guarantee consistency, for example (1) use namespace qualification to deactivate GForce: `DT[, .(base::max(a), base::max(b)), by=grp]`; (2) turn off all optimizations with `options(datatable.optimize = 0)`; or (3) set your R session to always sort in C locale with `Sys.setlocale("LC_COLLATE", "C")` (or temporarily with e.g. `withr::with_locale()`). Thanks @markseeto for the example and @michaelchirico for the improved documentation. +Merge.data.table: Improved Error Handling +Argument by in merge.data.table() now provides more informative error messages. When columns specified in by are missing from either of the data tables (x or y), the error will now clearly list which columns are missing from each table, making debugging easier. This improvement ensures better diagnostics for users when performing merges with invalid column names. This change will help users resolve errors more efficiently during data table merges. + # data.table v1.14.10 (Dec 2023) back to v1.10.0 (Dec 2016) has been moved to [NEWS.1.md](https://github.com/Rdatatable/data.table/blob/master/NEWS.1.md) diff --git a/R/merge.R b/R/merge.R index ab93d5498..0b03084d6 100644 --- a/R/merge.R +++ b/R/merge.R @@ -50,11 +50,30 @@ merge.data.table = function(x, y, by = NULL, by.x = NULL, by.y = NULL, all = FAL by = intersect(nm_x, nm_y) if (length(by) == 0L || !is.character(by)) stopf("A non-empty vector of column names for `by` is required.") - if (!all(by %chin% intersect(nm_x, nm_y))) - stopf("Elements listed in `by` must be valid column names in x and y") - by = unname(by) - by.x = by.y = by + # UPDATED PART STARTS HERE + if (!all(by %chin% intersect(nm_x, nm_y))) { + # Identify which keys are missing from each data table + missing_x = setdiff(by, nm_x) + missing_y = setdiff(by, nm_y) + + # Construct a more detailed error message + error_message = "Elements listed in `by` must be valid column names in x and y." + + if (length(missing_x) > 0) { + error_message = paste(error_message, "\nMissing columns in 'x':", paste(missing_x, collapse = ", ")) + } + if (length(missing_y) > 0) { + error_message = paste(error_message, "\nMissing columns in 'y':", paste(missing_y, collapse = ", ")) + } + + # Raise the error with the detailed message + stopf(error_message) } + # UPDATED PART ENDS HERE + + by = unname(by) + by.x = by.y = by +} # warn about unused arguments #2587 if (length(list(...))) {