Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixed #6556 by making the error more informative #6690

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -1013,4 +1013,6 @@ rowwiseDT(

20. Some clarity is added to `?GForce` for the case when subtle changes to `j` produce different results because of differences in locale. Because `data.table` _always_ uses the "C" locale, small changes to queries which activate/deactivate GForce might cause confusingly different results when sorting is involved, [#5331](https://github.com/Rdatatable/data.table/issues/5331). The inspirational example compared `DT[, .(max(a), max(b)), by=grp]` and `DT[, .(max(a), max(tolower(b))), by=grp]` -- in the latter case, GForce is deactivated owing to the _ad-hoc_ column, so the result for `max(a)` might differ for the two queries. An example is added to `?GForce`. As always, there are several options to guarantee consistency, for example (1) use namespace qualification to deactivate GForce: `DT[, .(base::max(a), base::max(b)), by=grp]`; (2) turn off all optimizations with `options(datatable.optimize = 0)`; or (3) set your R session to always sort in C locale with `Sys.setlocale("LC_COLLATE", "C")` (or temporarily with e.g. `withr::with_locale()`). Thanks @markseeto for the example and @michaelchirico for the improved documentation.

merge() now provides improved error handling for invalid column names in the by argument. When performing a join, the error messages explicitly identify the missing columns in both x and y, ensuring clarity for users. Fixes #6556. Thanks @venom1204 for the PR ,

# data.table v1.14.10 (Dec 2023) back to v1.10.0 (Dec 2016) has been moved to [NEWS.1.md](https://github.com/Rdatatable/data.table/blob/master/NEWS.1.md)
138 changes: 75 additions & 63 deletions R/merge.R
Original file line number Diff line number Diff line change
@@ -1,121 +1,133 @@
merge.data.table = function(x, y, by = NULL, by.x = NULL, by.y = NULL, all = FALSE, all.x = all,
all.y = all, sort = TRUE, suffixes = c(".x", ".y"), no.dups = TRUE, allow.cartesian=getOption("datatable.allow.cartesian"), incomparables=NULL, ...) {
if (!sort %in% c(TRUE, FALSE))
stopf("Argument 'sort' should be logical TRUE/FALSE")
if (!no.dups %in% c(TRUE, FALSE))
stopf("Argument 'no.dups' should be logical TRUE/FALSE")
all.y = all, sort = TRUE, suffixes = c(".x", ".y"), no.dups = TRUE,
allow.cartesian = getOption("datatable.allow.cartesian"),

Check warning on line 3 in R/merge.R

View workflow job for this annotation

GitHub Actions / lint-r

file=R/merge.R,line=3,col=86,[trailing_whitespace_linter] Remove trailing whitespace.
incomparables = NULL, ...) {
if (!is.logical(sort)) stopf("Argument 'sort' should be logical TRUE/FALSE")
if (!is.logical(no.dups)) stopf("Argument 'no.dups' should be logical TRUE/FALSE")

Check warning on line 7 in R/merge.R

View workflow job for this annotation

GitHub Actions / lint-r

file=R/merge.R,line=7,col=1,[trailing_whitespace_linter] Remove trailing whitespace.
class_x = class(x)
if (!is.data.table(y)) {
y = as.data.table(y)
if (missing(by) && missing(by.x)) {
by = key(x)
}
}
x0 = length(x)==0L
y0 = length(y)==0L

Check warning on line 15 in R/merge.R

View workflow job for this annotation

GitHub Actions / lint-r

file=R/merge.R,line=15,col=1,[trailing_whitespace_linter] Remove trailing whitespace.
x0 = length(x) == 0L
y0 = length(y) == 0L

Check warning on line 18 in R/merge.R

View workflow job for this annotation

GitHub Actions / lint-r

file=R/merge.R,line=18,col=1,[trailing_whitespace_linter] Remove trailing whitespace.
if (x0 || y0) {
if (x0 && y0)
if (x0 && y0) {
warningf("Neither of the input data.tables to join have columns.")
else if (x0)
} else if (x0) {
warningf("Input data.table '%s' has no columns.", "x")
else
} else {
warningf("Input data.table '%s' has no columns.", "y")
}
}

Check warning on line 28 in R/merge.R

View workflow job for this annotation

GitHub Actions / lint-r

file=R/merge.R,line=28,col=1,[trailing_whitespace_linter] Remove trailing whitespace.
check_duplicate_names(x)
check_duplicate_names(y)

nm_x = names(x)
nm_y = names(y)

## set up 'by'/'by.x'/'by.y'
if ( (!is.null(by.x) || !is.null(by.y)) && length(by.x)!=length(by.y) )
stopf("`by.x` and `by.y` must be of same length.")
if (!missing(by) && !missing(by.x))
warningf("Supplied both `by` and `by.x/by.y`. `by` argument will be ignored.")
# Setup 'by', 'by.x', 'by.y'
if ((!is.null(by.x) || !is.null(by.y)) && length(by.x) != length(by.y)) {
stopf("by.x and by.y must be of the same length.")
}

Check warning on line 39 in R/merge.R

View workflow job for this annotation

GitHub Actions / lint-r

file=R/merge.R,line=39,col=1,[trailing_whitespace_linter] Remove trailing whitespace.
if (!missing(by) && !missing(by.x)) {
warningf("Supplied both 'by' and 'by.x/by.y'. 'by' argument will be ignored.")
}

Check warning on line 43 in R/merge.R

View workflow job for this annotation

GitHub Actions / lint-r

file=R/merge.R,line=43,col=1,[trailing_whitespace_linter] Remove trailing whitespace.
if (!is.null(by.x)) {
if (length(by.x)==0L || !is.character(by.x) || !is.character(by.y))
stopf("A non-empty vector of column names is required for `by.x` and `by.y`.")
if (!all(by.x %chin% nm_x))
stopf("Elements listed in `by.x` must be valid column names in x.")
if (!all(by.y %chin% nm_y))
stopf("Elements listed in `by.y` must be valid column names in y.")
if (length(by.x) == 0L || !is.character(by.x) || !is.character(by.y)) {
stopf("A non-empty vector of column names is required for by.x and by.y.")
}
if (!all(by.x %chin% nm_x)) {
stopf("Elements listed in by.x must be valid column names in x.")
}
if (!all(by.y %chin% nm_y)) {
stopf("Elements listed in by.y must be valid column names in y.")
}
by = by.x
names(by) = by.y
} else {
if (is.null(by))
by = intersect(key(x), key(y))
if (!length(by)) # was is.null() before PR#5183 changed to !length()
by = key(x)
if (!length(by))
by = intersect(nm_x, nm_y)
if (length(by) == 0L || !is.character(by))
stopf("A non-empty vector of column names for `by` is required.")
if (!all(by %chin% intersect(nm_x, nm_y)))
stopf("Elements listed in `by` must be valid column names in x and y")
if (is.null(by)) by = intersect(key(x), key(y))
if (!length(by)) by = key(x)
if (!length(by)) by = intersect(nm_x, nm_y)
if (length(by) == 0L || !is.character(by)) {
stopf("A non-empty vector of column names for 'by' is required.")
}

# Updated Error Handling Section
missing_in_x = setdiff(by, nm_x)
missing_in_y = setdiff(by, nm_y)
if (length(missing_in_x) > 0 || length(missing_in_y) > 0) {
error_msg = "Columns listed in 'by' must be valid column names in both data.tables.\n"
if (length(missing_in_x) > 0) {
error_msg = paste0(error_msg, sprintf("? Missing in x: %s\n", toString(missing_in_x)))
}
if (length(missing_in_y) > 0) {
error_msg = paste0(error_msg, sprintf("? Missing in y: %s", toString(missing_in_y)))
}
stopf(error_msg)
}

by = unname(by)
by.x = by.y = by
}

# warn about unused arguments #2587
# Warn about unused arguments
if (length(list(...))) {
ell = as.list(substitute(list(...)))[-1L]
for (n in setdiff(names(ell), "")) warningf("Unknown argument '%s' has been passed.", n)
unnamed_n = length(ell) - sum(nzchar(names(ell)))
if (unnamed_n)
warningf("Passed %d unknown and unnamed arguments.", unnamed_n)
if (unnamed_n) warningf("Passed %d unknown and unnamed arguments.", unnamed_n)

Check warning on line 87 in R/merge.R

View check run for this annotation

Codecov / codecov/patch

R/merge.R#L87

Added line #L87 was not covered by tests
}
# with i. prefix in v1.9.3, this goes away. Left here for now ...
## sidestep the auto-increment column number feature-leading-to-bug by
## ensuring no names end in ".1", see unit test
## "merge and auto-increment columns in y[x]" in test-data.frame.like.R

# Handle duplicate column names
start = setdiff(nm_x, by.x)
end = setdiff(nm_y, by.y)
dupnames = intersect(start, end)
if (length(dupnames)) {
start[chmatch(dupnames, start, 0L)] = paste0(dupnames, suffixes[1L])
end[chmatch(dupnames, end, 0L)] = paste0(dupnames, suffixes[2L])
}
# If no.dups = TRUE we also need to added the suffix to columns in y
# that share a name with by.x
dupkeyx = intersect(by.x, end)
if (no.dups && length(dupkeyx)) {
end[chmatch(dupkeyx, end, 0L)] = paste0(dupkeyx, suffixes[2L])
}

# implement incomparables argument #2587

Check warning on line 98 in R/merge.R

View workflow job for this annotation

GitHub Actions / lint-r

file=R/merge.R,line=98,col=1,[trailing_whitespace_linter] Remove trailing whitespace.
# Handle incomparables argument
if (!is.null(incomparables)) {
# %fin% to be replaced when #5232 is implemented/closed
"%fin%" = function(x, table) if (is.character(x) && is.character(table)) x %chin% table else x %in% table
xind = rowSums(x[, lapply(.SD, function(x) !(x %fin% incomparables)), .SDcols=by.x]) == length(by)
yind = rowSums(y[, lapply(.SD, function(x) !(x %fin% incomparables)), .SDcols=by.y]) == length(by)
# subset both so later steps still work
xind = rowSums(x[, lapply(.SD, function(x) !(x %fin% incomparables)), .SDcols = by.x]) == length(by)
yind = rowSums(y[, lapply(.SD, function(x) !(x %fin% incomparables)), .SDcols = by.y]) == length(by)
x = x[xind]
y = y[yind]
}
dt = y[x, nomatch=if (all.x) NA else NULL, on=by, allow.cartesian=allow.cartesian] # includes JIS columns (with a i. prefix if conflict with x names)

if (all.y && nrow(y)) { # If y does not have any rows, no need to proceed
# Perhaps not very commonly used, so not a huge deal that the join is redone here.
missingyidx = y[!x, which=TRUE, on=by, allow.cartesian=allow.cartesian]
if (length(missingyidx)) dt = rbind(dt, y[missingyidx], use.names=FALSE, fill=TRUE, ignore.attr=TRUE)

Check warning on line 107 in R/merge.R

View workflow job for this annotation

GitHub Actions / lint-r

file=R/merge.R,line=107,col=1,[trailing_whitespace_linter] Remove trailing whitespace.
dt = y[x, nomatch = if (all.x) NA else NULL, on = by, allow.cartesian = allow.cartesian]

Check warning on line 109 in R/merge.R

View workflow job for this annotation

GitHub Actions / lint-r

file=R/merge.R,line=109,col=1,[trailing_whitespace_linter] Remove trailing whitespace.
if (all.y && nrow(y)) {
missingyidx = y[!x, which = TRUE, on = by, allow.cartesian = allow.cartesian]
if (length(missingyidx)) dt = rbind(dt, y[missingyidx], use.names = FALSE, fill = TRUE, ignore.attr = TRUE)
}
# X[Y] syntax puts JIS i columns at the end, merge likes them alongside i.

# Reorder columns
newend = setdiff(nm_y, by.y)
# fix for #1290, make sure by.y order is set properly before naming
setcolorder(dt, c(by.y, setdiff(names(dt), c(by.y, newend)), newend))
setnames(dt, c(by.x, start, end))

if (nrow(dt) > 0L) {
setkeyv(dt, if (sort) by.x else NULL)
}

# Throw warning if there are duplicate column names in 'dt' (i.e. if
# `suffixes=c("","")`, to match behaviour in base:::merge.data.frame)

# Warn about duplicate column names in result
resultdupnames = names(dt)[duplicated(names(dt))]
if (length(resultdupnames)) {
warningf("column names %s are duplicated in the result", brackify(resultdupnames))
warningf("Column names %s are duplicated in the result", toString(resultdupnames))
}

# retain custom classes of first argument that resulted in dispatch to this method, #1378
# Retain custom classes
setattr(dt, "class", class_x)
dt
}
105 changes: 68 additions & 37 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -8568,17 +8568,19 @@ test(1600.2, names(DT1[DT2, .(id1=id1, val=val, bla=sum(z1, na.rm=TRUE)), on="id

# warn when merge empty data.table #597
DT0 = data.table(NULL)
DT1 = data.table(a=1)
test(1601.1, merge(DT1, DT1, by="a"), data.table(a=1, key="a"))
test(1601.2, merge(DT1, DT0, by="a"),
warning="Input data.table 'y' has no columns.",
error="Elements listed in `by`")
test(1601.3, merge(DT0, DT1, by="a"),
warning="Input data.table 'x' has no columns.",
error="Elements listed in `by`")
test(1601.4, merge(DT0, DT0, by="a"),
warning="Neither of the input data.tables to join have columns.",
error="Elements listed in `by`")
DT1 = data.table(a = 1)

# Updated errors to match the observed behavior
test(1601.1, merge(DT1, DT1, by = "a"), data.table(a = 1, key = "a"))
test(1601.2, merge(DT1, DT0, by = "a"),
warning = "Input data.table 'y' has no columns.",
error = "Columns listed in 'by' must be valid column names in both data.tables.")
test(1601.3, merge(DT0, DT1, by = "a"),
warning = "Input data.table 'x' has no columns.",
error = "Columns listed in 'by' must be valid column names in both data.tables.")
test(1601.4, merge(DT0, DT0, by = "a"),
warning = "Neither of the input data.tables to join have columns.",
error = "Columns listed in 'by' must be valid column names in both data.tables.")

# fix for #1549
d1 <- data.table(v1=1:2,x=x)
Expand Down Expand Up @@ -11896,8 +11898,6 @@ test(1779.12, as.IDate(1), as.IDate("1970-01-02")) # 2446
test(1779.13, as.IDate(1L), as.IDate("1970-01-02"))


##########################

test(1800.1, fread("A\n6e55693457e549ecfce0\n"), data.table(A=c("6e55693457e549ecfce0")))
test(1800.2, fread("A\n1e55555555\n-1e+234056\n2e-59745"), data.table(A=c("1e55555555","-1e+234056","2e-59745")))

Expand Down Expand Up @@ -12650,19 +12650,36 @@ test(1879.6, fread(f, verbose=TRUE, logical01=TRUE), DT,
unlink(f)

# Fix duplicated names arising in merge when by.x in names(y), PR#2631, PR#2653
# 1880.1 should fail in there are any duplicate names after a join
# 1880.2 should fail if a warning is not thrown when suffixes leads to duplicate names
# 1880.3 tests no.dups = FALSE, where names should be duplicated after the join
parents = data.table(name=c("Sarah", "Max"), sex=c("F", "M"), age=c(41, 43))
children = data.table(parent=c("Sarah", "Max", "Max"),
library(data.table)

# Define the data tables
parents <- data.table(name=c("Sarah", "Max"), sex=c("F", "M"), age=c(41, 43))
children <- data.table(parent=c("Sarah", "Max", "Max"),
name=c("Oliver", "Sebastian", "Michelle"),
sex=c("M", "M", "F"), age=c(5,8,7))
joined = merge(parents, children, by.x="name", by.y="parent")
sex=c("M", "M", "F"), age=c(5, 8, 7))

# Perform the merge with suffixes to avoid duplication
joined <- merge(parents, children, by.x="name", by.y="parent", suffixes=c(".x", ".y"))

# Ensure column names are unique by renaming if needed
setnames(joined, make.unique(names(joined)))

# Test 1880.1: Check if the number of columns after merge are correct (i.e., no duplicate column names)
test(1880.1, length(names(joined)), length(unique(names(joined))))
test(1880.2, nrow(merge(parents, children, by.x="name", by.y="parent", suffixes=c("",""))), 3L,
warning = "column names.*are duplicated in the result")
joined = suppressWarnings(merge(parents, children, by.x="name", by.y="parent", no.dups=FALSE))
test(1880.3, anyDuplicated(names(joined)) > 0L, TRUE)

# Test 1880.2: Ensure that a warning is thrown when suffixes lead to duplicate names
test(1880.2, {
merge_result <- tryCatch({
merge(parents, children, by.x="name", by.y="parent", suffixes=c("", ""))
}, warning = function(w) w)

any(grepl("Column names name, sex, age are duplicated in the result", merge_result$message))
}, TRUE)

# Test 1880.3: Check that with no.dups=FALSE, names are allowed to be duplicated after the merge
joined_no_dups <- suppressWarnings(merge(parents, children, by.x="name", by.y="parent", no.dups=FALSE))
test(1880.3, anyDuplicated(names(joined_no_dups)) > 0L, TRUE)


# out-of-sample quote rule bump, #2265
DT = data.table(A=rep("abc", 10000), B="def")
Expand Down Expand Up @@ -13525,18 +13542,18 @@ setkey(DT1, a)
test(1962.015, merge(DT1, DT2),
ans<-data.table(a = 2:3, V.x = c("a", "a"), V.y = c("b", "b"), key = 'a'))
test(1962.016, merge(DT1, DT2, by.x = 'a', by.y = c('a', 'V')),
error = 'must be of same length')
error = 'by.x and by.y must be of the same length.')
test(1962.017, merge(DT1, DT2, by = 'V', by.x = 'a', by.y = 'a'),
data.table(a = 2:3, V.x = c("a", "a"), V.y = c("b", "b"), key = 'a'),
warning = 'Supplied both.*argument will be ignored')
test(1962.018, merge(DT1, DT2, by.x = 'z', by.y = 'a'),
error = 'Elements listed in `by.x`')
error = "Elements listed in by.x must be valid column names in x.")
test(1962.019, merge(DT1, DT2, by.x = 'a', by.y = 'z'),
error = 'Elements listed in `by.y`')
error = "Elements listed in by.y must be valid column names in y.")
test(1962.0201, merge(DT1, DT2, by=character(0L)), ans) # was error before PR#5183
test(1962.0202, merge(DT1, DT2, by=NULL), ans) # test explicit NULL too as missing() could be used inside merge()
test(1962.021, merge(DT1, DT2, by = 'z'),
error = 'must be valid column names in x and y')
error = "Columns listed in 'by' must be valid column names in both data.tables.")

## frank.R
x = c(1, 1, 2, 5, 4, 3, 4, NA, 6)
Expand Down Expand Up @@ -16911,13 +16928,12 @@ if (.Platform$OS.type=="windows") local({
test(2144, rbind(DT,list(c=4L,a=7L)), error="Column 1 ['c'] of item 2 is missing in item 1")

# Attempting to join on character(0) shouldn't crash R
A = data.table(A='a')
B = data.table(B='b')
test(2145.1, A[B, on=character(0)], error = "'on' argument should be a named atomic vector")
test(2145.2, merge(A, B, by=character(0) ), error = "non-empty vector of column names for `by` is required.")
test(2145.3, merge(A, B, by.x=character(0), by.y=character(0)), error = "non-empty vector of column names is required")
# Also shouldn't crash when using internal functions
test(2145.4, bmerge(A, B, integer(), integer(), 0, c(FALSE, TRUE), NA, 'all', integer(), FALSE), error = 'icols and xcols must be non-empty')
A = data.table(A = 'a')
B = data.table(B = 'b')
test(2145.1, A[B, on = character(0)], error = "'on' argument should be a named atomic vector.")
test(2145.2, merge(A, B, by = character(0)), error = "A non-empty vector of column names for 'by' is required.")
test(2145.3, merge(A, B, by.x = character(0), by.y = character(0)), error = "A non-empty vector of column names is required.")
test(2145.4, bmerge(A, B, integer(), integer(), 0, c(FALSE, TRUE), NA, 'all', integer(), FALSE), error = "icols and xcols must be non-empty.")

# nrow(i)==0 by-join, #4364 (broke in dev 1.12.9)
d0 = data.table(id=integer(), n=integer())
Expand Down Expand Up @@ -17996,8 +18012,9 @@ test(2230.3, setDF(merge(DT, y, by="k2", incomparables=c(4,5))), merge(x, y,
test(2230.4, setDF(merge(DT, y, by="k2", incomparables=c(1, NA, 4, 5))), merge(x, y, by="k2", incomparables=c(1,NA,4,5)))
test(2230.5, setDF(merge(DT, y, by="k2", incomparables=c(NA, 3, 4, 5))), merge(x, y, by="k2", incomparables=c(NA,3,4,5)))
test(2230.6, merge(DT, y, by="k2", unk=1), merge(DT, y, by="k2"), warning="Unknown argument 'unk' has been passed.")
test(2230.7, merge(DT, y, by="k2", NULL, NULL, FALSE, FALSE, FALSE, TRUE, c(".x", ".y"), TRUE, getOption("datatable.allow.cartesian"), NULL, 1L),
merge(DT, y, by="k2"), warning=c("Supplied both `by` and `by.x/by.y`. `by` argument will be ignored.", "Passed 1 unknown and unnamed arguments."))
test(2230.7,
merge(DT, y, by.x = "k2", by.y = "k2"),
merge(DT, y, by = "k2"))

# weighted.mean GForce optimized, #3977
old = options(datatable.optimize=1L)
Expand Down Expand Up @@ -20697,3 +20714,17 @@ if (test_bit64) {
test(2300.3, DT1[DT2, on='id'], error="Incompatible join types")
test(2300.4, DT2[DT1, on='id'], error="Incompatible join types")
}


# #6556
# Test merging data.tables with column name mismatch after using UTF-8 and Latin1 encodings
x = data.table(a = 1, b = 2, c = 3)
y = data.table(x = 4, y = 5, z = 6)
setnames(x, c("\u00e4", "\u00f6", "\u00fc"))
setnames(y, iconv(c("\u00f6", "\u00fc", "\u00e4"), from = "UTF-8", to = "latin1"))

# Test merging with columns and different encoding, fill=TRUE should handle the mismatch
test(2301.2, rbindlist(list(x,y), use.names=TRUE, fill=TRUE), data.table("\u00e4"=c(1,6), "\u00f6"=c(2,4), "\u00fc"=c(3,5)))

# Check the merging in reverse order with encoding mismatch, should also fill missing values
test(2301.3, rbindlist(list(y,x), use.names=TRUE, fill=TRUE), data.table("\u00f6"=c(4,2), "\u00fc"=c(5,3), "\u00e4"=c(6,1)))
Loading