I believe there’s a bug in the varchar() helper when used with SQL Server.
|
varchar <- function(x, type = "varchar") { |
|
# at least 255 characters, use max if more than 8000: |
|
max_length <- max(c(255, nchar(as.character(x))), na.rm = TRUE) |
|
|
|
if (max_length > 8000) { |
|
max_length <- "max" |
|
} |
|
|
|
paste0(type, "(", max_length, ")") |
|
} |
The SQL Server documentation for VARCHAR(n) defines n as the number of bytes, not characters.
Currently, varchar() determines n using nchar(type = "chars"), which underestimates the required width for multibyte UTF-8 strings. This can cause string truncation when inserting data longer than 255 bytes.
Reprex
string <- paste0("\xe2\x80\x99", paste(rep("a", 255), collapse = ""))
nchar(string)
#> [1] 256
nchar(string, type = "bytes")
#> [1] 258
odbc:::varchar(string) # not 258
#> [1] "varchar(256)"
Created on 2025-10-29 with reprex v2.1.1
Expected
varchar() should use the byte length of the input (nchar(x, type = "bytes")) rather than character count.
That aligns with SQL Server’s definition and avoids truncation with UTF-8 input.
Using byte length instead of character count should not adversely affect other database backends.
Proposed change
I think that this would be the cleanest fix, given nchar() supports returning the byte count.
# before
max_length <- max(c(255, nchar(as.character(x))), na.rm = TRUE)
# after
max_length <- max(c(255, nchar(as.character(x), type = "bytes")), na.rm = TRUE)
I plan to open a PR with this proposed change.
I believe there’s a bug in the
varchar()helper when used with SQL Server.odbc/R/aaa-odbc-data-type.R
Lines 119 to 128 in c8a2e26
The SQL Server documentation for
VARCHAR(n)definesnas the number of bytes, not characters.Currently,
varchar()determinesnusingnchar(type = "chars"), which underestimates the required width for multibyte UTF-8 strings. This can cause string truncation when inserting data longer than 255 bytes.Reprex
Created on 2025-10-29 with reprex v2.1.1
Expected
varchar()should use the byte length of the input (nchar(x, type = "bytes")) rather than character count.That aligns with SQL Server’s definition and avoids truncation with UTF-8 input.
Using byte length instead of character count should not adversely affect other database backends.
Proposed change
I think that this would be the cleanest fix, given
nchar()supports returning the byte count.I plan to open a PR with this proposed change.