目标
我想将一行的四列合并为一个值,命名为
COMBINED_VALUE
.
这些严格的规则适用:
-
按相同顺序具有相同列值的两行必须产生相同的结果
组合值
-
没有两行可以产生相同的结果
组合值
-
列的顺序有区别,尾随空格有区别,不可见字符有区别,任何区别都有区别
-
所有列都可以为空,其中一列定义为
INT
,三列定义为
NVARCHAR(100)
笔记
我尝试的第一件事是
BINARY_CHECKSUM
。它失败了,导致不同行的值相同:
SELECT
a, b,
binary_checksum_failing = BINARY_CHECKSUM(a,b)
FROM
(VALUES ('i01', CONVERT(DECIMAL(9,2), 100)),
('i01', CONVERT(DECIMAL(9,2), 10)),
('i01', CONVERT(DECIMAL(9,2), 1))) v(a, b)
HASHBYTES
使用“SHA2_256”效果更好,但它需要一个准备充分的
组合值
作为输入。如果确保了这一点,它就有机会产生“非常”独特的哈希值。我发现连接列应该是可行的,但需要替换NULL并分隔列。
在下面找到两段代码,显示我当前的进度和进一步的注释。第一部分显示了基于转换的解决方案
NVARCHAR
,第二部分使用
VARBINARY
示例行显示了边缘情况,并试图引发失败。
问题
-
你能让呈现的脚本失败吗?你能创建两个不同的行,产生相同的结果吗
组合值
通过验证时?
-
有更好的办法吗
组合值
? 有更快的方法吗?一般来说,最佳实践是什么?
脚本#1
:
DROP TABLE IF EXISTS #t
CREATE TABLE #t
(
id INT IDENTITY(1,1),
DESCRIPTION NVARCHAR(100),
number INT,
word1 NVARCHAR(100),
word2 NVARCHAR(100),
word3 NVARCHAR(100)
)
INSERT INTO #t(DESCRIPTION, number, word1, word2, word3)
VALUES
('base row' , 1, 'aa', 'aa', 'aa')
,('trailing space in different places', 1, 'aa', 'aa', 'aa ')
,('trailing space in different places', 1, 'aa', 'aa ', 'aa')
,('NULLs in different places' , 1, NULL, 'aa', 'aa')
,('NULLs in different places' , 1, 'aa', NULL, 'aa')
,('different rows, screwery with NULL', 1, 'aa', 'aa', NULL)
,('different rows, screwery with NULL', 1, 'aa', 'aa', '*NULL*')
,('different rows, screwery with sep.', 1, 'aa' , 'aa/SEP\aa', 'aa')
,('different rows, screwery with sep.', 1, 'aa/SEP\aa', 'aa', 'aa')
,('longer values' , 1111111111, 'abcdefghijklmn', 'abcdefghijklmn', 'abcdefghijklmn')
DECLARE @NULL_replacement NVARCHAR(10) = '*NULL*'
DECLARE @separator NVARCHAR(10) = '/SEP\'
;WITH cte AS
(
SELECT
id, DESCRIPTION, number,
word1, word2, word3,
COMBINED_VALUE,
valid_f
FROM
#t
OUTER APPLY
(SELECT valid_f = IIF(COUNT(*) = 0, 1, 0)
FROM (VALUES (word1), (word2), (word3)) v(v)
JOIN (VALUES (@NULL_replacement), (@separator)) w (w)
ON v Like '%'+w+'%' COLLATE Czech_BIN2 )validator
OUTER APPLY
(SELECT COMBINED_VALUE = CONCAT_WS(
@separator
,COALESCE(CONVERT(NVARCHAR(100), number), @NULL_replacement)
,COALESCE(CONVERT(NVARCHAR(100), word1), @NULL_replacement)
,COALESCE(CONVERT(NVARCHAR(100), word2), @NULL_replacement)
,COALESCE(CONVERT(NVARCHAR(100), word3), @NULL_replacement)) )combiner
)
SELECT
id, DESCRIPTION, number,
word1_encapsulated = '>' + word1 + '<',
word2_encapsulated = '>' + word2 + '<',
word3_encapsulated = '>' + word3 + '<',
COMBINED_VALUE_encapsulated = '>' + COMBINED_VALUE + '<',
valid_f, repeats,
COMBINED_VALUE_dlen = DATALENGTH(COMBINED_VALUE)
FROM
cte o
OUTER APPLY
(SELECT repeats = COUNT(*)
FROM cte i
WHERE o.COMBINED_VALUE LIKE i.COMBINED_VALUE COLLATE Czech_BIN2 ) repeatfinder
ORDER BY
id
我认为上面的例子证明了我需要替换NULL并使用分隔符。如果这些列被连接起来,请参见第4行和第5行
CONCAT
,你会得到相同的输出。请参阅该列
DESCRIPTION
了解更多详情。
NULL替换和分隔符
-
防止不同的行形成相同的行
组合值
-
必须禁止作为输入的一部分。这必须经过验证,输入必须经过消毒
-
不太可能成为输入的一部分,应该是“罕见的”
脚本#2
在上述操作之后,我提出了一个基本相似的查询,它使用
VARBINARY
使输出更小,特别是对于数字数据。空替换和分隔符“非常罕见”。它们的ASCII码在ASCII表中标记为“未使用”。
DECLARE
@separator_binary VARBINARY(5) = CONVERT(VARBINARY(5), CHAR(129)+CHAR(141)+CHAR(143)+CHAR(144)+CHAR(157)),
@NULL_replacement_binary VARBINARY(5) = CONVERT(VARBINARY(5), CHAR(157)+CHAR(144)+CHAR(143)+CHAR(141)+CHAR(129))
DECLARE
@separator_varchar VARCHAR(10) = CONVERT(VARCHAR(10), @separator_binary, 2),
@NULL_replacement_varchar VARCHAR(10) = CONVERT(VARCHAR(10), @NULL_replacement_binary, 2)
;WITH cte AS
(
SELECT
id, DESCRIPTION, number,
word1, word2, word3,
COMBINED_VALUE,
valid_f, word1_binary
FROM
#t
OUTER APPLY
(SELECT
number_binary = CONVERT(VARBINARY(8000), number),
word1_binary = CONVERT(VARBINARY(8000), word1),
word2_binary = CONVERT(VARBINARY(8000), word2),
word3_binary = CONVERT(VARBINARY(8000), word3)) binaries
OUTER APPLY
(SELECT valid_f = IIF(COUNT(*) = 0, 1, 0)
FROM (VALUES (CONVERT(VARCHAR(8000), number_binary, 2)),
(CONVERT(VARCHAR(8000), word1_binary , 2)),
(CONVERT(VARCHAR(8000), word2_binary , 2)),
(CONVERT(VARCHAR(8000), word3_binary , 2))) v(v)
JOIN (VALUES (@separator_varchar ),
(@NULL_replacement_varchar)) w(w) ON CHARINDEX(w, v COLLATE Latin1_General_BIN2) > 0) validator
OUTER APPLY
(SELECT
COMBINED_VALUE = COALESCE(number_binary, @NULL_replacement_binary)
+ @separator_binary + COALESCE(word1_binary , @NULL_replacement_binary)
+ @separator_binary + COALESCE(word2_binary , @NULL_replacement_binary)
+ @separator_binary + COALESCE(word3_binary , @NULL_replacement_binary)) combiner
)
SELECT
id, DESCRIPTION, number,
word1_encapsulated = '>' + word1 + '<',
word2_encapsulated = '>' + word2 + '<',
word3_encapsulated = '>' + word3 + '<',
COMBINED_VALUE, valid_f, repeats,
COMBINED_VALUE_dlen = DATALENGTH(COMBINED_VALUE),
word1_varchar = CONVERT(VARCHAR(8000), word1_binary, 2),
separator_varchar = @separator_varchar
FROM
cte o
OUTER APPLY
(SELECT repeats = COUNT(*)
FROM cte i
WHERE o.COMBINED_VALUE = i.COMBINED_VALUE) repeatfinder
ORDER BY
id