我不想说这个解决方案很漂亮,它不是。它能完成这项工作,但我不指望它能快速工作。它利用了艾伦·伯斯坦的
NGrams8k
,因此您还需要确保在服务器上有一个副本。
如果您不理解,请随意提问(除了NGRAMS8K的工作原理外,这是本文的目的):
CREATE TABLE T (string varchar(400));
INSERT INTO T
VALUES ('KRA!NTHI@#KUMAR, KU%^?MAR GO~()$U.BigGrin'),
('(SCARF.) / (WRAPS)');
GO
WITH Replacements AS (
SELECT *,
CASE WHEN NG.token NOT LIKE '[ /,~@#$%&*().!^?:]' THEN NG.token ELSE '-' END AS TokenR
FROM T
CROSS APPLY dbo.NGrams8k ('-'+T.String+'-',1) NG),
Repeating AS (
SELECT *,
CASE WHEN TokenR = '-' AND TokenR = LAG(TokenR) OVER (PARTITION BY string ORDER BY position) THEN NULL ELSE TokenR END AS TokenRR
FROM Replacements),
Cleaned AS (
SELECT string,
STUFF((SELECT sq.TokenRR + ''
FROM Repeating sq
WHERE sq.TokenRR IS NOT NULL
AND sq.string = R.string
ORDER BY position
FOR XML PATH('')),1,1,'') AS CleanString
FROM Repeating R
GROUP BY string)
SELECT string, LEFT(CleanString, LEN(CleanString)-1) AS CleanedString
FROM Cleaned;
GO
DROP TABLE T;
通过提供的示例数据,可以得到以下输出:
string | CleanedString
-------------------------------------------|---------------------------------
(SCARF.) / (WRAPS) | SCARF-WRAPS
KRA!NTHI@#KUMAR, KU%^?MAR GO~()$U.BigGrin | KRA-NTHI-KUMAR-KU-MAR-GO-U-BigGrin