代码之家 › 专栏 › 技术社区 › fabmilo

使用位运算符的最大16个字符字符串的Strlen

string algorithm c c++

fabmilo · 技术社区 · 16 年前

面临的挑战是找到最快的方法来确定C/C++中使用C.中的逐位运算的C字符串的长度。

char thestring[16];

c字符串的最大大小为16个字符,并且位于缓冲区内如果字符串等于16个字符,则结尾没有空字节。

我肯定可以做到,但还没做好。

我现在正在处理这个问题,但是假设字符串是在 零填充

len =   buff[0] != 0x0 +
            buff[1] != 0x0 +
            buff[2] != 0x0 +
            buff[3] != 0x0 +
            buff[4] != 0x0 +
            buff[5] != 0x0 +
            buff[6] != 0x0 +
            buff[7] != 0x0 +
            buff[8] != 0x0 +
            buff[9] != 0x0 +
            buff[10] != 0x0 +
            buff[11] != 0x0 +
            buff[12] != 0x0 +
            buff[13] != 0x0 +
            buff[14] != 0x0 +
            buff[15] != 0x0;

注意缓冲区是 零填充

10 回复 | 直到 16 年前

N 1.1 16 年前

这样就可以了 buf 初始化为零。你的解决方案 !=

len = !!buf[0] +
      !!buf[1] +
      //...
      !!buf[15]

更新 :上面的代码和OP的代码生成 相同的汇编代码 由GCC编译时 -O3 旗帜(不同(如果没有提供优化标志)

sbi 16 年前

你的代码不能正常工作。例如,考虑一个包含如下内容的缓冲区:

"\0123456789abcde";

根据您的代码,它的长度是15,但实际上它的长度是0,因为它的首字母是“\0”。

MSN 16 年前

我在《黑客之乐》中读到一个叫做SWAR(SIMD-within-a-register)的小技巧,假设每个字符8位:

#define CHAR_BITS 8
uint_fast_16_t all_character_bits[CHAR_BITS]= { 0 };

for (int bit_index= 0; bit_index<CHAR_BITS; ++bit_index)
{
    for (int character_index= 0; character_index<16; ++character_index)
    {
        all_character_bits[bit_index]|= ((buff[character_index] >> bit_index) & 1) << character_index;
    }
}

uint_fast_32_t zero_byte_character_mask= ~0;

for (int bit_index= 0; bit_index<CHAR_BITS; ++bit_index)
{
    zero_byte_character_mask&= (0xffff0000 | ~all_character_bits[bit_index]);
}

uint_fast_8_t first_null_byte= first_bit_set(zero_byte_character_mask);

其中first_bit_set是在整数中查找第一个位集的任意数量的流行且快速的实现。

AND

el.pescado - Ð½ÐµÑ Ð²Ð¾Ð¹Ð½Ðµ 16 年前

位运算。。。可能是这样的:

// TODO: optimize for 64-bit architectures
uint32_t *a = (uint32_t*)thestring;

for (int i = 0; i < 4; i++) // will be unwound
    for (int j = 0; j < 4; j++)
        if (a[i] & 0xff << j == 0)
           return 4*i+j;
return 16;

Sparky 16 年前

请参考Paul Hsieh在。。。

http://www.azillionmonkeys.com/qed/asmexample.html

虽然不完全是你要找的,但只要稍加调整,它就可以满足你的需要。

kennytm 16 年前

你可以从

template <typename T>
bool containsANull(T n) {
   return (n  - ((T) -1)/255) & ((T) -1)/255*128) & ~n;
}

它是如何工作的?

(T) -1/255位模式0x01010101是否根据需要重复

if n is                        0x0123456789ABCDEF
n - 0x1111..1 is               0xF0123456789ABCDE
(n-0x1111...1) & 0x8888...8 is 0x8000000008888888
~n is                          0xFEDCBA9876543210 
so the result is               0x8000000000000000

Seth 16 年前

你可以随心所欲地玩,但你可能无法战胜这一点:

int fast1(const char *s)
{ 
    if (!*s++) return 0; 
    if (!*s++) return 1; 
    if (!*s++) return 2; 
    if (!*s++) return 3; 
    if (!*s++) return 4; 
    if (!*s++) return 5; 
    if (!*s++) return 6; 
    if (!*s++) return 7; 
    if (!*s++) return 8; 
    if (!*s++) return 9; 
    if (!*s++) return 10; 
    if (!*s++) return 11; 
    if (!*s++) return 12; 
    if (!*s++) return 13; 
    if (!*s++) return 14; 
    if (!*s++) return 15; 
}

(这是否更快取决于处理器和编译器)。

int fast2(const char *s)
{ 
    if (!s[0]) return 0; 
    if (!s[1]) return 1; 
    if (!s[2]) return 2; 
    if (!s[3]) return 3; 
    if (!s[4]) return 4; 
    if (!s[5]) return 5; 
    if (!s[6]) return 6; 
    if (!s[7]) return 7; 
    if (!s[8]) return 8; 
    if (!s[9]) return 9; 
    if (!s[10]) return 10; 
    if (!s[11]) return 11; 
    if (!s[12]) return 12; 
    if (!s[13]) return 13; 
    if (!s[14]) return 14; 
    if (!s[15]) return 15; 
}

我在core2duot7200@2.0ghz、windowsxppro和visualstudio2008上分析了这两个函数,并关闭了优化(打开优化器会导致VS注意到在我的计时循环中没有输出,所以它会完全删除它)。

我在循环2中调用了每个函数 ²² 次,然后取平均值超过8次。

快速1 每次函数调用大约需要87.20 ns。

快速2

所以在我的CPU上,数组索引版本的速度几乎是指针版本的两倍。

更新2

这个函数也相当快,每次调用大约60纳秒。我猜指针解引用是由地址单元和整数单元的乘法执行的,所以操作是流水线的。在我的其他例子中,所有的工作都是由地址单元完成的。

int fast5(const char *s)
{
    return  /* 0 * (s[0] == 0) + don't need to test 1st byte */
            1 * (s[1] == 0)  +
            2 * (s[2] == 0)  +
            3 * (s[3] == 0)  +
            4 * (s[4] == 0)  +
            5 * (s[5] == 0)  +
            6 * (s[6] == 0)  +
            7 * (s[7] == 0)  +
            8 * (s[8] == 0)  +
            9 * (s[9] == 0)  +
            10 * (s[10] == 0) +
            11 * (s[11] == 0) +
            12 * (s[12] == 0) +
            13 * (s[13] == 0) +
            14 * (s[14] == 0) +
            15 * (s[15] == 0);
}

nategoose 16 年前

我很确定你发布的代码看起来很流畅,但在为许多处理器编译时实际上并不是那么好,尽管它可能在你的处理器上。据我所知,大多数处理器实际上并没有一种简单的方法从比较中得到1,因此很可能最终是一个条件跳转或以下形式的条件操作:

set R1, 0
test R2+0, 0
cinc R1                   ; conditional increment
test R2+1, 0
cinc R1
...

如果编译器做得很出色,在许多处理器上,这最终会是这样的:

set R1, 0
test R2+0, 0
jz end  ; jump if zero
inc R1
test R2+1, 0
jz end
inc R1
...

既然你说你的目标是一个GPU和那些往往是非常数学友好,你也许可以做到:

int acc = 0;
acc += str[0]/str[0];
acc += str[1]/str[1];
...

如果你能在不花费太多成本的情况下将陷阱除以零,并且处理陷阱带来的混乱。不过,那最终可能会很贵。

如果您的机器的寄存器可能包含字符串的一个以上八位字节,那么您可以尝试执行有限次数的跳转,一次测试0个以上的字节,然后在字节级别检查最后一个非零字。

你应该退房 Bit Twiddling Hacks 一个很酷的方法来加速strlen,它适用于大寄存器大小。

kennytm 16 年前

在假设的C++语言中,假设2的补码和小endian,

int128_t v = *reinterpret_cast<int128_t*>(thestring);
const int bit_count = 128;
int eight = ((1 << 64) - 1 - v) >> (bit_count - 4) & 8;
v >>>= 8 * eight;
int four  = ((1 << 32) - 1 - v) >> (bit_count - 3) & 4;
v >>>= 8 * four;
int two   = ((1 << 16) - 1 - v) >> (bit_count - 2) & 2;
v >>>= 8 * two;
int one   = ((1 <<  8) - 1 - v) >> (bit_count - 1) & 1;
return (one | two | four | eight) + !!v;

(修改自 http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog

drawnonward 16 年前

假设64位长和小端系统:

long a = ((long *)string)[0];
long b = ((long *)string)[1];

a = (a - 0x0101010101010101UL) & ~a & 0x8080808080808080UL;
b = (b - 0x0101010101010101UL) & ~b & 0x8080808080808080UL;

return a ? count_trailing_zeros( a ) / 8 : b ? 8 + count_trailing_zeros( b ) / 8 : 16;