代码之家  ›  专栏  ›  技术社区  ›  Mike

是否可以以查找和插入'o(log(n))'的方式使用Perl哈希?

  •  3
  • Mike  · 技术社区  · 15 年前

    是否可以使用Perl哈希 O(log(n)) 查找和插入?

    默认情况下,我假设查找是 O(n) 因为它是由一个未排序的列表表示的。

    我知道我可以创建一个数据结构来满足这个要求(例如,树等),但是,如果它是内置的,并且可以用作普通散列(例如,with%)。

    3 回复  |  直到 14 年前
        1
  •  17
  •   Chas. Owens    15 年前

    Perl5中的关联数组是用 hash tables 已摊销o(1)(即固定时间)插入和查找。这就是为什么我们倾向于称它们为散列而不是关联数组。

    很难找到说明Perl5使用哈希表实现关联数组的文档(除了我们称关联数组为“hashes”),但至少在 perldoc perlfaq4

    What happens if I add or remove keys from a hash while iterating over it?
       (contributed by brian d foy)
    
       The easy answer is "Don't do that!"
    
       If you iterate through the hash with each(), you can delete the key
       most recently returned without worrying about it.  If you delete or add
       other keys, the iterator may skip or double up on them since perl may
       rearrange the hash table.  See the entry for "each()" in perlfunc.
    

    更好的引述 perldoc perldata :

       If you evaluate a hash in scalar context, it returns false if the hash
       is empty.  If there are any key/value pairs, it returns true; more
       precisely, the value returned is a string consisting of the number of
       used buckets and the number of allocated buckets, separated by a slash.
       This is pretty much useful only to find out whether Perl's internal
       hashing algorithm is performing poorly on your data set.  For example,
       you stick 10,000 things in a hash, but evaluating %HASH in scalar
       context reveals "1/16", which means only one out of sixteen buckets has
       been touched, and presumably contains all 10,000 of your items.  This
       isn't supposed to happen.  If a tied hash is evaluated in scalar
       context, a fatal error will result, since this bucket usage information
       is currently not available for tied hashes.
    

    当然,O(1)只是理论性能。在现实世界中,我们没有完美的散列函数,因此散列函数在变大时会变慢,并且有一些退化的情况会将散列转换为O(N),但是Perl尽其所能防止这种情况发生。下面是一个Perl哈希的基准,其中有10、100、1000、10000、100000个键:

    Perl version 5.012000
                   Rate 10^5 keys 10^4 keys 10^3 keys 10^2 keys 10^1 keys
    10^5 keys 5688029/s        --       -1%       -4%       -7%      -12%
    10^4 keys 5748771/s        1%        --       -3%       -6%      -11%
    10^3 keys 5899429/s        4%        3%        --       -4%       -9%
    10^2 keys 6116692/s        8%        6%        4%        --       -6%
    10^1 keys 6487133/s       14%       13%       10%        6%        --
    

    以下是基准代码:

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    use Benchmark;
    
    print "Perl version $]\n";
    
    my %subs;
    for my $n (1 .. 5) {
        my $m = 10 ** $n;
        keys(my %h) = $m; #preallocated the hash so it doesn't have to keep growing
        my $k = "a";
        %h = ( map { $k++ => 1 } 1 .. $m );
        $subs{"10^$n keys"} = sub {
            return @h{"a", $k};
        }
    };
    
    Benchmark::cmpthese -1, \%subs;
    
        2
  •  7
  •   tzaman    15 年前

    Perl哈希是 哈希表 ,所以它已经有了O(1)插入和查找。

        3
  •  2
  •   Community CDub    8 年前

    任何认为哈希插入或查找时间为0(1)的人 现代硬件 非常天真。测量相同值的GET很简单 wrong . 下面的结果将使您更好地了解正在发生的事情。

    Perl version 5.010001
                Rate 10^6 keys 10^5 keys 10^1 keys 10^4 keys 10^3 keys 10^2 keys
    10^6 keys 1.10/s        --      -36%      -64%      -67%      -68%      -69%
    10^5 keys 1.73/s       57%        --      -43%      -49%      -50%      -52%
    10^1 keys 3.06/s      177%       76%        --      -10%      -12%      -15%
    10^4 keys 3.40/s      207%       96%       11%        --       -3%       -5%
    10^3 keys 3.49/s      216%      101%       14%        3%        --       -3%
    10^2 keys 3.58/s      224%      107%       17%        6%        3%        --
    

    以上结果在具有5MB CPU缓存的系统上进行测量。请注意,性能从3.5 m/s显著下降到1 m/s查找。不管怎样,它仍然非常快,在某些情况下,如果你知道你在做什么,你甚至可以击败像RDBMS这样的系统。您可以使用以下代码测量系统:

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    use Benchmark;
    
    print "Perl version $]\n";
    
    my %subs;
    for my $n ( 1 .. 6 ) {
        my $m = 10**$n;
        keys( my %h ) = $m;    #preallocated the hash so it doesn't have to keep growing
        my $k = "a";
        %h = ( map { $k++ => 1 } 1 .. $m );
        my $l = 10**( 6 - $n );
        my $a;
        $subs{"10^$n keys"} = sub {
            for ( 1 .. $l ) {
                $a = $h{$_} for keys %h;
            }
        };
    }
    
    Benchmark::cmpthese -3, \%subs;
    

    您也不应该忘记哈希查找时间取决于密钥长度。简单地说,没有真正的O(1)访问时间的技术。每个已知的真实技术都有最佳的O(logn)访问时间。只有系统具有O(1)访问时间,因为限制了它们的最大N值,并且降低了它的低N值的性能。这就是现实世界中事情是如何工作的,这也是为什么有人会制造像这样的算法的原因。 Judy Array 进化也越来越糟。