代码之家  ›  专栏  ›  技术社区  ›  anthony

保持hashlib状态

  •  4
  • anthony  · 技术社区  · 15 年前

    我想创建一个 hashlib 实例, update() 然后以某种方式保持其状态。稍后,我想使用此状态数据重新创建对象,并继续 UpDead() 它。最后,我想得到 hexdigest() 数据的累计运行总数。状态持久性必须在多次运行中生存。

    例子:

    import hashlib
    m = hashlib.sha1()
    m.update('one')
    m.update('two')
    # somehow, persist the state of m here
    
    #later, possibly in another process
    # recreate m from the persisted state
    m.update('three')
    m.update('four')
    print m.hexdigest()
    # at this point, m.hexdigest() should be equal to hashlib.sha1().update('onetwothreefour').hextdigest()
    

    编辑:

    我在2010年没有找到一个好的方法来使用Python,最后用C语言编写了一个小的助手应用程序来实现这一点。然而,下面有一些伟大的答案,我当时无法得到或不知道。

    5 回复  |  直到 7 年前
        1
  •  2
  •   Devesh Saini    8 年前

    你可以用这种方法 ctypes ,中没有助手应用程序 C 需要:

    雷希

    #! /usr/bin/env python
    
    ''' A resumable implementation of SHA-256 using ctypes with the OpenSSL crypto library
    
        Written by PM 2Ring 2014.11.13
    '''
    
    from ctypes import *
    
    SHA_LBLOCK = 16
    SHA256_DIGEST_LENGTH = 32
    
    class SHA256_CTX(Structure):
        _fields_ = [
            ("h", c_long * 8),
            ("Nl", c_long),
            ("Nh", c_long),
            ("data", c_long * SHA_LBLOCK),
            ("num", c_uint),
            ("md_len", c_uint)
        ]
    
    HashBuffType = c_ubyte * SHA256_DIGEST_LENGTH
    
    #crypto = cdll.LoadLibrary("libcrypto.so")
    crypto = cdll.LoadLibrary("libeay32.dll" if os.name == "nt" else "libssl.so")
    
    class sha256(object):
        digest_size = SHA256_DIGEST_LENGTH
    
        def __init__(self, datastr=None):
            self.ctx = SHA256_CTX()
            crypto.SHA256_Init(byref(self.ctx))
            if datastr:
                self.update(datastr)
    
        def update(self, datastr):
            crypto.SHA256_Update(byref(self.ctx), datastr, c_int(len(datastr)))
    
        #Clone the current context
        def _copy_ctx(self):
            ctx = SHA256_CTX()
            pointer(ctx)[0] = self.ctx
            return ctx
    
        def copy(self):
            other = sha256()
            other.ctx = self._copy_ctx()
            return other
    
        def digest(self):
            #Preserve context in case we get called before hashing is
            # really finished, since SHA256_Final() clears the SHA256_CTX
            ctx = self._copy_ctx()
            hashbuff = HashBuffType()
            crypto.SHA256_Final(hashbuff, byref(self.ctx))
            self.ctx = ctx
            return str(bytearray(hashbuff))
    
        def hexdigest(self):
            return self.digest().encode('hex')
    
    #Tests
    def main():
        import cPickle
        import hashlib
    
        data = ("Nobody expects ", "the spammish ", "imposition!")
    
        print "rehash\n"
    
        shaA = sha256(''.join(data))
        print shaA.hexdigest()
        print repr(shaA.digest())
        print "digest size =", shaA.digest_size
        print
    
        shaB = sha256()
        shaB.update(data[0])
        print shaB.hexdigest()
    
        #Test pickling
        sha_pickle = cPickle.dumps(shaB, -1)
        print "Pickle length:", len(sha_pickle)
        shaC = cPickle.loads(sha_pickle)
    
        shaC.update(data[1])
        print shaC.hexdigest()
    
        #Test copying. Note that copy can be pickled
        shaD = shaC.copy()
    
        shaC.update(data[2])
        print shaC.hexdigest()
    
    
        #Verify against hashlib.sha256()
        print "\nhashlib\n"
    
        shaD = hashlib.sha256(''.join(data))
        print shaD.hexdigest()
        print repr(shaD.digest())
        print "digest size =", shaD.digest_size
        print
    
        shaE = hashlib.sha256(data[0])
        print shaE.hexdigest()
    
        shaE.update(data[1])
        print shaE.hexdigest()
    
        #Test copying. Note that hashlib copy can NOT be pickled
        shaF = shaE.copy()
        shaF.update(data[2])
        print shaF.hexdigest()
    
    
    if __name__ == '__main__':
        main()
    

    可恢复的_sha-256.py

    #! /usr/bin/env python
    
    ''' Resumable SHA-256 hash for large files using the OpenSSL crypto library
    
        The hashing process may be interrupted by Control-C (SIGINT) or SIGTERM.
        When a signal is received, hashing continues until the end of the
        current chunk, then the current file position, total file size, and
        the sha object is saved to a file. The name of this file is formed by
        appending '.hash' to the name of the file being hashed.
    
        Just re-run the program to resume hashing. The '.hash' file will be deleted
        once hashing is completed.
    
        Written by PM 2Ring 2014.11.14
    '''
    
    import cPickle as pickle
    import os
    import signal
    import sys
    
    import rehash
    
    quit = False
    
    blocksize = 1<<16   # 64kB
    blocksperchunk = 1<<8
    
    chunksize = blocksize * blocksperchunk
    
    def handler(signum, frame):
        global quit
        print "\nGot signal %d, cleaning up." % signum
        quit = True
    
    
    def do_hash(fname, filesize):
        hashname = fname + '.hash'
        if os.path.exists(hashname):
            with open(hashname, 'rb') as f:
                pos, fsize, sha = pickle.load(f)
            if fsize != filesize:
                print "Error: file size of '%s' doesn't match size recorded in '%s'" % (fname, hashname)
                print "%d != %d. Aborting" % (fsize, filesize)
                exit(1)
        else:
            pos, fsize, sha = 0, filesize, rehash.sha256()
    
        finished = False
        with open(fname, 'rb') as f:
            f.seek(pos)
            while not (quit or finished):
                for _ in xrange(blocksperchunk):
                    block = f.read(blocksize)
                    if block == '':
                        finished = True
                        break
                    sha.update(block)
    
                pos += chunksize
                sys.stderr.write(" %6.2f%% of %d\r" % (100.0 * pos / fsize, fsize))
                if finished or quit:
                    break
    
        if quit:
            with open(hashname, 'wb') as f:
                pickle.dump((pos, fsize, sha), f, -1)
        elif os.path.exists(hashname):
            os.remove(hashname)
    
        return (not quit), pos, sha.hexdigest()
    
    
    def main():
        if len(sys.argv) != 2:
            print "Resumable SHA-256 hash of a file."
            print "Usage:\npython %s filename\n" % sys.argv[0]
            exit(1)
    
        fname = sys.argv[1]
        filesize = os.path.getsize(fname)
    
        signal.signal(signal.SIGINT, handler)
        signal.signal(signal.SIGTERM, handler)
    
        finished, pos, hexdigest = do_hash(fname, filesize)
        if finished:
            print "%s  %s" % (hexdigest, fname)
        else:
            print "sha-256 hash of '%s' incomplete" % fname
            print "%s" % hexdigest
            print "%d / %d bytes processed." % (pos, filesize)
    
    
    if __name__ == '__main__':
        main()
    

    演示

    import rehash
    import pickle
    sha=rehash.sha256("Hello ")
    s=pickle.dumps(sha.ctx)
    sha=rehash.sha256()
    sha.ctx=pickle.loads(s)
    sha.update("World")
    print sha.hexdigest()
    

    输出

    a591a6d40bf420404a011733cfb7b190d62c65bf0bcda32b57b277d9ad9f146e
    

    注:我要感谢PM2RING的精彩代码。

        2
  •  1
  •   John La Rooy    15 年前

    sha1是一个围绕C库的包装器,因此您将无法对其进行酸洗。

    它需要实现 __getstate__ __setstate__ python访问其内部状态的方法

    你可以使用 pure Python 如果sha1的速度足以满足您的需求,则可以实现它

        3
  •  1
  •   weaver    7 年前

    我也面临着这个问题,没有找到现有的解决方案,所以我最终写了一个图书馆,它做了与Devesh Saini描述的非常相似的事情: https://github.com/kislyuk/rehash . 例子:

    import pickle, rehash
    hasher = rehash.sha256(b"foo")
    state = pickle.dumps(hasher)
    
    hasher2 = pickle.loads(state)
    hasher2.update(b"bar")
    
    assert hasher2.hexdigest() == rehash.sha256(b"foobar").hexdigest()
    
        5
  •  -1
  •   jsbueno    15 年前

    您可以轻松地围绕哈希对象构建一个包装对象,该对象可以透明地持久化数据。

    明显的缺点是,它需要保留完整的哈希数据才能恢复状态—因此,根据您处理的数据大小,这可能不适合您的需要。但它应该可以正常工作到几十兆字节。

    不幸的是,hashlib没有将散列算法公开为适当的类,它为工厂函数提供构造散列对象的方法,因此我们不能在不加载保留符号的情况下正确地对这些对象进行子类化,这是我宁愿避免的情况。这只意味着您必须从一开始就构建包装类,但无论如何,这并不是Python的开销。

    下面是一个示例代码,它甚至可以满足您的需求:

    import hashlib
    from cStringIO import StringIO
    
    class PersistentSha1(object):
        def __init__(self, salt=""):
            self.__setstate__(salt)
    
        def update(self, data):
            self.__data.write(data)
            self.hash.update(data)
    
        def __getattr__(self, attr):
            return getattr(self.hash, attr)
    
        def __setstate__(self, salt=""):
            self.__data = StringIO()
            self.__data.write(salt)
            self.hash = hashlib.sha1(salt)
    
        def __getstate__(self):
            return self.data
    
        def _get_data(self):
            self.__data.seek(0)
            return self.__data.read()
    
        data = property(_get_data, __setstate__)
    

    您可以访问“data”成员本身以直接获取和设置状态,也可以使用python pickle函数:

    >>> a = PersistentSha1()
    >>> a
    <__main__.PersistentSha1 object at 0xb7d10f0c>
    >>> a.update("lixo")
    >>> a.data
    'lixo'
    >>> a.hexdigest()
    '6d6332a54574aeb35dcde5cf6a8774f938a65bec'
    >>> import pickle
    >>> b = pickle.dumps(a)
    >>>
    >>> c = pickle.loads(b)
    >>> c.hexdigest()
    '6d6332a54574aeb35dcde5cf6a8774f938a65bec'
    
    >>> c.data
    'lixo'