代码之家 › 专栏 › 技术社区 › grieve

如何检测文件是否是Python中的二进制(非文本)文件?

binary file python

grieve · 技术社区 · 16 年前

如何判断文件是否是Python中的二进制(非文本)文件?我在python中搜索一大组文件,并在二进制文件中不断获得匹配项。这使得输出看起来异常混乱。

我知道我可以使用grep-i,但是我对数据做的比grep允许的更多。

在过去,我只会搜索大于0x7f的字符,但UTF8等类似字符使这在现代系统上成为不可能的。理想情况下,解决方案会很快,但任何解决方案都可以。

18 回复 | 直到 7 年前

knia KamilCuk 7 年前

您也可以使用 mimetypes 模块:

import mimetypes
...
mime = mimetypes.guess_type(file)

编译二进制mime类型的列表相当容易。例如,Apache使用mime.types文件分发,您可以将其解析为一组列表(二进制和文本),然后检查mime是否在您的文本或二进制列表中。

jfs 10 年前

还有另一种方法 based on file(1) behavior :

>>> textchars = bytearray({7,8,9,10,12,13,27} | set(range(0x20, 0x100)) - {0x7f})
>>> is_binary_string = lambda bytes: bool(bytes.translate(None, textchars))

例子:

>>> is_binary_string(open('/usr/bin/python', 'rb').read(1024))
True
>>> is_binary_string(open('/usr/bin/dh_python3', 'rb').read(1024))
False

Jorge Orpinel 15 年前

试试这个:

def is_binary(filename):
    """Return true if the given filename is binary.
    @raise EnvironmentError: if the file does not exist or cannot be accessed.
    @attention: found @ http://bytes.com/topic/python/answers/21222-determine-file-type-binary-text on 6/08/2010
    @author: Trent Mick <TrentM@ActiveState.com>
    @author: Jorge Orpinel <jorge@orpinel.com>"""
    fin = open(filename, 'rb')
    try:
        CHUNKSIZE = 1024
        while 1:
            chunk = fin.read(CHUNKSIZE)
            if '\0' in chunk: # found null byte
                return True
            if len(chunk) < CHUNKSIZE:
                break # done
    # A-wooo! Mira, python no necesita el "except:". Achis... Que listo es.
    finally:
        fin.close()

    return False

Shane C. Mason 16 年前

如果有帮助,许多二进制类型都以一个幻数开始。 Here is a list 文件签名。

Calaelen skyking 7 年前

如果将python3与utf-8一起使用,它是直接向前的,只需在文本模式下打开文件,如果收到 UnicodeDecodeError .python3在文本模式下处理文件时将使用unicode(二进制模式下使用bytearray)-如果编码无法解码任意文件,则很可能会 单码解码错误 .

例子:

try:
    with open(filename, "r") as f:
        for l in f:
             process_line(l)
except UnicodeDecodeError:
    pass # Fond non-text data

Jacob Gabrielson 16 年前

下面是一个使用Unix的建议 file 命令:

import re
import subprocess

def istext(path):
    return (re.search(r':.* text',
                      subprocess.Popen(["file", '-L', path], 
                                       stdout=subprocess.PIPE).stdout.read())
            is not None)

示例用法:

>>> istext('/etc/motd') 
True
>>> istext('/vmlinuz') 
False
>>> open('/tmp/japanese').read()
'\xe3\x81\x93\xe3\x82\x8c\xe3\x81\xaf\xe3\x80\x81\xe3\x81\xbf\xe3\x81\x9a\xe3\x81\x8c\xe3\x82\x81\xe5\xba\xa7\xe3\x81\xae\xe6\x99\x82\xe4\xbb\xa3\xe3\x81\xae\xe5\xb9\x95\xe9\x96\x8b\xe3\x81\x91\xe3\x80\x82\n'
>>> istext('/tmp/japanese') # works on UTF-8
True

它有不可移植到Windows的缺点(除非您有类似的 file 命令),并且必须为每个文件生成一个外部进程,这可能不好用。

kenorb 10 年前

使用 binaryornot 图书馆(图书馆) GitHub )

它非常简单,基于在这个stackoverflow问题中找到的代码。

实际上,您可以用两行代码编写这个代码,但是这个包可以避免您必须使用各种奇怪的文件类型(跨平台)编写和彻底测试这两行代码。

Douglas Leeder 16 年前

通常你得猜。

如果文件中有扩展名,您可以将它们作为一条线索来查看。

您还可以识别已知的二进制格式,并忽略这些格式。

否则,请查看您拥有的不可打印的ASCII字节的比例,并从中进行猜测。

您还可以尝试从UTF-8解码,看看这是否会产生合理的输出。

Kamil Kisiel 16 年前

如果您不在Windows上,可以使用 Python Magic 确定文件类型。然后您可以检查它是否是文本/mime类型。

Kieee 7 年前

较短的解决方案,带有UTF-16警告:

def is_binary(filename):
    """ 
    Return true if the given filename appears to be binary.
    File is considered to be binary if it contains a NULL byte.
    FIXME: This approach incorrectly reports UTF-16 as binary.
    """
    with open(filename, 'rb') as f:
        for block in f:
            if b'\0' in block:
                return True
    return False

Caco 7 年前

我们可以使用python本身来检查文件是否是二进制的,因为如果我们尝试以文本模式打开二进制文件,它就会失败。

def is_binary(file_name):
    try:
        with open(file_name, 'tr') as check_file:  # try open file in text mode
            check_file.read()
            return False
    except:  # if fail then file is non-text (binary)
        return True

rsaw 13 年前

我来这里寻找完全相同的东西——标准库提供的一个全面的解决方案,用于检测二进制或文本。在回顾了人们提出的选择之后,尼克斯文件命令看起来是最好的选择(我只为LinuxBoxen开发)。其他一些发布的解决方案使用文件但在我看来,它们不必要地复杂,所以我想说的是:

def test_file_isbinary(filename):
    cmd = shlex.split("file -b -e soft '{}'".format(filename))
    if subprocess.check_output(cmd)[:4] in {'ASCI', 'UTF-'}:
        return False
    return True

它应该不用说,但是调用此函数的代码应该确保您可以在测试之前读取文件,否则将错误地将文件检测为二进制文件。

kenorb 10 年前

我想最好的解决办法是使用猜测类型函数。它包含一个包含多个mimetype的列表,您还可以包含自己的类型。下面是我为解决我的问题所做的脚本:

from mimetypes import guess_type
from mimetypes import add_type

def __init__(self):
        self.__addMimeTypes()

def __addMimeTypes(self):
        add_type("text/plain",".properties")

def __listDir(self,path):
        try:
            return listdir(path)
        except IOError:
            print ("The directory {0} could not be accessed".format(path))

def getTextFiles(self, path):
        asciiFiles = []
        for files in self.__listDir(path):
            if guess_type(files)[0].split("/")[0] == "text":
                asciiFiles.append(files)
        try:
            return asciiFiles
        except NameError:
            print ("No text files in directory: {0}".format(path))
        finally:
            del asciiFiles

它在类内部,正如您可以根据代码的结构看到的那样。但是,您可以在应用程序中修改您想要实现它的内容。使用起来很简单。方法gettextfiles返回一个列表对象,其中包含您传入的路径变量目录中的所有文本文件。

roskakori 8 年前

下面是一个函数,它首先检查文件是否以BOM开头,如果不是,则查找初始8192字节内的零字节:

import codecs


#: BOMs to indicate that a file is a text file even if it contains zero bytes.
_TEXT_BOMS = (
    codecs.BOM_UTF16_BE,
    codecs.BOM_UTF16_LE,
    codecs.BOM_UTF32_BE,
    codecs.BOM_UTF32_LE,
    codecs.BOM_UTF8,
)


def is_binary_file(source_path):
    with open(source_path, 'rb') as source_file:
        initial_bytes = source_file.read(8192)
    return not any(initial_bytes.startswith(bom) for bom in _TEXT_BOMS) \
           and b'\0' in initial_bytes

从技术上讲,检查UTF-8BOM是不必要的,因为它不应该包含所有实际用途的零字节。但是,由于这是一种非常常见的编码,因此在开始时检查BOM比扫描所有8192字节中的0要快。

fortran 13 年前

你在Unix吗?如果是,请尝试:

isBinary = os.system("file -b" + name + " | grep text > /dev/null")

shell返回值是颠倒的(0是可以的,所以如果它找到“text”,那么它将返回0,在python中,这是一个错误的表达式)。

kenorb 10 年前

更简单的方法是检查文件是否包含空字符( \x00 )通过使用 in 操作员,例如:

b'\x00' in open("foo.bar", 'rb').read()

请参见下面的完整示例:

#!/usr/bin/env python3
import argparse
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('file', nargs=1)
    args = parser.parse_args()
    with open(args.file[0], 'rb') as f:
        if b'\x00' in f.read():
            print('The file is binary!')
        else:
            print('The file is not binary!')

样品使用情况:

$ ./is_binary.py /etc/hosts
The file is not binary!
$ ./is_binary.py `which which`
The file is binary!

Rob Truxal 8 年前

关于NIX:

如果您可以访问 `file` shell命令,shlex可以帮助使子进程模块更可用:

from os.path import realpath
from subprocess import check_output
from shlex import split

filepath = realpath('rel/or/abs/path/to/file')
assert 'ascii' in check_output(split('file {}'.format(filepth).lower()))

或者,您也可以将其保存在for循环中,以获取当前目录中所有文件的输出,方法是:

import os
for afile in [x for x in os.listdir('.') if os.path.isfile(x)]:
    assert 'ascii' in check_output(split('file {}'.format(afile).lower()))

或所有子目录:

for curdir, filelist in zip(os.walk('.')[0], os.walk('.')[2]):
     for afile in filelist:
         assert 'ascii' in check_output(split('file {}'.format(afile).lower()))

umläute 7 年前

如果文件中包含 NULL character .

这是Perl的版本 pp_fttext() ( pp_sys.c )在python中实现:

import sys
PY3 = sys.version_info[0] == 3

# A function that takes an integer in the 8-bit range and returns
# a single-character byte object in py3 / a single-character string
# in py2.
#
int2byte = (lambda x: bytes((x,))) if PY3 else chr

_text_characters = (
        b''.join(int2byte(i) for i in range(32, 127)) +
        b'\n\r\t\f\b')

def istextfile(fileobj, blocksize=512):
    """ Uses heuristics to guess whether the given file is text or binary,
        by reading a single block of bytes from the file.
        If more than 30% of the chars in the block are non-text, or there
        are NUL ('\x00') bytes in the block, assume this is a binary file.
    """
    block = fileobj.read(blocksize)
    if b'\x00' in block:
        # Files with null bytes are binary
        return False
    elif not block:
        # An empty file is considered a valid text file
        return True

    # Use translate's 'deletechars' argument to efficiently remove all
    # occurrences of _text_characters from the block
    nontext = block.translate(None, _text_characters)
    return float(len(nontext)) / len(block) <= 0.30

还请注意,编写此代码的目的是在Python2和Python3上运行,而不做任何更改。

来源: Perl's "guess if file is text or binary" implemented in Python

如何检测文件是否是Python中的二进制(非文本)文件?

关于NIX:

如果您可以访问 file shell命令,shlex可以帮助使子进程模块更可用:

或者,您也可以将其保存在for循环中,以获取当前目录中所有文件的输出,方法是:

或所有子目录:

如果您可以访问 `file` shell命令,shlex可以帮助使子进程模块更可用: