代码之家  ›  专栏  ›  技术社区  ›  SantoshGupta7

将字符串和数组数据从csv文件转换为tfrecords时出错

  •  0
  • SantoshGupta7  · 技术社区  · 6 年前

    我正在按照这些示例将csv文件转换为tfrecords。

    这是我尝试的代码

    csv = pd.read_csv("ehealth.csv").values
    with tf.python_io.TFRecordWriter("ehealth.tfrecords") as writer:
        for row in csv:
            question, answer, question_bert, answer_bert = row[0], row[1] , row[1], row[2]
            example = tf.train.Example()
            example.features.feature["question"].bytes_list.value.extend(question.encode("utf8"))
            example.features.feature["answer"].bytes_list.value.extend(answer.encode("utf8"))
            example.features.feature["question_bert"].float_list.value.extend(question_bert)
            example.features.feature["answer_bert"].float_list.value.append(answer_bert)
            writer.write(example.SerializeToString())
    

    这是我的错

    TypeError                                 Traceback (most recent call last) <ipython-input-36-0a8c5e073d84> in <module>()
          4         question, answer, question_bert, answer_bert = row[0], row[1] , row[1], row[2]
          5         example = tf.train.Example()
    ----> 6         example.features.feature["question"].bytes_list.value.extend(question.encode("utf8"))
          7         example.features.feature["answer"].bytes_list.value.extend(answer.encode("utf8"))
          8         example.features.feature["question_bert"].float_list.value.extend(question_bert)
    
    TypeError: 104 has type int, but expected one of: bytes
    

    对字符串进行编码时,似乎有问题。我对这两行进行了评论,以确保其他所有内容都正常工作,

    csv = pd.read_csv("ehealth.csv").values
    with tf.python_io.TFRecordWriter("ehealth.tfrecords") as writer:
        for row in csv:
            question, answer, question_bert, answer_bert = row[0], row[1] , row[1], row[2]
            example = tf.train.Example()
    #         example.features.feature["question"].bytes_list.value.extend(question)
    #         example.features.feature["answer"].bytes_list.value.extend(answer)
            example.features.feature["question_bert"].float_list.value.extend(question_bert)
            example.features.feature["answer_bert"].float_list.value.append(answer_bert)
            writer.write(example.SerializeToString())
    

    但后来我发现了这些错误

    TypeError                                 Traceback (most recent call last) <ipython-input-13-565b43316ef5> in <module>()
          6 #         example.features.feature["question"].bytes_list.value.extend(question)
          7 #         example.features.feature["answer"].bytes_list.value.extend(answer)
    ----> 8         example.features.feature["question_bert"].float_list.value.extend(question_bert)
          9         example.features.feature["answer_bert"].float_list.value.append(answer_bert)
         10         writer.write(example.SerializeToString())
    
    TypeError: 's' has type str, but expected one of: int, long, float
    

    原来问题是pandas将我的数组解释为字符串而不是数组

    type( csv[0][2])
    
    ->str
    

    而且,看起来我不得不用 example.SerializeToString() 因为我有一个数组,但不知道该怎么做。

    下面是再现错误的完整代码,包括从google驱动器下载csv文件的代码。

    import pandas as pd
    import numpy as np
    import requests
    import tensorflow as tf
    
    def download_file_from_google_drive(id, destination):
        URL = "https://docs.google.com/uc?export=download"
    
        session = requests.Session()
    
        response = session.get(URL, params = { 'id' : id }, stream = True)
        token = get_confirm_token(response)
    
        if token:
            params = { 'id' : id, 'confirm' : token }
            response = session.get(URL, params = params, stream = True)
    
        save_response_content(response, destination)    
    
    def get_confirm_token(response):
        for key, value in response.cookies.items():
            if key.startswith('download_warning'):
                return value
    
        return None
    
    def save_response_content(response, destination):
        CHUNK_SIZE = 32768
    
        with open(destination, "wb") as f:
            for chunk in response.iter_content(CHUNK_SIZE):
                if chunk: # filter out keep-alive new chunks
                    f.write(chunk)
    
    # download_file_from_google_drive('1rMjqKkMnt6_vROrGmlTGStNGmwPO4YFX', 'model.zip') #
    
    file_id = '1anbEwfViu9Rzu7tWKgPb_We1EwbA4x1-'
    destination = 'ehealth.csv'
    download_file_from_google_drive(file_id, destination)
    
    healthdata=pd.read_csv('ehealth.csv')
    healthdata.head()
    
    csv = pd.read_csv("ehealth.csv").values
    with tf.python_io.TFRecordWriter("ehealth.tfrecords") as writer:
        for row in csv:
            question, answer, question_bert, answer_bert = row[0], row[1] , row[1], row[2]
            example = tf.train.Example()
            example.features.feature["question"].bytes_list.value.extend(question)
            example.features.feature["answer"].bytes_list.value.extend(answer)
            example.features.feature["question_bert"].float_list.value.extend(question_bert)
            example.features.feature["answer_bert"].float_list.value.append(answer_bert)
            writer.write(example.SerializeToString())
    
    
    csv = pd.read_csv("ehealth.csv").values
    with tf.python_io.TFRecordWriter("ehealth.tfrecords") as writer:
        for row in csv:
            question, answer, question_bert, answer_bert = row[0], row[1] , row[1], row[2]
            example = tf.train.Example()
    #         example.features.feature["question"].bytes_list.value.extend(question)
    #         example.features.feature["answer"].bytes_list.value.extend(answer)
            example.features.feature["question_bert"].float_list.value.extend(question_bert)
            example.features.feature["answer_bert"].float_list.value.append(answer_bert)
            writer.write(example.SerializeToString())
    
    0 回复  |  直到 6 年前
        1
  •  2
  •   Joocheol Kim    6 年前

    尝试

    example.features.feature["question"].bytes_list.value.extend([bytes(question, 'utf-8')])
    

    这将有助于您的第6行错误,同样的变化适用于第7行。

    检查你的编号

    question, answer, question_bert, answer_bert = row[0], row[1] , row[1], row[2]
    

    我想应该是0,1,2和3。

    在纠正正确的顺序时,仍然会出现错误。 所以,加上

    print(type(question_bert))
    

    上面说是一根绳子。如果它是一个字符串,那么您需要更改

    float_list.value.append
    

    bytes_list.value.extend
    

    如果有数组,则需要使用

    tf.serialize_tensor
    

    下面是tf.serialize_张量的一个简单示例

    a = np.array([[1.0, 2, 46], [0, 0, 1]])
    b=tf.serialize_tensor(a)
    b
    

    输出为

    <tf.Tensor: id=25, shape=(), dtype=string, numpy=b'\x08\x02\x12\x08\x12\x02\x08\x02\x12\x02\x08\x03"0\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00G@\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xf0?'>
    

    您需要将其保存为字节。

    推荐文章