代码之家  ›  专栏  ›  技术社区  ›  Aaraeus

两个Python循环看起来应该做相同的事情,但输出不同的结果?

  •  2
  • Aaraeus  · 技术社区  · 6 年前

    昨天我试图完成Udacity关于文本矢量化的第11课。我仔细检查了代码,一切似乎都很好——我接收了一些电子邮件,打开它们,删除了一些签名词,并将每个电子邮件的词干返回到一个列表中。

    这里的循环1:

    for name, from_person in [("sara", from_sara), ("chris", from_chris)]:
        for path in from_person:
            ### only look at first 200 emails when developing
            ### once everything is working, remove this line to run over full dataset
    #        temp_counter += 1
        if temp_counter < 200:
            path = os.path.join('/xxx', path[:-1])
            email = open(path, "r")
    
            ### use parseOutText to extract the text from the opened email
    
            email_stemmed = parseOutText(email)
    
            ### use str.replace() to remove any instances of the words
            ### ["sara", "shackleton", "chris", "germani"]
    
            email_stemmed.replace("sara","")
            email_stemmed.replace("shackleton","")
            email_stemmed.replace("chris","")
            email_stemmed.replace("germani","")
    
        ### append the text to word_data
    
        word_data.append(email_stemmed.replace('\n', ' ').strip())
    
        ### append a 0 to from_data if email is from Sara, and 1 if email is from Chris
            if from_person == "sara":
                from_data.append(0)
            elif from_person == "chris":
                from_data.append(1)
    
        email.close()
    

    这里的循环2:

    for name, from_person in [("sara", from_sara), ("chris", from_chris)]:
        for path in from_person:
            ### only look at first 200 emails when developing
            ### once everything is working, remove this line to run over full dataset
    #        temp_counter += 1
            if temp_counter < 200:
                path = os.path.join('/xxx', path[:-1])
                email = open(path, "r")
    
                ### use parseOutText to extract the text from the opened email
                stemmed_email = parseOutText(email)
    
                ### use str.replace() to remove any instances of the words
                ### ["sara", "shackleton", "chris", "germani"]
                signature_words = ["sara", "shackleton", "chris", "germani"]
                for each_word in signature_words:
                    stemmed_email = stemmed_email.replace(each_word, '')         #careful here, dont use another variable, I did and broke my head to solve it
    
                ### append the text to word_data
                word_data.append(stemmed_email)
    
                ### append a 0 to from_data if email is from Sara, and 1 if email is from Chris
                if name == "sara":
                    from_data.append(0)
                else: # its chris
                    from_data.append(1)
    
    
                email.close()
    

    代码的下一部分按预期工作:

    print("emails processed")
    from_sara.close()
    from_chris.close()
    
    pickle.dump( word_data, open("/xxx/your_word_data.pkl", "wb") )
    pickle.dump( from_data, open("xxx/your_email_authors.pkl", "wb") )
    
    
    print("Answer to Lesson 11 quiz 19: ")
    print(word_data[152])
    
    
    ### in Part 4, do TfIdf vectorization here
    
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.feature_extraction import stop_words
    print("SKLearn has this many Stop Words: ")
    print(len(stop_words.ENGLISH_STOP_WORDS))
    
    vectorizer = TfidfVectorizer(stop_words="english", lowercase=True)
    vectorizer.fit_transform(word_data)
    
    feature_names = vectorizer.get_feature_names()
    
    print('Number of different words: ')
    print(len(feature_names))
    

    但是当我用循环1计算单词总数时,我得到了错误的结果。当我使用循环2时,得到了正确的结果。

    我看这段代码的时间太长了,我无法发现其中的区别——在循环1中我做了什么错误?

    据记录,我一直得到的错误答案是38825。正确答案应该是38757。

    非常感谢你的帮助,好心的陌生人!

    1 回复  |  直到 6 年前
        1
  •  3
  •   Primusa    6 年前

    这些行没有任何作用:

    email_stemmed.replace("sara","")
    email_stemmed.replace("shackleton","")
    email_stemmed.replace("chris","")
    email_stemmed.replace("germani","")
    

    replace 返回新字符串,不修改 email_stemmed . 相反,您应该将返回值设置为 EMAIL STEMED :

    email_stemmed = email_stemmed.replace("sara", "")
    

    等等。

    循环2确实在for循环中设置了返回值:

    for each_word in signature_words:
        stemmed_email = stemmed_email.replace(each_word, '')
    

    上面的代码段与第一个代码段末尾的代码段不同 EMAIL STEMED 由于 代替 正确使用,而在第二次使用结束时 stemmed_email 实际上每个字都被删去了。