代码之家  ›  专栏  ›  技术社区  ›  Paul D. Waite

什么是好的Python亵渎过滤器库?[关闭]

  •  32
  • Paul D. Waite  · 技术社区  · 15 年前

    就像 https://stackoverflow.com/questions/1521646/best-profanity-filter ,但是对于Python和Im,我正在寻找可以在本地运行和控制自己的库,而不是web服务。

    (虽然听到你对亵渎过滤原则的基本反对意见总是很好,但我并不是在这里特别寻找它们。我知道亵渎过滤不能收集每一个有害的事情被说。我知道,从总体上讲,发誓并不是一个特别大的问题。我知道你需要一些人力来处理内容问题。我只想找一个好的图书馆,看看我能用它做什么。)

    6 回复  |  直到 8 年前
        1
  •  44
  •   leoluk    15 年前

    我没有发现任何Python亵渎库,所以我自己做了一个。


    filterlist

    与禁止词匹配的正则表达式列表。请不要使用 \b inside_words .

    例子: ['bad', 'un\w+']

    ignore_case

    True

    replacements

    违约: "$@%-?!"

    示例: "%&$?!" "-" 等。

    complete

    控制是替换整个字符串还是保留第一个和最后一个字符。

    内文

    违约: False

    控制是否在其他单词中也搜索单词。禁用此


    (最后举例)

    """
    Module that provides a class that filters profanities
    
    """
    
    __author__ = "leoluk"
    __version__ = '0.0.1'
    
    import random
    import re
    
    class ProfanitiesFilter(object):
        def __init__(self, filterlist, ignore_case=True, replacements="$@%-?!", 
                     complete=True, inside_words=False):
            """
            Inits the profanity filter.
    
            filterlist -- a list of regular expressions that
            matches words that are forbidden
            ignore_case -- ignore capitalization
            replacements -- string with characters to replace the forbidden word
            complete -- completely remove the word or keep the first and last char?
            inside_words -- search inside other words?
    
            """
    
            self.badwords = filterlist
            self.ignore_case = ignore_case
            self.replacements = replacements
            self.complete = complete
            self.inside_words = inside_words
    
        def _make_clean_word(self, length):
            """
            Generates a random replacement string of a given length
            using the chars in self.replacements.
    
            """
            return ''.join([random.choice(self.replacements) for i in
                      range(length)])
    
        def __replacer(self, match):
            value = match.group()
            if self.complete:
                return self._make_clean_word(len(value))
            else:
                return value[0]+self._make_clean_word(len(value)-2)+value[-1]
    
        def clean(self, text):
            """Cleans a string from profanity."""
    
            regexp_insidewords = {
                True: r'(%s)',
                False: r'\b(%s)\b',
                }
    
            regexp = (regexp_insidewords[self.inside_words] % 
                      '|'.join(self.badwords))
    
            r = re.compile(regexp, re.IGNORECASE if self.ignore_case else 0)
    
            return r.sub(self.__replacer, text)
    
    
    if __name__ == '__main__':
    
        f = ProfanitiesFilter(['bad', 'un\w+'], replacements="-")    
        example = "I am doing bad ungood badlike things."
    
        print f.clean(example)
        # Returns "I am doing --- ------ badlike things."
    
        f.inside_words = True    
        print f.clean(example)
        # Returns "I am doing --- ------ ---like things."
    
        f.complete = False    
        print f.clean(example)
        # Returns "I am doing b-d u----d b-dlike things."
    
        2
  •  20
  •   user2592414    12 年前
    arrBad = [
    '2g1c',
    '2 girls 1 cup',
    'acrotomophilia',
    'anal',
    'anilingus',
    'anus',
    'arsehole',
    'ass',
    'asshole',
    'assmunch',
    'auto erotic',
    'autoerotic',
    'babeland',
    'baby batter',
    'ball gag',
    'ball gravy',
    'ball kicking',
    'ball licking',
    'ball sack',
    'ball sucking',
    'bangbros',
    'bareback',
    'barely legal',
    'barenaked',
    'bastardo',
    'bastinado',
    'bbw',
    'bdsm',
    'beaver cleaver',
    'beaver lips',
    'bestiality',
    'bi curious',
    'big black',
    'big breasts',
    'big knockers',
    'big tits',
    'bimbos',
    'birdlock',
    'bitch',
    'black cock',
    'blonde action',
    'blonde on blonde action',
    'blow j',
    'blow your l',
    'blue waffle',
    'blumpkin',
    'bollocks',
    'bondage',
    'boner',
    'boob',
    'boobs',
    'booty call',
    'brown showers',
    'brunette action',
    'bukkake',
    'bulldyke',
    'bullet vibe',
    'bung hole',
    'bunghole',
    'busty',
    'butt',
    'buttcheeks',
    'butthole',
    'camel toe',
    'camgirl',
    'camslut',
    'camwhore',
    'carpet muncher',
    'carpetmuncher',
    'chocolate rosebuds',
    'circlejerk',
    'cleveland steamer',
    'clit',
    'clitoris',
    'clover clamps',
    'clusterfuck',
    'cock',
    'cocks',
    'coprolagnia',
    'coprophilia',
    'cornhole',
    'cum',
    'cumming',
    'cunnilingus',
    'cunt',
    'darkie',
    'date rape',
    'daterape',
    'deep throat',
    'deepthroat',
    'dick',
    'dildo',
    'dirty pillows',
    'dirty sanchez',
    'dog style',
    'doggie style',
    'doggiestyle',
    'doggy style',
    'doggystyle',
    'dolcett',
    'domination',
    'dominatrix',
    'dommes',
    'donkey punch',
    'double dong',
    'double penetration',
    'dp action',
    'eat my ass',
    'ecchi',
    'ejaculation',
    'erotic',
    'erotism',
    'escort',
    'ethical slut',
    'eunuch',
    'faggot',
    'fecal',
    'felch',
    'fellatio',
    'feltch',
    'female squirting',
    'femdom',
    'figging',
    'fingering',
    'fisting',
    'foot fetish',
    'footjob',
    'frotting',
    'fuck',
    'fucking',
    'fuck buttons',
    'fudge packer',
    'fudgepacker',
    'futanari',
    'g-spot',
    'gang bang',
    'gay sex',
    'genitals',
    'giant cock',
    'girl on',
    'girl on top',
    'girls gone wild',
    'goatcx',
    'goatse',
    'gokkun',
    'golden shower',
    'goo girl',
    'goodpoop',
    'goregasm',
    'grope',
    'group sex',
    'guro',
    'hand job',
    'handjob',
    'hard core',
    'hardcore',
    'hentai',
    'homoerotic',
    'honkey',
    'hooker',
    'hot chick',
    'how to kill',
    'how to murder',
    'huge fat',
    'humping',
    'incest',
    'intercourse',
    'jack off',
    'jail bait',
    'jailbait',
    'jerk off',
    'jigaboo',
    'jiggaboo',
    'jiggerboo',
    'jizz',
    'juggs',
    'kike',
    'kinbaku',
    'kinkster',
    'kinky',
    'knobbing',
    'leather restraint',
    'leather straight jacket',
    'lemon party',
    'lolita',
    'lovemaking',
    'make me come',
    'male squirting',
    'masturbate',
    'menage a trois',
    'milf',
    'missionary position',
    'motherfucker',
    'mound of venus',
    'mr hands',
    'muff diver',
    'muffdiving',
    'nambla',
    'nawashi',
    'negro',
    'neonazi',
    'nig nog',
    'nigga',
    'nigger',
    'nimphomania',
    'nipple',
    'nipples',
    'nsfw images',
    'nude',
    'nudity',
    'nympho',
    'nymphomania',
    'octopussy',
    'omorashi',
    'one cup two girls',
    'one guy one jar',
    'orgasm',
    'orgy',
    'paedophile',
    'panties',
    'panty',
    'pedobear',
    'pedophile',
    'pegging',
    'penis',
    'phone sex',
    'piece of shit',
    'piss pig',
    'pissing',
    'pisspig',
    'playboy',
    'pleasure chest',
    'pole smoker',
    'ponyplay',
    'poof',
    'poop chute',
    'poopchute',
    'porn',
    'porno',
    'pornography',
    'prince albert piercing',
    'pthc',
    'pubes',
    'pussy',
    'queaf',
    'raghead',
    'raging boner',
    'rape',
    'raping',
    'rapist',
    'rectum',
    'reverse cowgirl',
    'rimjob',
    'rimming',
    'rosy palm',
    'rosy palm and her 5 sisters',
    'rusty trombone',
    's&m',
    'sadism',
    'scat',
    'schlong',
    'scissoring',
    'semen',
    'sex',
    'sexo',
    'sexy',
    'shaved beaver',
    'shaved pussy',
    'shemale',
    'shibari',
    'shit',
    'shota',
    'shrimping',
    'slanteye',
    'slut',
    'smut',
    'snatch',
    'snowballing',
    'sodomize',
    'sodomy',
    'spic',
    'spooge',
    'spread legs',
    'strap on',
    'strapon',
    'strappado',
    'strip club',
    'style doggy',
    'suck',
    'sucks',
    'suicide girls',
    'sultry women',
    'swastika',
    'swinger',
    'tainted love',
    'taste my',
    'tea bagging',
    'threesome',
    'throating',
    'tied up',
    'tight white',
    'tit',
    'tits',
    'titties',
    'titty',
    'tongue in a',
    'topless',
    'tosser',
    'towelhead',
    'tranny',
    'tribadism',
    'tub girl',
    'tubgirl',
    'tushy',
    'twat',
    'twink',
    'twinkie',
    'two girls one cup',
    'undressing',
    'upskirt',
    'urethra play',
    'urophilia',
    'vagina',
    'venus mound',
    'vibrator',
    'violet blue',
    'violet wand',
    'vorarephilia',
    'voyeur',
    'vulva',
    'wank',
    'wet dream',
    'wetback',
    'white power',
    'women rapping',
    'wrapping men',
    'wrinkled starfish',
    'xx',
    'xxx',
    'yaoi',
    'yellow showers',
    'yiffy',
    'zoophilia']
    
    def profanityFilter(text):
    brokenStr1 = text.split()
    badWordMask = '!@#$%!@#$%^~!@%^~@#$%!@#$%^~!'
    new = ''
    for word in brokenStr1:
        if word in arrBad:
            print word + ' <--Bad word!'
            text = text.replace(word,badWordMask[:len(word)])
            #print new
    
    return text
    
    print profanityFilter("this thing sucks sucks sucks fucking stuff")
    

    你可以添加或删除坏词列表,阿雷巴德,你喜欢。

        3
  •  5
  •   Matt user129975    10 年前

    WebPurify是Python的亵渎过滤器库

        5
  •  2
  •   Aaron Digulla    15 年前

    亵渎?那是什么?;-)

    电脑还需要几年的时间才能真正识别咒骂和咒骂,我真诚地希望,到那时人们已经明白亵渎是人类的行为,而不是“危险的”

    与其说是一个愚蠢的过滤器,不如说是一个聪明的主持人,他可以适当地平衡讨论的基调。一个能够发现虐待的主持人,比如:

    “如果你是我丈夫,我会给你的茶下毒。”——“如果你是我妻子,我会喝。”

        6
  •  0
  •   Glenn Maynard    15 年前

    当然,用户可以解决这个问题,但它应该彻底消除亵渎:

    import re
    def remove_profanity(s):
        def repl(word):
            m = re.match(r"(\w+)(.*)", word)
            if not m:
                return word
            word = "Bork" if m.group(1)[0].isupper() else "bork"
            word += m.group(2)
            return word
        return " ".join([repl(w) for w in s.split(" ")])
    
    print remove_profanity("You just come along with me and have a good time. The Galaxy's a fun place. You'll need to have this fish in your ear.")
    
    推荐文章