代码之家 › 专栏 › 技术社区 › Abhishek Ram

如何在Spacy中添加其他货币字符

spacy python

Abhishek Ram · 技术社区 · 7 年前

我有文件 \u0080 用作欧元。我想将这些字符和其他字符添加到货币符号列表中,以便货币实体被Spacy NER拾取。处理这个问题的最佳方法是什么?

此外,我还有一些案例,其中货币表示为 CAD 5,000 国家能源监管局也没有把这些选作货币。应对这种情况的最佳方法是什么,培训NER或添加 CAD 作为货币符号?

1 回复 | 直到 7 年前

Jacques Gaudin 7 年前

1、本项目 'u\0080' 问题

第一件事第一,看来 'u\0080' 字符取决于您使用的平台,它不在Windows 7计算机上打印,但在Linux计算机上工作。。。

为了完整起见,我假设您从包含 '' 转义序列(应打印为 â¬ 在浏览器中 '\u0080' 字符和其他一些我们认为是货币的任意符号。

在将文本内容传递给spaCy之前,我们可以调用 html.unescape 负责翻译  到 - ,而默认配置会将其识别为货币。

text_html = ("I just found out that CAD 1,000 is about 641.3 &#x80. "
             "Some people call it 641.3 \u0080. "
             "Fantastic! But in the U.K. I'd rather pay 344ð or \U0001F33B56.")

text = html.unescape(text_html)

第二,如果存在未被识别为货币的符号,如 ð 和 ð» 例如,我们可以更改 Defaults 我们用来限定他们为货币的语言。

这包括更换 lex_attr_getters[IS_CURRENCY] 一个自定义函数,其中包含描述货币的符号列表。

def is_currency_custom(text):
    # Stripping punctuation
    table = str.maketrans({key: None for key in string.punctuation})
    text = text.translate(table)

    all_currencies = ["\U0001F385", "\U0001F33B", "\u0080", "CAD"]
    if text in all_currencies:
        return True
    return is_currency_original(text)

# Keep a reference to the original is_currency function
is_currency_original = EnglishDefaults.lex_attr_getters[IS_CURRENCY]
# Assign a new function for IS_CURRENCY
EnglishDefaults.lex_attr_getters[IS_CURRENCY] = is_currency_custom

二是 CAD 5,000 问题

对于这个问题,一个简单的解决方案是定义一个特例。我们对标记器说,无论它在哪里遇到 CAD ,这是一个特殊情况,需要按照我们的指示执行。我们可以设置 IS_CURRENCY 国旗和其他东西。

special_case = [{
        ORTH: u'CAD', 
        TAG: u'$', 
        IS_CURRENCY: True}]

nlp.tokenizer.add_special_case(u'CAD', special_case)

请注意,这并不完美,因为您可能会得到误报。想象一下,一家加拿大公司销售CAD绘图服务的文档。。。所以这很好,但不是很好。

如果我们想更精确,我们可以创建 Matcher 对象,该对象将查找以下模式 CURRENCY[SPACE]NUMBER 或 NUMBER[SPACE]CURRENCY 并关联 MONEY 实体。

matcher = Matcher(nlp.vocab)

MONEY = nlp.vocab.strings['MONEY']

# This is the matcher callback that sets the MONEY entity
def add_money_ent(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    doc.ents += ((MONEY, start, end),)

matcher.add(
    'MoneyRedefined', 
    add_money_ent,
    [{'IS_CURRENCY': True}, {'IS_SPACE': True, 'OP': '?'}, {'LIKE_NUM': True}],
    [{'LIKE_NUM': True}, {'IS_SPACE': True, 'OP': '?'}, {'IS_CURRENCY': True}]
)

然后你把它应用到你的 doc 对象具有 matcher(doc) 这个 'OP' 键允许模式匹配0或1次,从而使模式成为可选模式。

3、完整代码

import spacy
from spacy.symbols import IS_CURRENCY
from spacy.lang.en import EnglishDefaults
from spacy.matcher import Matcher
from spacy import displacy
import html
import string


def is_currency_custom(text):
    # Stripping punctuation
    table = str.maketrans({key: None for key in string.punctuation})
    text = text.translate(table)

    all_currencies = ["\U0001F385", "\U0001F33B", "\u0080", "CAD"]
    if text in all_currencies:
        return True
    return is_currency_original(text)

# Keep a reference to the original is_currency function
is_currency_original = EnglishDefaults.lex_attr_getters[IS_CURRENCY]
# Assign a new function for IS_CURRENCY
EnglishDefaults.lex_attr_getters[IS_CURRENCY] = is_currency_custom

nlp = spacy.load('en')

matcher = Matcher(nlp.vocab)

MONEY = nlp.vocab.strings['MONEY']

# This is the matcher callback that sets the MONEY entity
def add_money_ent(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    doc.ents += ((MONEY, start, end),)

matcher.add(
    'MoneyRedefined', 
    add_money_ent,
    [{'IS_CURRENCY': True}, {'IS_SPACE': True, 'OP': '?'}, {'LIKE_NUM': True}],
    [{'LIKE_NUM': True}, {'IS_SPACE': True, 'OP': '?'}, {'IS_CURRENCY': True}]
)

text_html = ("I just found out that CAD 1,000 is about 641.3 &#x80. "
             "Some people call it 641.3 \u0080. "
             "Fantastic! But in the U.K. I'd rather pay 344ð or \U0001F33B56.")

text = html.unescape(text_html)

doc = nlp(text)

matcher(doc)

displacy.serve(doc, style='ent')

这将提供预期的: