代码之家  ›  专栏  ›  技术社区  ›  Firefly

啊。提取根和后缀

  •  3
  • Firefly  · 技术社区  · 9 年前

    我有一个csv文件,用分号分隔。这个文件包含一个丹麦词典,我需要从中提取词干和后缀。 我需要在AWK中完成!

    文件:

    adelig;adelig;adj.;1
    adelig;adelige;adj.;2
    adelig;adeligt;adj.;3
    adelig;adeligst;adj.;5
    voksen;voksen;adj.;1
    voksen;voksne;adj.;2
    voksen;voksent;adj.;3
    voksen;voksnest;adj.;5
    virkemiddel;virkemiddel;sb.;1
    virkemiddel;virkemidlet;sb.;2
    virkemiddel;virkemidlets;sb.;3
    virkemiddel;virkemiddels;sb.;4
    virkemiddel;virkemidlerne;sb.;5
    virkemiddel;virkemidlernes;sb.;6
    virkemiddel;virkemiddel;sb.;7
    virkemiddel;virkemidler;sb.;7
    virkemiddel;virkemiddels;sb.;8
    virkemiddel;virkemidlers;sb.;8
    

    预期输出:

    adelig;adelig; ,e,t,*,st
    voksen;voks; ,ne,ent,*,nest
    virkemiddel;virkemid ,let,lets,dels,lerne,lernes,del;ler,dels;lers
    

    第四栏是表格。当某些形式缺失时,后缀将被星号取代。喜欢 adelig;adelig; ,e,t,*,st 如果形式(数字)重复,后缀用分号分隔。喜欢 virkemiddel;virkemid ,let,lets,dels,lerne,lernes,del;ler,dels;lers

    我开始写这段代码,但我没有得到处理不止一个可能的词干的算法。与以下情况一样 维基米德尔

    BEGIN{
    FS=";"
    }
    
    {
    
        lemm=$1;
        form=$2;
    
        if(match(form, lemm) > 0)
        {
            root=lemm;
            sub(root,"",form);
            suf[$1]=suf[$1]","form;
        }
        else
        {
            split($1,a,"");
            split($2,b,"");
    
    
            s="";
            for(i in a)
            { 
                if(b[i]!=a[i])
                {
                    break;
                }
                s = s "" a[i];
            }
        }
        root=s;
    
    }
    
    4 回复  |  直到 9 年前
        1
  •  4
  •   glenn jackman    9 年前

    下面是一些awk代码,用于查找公共前缀长度并确定后缀列表。我没有处理丢失的表格,也没有处理重复的号码,但它应该会给你一个开始

    #!/usr/bin/gawk -f
    
    BEGIN { FS = OFS = ";" }
    { words[$1] = words[$1] FS $2 }
    END {
        for (word in words) {
            sub("^"FS, "", words[word])
            num_words = split(words[word], these_words)
            prefix_length = common_prefix_length(these_words, num_words)
    
            suffixes = ""
            sep = ""
            for (i=1; i<=num_words; i++) {
                suffixes = suffixes sep substr(these_words[i],prefix_length+1)
                sep = ","
            }
            print word, substr(these_words[1], 1, prefix_length), suffixes
        }
    }
    
    function common_prefix_length(w, n                 ,i,j,minlen, char) {
        minlen = length(w[1])
        for (i=2; i<=n; i++) 
            if (length(w[i]) < minlen)
                minlen = length(w[i])
    
        for (i=1; i <= minlen; i++) {
            char = substr(w[1], i, 1)
            for (j=2; j <= n; j++)
                if (substr(w[j], i, 1) != char)
                    return i-1
        }
        return minlen
    }
    

    给定您的输入,输出为

    voksen;voks;en,ne,ent,nest
    virkemiddel;virkemid;del,let,lets,dels,lerne,lernes,del,ler,dels,lers
    adelig;adelig;,e,t,st
    
        2
  •  2
  •   fedorqui    9 年前

    这可能是Python中的一个很好的起点。它使用了 os.path.commonprefix 从单词列表中获取词干。

    import os
    import csv
    
    file="a"
    prev_word=""
    words=[]
    data=dict()
    csv_reader = csv.DictReader(
        open(file),
        delimiter=";",
        fieldnames=['common','word','type','num']
        )
    
    for row in csv_reader:
        word = row['common']
        if not prev_word or word == prev_word:
            words.append(row['word'])
        else:
            common=os.path.commonprefix(words)
            data[prev_word] = words
            words=[]
        prev_word = word
    
    data[prev_word] = words
    for word,values in data.iteritems():
        common = os.path.commonprefix(values)
        suffixes = [i[len(common):] for i in values]
        suffixes = [i if len(i) else '*' for i in suffixes]
        print "%s;%s;%s" %(word,common,','.join(suffixes))
    

    它返回:

    voksen;voks;ne,ent,nest
    virkemiddel;virkemid;let,lets,dels,lerne,lernes,del,ler,dels,lers
    adelig;adelig;*,e,t,st
    
        3
  •  2
  •   Kaz    3 年前

    三种解决方案 TXR 首先,使用提取语言构建基于结构的显式数据模型,然后处理这些结构:

    @(do
       (defstruct inflection ()
         word type index)
    
       (defstruct dict-entry ()
         root variants max-index))
    @(collect :vars (dict))
    @  (all)
    @word;@(skip)
    @  (and)
    @    (collect :gap 0 :vars (infl))
    @word;@variant;@type;@index
    @      (bind infl @(new inflection word variant type type index (toint index)))
    @    (end)
    @    (bind dict @(new dict-entry root word variants infl
                          max-index [find-max-key infl > .index]))
    @  (end)
    @(end)
    @(do (each ((d dict))
           (let* ((vs (mapcar .word d.variants))
                  (prefix (reduce-left (ret [@1 0..(mismatch @1 @2)]) vs))
                  (plen (len prefix))
                  (prefix [(first vs) 0..plen]))
             (put-string `@{d.root};@prefix; `)
             (each ((i (range 2 d.max-index)))
               (let ((vlist [keepql i d.variants .index]))
                 (put-char #\,)
                 (put-string
                   (if (null vlist)
                     "*"
                     [cat-str (mapcar (ret [@1.word plen..:]) vlist) ";"]))))
             (put-line))))
    

    运行:

    $ txr stems.txr data
    adelig;adelig; ,e,t,*,st
    voksen;voks; ,ne,ent,*,nest
    virkemiddel;virkemid; ,let,lets,dels,lerne,lernes,del;ler,dels;lers
    

    注意细微差异:

    virkemiddel;virkemid; ,let,lets,dels,lerne,lernes,del;ler,dels;lers
                        ^
    

    该分号被排除在原始所需输出中;没有给出排除依据的理由,因此目前它被视为印刷错误。

    表达式 (ret [@1 0..(mismatch @1 @2)]) 生成一个双参数函数,该函数返回一对字符串的公共前缀。为了返回字符串列表的公共前缀,我们将此函数用作 reduce-left .

    第二个版本,没有数据结构。在上产生相同的输出 data :

    @(repeat)
    @  (all)
    @word;@(skip)
    @  (and)
    @    (collect :gap 0)
    @word;@variant;@type;@strindex
    @      (bind index @(toint strindex))
    @    (end)
    @    (do
           (let* ((prefix (reduce-left (ret [@1 0..(mismatch @1 @2)]) variant))
                  (plen (len prefix))
                  (max-index [find-max index])
                  (v-i-pairs (zip variant index)))
            (put-string `@word;@prefix; `)
            (each ((i (range 2 max-index)))
              (let ((vlist [keepql i v-i-pairs second]))
                (put-char #\,)
                (put-string
                  (cat-str (or (mapcar (aret [@1 plen..:]) vlist)
                               '("*"))
                           ";"))))
            (put-line)))
    @  (end)
    @(end)
    

    纯TXR Lisp解决方案,不使用提取语言。一个巨大的表达式,它读取输入行,拆分它们,将第四个字段转换为整数,按词根对条目进行分组,等等:

    (flow
      (get-lines)
      (keep-matches (`@a;@b;@c;@d` @1)
        (list a b c (toint d)))
      (partition-by first)
      (mapcar transpose)
      (mapdo (tb ((word variant type index))
               (let* ((prefix (reduce-left (ret [@1 0..(mismatch @1 @2)]) variant))
                      (plen (len prefix))
                      (max-index [find-max index])
                      (v-i-pairs (zip variant index)))
                 (put-string `@(first word);@prefix; `)
                 (each ((i (range 2 max-index)))
                   (let ((vlist [keepql i v-i-pairs second]))
                     (put-char #\,)
                     (put-string
                       (cat-str (or (mapcar (aret [@1 plen..:]) vlist)
                                    '("*"))
                                ";"))))
                 (put-line)))))
    

    运行:

    $ txr stems3.tl < data
    adelig;adelig; ,e,t,*,st
    voksen;voks; ,ne,ent,*,nest
    virkemiddel;virkemid; ,let,lets,dels,lerne,lernes,del;ler,dels;lers
    
        4
  •  0
  •   Firefly    9 年前

    这是我获得预期结果的代码。 代码中的注释指出了格伦代码的主要变化。

    BEGIN {
    FS=OFS=";"
    }
    
    { 
        words[$1";"$3] = words[$1";"$3] FS $2;
        num[$1";"$3]=num[$1";"$3] $4 FS; #Array to store numbers in the fourth column by two ID's
    }
    
    END {
        for (item in words) {
            sub("^"FS, "", words[item]);
            words_n = split(words[item], extrac);
            split(num[item],numbers); #Extract numbers one by one, in order to compare them.
            split(item,cab,";");
            long = extract_stem(extrac, words_n);
    
            suffix = "";
            sep = ",";
    
            for (i=1; i<=words_n; i++)
            {
                suf=substr(extrac[i],long+1)
                if(suf!="") #Avoid null values from suffixes.
                {
                    suffix = suffix sep suf;
                }
    
                if(numbers[i]==numbers[i+1]) #Compare numbers with the next number
                {
                    sep=";";
                }
                else if((numbers[i+1]-numbers[i])!= 1) #Subtract numbers to its previous number
                {
                    sep=",*,";
                }
                else
                {
                    sep=",";
                }
            }
            print cab[1], substr(extrac[1], 1, long), " "suffix
        }
    }
    
    
    function extract_stem(wrd, nmr ,i,j,min, chr) { #This is the magic of glenn jackman!
        min = length(wrd[1])
        for (i=2; i<=nmr; i++)
        {
            if (length(wrd[i]) < min)
            {
                min = length(wrd[i]);
            }
        }
    
        for (i=1; i <= min; i++)
        {
            chr = substr(wrd[1], i, 1)
            for (j=2; j <= nmr; j++)
            {
                if (substr(wrd[j], i, 1) != chr)
                {
                    return i-1;
                }
            }
        }
        return min
    }
    

    我不得不修改代码。我没有考虑过这种狡辩。 当动词和副词的引理相同时。

    abe;abe;sb.;1
    abe;aben;sb.;2
    abe;abens;sb.;3
    abe;abes;sb.;4
    abe;aberne;sb.;5
    abe;abernes;sb.;6
    abe;aber;sb.;7
    abe;abers;sb.;8
    abe;abe;vb.;1
    abe;ab;vb.;2
    abe;abet;vb.;3
    abe;aber;vb.;4
    abe;abede;vb.;6
    abe;abes;vb.;7
    abe;abedes;vb.;8