代码之家 › 专栏 › 技术社区 › Thilo

如何在Perl中从文本文件中提取/解析表格数据?

data-extraction text-parsing parsing perl

6

Thilo · 技术社区 · 14 年前

我在找类似的东西 HTML::TableExtract ,只是不用于HTML输入,而是用于包含用缩进和间距格式化的“表格”的纯文本输入。

数据可能如下所示:

Here is some header text.

Column One       Column Two      Column Three
a                                           b
a                    b                      c


Some more text

Another Table     Another Column
abdbdbdb          aaaa

2 回复 | 直到 14 年前

1

DVK 14 年前

不知道有什么打包的解决方案,但不太灵活的方法是相当简单的,假设您可以对文件进行两次传递:(下面是部分Perlish伪代码示例)

假设:数据可能包含空格,如果有空格,则不引用ALACV-如果不是这样,只需使用 Text::CSV(_XS)
逻辑定义了一个“列分隔符”,以使其成为以空格填充100%的任何连续的垂直行集。
如果意外地每一行都有一个空格,这是偏移量M字符处的数据的一部分,那么逻辑将把偏移量M视为一个列分隔符,因为它无法更好地识别。 唯一能让它更好地了解的方法是,是否要求列分隔至少为X个空格,其中X>1

样本代码:

my $INFER_FROM_N_LINES = 10; # Infer columns from this # of lines
                             # 0 means from entire file
my $lines_scanned = 0;
my @non_spaces=[];
# First pass - find which character columns in the file have all spaces and which don't
my $fh = open(...) or die;
while (<$fh>) {
    last if $INFER_FROM_N_LINES && $lines_scanned++ == $INFER_FROM_N_LINES;
    chomp;
    my $line = $_;
    my @chars = split(//, $line); 
    for (my $i = 0; $i < @chars; $i++) { # Probably can be done prettier via map?
        $non_spaces[$i] = 1 if $chars[$i] ne " ";
    }
}
close $fh or die;

# Find columns, defined as consecutive "non-spaces" slices.
my @starts, @ends; # Index at which columns start and end
my $state = " "; # Not inside a column
for (my $i = 0; $i < @non_spaces; $i++) {
    next if $state eq " " && !$non_spaces[$i];
    next if $state eq "c" && $non_spaces[$i];
    if ($state eq " ") { # && $non_spaces[$i] of course => start column
        $state = "c";
        push @starts, $i;
    } else { # meaning $state eq "c" && !$non_spaces[$i] => end column
        $state = " ";
        push @ends, $i-1;
    }
}
if ($state eq "c") { # Last char is NOT a space - produce the last column end
    push @ends, $#non_spaces;
}

# Now split lines
my $fh = open(...) or die;
my @rows = ();
while (<$fh>) {
    my @columns = ();
    push @rows, \@columns;
    chomp;
    my $line = $_;
    for (my $col_num = 0; $col_num < @starts; $col_num++) {
        $columns[$col_num] = substr($_, $starts[$col_num], $ends[$col_num]-$starts[$col_num]+1);
    }
}
close $fh or die;

现在,如果你 要求列分隔至少为X个空格,其中X>1

# Find columns, defined as consecutive "non-spaces" slices separated by at least 3 spaces.
my $min_col_separator_is_X_spaces = 3;
my @starts, @ends; # Index at which columns start and end
my $state = "S"; # inside a separator
NEXT_CHAR: for (my $i = 0; $i < @non_spaces; $i++) {
    if ($state eq "S") { # done with last column, inside a separator
        if ($non_spaces[$i]) { # start a new column
            $state = "c";
            push @starts, $i;
        }
        next;
    }
    if ($state eq "c") { # Processing a column
        if (!$non_spaces[$i]) { # First space after non-space
                                # Could be beginning of separator? check next X chars!
            for (my $j = $i+1; $j < @non_spaces
                            || $j < $i+$min_col_separator_is_X_spaces; $j++) {
                 if ($non_spaces[$j]) {
                     $i = $j++; # No need to re-scan again
                     next NEXT_CHAR; # OUTER loop
                 }
                 # If we reach here, next X chars are spaces! Column ended!
                 push @ends, $i-1;
                 $state = "S";
                 $i = $i + $min_col_separator_is_X_spaces;
            }
         }
        next;
    }
}

2

1

Jon Purdy 14 年前

这是一个非常快速的解决方案,并给出了概述。(很抱歉我的长度)基本上,如果一个“单词”出现在列标题的开头 ,然后在列中结束 n 除非它的身体大部分都撞到柱子上 +1,在这种情况下,它会在那里结束。整理这些,扩展它以支持多个不同的表等等,这些都是一个练习。也可以使用列标题的左偏移量以外的值作为边界标记,例如中心,或由列编号确定的某个值。

#!/usr/bin/perl


use warnings;
use strict;


# Just plug your headers in here...
my @headers = ('Column One', 'Column Two', 'Column Three');

# ...and get your results as an array of arrays of strings.
my @result = ();


my $all_headers = '(' . (join ').*(', @headers) . ')';
my $found = 0;
my @header_positions;
my $line = '';
my $row = 0;
push @result, [] for (1 .. @headers);


# Get lines from file until a line matching the headers is found.

while (defined($line = <DATA>)) {

    # Get the positions of each header within that line.

    if ($line =~ /$all_headers/) {
        @header_positions = @-[1 .. @headers];
        $found = 1;
        last;
    }

}


$found or die "Table not found! :<\n";


# For each subsequent nonblank line:

while (defined($line = <DATA>)) {
    last if $line =~ /^$/;

    push @{$_}, "" for (@result);
    ++$row;

    # For each word in line:

    while ($line =~ /(\S+)/g) {

        my $word = $1;
        my $position = $-[1];
        my $length = $+[1] - $position;
        my $column = -1;

        # Get column in which word starts.

        while ($column < $#headers &&
            $position >= $header_positions[$column + 1]) {
            ++$column;
        }

        # If word is not fully within that column,
        # and more of it is in the next one, put it in the next one.

        if (!($column == $#headers ||
            $position + $length < $header_positions[$column + 1]) &&
            $header_positions[$column + 1] - $position <
            $position + $length - $header_positions[$column + 1]) {

            my $element = \$result[$column + 1]->[$row];
            $$element .= " $word";

        # Otherwise, put it in the one it started in.

        } else {

            my $element = \$result[$column]->[$row];
            $$element .= " $word";

        }

    }

}


# Output! Eight-column tabs work best for this demonstration. :P

foreach my $i (0 .. $#headers) {
    print $headers[$i] . ": ";
    foreach my $c (@{$result[$i]}) {
        print "$c\t";
    }
    print "\n";
}


__DATA__

This line ought to be ignored.

Column One       Column Two      Column Three
These lines are part of the tabular data to be processed.
The data are split based on how much words overlap columns.

This line ought to be ignored also.

样本输出:

Column One:      These lines are         The data are split
Column Two:      part of the tabular     based on how
Column Three:    data to be processed.   much words overlap columns.