代码之家  ›  专栏  ›  技术社区  ›  Abhi Thakkar

如何在TABLA命令行中指定列坐标

  •  1
  • Abhi Thakkar  · 技术社区  · 7 年前

    java -jar tabula-java.jar -a 301.95,14.85,841.0500000000001,695.25 -t example.pdf
    

    但在这种情况下,两列数据在某些行中混合, 所以我想指定列坐标来获得完美的数据, 但我不知道如何获得柱坐标, 因此,任何人都可以用完美的命令引导我,这将是有益的。

    提前感谢!

    1 回复  |  直到 7 年前
        1
  •  3
  •   Pants    7 年前

    可以使用-c或--columns参数指定列坐标。您指定的坐标将是列之间轮廓线的坐标。因此,如果一列从10.5到13.5,下一列从13.5到17.5,那么您只列出13.5。您还需要关闭guess。您没有提供pdf示例,因此我无法为您提供正确的坐标,但您的命令如下所示:

    java -jar tabula-java.jar -a 301.95,14.85,841.0500000000001,695.25 -c 15.7,17.3,19.2,33.2,70.1,100.7,200.6,300.7 -t example.pdf -g False
    

    您可以从help命令中阅读有关正确获取命令的不同选项的更多信息:

        $ java -jar target/tabula-1.0.1-jar-with-dependencies.jar --help
    usage: tabula [-a <AREA>] [-b <DIRECTORY>] [-c <COLUMNS>] [-d] [-f
           <FORMAT>] [-g] [-h] [-i] [-l] [-n] [-o <OUTFILE>] [-p <PAGES>] [-r]
           [-s <PASSWORD>] [-t] [-u] [-v]
    
    Tabula helps you extract tables from PDFs
    
     -a,--area <AREA>           Portion of the page to analyze
                                (top,left,bottom,right). Example: --area
                                269.875,12.75,790.5,561. Default is entire
                                page
     -b,--batch <DIRECTORY>     Convert all .pdfs in the provided directory.
     -c,--columns <COLUMNS>     X coordinates of column boundaries. Example
                                --columns 10.1,20.2,30.3
     -d,--debug                 Print detected table areas instead of
                                processing.
     -f,--format <FORMAT>       Output format: (CSV,TSV,JSON). Default: CSV
     -g,--guess                 Guess the portion of the page to analyze per
                                page.
     -h,--help                  Print this help text.
     -i,--silent                Suppress all stderr output.
     -l,--lattice               Force PDF to be extracted using lattice-mode
                                extraction (if there are ruling lines
                                separating each cell, as in a PDF of an Excel
                                spreadsheet)
     -n,--no-spreadsheet        [Deprecated in favor of -t/--stream] Force PDF
                                not to be extracted using spreadsheet-style
                                extraction (if there are no ruling lines
                                separating each cell)
     -o,--outfile <OUTFILE>     Write output to <file> instead of STDOUT.
                                Default: -
     -p,--pages <PAGES>         Comma separated list of ranges, or all.
                                Examples: --pages 1-3,5-7, --pages 3 or
                                --pages all. Default is --pages 1
     -r,--spreadsheet           [Deprecated in favor of -l/--lattice] Force
                                PDF to be extracted using spreadsheet-style
                                extraction (if there are ruling lines
                                separating each cell, as in a PDF of an Excel
                                spreadsheet)
     -s,--password <PASSWORD>   Password to decrypt document. Default is empty
     -t,--stream                Force PDF to be extracted using stream-mode
                                extraction (if there are no ruling lines
                                separating each cell)
     -u,--use-line-returns      Use embedded line returns in cells. (Only in
                                spreadsheet mode.)
     -v,--version               Print version and exit.