代码之家 › 专栏 › 技术社区 › LiefLayer

合并两个PDF/A结果应该是一个有效的PDF/A太pdfbox

pdfa pdfbox merge java

LiefLayer · 技术社区 · 1 年前

我正在使用pdfbox合并两个PDF/A

现在我的代码如下所示:

    PDFMergerUtility mergerUtility = new PDFMergerUtility();

    File file = new File("example/c.pdf");

    mergerUtility.addSource(new File("example/a.pdf"));
    mergerUtility.addSource(new File("example/b.pdf"));

    mergerUtility.setDestinationFileName(file.getAbsolutePath());

    try {
        mergerUtility.mergeDocuments(MemoryUsageSetting.setupMainMemoryOnly());
    } catch (IOException ex) {
        throw new RuntimeException("Unable to merge", ex);
    }

    File inputFile = new File("example/c.pdf");
    PDDocument doc = PDDocument.load(inputFile);

    File orig = new File("example/a.pdf");
    PDDocument origDoc = PDDocument.load(orig);
    File orig2 = new File("example/b.pdf");
    PDDocument orig2Doc = PDDocument.load(orig2);
    PDStructureTreeRoot treeRoot = origDoc.getDocumentCatalog().getStructureTreeRoot();
    PDStructureTreeRoot treeRoot2 = orig2Doc.getDocumentCatalog().getStructureTreeRoot();
    treeRoot.setKids(treeRoot2.getKids());
    doc.getDocumentCatalog().setStructureTreeRoot(treeRoot);

    List<PDOutputIntent> outputIntents=new ArrayList<>();
    outputIntents.add(doc.getDocumentCatalog().getOutputIntents().get(0));
    doc.getDocumentCatalog().setOutputIntents(outputIntents);
    doc.save("example/d.pdf");
    doc.close();

通过将OutputIntent设置为与第一页相同(因此d.pdf只有一个),我已经解决了很多问题。。。我遇到的最后一个问题(仅在这个验证器上 https://avepdf.com/pdfa-validation )为4:

“非标准结构类型未映射到任何功能等效的标准类型。”

我能够识别它们,这是由于在我的结果(由pdfbox生成)中使用了“THead”和“TBody”。

通过使用与原始文件相同的StructureTreeRoot(这是您可以看到的“最终”代码,仍然无法从两个PDF/a的合并中获得有效的PDF/a),我能够将错误减半(现在我只得到2个“非标准结构类型未映射到任何功能等效的标准类型。”)。。。但我不知道如何合并这两个StructureTreeRoot,也不知道这是否是真正的解决方案(也许有一种方法可以告诉pdfbox,避免使用THead和TBody)。

结果已经很好了,因为它通过了大多数pdf/a验证器,我只需要它也通过那个验证器(因为它是我工作的公司使用的)。。。此外,我不认为验证器是罪魁祸首,因为两个输入文件都作为有效的PDF/A文件传递。

有什么想法吗?

PS。我找到了一种方法来合并两个仍然不起作用的PDStructureTreeRoot。。。但我更新了代码。添加

PDStructureTreeRoot treeRoot2 = orig2Doc.getDocumentCatalog().getStructureTreeRoot();
treeRoot.setKids(treeRoot2.getKids());

0 回复 | 直到 1 年前

LiefLayer 1 年前

好吧…这真的很艰难。我第一次不得不从聊天软件那里获得帮助,但这仍然不够,因为人工智能当然无法创建大量的工作代码,每次我都必须进行更正,用代码样本(一种协议代码)翻译我的想法仍然是一个很好的帮助。无论如何,我必须理解的最重要的一点是,通过克隆StructureTreeRoot(但这在我的编辑中)不可能解决我的问题,所以我删除了代码的这一部分,没有理由保留它。我不明白如何避免THead或TBody的产生。所以我唯一能做的就是用P这样的标准替代品来代替它们中的每一个。

第一个问题是如何以一种可以迭代的方式获取根。为了理解这一点,我从调试树开始,并手动访问它,直到达到THead。。。我有了一些方向,这要归功于我可以在生成的PDF中使用Visual Studio代码阅读的结构。人工智能在这里有助于理解如何访问一些受保护的变量,而我只能在调试模式下访问这些变量。

COSDictionary catalogDict = doc2.getDocumentCatalog().getCOSObject();
COSObject structTreeRootRef = (COSObject) catalogDict.getItem(COSName.STRUCT_TREE_ROOT);
COSDictionary structTreeRootDict = (COSDictionary) structTreeRootRef.getObject();
COSBase result = structTreeRootDict.getItem(COSName.K);
COSDictionary dict1 = result instanceof COSObject ? (COSDictionary) ((COSObject) result).getObject() : null;
//this
COSArray array = result instanceof COSArray ? (COSArray) result : new COSArray();
//or
COSBase result1 = dict1.getItem(COSName.K);
COSArray array = result1 instanceof COSArray ? (COSArray) result1 : new COSArray();

我没有原始代码,但基本上在第一部分之后,我只做了一个get(I),因为我知道我需要得到的每个节点,最后

subDict.setItem(COSName.S, COSName.getPDFName("P"));

当然,这已经是可以工作的代码了(这很好,因为到那时我必须学习如何访问pdf树,而pdfbox在这方面一点都不直观),但我当然还没有完成,因为我的解决方案只适用于我自己的两个pdf/a示例。所以我决定把get(I)变成循环。更好,但仍然不是一个好的解决方案,因为当我尝试另一个pdf时,我的THead和TBody更深一层,我不得不添加另一个用于内部循环,使其再次工作。。。当然表现也不太好。这就是chatgpt通过提供递归替代方案再次提供帮助的地方(解决方案很琐碎,但老实说我根本没有想到)。。。我仍然需要对代码进行大量更正,但最终正确的解决方案是:

    COSDictionary catalogDict = doc2.getDocumentCatalog().getCOSObject();
    COSObject structTreeRootRef = (COSObject) catalogDict.getItem(COSName.STRUCT_TREE_ROOT);
    COSDictionary structTreeRootDict = (COSDictionary) structTreeRootRef.getObject();
    COSName newName = COSName.getPDFName("P");

    updateStructureTree(structTreeRootDict, newName);

递归方法:

private static void updateStructureTree(COSDictionary dict, COSName newName) {
    COSBase result = dict.getItem(COSName.K);
    COSDictionary dict1 = result instanceof COSObject ? (COSDictionary) ((COSObject) result).getObject() : null;

    if(dict1 != null){
        COSBase result1 = dict1.getItem(COSName.K);
        COSArray array = result1 instanceof COSArray ? (COSArray) result1 : new COSArray();

        for (COSBase resultItem : array) {
            COSDictionary subDict = resultItem instanceof COSObject ?
                    (COSDictionary) ((COSObject) resultItem).getObject() :
                    new COSDictionary();

            if (subDict.getItem(COSName.S) != null &&
                    (subDict.getItem(COSName.S).equals(COSName.getPDFName("THead")) ||
                            subDict.getItem(COSName.S).equals(COSName.getPDFName("TBody")))) {
                subDict.setItem(COSName.S, newName);
            }

            updateStructureTree(subDict, newName);
        }
    }else{
        COSArray array = result instanceof COSArray ? (COSArray) result : new COSArray();

        for (COSBase resultItem : array) {
            COSDictionary subDict = resultItem instanceof COSObject ?
                    (COSDictionary) ((COSObject) resultItem).getObject() :
                    new COSDictionary();

            if (subDict.getItem(COSName.S) != null &&
                    (subDict.getItem(COSName.S).equals(COSName.getPDFName("THead")) ||
                            subDict.getItem(COSName.S).equals(COSName.getPDFName("TBody")))) {
                subDict.setItem(COSName.S, newName);
            }

            updateStructureTree(subDict, newName);
        }
    }
}

PS我找到了一个替代方案: 只需创建一个列表,然后

mergerUtility.setDocumentMergeMode(PDFMergerUtility.DocumentMergeMode.OPTIMIZE_RESOURCES_MODE);
for (int i = 1; i < lists.size(); i++) {
    PDDocument currentDoc = PDDocument.load(lists.get(i));

    mergerUtility.appendDocument(docC, currentDoc);
}

并保存您的新文档。。。使用appendDocument而不是mergeDocuments不会使PDF/a无效,即使它得到了THead和TBody。。。并且它也不会改变报头版本。