代码之家 › 专栏 › 技术社区 › David Walker

将XML转换为纯文本

hl7 xslt xml c#

David Walker · 技术社区 · 15 年前

我的目标是构建一个引擎,它使用最新的HL7 3.0CDA文档,并使它们与HL7 2.5向后兼容,后者是一个完全不同的野兽。

CDA文档是一个XML文件,当与匹配的XSL文件配对时,它会呈现适合最终用户显示的HTML文档。

在HL7 2.5中,我需要获取呈现的文本,不需要任何标记,并将其折叠成文本流(或类似的),我可以用80个字符的行来填充HL7 2.5消息。

到目前为止,我正在采用一种使用xslcompiledTransform的方法来使用xslt转换我的XML文档并生成一个结果HTML文档。

我的下一步是使用该文档(或者在这之前的一步)并将HTML呈现为文本。我已经找了一段时间了,但还不知道该怎么做。我希望这是一件容易的事情,我只是忽视,或只是找不到神奇的搜索词。有人能帮忙吗?

fwiw,我已经阅读了5或10个其他问题,其中包含或告诫使用regex来解决这个问题,并且不认为我想走这条路。我需要呈现的文本。

using System;
using System.IO;
using System.Xml;
using System.Xml.Xsl;
using System.Xml.XPath;

public class TransformXML
{

    public static void Main(string[] args)
    {
        try
        {

            string sourceDoc = "C:\\CDA_Doc.xml";
            string resultDoc = "C:\\Result.html";
            string xsltDoc = "C:\\CDA.xsl";

            XPathDocument myXPathDocument = new XPathDocument(sourceDoc);
            XslCompiledTransform myXslTransform = new XslCompiledTransform();

            XmlTextWriter writer = new XmlTextWriter(resultDoc, null);
            myXslTransform.Load(xsltDoc);

            myXslTransform.Transform(myXPathDocument, null, writer);

            writer.Close();

            StreamReader stream = new StreamReader (resultDoc);

        }

        catch (Exception e)
        {
            Console.WriteLine ("Exception: {0}", e.ToString());
        }
    }
}

6 回复 | 直到 15 年前

Scott Baker 15 年前

既然您有XML源代码,那么考虑编写一个XSL,它将在不使用中间HTML步骤的情况下为您提供所需的输出。它比转换HTML要可靠得多。

David Silva Smith 15 年前

这将只留下文本:

class Program
{
    static void Main(string[] args)
    {
        var blah =  new System.IO.StringReader(sourceDoc);
        var reader = System.Xml.XmlReader.Create(blah);
        StringBuilder result = new StringBuilder();

        while (reader.Read())
        {
            result.Append( reader.Value);
        }
        Console.WriteLine(result);
    }

    static string sourceDoc = "<html><body><p>this is a paragraph</p><p>another paragraph</p></body></html>";
}

ProKiner 15 年前

或者可以使用正则表达式:

public static string StripHtml(String htmlText)
{
    // replace all tags with spaces...
   htmlText = Regex.Replace(htmlText, @"<(.|\n)*?>", " ");

   // .. then eliminate all double spaces
   while (htmlText.Contains("  "))
   {
       htmlText = htmlText.Replace("  ", " ");
    }

   // clear out non-breaking spaces and & character code
   htmlText = htmlText.Replace("&nbsp;", " ");
   htmlText = htmlText.Replace("&amp;", "&");

   return htmlText;
}

yrral 15 年前

你能用像这样的东西吗 this 哪个使用lynx和perl呈现HTML,然后将其转换为纯文本?

Community CDub 8 年前

请参阅关于so的类似问题的答案:

How can I Convert HTML to Text in C#

Chris Scott 15 年前

这是xsl:fo和fop的一个很好的用例。 FOP 不仅仅是PDF输出,支持的其他主要输出之一是文本。您应该能够构建一个简单的xslt+fo样式表,该样式表具有您想要的规范(即行宽)。

这个解决方案比Scottesa建议的只使用XML->xslt->文本要重一些,但是如果您有更复杂的格式要求(例如缩进),那么在fo中表达比在xslt中模拟要容易得多。

我将避免使用regex提取文本。这太低了,肯定是易碎的。如果只需要文本和80个字符的行,默认的XSLT模板将只打印元素文本。一旦只有文本,就可以应用任何必要的文本处理。

顺便说一句,我在一家公司工作,该公司生产CDA作为我们产品的一部分(用于指示的语音识别)。我将研究一个将3.0直接转换为2.5的XSLT。根据您希望在两个版本之间保持的保真度,如果您真正想要实现的是格式之间的转换,那么完整的XSLT路由可能是您最容易的选择。这就是XSLT的目的。