代码之家 › 专栏 › 技术社区 › Churchill

阅读长字符串的前3段。[C,HTML AgilityPack]

asp.net c#

Churchill · 技术社区 · 15 年前

我想从一个长字符串中读取,并只输出字符串的前3段。我怎样才能做到这一点?我想用这段代码来显示单词的数量,但后来改成了段落。

public string MySummary(string html, int max)
{
    string summaryHtml = string.Empty;

    // load our html document
    HtmlDocument htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(html);

    int wordCount = 0;




    foreach (var element in htmlDoc.DocumentNode.ChildNodes)
    {
        // inner text will strip out all html, and give us plain text
        string elementText = element.InnerText;

        // we split by space to get all the words in this element
        string[] elementWords = elementText.Split(new char[] { ' ' });

        // and if we haven't used too many words ...

        if (wordCount <= max)
        {
            // add the *outer* HTML (which will have proper 
            // html formatting for this fragment) to the summary
            summaryHtml += element.OuterHtml;
            wordCount += elementWords.Count() + 1;

        }
        else
        {
            break;
        }
    }

    return summaryHtml ;
}

5 回复 | 直到 15 年前

Andrew Bullock 15 年前

如果你说的段落 <p> 标记,获取文档的所有子节点 <p> 把前3个的内部文本拉出来?

编辑评论:

RTFM?

http://htmlagilitypack.codeplex.com/wikipage?title=Examples&referringTitle=Home

类似于:

string.Join(doc.DocumentElement.SelectNodes("//p").Take(3).Select(n => n.Text).ToArray(), " ");

rlee923 15 年前

为什么不直接使用字符串标记器并读到前面的where forth

找到了吗?

Derek Ekins 12 年前

我只需要自己做这件事,并且想出了一个非常简单但又很宽容的方法来做这件事,对我们的特定场景来说效果很好:

    public string GetParagraphs(string html, int numberOfParagraphs)
    {
        const string paragraphSeparator = "</p>";
        var paragraphs = html.Split(new[] { paragraphSeparator }, StringSplitOptions.RemoveEmptyEntries);
        return string.Join("", paragraphs.Take(numberOfParagraphs).Select(paragraph => paragraph + paragraphSeparator));
    }

我知道这对于文档的结构是多么幼稚,它也会得到任何 <p> 标记之间 <p> ,但是在我的用例中,这正是我想要的——也许这对您也有用?

Pino 10 年前

这是更好的答案。但是如果我们想把第2段改为第5段,那么代码是什么呢?

public string GetParagraphs(string html, int numberOfParagraphs) {
    const string paragraphSeparator = "</p>";
    var paragraphs = html.Split(new[] { paragraphSeparator }, StringSplitOptions.RemoveEmptyEntries);
    return string.Join("", paragraphs.Take(numberOfParagraphs).Select(paragraph => paragraph + paragraphSeparator));
}

StefanM MarkOwen320 8 年前

你必须使用htmlagilitypack。

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(HtmlContent);

string Html = string.Join(" ", doc.DocumentNode.SelectNodes("//p").Take(2).Select(n => n.OuterHtml).ToArray());

string Html = string.Join(" ", doc.DocumentNode.SelectNodes("//p").Take(2).Select(n => n.OuterHtml).ToArray());

推荐文章

Alireza Noori · 全局配置用于本地化的MudDontext验证消息?

5 月前

user1946932 · .Net正则表达式在所有字符前添加空格

7 月前

thedisplayedName · “当前上下文中不存在名称”TextBox“在一些教程中出现错误

7 月前

TSDrake · 发布ASP。没有特定文件夹的.NET应用程序

7 月前

Hiroco · 传递到ViewDataDictionary的模型项的类型为“x”,但此ViewDataDictionary实例需要类型为“y”的模型项

11 月前

Vengat Ramanan · 用户登录Asp时隐藏导航和页脚。网络核心

11 月前

Maxim Rudolovskii · 在ASP.NET中,有没有从查询字符串中获取复杂对象的方法。网8?

12 月前

ramamoorthy_villi · 输入验证错误外键字段

1 年前

Enderbyte09 · ASP。NET Core绝对拒绝在其他端口上托管

1 年前

SoundWaves · 通过IP从asp.net Web表单使用ESC/P打印到Brother P-Touch

1 年前