You want to remove HTML tags from your string. This is useful for displaying HTML in plain text and stripping formatting like bold and italics, while not removing any actual textual content. Test the methods available for this functionality for performance and correctness with test cases.
These C# example programs show how to remove HTML tags from strings.
Removing HTML tags from strings Input: <p>The <b>dog</b> is <i>cute</i>.</p> Output: The dog is cute. Performance test for HTML removal HtmlRemoval.StripTagsRegex: 2404 ms HtmlRemoval.StripTagsRegexCompiled: 1366 ms HtmlRemoval.StripTagsCharArray: 287 ms [fastest] File length test for HTML removal File tested: Real-world HTML file File length before: 8085 chars HtmlRemoval.StripTagsRegex: 4382 chars HtmlRemoval.StripTagsRegexCompiled: 4382 chars HtmlRemoval.StripTagsCharArray: 4382 chars
Examples
First, here is a static class that tests three different ways of removing HTML tags and their contents. The methods receive string arguments and then process the string and return new strings that do not have the HTML tags. The methods have different performance characteristics. As a reminder, HTML tags start with <
and end with >
.
HtmlRemoval static class [C#] using System; using System.Text.RegularExpressions; /// <summary> /// Methods to remove HTML from strings. /// </summary> public static class HtmlRemoval { /// <summary> /// Remove HTML from string with Regex. /// </summary> public static string StripTagsRegex(string source) { return Regex.Replace(source, "<.*?>", string.Empty); } /// <summary> /// Compiled regular expression for performance. /// </summary> static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled); /// <summary> /// Remove HTML from string with compiled Regex. /// </summary> public static string StripTagsRegexCompiled(string source) { return _htmlRegex.Replace(source, string.Empty); } /// <summary> /// Remove HTML tags from string using char array. /// </summary> public static string StripTagsCharArray(string source) { char[] array = new char[source.Length]; int arrayIndex = 0; bool inside = false; for (int i = 0; i < source.Length; i++) { char let = source[i]; if (let == '<') { inside = true; continue; } if (let == '>') { inside = false; continue; } if (!inside) { array[arrayIndex] = let; arrayIndex++; } } return new string(array, 0, arrayIndex); } }
Notes. This is a public static class written in the C# language that does not save state. You can call into the class using the code HtmlRemoval.StripTags*. Normally, you can put this class in a separate file named HtmlRemoval.cs. Because it is not project-specific, it is useful for many programs.
StripTagsRegex. This method uses a static call to Regex.Replace, and therefore the expression is not compiled. For this reason, this method could be optimized by pulling the Regex out of the method, such as in the second method. The regular expression specifies that all sequences matching <
and >
with any number of characters, but the minimal number, are replaced with string.Empty (removed).
StripTagsRegexCompiled. This method does the exact same thing as the previous method, but its regular expression is pulled out of the method call and stored in the static class. I recommend this method for most programs, as it is very simple to inspect and considerably faster than the first method. The static Regex will only be created once in your program.
StripTagsCharArray. This method is a heavily optimized version of an approach that could instead use StringBuilder. In most benchmarks, this method is faster and is appropriate for when you need to strip lots of HTML files. A detailed description of the method's body is available below.
Tests
Here we look at a program that runs these methods through a very simple test. The three methods work identically on valid HTML. One thing you should note is that the char array method will strip anything that follows a <
, but the Regex methods will require a >
before they strip the tag.
Program that tests HTML removal [C#] using System; using System.Text.RegularExpressions; class Program { static void Main() { const string html = "<p>There was a <b>.NET</b> programmer " + "and he stripped the <i>HTML</i> tags.</p>"; Console.WriteLine(HtmlRemoval.StripTagsRegex(html)); Console.WriteLine(HtmlRemoval.StripTagsRegexCompiled(html)); Console.WriteLine(HtmlRemoval.StripTagsCharArray(html)); } } Output There was a .NET programmer and he stripped the HTML tags. There was a .NET programmer and he stripped the HTML tags. There was a .NET programmer and he stripped the HTML tags.
First, from my performance research I know regular expressions in the C# language are usually not the fastest way to process test. I wrote an algorithm that uses a combination of char arrays and the new string constructor to strip HTML tags, filling the requirement and often performing better.
The benchmark for these methods stripped 10000 HTML files of around 8000 characters in tight loops. The file was read in from File.ReadAllText. The result was that the char array method was considerably faster. This could worthwhile to use if you have to strip many files in a script, such as one that preprocesses a large website in memory.
Iterative method
The method here that uses char arrays and is dramatically faster than the other two methods uses a neat algorithm for parsing the HTML quickly. It iterates through all characters, flipping a flag Boolean depending on whether it is inside a tag block. It only adds characters to the array buffer if it is not a tag. For performance, it uses char arrays and the new string constructor that accepts a char array and a range. This is faster than using StringBuilder.
Using RegexOptions.Compiled and a separate Regex results in better performance than using the Regex static method. RegexOptions.Compiled has some drawbacks, however. It can reduce startup time by 10x in some cases. More material is available pertaining to make Regexes simpler and faster to run.
Self-closing tags
In XHTML, certain elements such as BR and IMG have no separate closing tag, and instead use the "/>"
at the end of the first tag. The test file noted includes these self-closing tags, and the methods handle it correctly. Here are some HTML tags supported.
Supported tags <img src="" /> <img src=""/> <br /> <br/> < div > <!-- -->
Source: dotnetperls.com
See some example:
0 comments:
Post a Comment