Reading a web page

Many times there is a need to read a web page programatically. That is, get the html through code and parse through it to search what you want. Codeplex has a DLL named HTML Agility Pack (HAP) that comes to our rescue.
Download the DLL and then add a reference to it in your application. The DLL allows us to access the html code from the url provided and then exposes a DOM for us to navigate through. With HAP you can read, search, modify and save the web pages. Of course, modifications and save will only result in a local file and not persist the changes on the web page itself.
In this blog, we will create two methods: one to get all the meta tags on the web page and the other to get all the links on the page. The code is shown below:
private void btnGetMetaTags_Click(object sender, RoutedEventArgs e)
        {
            tbResult.Text = string.Empty;
            var webGet = new HtmlWeb();
            var document = webGet.Load(tbUrl.Text);

            var metaTags = document.DocumentNode.Descendants("meta");

            if (metaTags != null)
            {
                foreach (var tag in metaTags)
                {
                    if (tag.Attributes["name"] != null && tag.Attributes["content"] != null)
                    {
                        tbResult.Text += "Name: " + tag.Attributes["name"].Value + Environment.NewLine;
                        tbResult.Text += "Content: " + tag.Attributes["content"].Value + Environment.NewLine;
                        tbResult.Text += Environment.NewLine;
                    }
                }
            }
        }

        private void btnGetAllLinks_Click(object sender, RoutedEventArgs e)
        {
            tbResult.Text = string.Empty;
            var webGet = new HtmlWeb();
            var document = webGet.Load(tbUrl.Text);

            var linksOnPage = document.DocumentNode.Descendants("a");

            foreach (var lnk in linksOnPage)
            {
                if (lnk.Attributes["href"] != null && lnk.InnerText.Trim().Length > 0)
                {
                    tbResult.Text += "URL: " + lnk.Attributes["href"].Value + Environment.NewLine;
                    tbResult.Text += "InnerText: " + lnk.InnerText.Trim() + Environment.NewLine;
                    tbResult.Text += Environment.NewLine;
                }
            }
        }
In the code, first we create and load a HtmlWeb document from the supplied url. Then we access its DocumentNode property and use LINQ to further access all the descendants.
Once you run the application and click on "Get Meta Tags" button, you get the following result:
When you click on the "Get all links on the page" button, you get the following result:
You can also modify the results in a similar fashion and then save them by calling the Save method of the document.

1 comment:

Louis Crawford said...

There is one tip to quickly resolve your problem, I recommend download and use one tool. If you have technical experience and you want to install a DLL file manually, select your version of Windows and download https://fix4dll.com/msvcr120_dll msvcr120.dll missing, after that copy it to the appropriate and use the instruction, it will fix dll errors.