# tagsoup **Repository Path**: mirrors_ndmitchell/tagsoup ## Basic Information - **Project Name**: tagsoup - **Description**: Haskell library for parsing and extracting information from (possibly malformed) HTML/XML documents - **Primary Language**: Unknown - **License**: BSD-3-Clause - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-09-25 - **Last Updated**: 2026-03-29 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # TagSoup [](https://hackage.haskell.org/package/tagsoup) [](https://www.stackage.org/package/tagsoup) [](https://github.com/ndmitchell/tagsoup/actions) TagSoup is a library for parsing HTML/XML. It supports the HTML 5 specification, and can be used to parse either well-formed XML, or unstructured and malformed HTML from the web. The library also provides useful functions to extract information from an HTML document, making it ideal for screen-scraping. The library provides a basic data type for a list of unstructured tags, a parser to convert HTML into this tag type, and useful functions and combinators for finding and extracting information. This document gives two particular examples of scraping information from the web, while a few more may be found in the [Sample](https://github.com/ndmitchell/tagsoup/blob/master/test/TagSoup/Sample.hs) file from the source repository. The examples we give are: * Obtaining the last modified date of the Haskell wiki * Obtaining a list of Simon Peyton Jones' latest papers * A brief overview of some other examples The intial version of this library was written in Javascript and has been used for various commercial projects involving screen scraping. In the examples general hints on screen scraping are included, learnt from bitter experience. It should be noted that if you depend on data which someone else may change at any given time, you may be in for a shock! This library was written without knowledge of the Java version of [TagSoup](https://github.com/jukka/tagsoup). They have made a very different design decision: to ensure default attributes are present and to properly nest parsed tags. We do not do this - tags are merely a list devoid of nesting information. #### Acknowledgements Thanks to Mike Dodds for persuading me to write this up as a library. Thanks to many people for debugging and code contributions, including: Gleb Alexeev, Ketil Malde, Conrad Parker, Henning Thielemann, Dino Morelli, Emily Mitchell, Gwern Branwen. ## Potential Bugs There are two things that may go wrong with these examples: * _The Websites being scraped may change._ There is nothing I can do about this, but if you suspect this is the case let me know, and I'll update the examples and tutorials. I have already done so several times, it's only a few minutes work. * _The `openURL` method may not work._ This happens quite regularly, and depending on your server, proxies and direction of the wind, they may not work. The solution is to use `wget` to download the page locally, then use `readFile` instead. Hopefully a decent Haskell HTTP library will emerge, and that can be used instead. ## Last modified date of Haskell wiki Our goal is to develop a program that displays the date that the wiki at [`wiki.haskell.org`](http://wiki.haskell.org/Haskell) was last modified. This example covers all the basics in designing a basic web-scraping application. ### Finding the Page We first need to find where the information is displayed and in what format. Taking a look at the [front web page](http://wiki.haskell.org/Haskell), when not logged in, we see: ```html
``` So, we see that the last modified date is available. This leads us to rule 1: **Rule 1:** Scrape from what the page returns, not what a browser renders, or what view-source gives. Some web servers will serve different content depending on the user agent, some browsers will have scripting modify their displayed HTML, some pages will display differently depending on your cookies. Before you can start to figure out how to start scraping, first decide what the input to your program will be. There are two ways to get the page as it will appear to your program. #### Using the HTTP package We can write a simple HTTP downloader with using the [HTTP package](http://hackage.haskell.org/package/HTTP): ```haskell module Main where import Network.HTTP openURL :: String -> IO String openURL x = getResponseBody =<< simpleHTTP (getRequest x) main :: IO () main = do src <- openURL "http://wiki.haskell.org/Haskell" writeFile "temp.htm" src ``` Now open `temp.htm`, find the fragment of HTML containing the hit count, and examine it. ### Finding the Information Now we examine both the fragment that contains our snippet of information, and the wider page. What does the fragment have that nothing else has? What algorithm would we use to obtain that particular element? How can we still return the element as the content changes? What if the design changes? But wait, before going any further: **Rule 2:** Do not be robust to design changes, do not even consider the possibility when writing the code. If the user changes their website, they will do so in unpredictable ways. They may move the page, they may put the information somewhere else, they may remove the information entirely. If you want something robust talk to the site owner, or buy the data from someone. If you try and think about design changes, you will complicate your design, and it still won't work. It is better to write an extraction method quickly, and happily rewrite it when things change. So now, let's consider the fragment from above. It is useful to find a tag which is unique just above your snippet - something with a nice `id` or `class` attribute - something which is unlikely to occur multiple times. In the above example, an `id` with value `footer-info-lastmod` seems perfect. ```haskell module Main where import Data.Char import Network.HTTP import Text.HTML.TagSoup openURL :: String -> IO String openURL x = getResponseBody =<< simpleHTTP (getRequest x) haskellLastModifiedDateTime :: IO () haskellLastModifiedDateTime = do src <- openURL "http://wiki.haskell.org/Haskell" let lastModifiedDateTime = fromFooter $ parseTags src putStrLn $ "wiki.haskell.org was last modified on " ++ lastModifiedDateTime where fromFooter = unwords . drop 6 . words . innerText . take 2 . dropWhile (~/= "