I'm thinking of trying Beautiful Soup, a Python package for HTML scraping. Are there any other HTML scraping packages I should be looking at? Python is not a requirement, I'm actually interested in hearing about other languages as well.
The story so far:
- Python
- Ruby
- .NET
- Perl
- Java
- JavaScript
- PHP
- Most of them
The Ruby world's equivalent to Beautiful Soup is why_the_lucky_stiff's Hpricot.
In the .NET world, I recommend the HTML Agility Pack. Not near as simple as some of the above options (like HTMLSQL), but it's very flexible. It lets you maniuplate poorly formed HTML as if it were well formed XML, so you can use XPATH or just itereate over nodes.
http://www.codeplex.com/htmlagilitypack
BeautifulSoup is a great way to go for HTML scraping. My previous job had me doing a lot of scraping and I wish I knew about BeautifulSoup when I started. It's like the DOM with a lot more useful options and is a lot more pythonic. If you want to try Ruby they ported BeautifulSoup calling it RubyfulSoup but it hasn't been updated in a while.
Other useful tools are HTMLParser or sgmllib.SGMLParser which are part of the standard Python library. These work by calling methods every time you enter/exit a tag and encounter html text. They're like Expat if you're familiar with that. These libraries are especially useful if you are going to parse very large files and creating a DOM tree would be long and expensive.
Regular expressions aren't very necessary. BeautifulSoup handles regular expressions so if you need their power you can utilize it there. I say go with BeautifulSoup unless you need speed and a smaller memory footprint. If you find a better HTML parser on Python, let me know.
I found HTMLSQL to be a ridiculously simple way to screenscrape. It takes literally minutes to get results with it.
The queries are super-intuitive - like:
There are now some other alternatives that take the same approach.
The Python lxml library acts as a Pythonic binding for the libxml2 and libxslt libraries. I like particularly its XPath support and pretty-printing of the in-memory XML structure. It also supports parsing broken HTML. And I don't think you can find other Python libraries/bindings that parse XML faster than lxml.
For Perl, there's WWW::Mechanize.
Python has several options for HTML scraping in addition to Beatiful Soup. Here are some others:
WWW:Mechanize
. Gives you a browser like object to ineract with web pageslibwww
. Supports various options to traverse and select elements (e.g. XPath and CSS selection)'Simple HTML DOM Parser' is a good option for PHP, if your familiar with jQuery or JavaScript selectors then you will find yourself at home.
Find it here
There is also a blog post about it here.
Why has no one mentioned JSOUP yet for Java? http://jsoup.org/
The templatemaker utility from Adrian Holovaty (of Django fame) uses a very interesting approach: You feed it variations of the same page and it "learns" where the "holes" for variable data are. It's not HTML specific, so it would be good for scraping any other plaintext content as well. I've used it also for PDFs and HTML converted to plaintext (with pdftotext and lynx, respectively).
I would first find out if the site(s) in question provide an API server or RSS Feeds for access the data you require.
I know and love Screen-Scraper.
Screen-Scraper is a tool for extracting data from websites. Screen-Scraper automates:
Common uses:
Technical:
Three editions of screen-scraper:
Another option for Perl would be Web::Scraper which is based on Ruby's Scrapi. In a nutshell, with nice and concise syntax, you can get a robust scraper directly into data structures.
Scraping Stack Overflow is especially easy with Shoes and Hpricot.
I've had some success with HtmlUnit, in Java. It's a simple framework for writing unit tests on web UI's, but equally useful for HTML scraping.
Yahoo! Query Language or YQL can be used alongwith jQuery, AJAX, JSONP to screen scrape web pages
Another tool for .NET is MhtBuilder
There is this solution too: netty HttpClient
I use Hpricot on Ruby. As an example this is a snippet of code that I use to retrieve all book titles from the six pages of my HireThings account (as they don't seem to provide a single page with this information):
It's pretty much complete. All that comes before this are library imports and the settings for my proxy.
I've used Beautiful Soup a lot with Python. It is much better than regular expression checking, because it works like using the DOM, even if the HTML is poorly formatted. You can quickly find HTML tags and text with simpler syntax than regular expressions. Once you find an element, you can iterate over it and its children, which is more useful for understanding the contents in code than it is with regular expressions. I wish Beautiful Soup existed years ago when I had to do a lot of screenscraping -- it would have saved me a lot of time and headache since HTML structure was so poor before people started validating it.
Although it was designed for .NET web-testing, I've been using the WatiN framework for this purpose. Since it is DOM-based, it is pretty easy to capture HTML, text, or images. Recentely, I used it to dump a list of links from a MediaWiki All Pages namespace query into an Excel spreadsheet. The following VB.NET code fragement is pretty crude, but it works.
Implementations of the HTML5 parsing algorithm: html5lib (Python, Ruby), Validator.nu HTML Parser (Java, JavaScript; C++ in development), Hubbub (C), Twintsam (C#; upcoming).
You would be a fool not to use Perl.. Here come the flames..
Bone up on the following modules and ginsu any scrape around.
I have used LWP and HTML::TreeBuilder with Perl and have found them very useful.
LWP (short for libwww-perl) lets you connect to websites and scrape the HTML, you can get the module here and the O'Reilly book seems to be online here.
TreeBuilder allows you to construct a tree from the HTML, and documentation and source are available in HTML::TreeBuilder - Parser that builds a HTML syntax tree.
There might be too much heavy-lifting still to do with something like this approach though. I have not looked at the Mechanize module suggested by another answer, so I may well do that.
In Java, you can use TagSoup.
Well, if you want it done from the client side using only a browser you have jcrawl.com. After having designed your scrapping service from the web application (http://www.jcrawl.com/app.html), you only need to add the generated script to an HTML page to start using/presenting your data.
All the scrapping logic happens on the the browser via JavaScript. I hope you find it useful. Click this link for a live example that extracts the latest news from Yahoo tennis.
You probably have as much already, but I think this is what you are trying to do:
I've had mixed results in .NET using SgmlReader which was originally started by Chris Lovett and appears to have been updated by MindTouch.
I like Google Spreadsheets' ImportXML(URL, XPath) function.
It will repeat cells down the column if your XPath expression returns more than one value.
You can have up to 50
importxml()
functions on one spreadsheet.RapidMiner's Web Plugin is also pretty easy to use. It can do posts, accepts cookies, and can set the user-agent.
I've also had great success using Aptana's Jaxer + jQuery to parse pages. It's not as fast or 'script-like' in nature, but jQuery selectors + real JavaScript/DOM is a lifesaver on more complicated (or malformed) pages.