There are so many open source java html parser. In this blog post just I will try to give some information about Jsoup. It is an open source Java HTML parser that I have been working on recently. Instead of Jsoup, you can use HTML parser, Jericho HTML parser or other parser libraries you want.
jsoup is a Java library for working with real-world HTML.
By using this library,
- You can parse HTML from a URL, file, or string.
- You can find and extract data, using DOM traversal or CSS selectors.
- You can manipulate the HTML elements, attributes, and text.
- You can clean user-submitted content against a safe white-list.
Getting Source Code.
Download the library and use it in your project. The current release version is 1.5.2.
Visit the example and start with jsoup to parse html.
If you use Maven to manage the dependencies in your Java project, you do not need to download; just place the following into your POM's
This post is an introduction to Jsoup. In other posts I will give some examples that I use in my real project. The examples will be about getting elements in html.
I wish to be useful.