Showing posts with label java. Show all posts
Showing posts with label java. Show all posts

Saturday, January 28, 2012

Web Crawler Tools


What are the best java based web crawler tools?

Crawler4j

Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web.

Heritrix

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix is designed to respect the robots.txt exclusion directives and META robots tags .

WebSPHINX

WebSPHINX ( Website-Specific Processors for HTML INformation eXtraction) is a Java class library and interactive development environment for web crawlers. A web crawler (also called a robot or spider) is a program that browses and processes Web pages automatically. WebSPHINX consists of two parts: the Crawler Workbench and the WebSPHINX class library.

Nutch

Apache Nutch is an open source web-search software project. Stemming from Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika for HTML and and array other document formats.

WebLech

WebLech is a fully featured web site download/mirror tool in Java, which supports many features required to download websites and emulate standard web-browser behaviour as much as possible. WebLech is multithreaded and will feature a GUI console.

Arale

While many bots around are focused on page indexing, Arale is primarly designed for personal use. It fits the needs of advanced web surfers and web developers.

HyperSpider

HyperSpider (Java app) collects the link structure of a website. Data import/export from/to database and CSV-files. Export to Graphviz DOT, Resource Description Framework (RDF/DC), XML Topic Maps (XTM), Prolog, HTML. Visualization as hierarchy and map.

Arachnid

Arachnid is a Java-based web spider framework. It includes a simple HTML parser object that parses an input stream containing HTML content. Simple Web spiders can be created by sub-classing Arachnid and adding a few lines of code called after each page of a Web site is parsed.

Spindle

Spindle is a web indexing/search tool built on top of the Lucene toolkit. It includes a HTTP spider that is used to build the index, and a search class that is used to search the index. In addition, support is provided for the Bitmechanic listlib JSP TagLib, so that a search can be added to a JSP based site without writing any Java classes.

Spider

Spider is a complete standalone Java application designed to easily integrate varied data sources. XML driven framework for data retrieval from network accessible sources, scheduled pulling, highly extensible, provides hooks for custom post-processing and configuration and implemented as a Avalon/Keel framework data feed service.

LARM

LARM is a 100% Java search solution for end-users of the Jakarta Lucene search engine framework. It contains methods for indexing files, database tables, and a crawler for indexing web sites. Well, it will be. At the moment we only have some specifications. It's up to you to turn this into a working program. Its predecessor was an experimental crawler called larm-web crawler available from the Jakarta project.

Metis

Metis is a tool to collect information from the content of web sites. This was written for the Ideahamster Group for finding the competitive intelligence weight of a web server and assists in satisfying the CI Scouting portion of the Open Source Security Testing Methodology Manual (OSSTMM).

Aperture

Aperture crawls information systems such as file systems, websites, mail boxes and mail servers. It can extract full-text and metadata from many common file formats. Aperture has a flexible architecture that can be extended with custom file formats, data sources, etc., with support for deployment on OSGI platforms.

Smart and Simple Web Crawler

A framework that is crawl a web site with integrated Lucene support. Support two crawling modes, Max Iterations and Max Depth. Provides a filter interface to limit the links to be crawled. Filters can be combined with AND, OR and NOT.

Web Harvest

Web-Harvest collects Web pages and extracts useful data from them. It leverages technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites. However it can be extended by custom Java libraries to augment its extraction capabilities.

Criterions for Selecting a Tool

  1. Multi-Threaded Structure.
  2. Control for Depth.
  3. Control for Redundant Links.
  4. "Max Page Size - to be crawled", "Max Page Number- to be crawled", "Time to Work" should be used as parameter to manage crawler.
  5. Documentation.
I use crawler4j for crawling whole web.
You can setup a multi-threaded web crawler in 5 minutes!

Saturday, July 23, 2011

Java Send e-mail

In this post, by using JavaMail API, I will write a java program that sends an e-mail to the given e-mail address.
The JavaMail API provides a platform-independent and protocol-independent framework to build mail and messaging application.
Download the library and add it to your class path. Then start your development environment and use this code to send e-mail.


final String sender = “sender@gmail.com”;
final String password = “yourpassword”;
final String receiver = “receiver@gmail.com”;
Properties props = new Properties();
props.put("mail.smtp.host", "smtp.gmail.com");
props.put("mail.smtp.socketFactory.port", "465");
props.put("mail.smtp.socketFactory.class", "javax.net.ssl.SSLSocketFactory");
props.put("mail.smtp.auth", "true");
props.put("mail.smtp.port", "465");
props.put("mail.debug", "true");
Session session = Session.getDefaultInstance(props,
new javax.mail.Authenticator() {
protected PasswordAuthentication getPasswordAuthentication() {
return new PasswordAuthentication(sender, password);
}
});
try {
Message message = new MimeMessage(session);
message.setFrom(new InternetAddress(sender));
message.setRecipients(Message.RecipientType.TO, InternetAddress.parse(receiver));
message.setSubject(subject);
message.setText(body);

Transport.send(message);
System.out.println("Email was Sent!!");
} catch (MessagingException e) {
throw new RuntimeException(e);
}


Change the values according to your accounts. For example change sender and receiver values.
I wish to be useful.

Saturday, July 2, 2011

Java Mysql Database Connection

This java program tries to connect to the named database installed on your local MySQL server.


import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;

public class JavaMysqlDatabaseConnection {

public static void main(String args[]) {

Connection connection = null;
public String databaseName = "yourDatabaseName";
public String userName = "yourUserName";
public String userPassword = "yourUserPassword";
try {
Class.forName("com.mysql.jdbc.Driver").newInstance();
connection = DriverManager.getConnection("jdbc:mysql://"+databaseName+"",
userName , userPassword);

if(!connection.isClosed()) {
System.out.println("Successfully connected to Database");
} else {
System.out.println("Not Connected to Database");
}
} catch(Exception e) {
System.err.println("Exception: " + e.getMessage());
} finally {
try {
if(connection != null)
connection.close();
} catch(SQLException e) {}
}
}
}


Step by Step get a Connection;

1. Open your development platform.
2. Craete a java class.
3. Copy this code to your class.
4. Change parameters according to your database, and get a connection.

In above example, "com.mysql.jdbc.Driver" is the name of the JDBC driver that you want to load.
I wish to be useful.

Tuesday, May 17, 2011

Jsoup HTML Parser


There are so many open source java html parser. In this blog post just I will try to give some information about Jsoup. It is an open source Java HTML parser that I have been working on recently. Instead of Jsoup, you can use HTML parser, Jericho HTML parser or other parser libraries you want.

jsoup is a Java library for working with real-world HTML.

By using this library,

  • You can parse HTML from a URL, file, or string.
  • You can find and extract data, using DOM traversal or CSS selectors.
  • You can manipulate the HTML elements, attributes, and text.
  • You can clean user-submitted content against a safe white-list.

Getting Source Code.

Download the library and use it in your project. The current release version is 1.5.2.

Visit the example and start with jsoup to parse html.

If you use Maven to manage the dependencies in your Java project, you do not need to download; just place the following into your POM's section:

This post is an introduction to Jsoup. In other posts I will give some examples that I use in my real project. The examples will be about getting elements in html.

I wish to be useful.

Sunday, May 1, 2011

Java Primitive Data Types

In this part of my blog, I will try to give some information about java primitive data types, their range and default value of the types.

The Java programming language is statically-typed, which means that all variables must first be declared before they can be used. Before using any variable in your program, you must declare the variable with its type and name.

For example:

int data = 1;

This declaration tells your program that there is a field named “data”, holds numerical data, and has an initial value of "1".

Primitive data types and their range:

boolean :1 bit

range - May take on the values “true” and “false” only.

byte :1 byte

range - form -128 to 127

short :2 bytes

range – from -32,768 to 32,767

int :4 bytes

range – from -2,147,483,648 to 2,147,483,647

long :8 bytes

range – from -9,223,372,036,854,775,808

to 9,223,372,036,854,775,807

float :4 bytes

range – from 1.40129846432481707e-48

to 3.40282346638528860e+38

(positive or negative)

double :8 bytes

range – from 4.94065645841246544e-324d

to 1.79769313486231570e+308d

(positive or negative)

char :2 bytes, unsigned, unicode

range – from 0 to 65,535

String :a sequence of characters

We must know the range of the types to use them effectively in our programs. For example we have a counter, it starts with the value of 0 and it increases continuously. Initially for this counter as primitive type “int” will be sufficient. But in the future, the value of the counter will increase and it will be outside the range of int. so the program will not work correctly.

As developer if you do not assign a value to a variable you declare, it will be assigned will its default value.

Default values for the data types:

Data Type Default Value

byte :0

short :0

int :0

long :0L

float :0.0f

double :0.0d

char :’\u0000’

boolean :false

String :null

I wish to be useful.