Oct 11

The growth of dependency on the internet has led to a number of people relying on the World Wide Web for all kind of information. Be it the product price, weather forecast, market information, job postings or finding competitors, almost all do it through the internet. This requires one to collect data from various websites.

Although different websites contain same type of data, they are present in different styles. To search for information on a particular topic one has to browse through the search engines, read the data on website copy and paste it to a document. This way of manual data extraction is quite time consuming as well as inefficient. This has led to the automation of the whole process of data extraction.

The data can be extracted from websites in a more easy and efficient manner by a method known as data scraping. Data scraping is a technique in which a program or script is written to extract data in form or text, image or any other output from the websites. The data scraping is designed similar to the way a normal human being would collect data i.e. visit the website by entering the URL, request for web pages, copy the data and paste in another document. This entire job is being done by data scraping tools in less time with more accuracy.

The process of capturing data from target websites through software programs is known as web harvesting. Web harvesting implements a web crawler and directs the crawler to different URLs and enables web search applications and search engines to locate the informative data. The search engines implemented by web harvesting index only the URLs to which they are directed and hence are faster than general purpose search engines.

Website scraper is the software used to collect different data from the target URLs. Website scrapers eliminate the necessity of human handling of the web data. The extracted data can be exported in different formats like text files, XML files, MS- Access, Microsoft SQL server, MY SQL or CSV files.

Data present in websites is in HTML format which is machine readable. The process of capturing data from the HTML files is known as screen scraping. Here the data is extracted from the terminal screen rather than directly from web data base to make it available in human readable form. Screen scrapers are software scripts which read the terminals memory through the auxiliary port.

The availability of data extraction and web scraping tools has enabled the easy availability of data. One need not spend hours and days for gathering data from websites.


