Definition:
Web mining is the process of using data mining techniques and algorithms to extract information directly from the Web either through Web documents and Web services, hyperlinks and server logs. The goal of Web mining is to look for patterns in Web data by collecting and analyzing information in order to gain insights into trends, the industry, and users in general.
Types of web mining:
- Content web mining: The process of extracting useful information from the contents of Web pages and Web documents, which are mostly text, images, and audio or video files.
- Web Structure Mining: Process of analyzing the structure of nodes and the connection of a website through the use of graph theory. There are two things that can be added from this: the structure of a website in terms of how it connects to other sites and the document structure of the web page itself, as to how each page connects.
- Mining of the use of the web: The process of extracting patterns and information from server logs to gain insights into user activity, where it comes from, how many users have clicked on an item on the site, and the types of activities taking place on the site.
Web Mining vs. Data Mining
When comparing web mining to traditional data mining, there are three main differences to consider:
- Scale: In traditional data mining, processing 1 million records from a database would be a lot of work. In web mining, even 10 million pages wouldn’t be a very large number.
- Access: When mining corporate information data, the data is private and often requires access rights to read it. For web mining, data is public and rarely requires access rights. However, web mining has additional limitations, due to the implicit agreement regarding webmasters of automated access to this data. This implicit agreement is that a webmaster allows crawlers to access useful data on the website, and instead the crawler promises not to overload the site and has the potential to drive more traffic to the web page once the search index is published. With web mining, there is often no such index, which means that the crawler has to be very careful during the crawling process, so as not to cause any problems for the webmaster.
- Structure: A traditional data mining task gets information from a database, which provides a certain level of explicit structure. A typical web mining task is to process unstructured or semi-structured data from web pages. Even though the underlying information for web pages comes from a database, this is often obscured by the HTML format.