Data mining. Textbook - Pavel Minakov читать книгу онлайн бесплатно без сокращений (страница 6)

Pavel Minakov Data mining. Textbook читать онлайн страница 6

4 5 6 7 8

Вперед

Looking at the example of creating two datasets – one for implicit clustering and one for managed clustering – we can easily see the difference between the two methods. In the first example, the results may be the same in one case and different in another. But if the method is good for finding interesting relationships (as it usually is), it will give us useful information about the overall structure of the data. However, if the technique is not good at identifying relationships, then it will give us very little information.

Let’s say we are developing a system for determining the direction of a new product and want to identify similar products. Since it is not possible to measure the direction of a product outside the system, we will have to find relationships between products based on information about their names. If there is a good rule that we can use to establish relationships between similar products, then this information is very useful as it allows us to find interesting relationships (by identifying similar products that appear close to each other). However, if the relationship between two products isn’t very obvious, it’s likely that it’s just an unrelated relationship – which means the feature detection method we choose may not matter much. On the other hand, if the relationship is not very obvious but extremely useful (as in the example above), then we can start to learn how the product name is related to the process the product went through. This is an example of how different methods can produce very different results.

Unlike the characteristics of different methods, you also have different possible techniques. For example, when I say that my system uses image recognition, it doesn’t necessarily mean that the process the product goes through uses image recognition. If there are product images that we have taken in the past, or if we have captured some input from a product image, the resulting system will probably not use image recognition. It could be something completely different – something much more complex. Each of these methods is capable of identifying very different things. The result may depend on the characteristics of the actual data or on the data used. This means it’s not enough to look at a specific type of tool – we also need to look at what type of tool will be used for a particular type of process. This is an example of how data analysis should not be focused only on the problem being solved. Most likely, the system goes through many different processes, so we need to look at how different tools will be used to create a relationship between two points, and then decide which type of data to consider.

Often, we will be more concerned with how the method will be applied. For example, we might want to see what type of data is most likely to be useful for finding a relationship. We see that there is not much difference in how natural language processing is applied. This means that if we want to find a relationship, natural language processing is a good choice. However, natural language processing does not solve every possible relationship. Natural language processing is often useful when we want to take a huge number of small steps, but natural language processing does nothing when we want to go really deep. A look at natural language processing allows you to establish relationships between data that cannot be done using other methods. This is one of the reasons why natural language processing can be useful but not necessary.

Вперед