Data Mining Methods. Cluster analysis is an algorithm for studying data divided into groups according to similar characteristics.

The use of modern practical methods of data analysis and recognition is in demand in technical and humanitarian fields, in science and production, business and finance. This description presents the main algorithmic essence, the understanding of which is useful for more efficient use of recognition and classification methods in data analysis.

1. The task of recognition (classification with a teacher) and the state of the art in the field of practical methods for its solution. The main stages in the development of the theory and practice of recognition: the creation of heuristic algorithms, recognition models and model optimization, an algebraic approach to model correction. The main approaches are based on the construction of separating surfaces, potential functions, statistical and neural network models that solve trees, and others.

The main approaches and algorithms of combinatorial-logical recognition methods (models for calculating estimates or algorithms based on the principle of partial precedence) developed at the Computing Center of the Russian Academy of Sciences. A.A. Dorodnitsyn. These models are based on the idea of ​​searching for important partial precedents in feature descriptions of the original data (informative fragments of feature values, or representative sets). For real features, optimal neighborhoods of informative fragments are found. In another terminology, these partial cases are called knowledge or logical patterns that relate the values ​​of the original features to a recognizable or predictable value. The found knowledge is important information about the studied classes (images) of objects. They are directly used in solving problems of recognition or forecasting, they give a visual representation of the interdependencies existing in these data, which is of independent value to researchers and can serve as the basis for the subsequent creation of accurate models of the objects, situations, phenomena or processes under study. Based on the body of knowledge found, the values ​​of such useful quantities as the degree of importance (informativeness) of features and objects, logical correlations of features and logical descriptions of classes of objects are also calculated, and the problem of feature space minimization is solved.

2. Methods for solving the main problem of cluster analysis (classification without a teacher) - finding groupings of objects (clusters) in a given sample of multidimensional data. A brief review of the main approaches to solving the problem of cluster analysis and a description of the committee method for synthesizing collective solutions are given.

3. Software system for data mining, recognition and forecasting RECOGNITION. The requirements for the system are based on the ideas of universality and intelligence. The universality of the system is understood as the possibility of its application to the widest possible range of tasks (in terms of dimensions, type, quality and structure of data, in terms of calculated values). Intelligence is understood as the presence of self-tuning elements and the ability to successfully automatically solve problems by an unskilled user. Within the framework of the RECOGNITION System, a library of programs has been developed that implements linear, combinatorial-logical, statistical, neural network, hybrid methods for forecasting, classifying and extracting knowledge from precedents, as well as collective forecasting and classification methods.


1. Recognition algorithms based on the calculation of estimates. Recognition is carried out on the basis of a comparison of the recognized object with the reference ones according to various sets of features, and the use of voting procedures. The optimal parameters of the decision rule and the voting procedure are found from the solution of the problem of optimizing the recognition model - such parameter values ​​are determined for which the recognition accuracy (the number of correct answers in the training sample) is maximum.

2. Algorithms for voting on dead-end tests. Comparison of the recognized object with the reference ones is carried out according to various "informative" subsets of features. Dead-end tests (or analogues of dead-end tests for real-valued features) of various random sub-tables of the original template table are used as such feature subsystems.

Based on the training sample, sets of logical patterns of each class are calculated - sets of features and intervals of their values ​​that are characteristic of each class. When recognizing a new object, the number of logical patterns of each class that are executed on the recognized object is calculated. Each individual "execution" counts as a "vote" in favor of the respective class. The object belongs to the class, the normalized amount of "votes" for which is the maximum. This method allows you to evaluate feature weights, logical correlations of features, build logical descriptions of classes, and find minimal feature subspaces.

4. Algorithms for statistical weighted voting.

Based on the data of the training sample, statistically justified logical patterns of classes are found. When recognizing new objects, an estimate of the probability of the object belonging to each of the classes is calculated, which is a weighted sum of "votes".

5. Linear machine.

For each class of objects, some linear function is found. The recognized object belongs to the class whose function takes the maximum value on the given object. The optimal linear functions of the classes are found as a result of solving the problem of finding the maximum joint subsystem of the system of linear inequalities, which is formed from the training sample. As a result, a special piecewise linear surface is found that correctly separates the maximum number of training sample elements.

6. Fisher's linear discriminant.

Classical statistical method for constructing piecewise linear surfaces separating classes. Favorable conditions for the applicability of Fisher's linear discriminant are the fulfillment of the following factors: linear separability of classes, dichotomy, "simple structure" of classes, non-degeneracy of covariance matrices, absence of outliers. The created modification of Fisher's linear discriminant makes it possible to successfully use it in "unfavorable" cases.

7. Method of k-nearest neighbors.

Classical statistical method. A recognizable object belongs to the class from which it has the maximum number of neighbors. The optimal number of neighbors and a priori class probabilities are estimated from the training sample.

8. Neural Network Recognition Model with Back Propagation

A modification of the well-known method for training a neural network in pattern recognition (backpropagation method) has been created. As a criterion for the quality of the current parameters of the neural network, a hybrid criterion is used that takes into account both the sum of the squared deviations of the values ​​of the output signals from the required ones and the number of misclassifications on the training set.

9.Support vector machine.

A method for constructing a non-linear separating surface using support vectors. In the new feature space (rectifying space), a dividing surface is built that is close to linear. The construction of this surface is reduced to solving a quadratic programming problem.

10. Algorithms for solving problems of recognition by teams of various recognition algorithms.

The recognition problem is solved in two stages. First, different algorithms of the System are applied independently. Next, an automatically optimal collective solution is found using special methods - "correctors". Various approaches are used as corrective methods.

11. Methods of cluster analysis (automatic classification or unsupervised learning).

The following known approaches are used:

Hierarchical grouping algorithms;

Clustering with the criterion of minimizing the sum of squared deviations;

k-means method.

It is possible to solve the classification problem both for a given and an unknown number of classes.

12. Algorithm for constructing collective solutions of the classification problem.

The classification problem is solved in two stages. First there is a set various solutions(in the form of coverings or partitions) with a fixed number of classes using various algorithms of the System. Next, the optimal collective classification is found as a result of solving a special discrete optimization problem.

Home > Lecture

Topic 7.CLASSIFICATION ANALYSIS

Lecture No. 9

1. Exploratory data analysis. Measurement scales

2. Classification trees

3. Discriminant analysis (trained classification)

4. Cluster analysis (classification without training)

5. Canonical correlations

1. Exploratory data analysis. Measurement scales

In the presence of a large number of variables and the absence of information about relationships and patterns, one of the first stages of analyzing the available data is the so-called exploratory data analysis. As a rule, exploratory analysis considers and compares a large number of variables, and for the search, classification and scaling of variables are carried out. Variables differ in how well they can be measured, or, in other words, how much measurable information their measurement scale provides. Another factor that determines the amount of information is the type of scale on which the measurement is made. Usually the following types of measurement scales are used: nominal, ordinal, interval and relative. Nominal variables used only for qualitative classification. This means that these variables can only be measured in terms of belonging to some significantly different classes. A typical example of nominal variables is the manufacturer, the type of product, the sign of its suitability, etc. Often nominal variables are called categorical. Ordinal variables allow to rank objects, if it is indicated which of them have the quality expressed by this variable to a greater or lesser extent. However, they do not allow one to judge how much more or how much less of a given quality is contained in a variable. A typical example is the sorting of goods: highest, first, second, third. The same product differs qualitatively, but it is impossible to say that the difference between them is 25%. Categorical and ordinal variables are especially common when questioning, for example, change and compare the differences between them. An example - the temperature, measured in degrees, forms an interval scale, since it is possible to evaluate the difference in variables already in numerical form (40 degrees more than 30 by 10). The interval scale can be easily translated into an ordinal scale if we take some values ​​of the variables as the boundaries of different classes (for example, it is warm or hot outside for a month, taking the boundary between the classes "warm" and "hot" in the value of the variable, but their feature is the presence of a certain point absolute zero.As a rule, these are continuous variables. 2. Classification trees Classification Trees is a method that allows one to predict the belonging of observations or objects to one or another class of a categorical dependent variable, depending on the corresponding values ​​of one or more predictor variables. Building classification trees- one of the hierarchical coin sorting device. Let's make the coins roll along a narrow chute, in which a slot the size of a one-kopeck coin is cut. If the coin fell into the slot, then this is 1 kopeck; otherwise, it continues to roll further along the chute and stumbles upon a slot for a two-kopeck coin; if it fails there, then it's 2 kopecks, if not (which means it's 3 or 5 kopecks), it will roll further, and so on. Thus, we have built a classification tree. The decision rule implemented in this classification tree allows efficient sorting of a handful of coins, and is generally applicable to a wide range of classification problems. Classification trees are ideally suited for graphical representation, and therefore the conclusions drawn from them are much easier to interpret than if they were presented only in numerical form. Hierarchical structure classification tree- one of the build process classification tree consists of four main steps:

    Selection of the forecast accuracy criterion

    Branch type selection

    Determining when to stop branching

    Determining "appropriate" tree sizes

Ultimately, the goal of analysis with classification trees is to get the most accurate prediction possible. The most classifications.

3. Discriminant analysis (trained classification)

Discriminant analysis is used to decide which class (group) to attribute this or that object (process) to based on the study of its parameters or characteristics.) of the product and the task is to establish which of the parameters contribute to the difference (discrimination) between separately grouped aggregates (grades) of goods that form the general population. After that, a decision is made on whether this product belongs to a certain group. Therefore, this kind of statistical analysis is multivariate and the main idea of ​​discriminant analysis is to determine whether populations differ in the mean of some parameter (variable), and then use this variable to predict for new members of their domains. Each of the areas differs from the other by the value of a certain parameter (or rather by the value of its average) or sets of parameters taken as a classification feature. The discrimination rule is chosen in accordance with a certain principle of optimality, for example, the minimum probability of false classification. In practical calculations, distinctions move from the feature vector to linear function(discriminant function), which for two groups (classes) has the form of a linear multiple regression equation, in which the coded signs of distinction into groups act as dependent variables. If there are more than two groups, then more than one discriminant function can be composed. For example, when there are three populations, it is possible to evaluate: (1) - The feature for discrimination sense is very similar to multivariate analysis of variance. When discriminant functions are obtained, the question arises of how well they can predict, to which population does a particular sample belong? For this, classification indicators or classification functions are determined and the next observation or a specific sample is assigned to the group for which the classification group has the greatest value. 4. Cluster analysis (classification without training) Cluster analysis is a statistical method that includes a set of different algorithms for distributing objects into clusters (cluster - bunch, accumulation). Partitioning objects H into an integer number of clusters K, so that each object belongs to one and only one subset of the partition. At the same time, objects belonging to the same cluster must be similar, and objects belonging to different clusters must be heterogeneous. The solution to the problem of cluster analysis are partitions that satisfy the optimality criterion. This criterion is called the objective function, which, for example, can be the minimum of the sum of squared deviations of the features of the group objects from the mean value

min Σ(x i – x cf) 2

The similarity and heterogeneity of objects in groups will be characterized by a certain value, which received the name - the distance function. The larger the distance function between objects, the more heterogeneous they are. It is clear that if this function exceeds a certain limit, then the objects should be related to different groups(clusters). Depending on the clustering algorithm used, the following distance functions are distinguished: - Euclidean metric (Σx i – xj) 2) 1/2 ; - Manhattan distance Σ|x i – x j |; - Chebyshev distance max|x i – x j |, etc. are considered as separate clusters. Further, at each step of the algorithm, the two closest clusters are combined, and, taking into account the adopted distance function, all distances are recalculated according to the formula. When the objective function is reached, the iterations stop. 5. Canonical correlations Classical correlation analysis allows you to find statistical relationships between two variables, the so-called two sets of variables use the methods of canonical analysis. Canonical analysis, being a generalization of multiple correlation as a measure of the relationship between one random variable and many other random variables, considers the relationship between sets of random variables. At the same time, it is limited to considering a small number of the most correlated linear combinations from each set. The analysis of canonical correlation is based on the use of canonical roots or canonical variables, which are considered as "hidden" variables that characterize the observed phenomena. The number of canonical roots is equal to the number of variables in the smaller set. In practice, when determining the canonical correlation, a separate correlation matrix is ​​built, which is the product of standard correlation matrices that characterize the dependencies between two individual variables. Then, as many eigenvalues ​​of the resulting matrix are calculated as there are canonical roots. If we take the square root of the obtained eigenvalues, we get a set of numbers that can be interpreted as correlation coefficients. Since they are canonical variables, they are also called canonical correlations. The work of discriminant, cluster and canonical analysis should be evaluated using special statistical packages that implement these algorithms on a computer.

Cluster analysis is

Good day. Here I have respect for people who are fans of their work.

Maxim, my friend, belongs to this category. Constantly works with figures, analyzes them, makes relevant reports.

Yesterday we had lunch together, so for almost half an hour he told me about cluster analysis - what it is and in what cases its application is reasonable and expedient. Well, what about me?

I have a good memory, so I will provide you with all this data, by the way, which I already knew about in its original and most informative form.

Cluster analysis is designed to divide a set of objects into homogeneous groups (clusters or classes). This is a task of multivariate data classification.

There are about 100 different clustering algorithms, however, the most commonly used are hierarchical cluster analysis and k-means clustering.

Where is cluster analysis used? In marketing, this is the segmentation of competitors and consumers.

In management: division of personnel into groups of different levels of motivation, classification of suppliers, identification of similar production situations in which marriage occurs.

In medicine, the classification of symptoms, patients, drugs. In sociology, the division of respondents into homogeneous groups. In fact, cluster analysis has proven itself well in all spheres of human life.

charm this method- it works even when there is little data and the requirements of the normality of distributions of random variables and other requirements of classical methods of statistical analysis are not met.

Let us explain the essence of cluster analysis without resorting to strict terminology:
Let's say you conducted a survey of employees and want to determine how you can most effectively manage your staff.

That is, you want to divide employees into groups and select the most effective control levers for each of them. At the same time, the differences between groups should be obvious, and within the group, the respondents should be as similar as possible.

To solve the problem, it is proposed to use hierarchical cluster analysis.

As a result, we will get a tree, looking at which we must decide how many classes (clusters) we want to split the staff into.

Suppose that we decide to divide the staff into three groups, then to study the respondents who fell into each cluster, we get a tablet with the following content:


Let us explain how the above table is formed. The first column contains the number of the cluster — the group whose data is reflected in the row.

For example, the first cluster is 80% male. 90% of the first cluster fall into the age group from 30 to 50 years old, and 12% of respondents believe that benefits are very important. Etc.

Let's try to make portraits of respondents of each cluster:

  1. The first group is mostly men. middle age holding leadership positions. The social package (MED, LGOTI, TIME-free time) does not interest them. They prefer to receive a good salary, rather than help from the employer.
  2. Group two, on the contrary, prefers the social package. It consists mainly of "aged" people occupying low positions. Salary is certainly important for them, but there are other priorities.
  3. The third group is the "youngest". Unlike the previous two, there is an obvious interest in learning and professional growth opportunities. This category of employees has good chance soon replenish the first group.

Thus, planning a campaign to introduce effective methods management of personnel, it is obvious that in our situation it is possible to increase the social package for the second group to the detriment, for example, of wages.

If we talk about which specialists should be sent for training, then we can definitely recommend paying attention to the third group.

Source: http://www.nickart.spb.ru/analysis/cluster.php

Features of cluster analysis

A cluster is the price of an asset in a certain period of time during which transactions were made. The resulting volume of purchases and sales is indicated by a number within the cluster.

The bar of any TF contains, as a rule, several clusters. This allows you to see in detail the volumes of purchases, sales and their balance in each individual bar, for each price level.


A change in the price of one asset inevitably entails a chain of price movements on other instruments as well.

Attention!

In most cases, the understanding of the trend movement occurs already at the moment when it is rapidly developing, and entering the market along the trend is fraught with falling into a corrective wave.

For successful trades, it is necessary to understand the current situation and be able to anticipate future price movements. This can be learned by analyzing the cluster graph.

With the help of cluster analysis, you can see the activity of market participants inside even the smallest price bar. This is the most accurate and detailed analysis, as it shows the point distribution of transaction volumes for each asset price level.

In the market there is a constant confrontation between the interests of sellers and buyers. And every smallest price movement (tick) is the move to a compromise - the price level - which in this moment suits both parties.

But the market is dynamic, the number of sellers and buyers is constantly changing. If at one point in time the market was dominated by sellers, then the next moment, most likely, there will be buyers.

The number of completed transactions at neighboring price levels is also not the same. And yet, first, the market situation is reflected in the total volume of transactions, and only then on the price.

If you see the actions of the dominant market participants (sellers or buyers), then you can predict the price movement itself.

To successfully apply cluster analysis, you first need to understand what a cluster and a delta are.


A cluster is called a price movement, which is divided into levels at which transactions were made with known volumes. The delta shows the difference between buying and selling occurring in each cluster.

Each cluster, or group of deltas, allows you to figure out whether buyers or sellers dominate the market at a given time.

It is enough just to calculate the total delta by summing the sales and purchases. If the delta is negative, then the market is oversold, there are redundant sell transactions. When the delta is positive, the market is clearly dominated by buyers.

The delta itself can take on a normal or critical value. The value of the delta volume above the normal value in the cluster is highlighted in red.

If the delta is moderate, then this characterizes a flat state in the market. With a normal delta value, a trend movement is observed in the market, but a critical value is always a harbinger of a price reversal.

Forex trading with CA

For getting maximum profit you need to be able to determine the transition of the delta from a moderate level to a normal one. Indeed, in this case, you can notice the very beginning of the transition from a flat to a trend movement and be able to get the most profit.

The cluster chart is more visual, on it you can see significant levels of accumulation and distribution of volumes, build support and resistance levels. This allows the trader to find the exact entry to the trade.

Using the delta, one can judge the predominance of sales or purchases in the market. Cluster analysis allows you to observe transactions and track their volumes inside the bar of any TF.

This is especially important when approaching significant support or resistance levels. Cluster judgments are the key to understanding the market.

Source: http://orderflowtrading.ru/analitika-rynka/obemy/klasternyy-analiz/

Areas and features of application of cluster analysis

The term cluster analysis (first introduced by Tryon, 1939) actually includes a set of different classification algorithms.

A common question asked by researchers in many fields is how to organize observed data into visual structures, i.e. expand taxonomies.

According to the modern system accepted in biology, man belongs to primates, mammals, amniotes, vertebrates and animals.

Note that in this classification, the higher the level of aggregation, the less similarity between members in the corresponding class.

Man has more similarities with other primates (i.e., apes) than with "distant" members of the mammal family (i.e., dogs), and so on.

Note that the previous discussion refers to clustering algorithms, but does not mention anything about testing for statistical significance.

In fact, cluster analysis is not so much an ordinary statistical method as a “set” of various algorithms for “distributing objects into clusters”.

There is a point of view that, unlike many other statistical procedures, cluster analysis methods are used in most cases when you do not have any a priori hypotheses about the classes, but are still in the descriptive stage of the study.

Attention!

It should be understood that cluster analysis determines the "most possibly meaningful decision".

Therefore, testing for statistical significance is not really applicable here, even in cases where p-levels are known (as, for example, in the K-means method).

The clustering technique is used in a wide variety of fields. Hartigan (1975) has provided an excellent overview of the many published studies containing results obtained by cluster analysis methods.

For example, in the field of medicine, the clustering of diseases, treatment of diseases, or symptoms of diseases leads to widely used taxonomies.

In the field of psychiatry, the correct diagnosis of symptom clusters such as paranoia, schizophrenia, etc. is critical to successful therapy. In archeology, using cluster analysis, researchers are trying to establish taxonomies of stone tools, funeral objects, etc.

Widespread applications of cluster analysis are known in marketing research. In general, whenever it is necessary to classify "mountains" of information into groups suitable for further processing, cluster analysis turns out to be very useful and effective.

Tree Clustering

The example in the Primary Purpose section explains the purpose of the join (tree clustering) algorithm.

The purpose of this algorithm is to combine objects (for example, animals) into sufficiently large clusters using some measure of similarity or distance between objects. A typical result of such clustering is a hierarchical tree.

Consider a horizontal tree diagram. The diagram starts with each object in the class (on the left side of the diagram).

Now imagine that gradually (in very small steps) you "weaken" your criterion for what objects are unique and what are not.

In other words, you lower the threshold related to the decision to combine two or more objects into one cluster.

As a result, you link more and more objects together and aggregate (combine) more and more clusters of increasingly different elements.

Finally, in the last step, all objects are merged together. In these charts, the horizontal axes represent the pooling distance (in vertical dendrograms, the vertical axes represent the pooling distance).

So, for each node in the graph (where a new cluster is formed), you can see the amount of distance for which the corresponding elements are linked into a new single cluster.

When the data has a clear "structure" in terms of clusters of objects that are similar to each other, then this structure is likely to be reflected in the hierarchical tree by various branches.

As a result of successful analysis by the join method, it becomes possible to detect clusters (branches) and interpret them.

The union or tree clustering method is used in the formation of clusters of dissimilarity or distance between objects. These distances can be defined in one-dimensional or multidimensional space.

For example, if you have to cluster the types of food in a cafe, you can take into account the number of calories contained in it, the price, the subjective assessment of taste, etc.

The most direct way to calculate distances between objects in a multidimensional space is to calculate Euclidean distances.

If you have a 2D or 3D space, then this measure is the actual geometric distance between objects in space (as if the distances between objects were measured with a tape measure).

However, the pooling algorithm does not "care" about whether the distances "provided" for that are real or some other derived distance measures, which is more meaningful to the researcher; and the challenge for researchers is to select the right method for specific applications.

Euclidean distance. This seems to be the most common type of distance. It is simply a geometric distance in multidimensional space and is calculated as follows:

Note that the Euclidean distance (and its square) is calculated from the original data, not from the standardized data.

This is the usual way of calculating it, which has certain advantages (for example, the distance between two objects does not change when a new object is introduced into the analysis, which may turn out to be an outlier).

Attention!

However, distances can be greatly affected by differences between the axes from which the distances are calculated. For example, if one of the axes is measured in centimeters, and then you convert it to millimeters (by multiplying the values ​​by 10), then the final Euclidean distance (or the square of the Euclidean distance) calculated from the coordinates will change dramatically, and, as a result, the results of the cluster analysis can be very different from the previous ones.

The square of the Euclidean distance. Sometimes you may want to square the standard Euclidean distance to give more weight to more distant objects.

This distance is calculated as follows:

City block distance (Manhattan distance). This distance is simply the average of the differences over the coordinates.

In most cases, this measure of distance leads to the same results as for the usual Euclid distance.

However, note that for this measure the influence of individual large differences (outliers) decreases (because they are not squared). Manhattan distance is calculated using the formula:

Chebyshev distance. This distance can be useful when one wishes to define two objects as "different" if they differ in any one coordinate (any one dimension). The Chebyshev distance is calculated by the formula:

Power distance. It is sometimes desired to progressively increase or decrease the weight related to a dimension for which the corresponding objects are very different.

This can be achieved using a power-law distance. The power distance is calculated by the formula:

where r and p are user-defined parameters. A few examples of calculations can show how this measure "works".

The p parameter is responsible for the gradual weighting of differences in individual coordinates, the r parameter is responsible for the progressive weighting of large distances between objects. If both parameters - r and p, are equal to two, then this distance coincides with the Euclidean distance.

The percentage of disagreement. This measure is used when the data is categorical. This distance is calculated by the formula:

Association or association rules

At the first step, when each object is a separate cluster, the distances between these objects are determined by the chosen measure.

However, when several objects are linked together, the question arises, how should the distances between clusters be determined?

In other words, you need a join or link rule for two clusters. There are various possibilities here: for example, you can link two clusters together when any two objects in the two clusters are closer to each other than the corresponding link distance.

In other words, you use the "nearest neighbor rule" to determine the distance between clusters; this method is called the single link method.

This rule builds "fibrous" clusters, i.e. clusters "linked together" only by individual elements that happen to be closer to each other than the others.

Alternatively, you can use neighbors in clusters that are farthest from each other of all other feature pairs. This method is called the full link method.

There are also many other methods for joining clusters, similar to those that have been discussed.

Single connection (nearest neighbor method). As described above, in this method, the distance between two clusters is determined by the distance between the two closest objects (nearest neighbors) in different clusters.

This rule must, in a sense, string objects together to form clusters, and the resulting clusters tend to be represented by long "strings".

Full connection (method of the most distant neighbors). In this method, the distances between clusters are defined as the largest distance between any two objects in different clusters (i.e. "most distant neighbors").

Unweighted pairwise mean. In this method, the distance between two different clusters is calculated as the average distance between all pairs of objects in them.

The method is effective when objects actually form different "groves", but it works equally well in cases of extended ("chain" type) clusters.

Note that in their book Sneath and Sokal (1973) introduce the abbreviation UPGMA to refer to this method as the unweighted pair-group method using arithmetic averages.

Weighted pairwise mean. The method is identical to the unweighted pairwise average method, except that the size of the respective clusters (ie, the number of objects they contain) is used as a weighting factor in the calculations.

Therefore, the proposed method should be used (rather than the previous one) when unequal cluster sizes are assumed.

Sneath and Sokal (1973) introduce the abbreviation WPGMA to refer to this method as the weighted pair-group method using arithmetic averages.

Unweighted centroid method. In this method, the distance between two clusters is defined as the distance between their centers of gravity.

Attention!

Sneath and Sokal (1973) use the acronym UPGMC to refer to this method as the unweighted pair-group method using the centroid average.

Weighted centroid method (median). This method is identical to the previous one, except that weights are used in the calculations to take into account the difference between cluster sizes (i.e., the number of objects in them).

Therefore, if there are (or are suspected) significant differences in cluster sizes, this method is preferable to the previous one.

Sneath and Sokal (1973) used the abbreviation WPGMC to refer to it as the weighted pair-group method using the centroid average.

Ward method. This method is different from all other methods because it uses methods analysis of variance to estimate distances between clusters.

The method minimizes the sum of squares (SS) for any two (hypothetical) clusters that can be formed at each step.

Details can be found in Ward (1963). In general, the method seems to be very efficient, but it tends to create small clusters.

Earlier this method was discussed in terms of "objects" that should be clustered. In all other types of analysis, the question of interest to the researcher is usually expressed in terms of observations or variables.

It turns out that clustering, both by observations and by variables, can lead to quite interesting results.

For example, imagine that a medical researcher is collecting data on various characteristics (variables) of patients' conditions (observations) with heart disease.

The investigator may wish to cluster observations (of patients) to identify clusters of patients with similar symptoms.

At the same time, the researcher may wish to cluster variables to identify clusters of variables that are associated with a similar physical state.e

After this discussion regarding whether to cluster observations or variables, one might ask, why not cluster in both directions?

The Cluster Analysis module contains an efficient two-way join procedure to do just that.

However, two-way pooling is used (relatively rarely) in circumstances where both observations and variables are expected to contribute simultaneously to the discovery of meaningful clusters.

So, returning to the previous example, we can assume that a medical researcher needs to identify clusters of patients that are similar in relation to certain clusters of physical condition characteristics.

The difficulty in interpreting the results obtained arises from the fact that the similarities between different clusters may come from (or be the cause of) some difference in the subsets of variables.

Therefore, the resulting clusters are inherently heterogeneous. Perhaps it seems a bit hazy at first; indeed, compared to other cluster analysis methods described, two-way pooling is probably the least commonly used method.

However, some researchers believe that it offers a powerful tool for exploratory data analysis (for more information, see Hartigan's description of this method (Hartigan, 1975)).

K means method

This clustering method differs significantly from agglomerative methods such as Union (tree clustering) and Two-Way Union. Suppose you already have hypotheses about the number of clusters (by observation or by variable).

You can tell the system to form exactly three clusters so that they are as different as possible.

This is exactly the type of problem that the K-Means algorithm solves. In general, the K-means method builds exactly K distinct clusters spaced as far apart as possible.

In the physical condition example, a medical researcher may have a "hunch" from their clinical experience that their patients generally fall into three different categories.

Attention!

If so, then the means of the various measures of physical parameters for each cluster would provide a quantitative way of representing the investigator's hypotheses (eg, patients in cluster 1 have a high parameter of 1, a lower parameter of 2, etc.).

From a computational point of view, you can think of this method as an analysis of variance "in reverse". The program starts with K randomly selected clusters, and then changes the belonging of objects to them in order to:

  1. minimize variability within clusters,
  2. maximize variability between clusters.

This method is similar to reverse analysis of variance (ANOVA) in that the significance test in ANOVA compares between-group versus within-group variability in testing the hypothesis that group means are different from each other.

In K-means clustering, the program moves objects (i.e., observations) from one group (cluster) to another in order to obtain the most significant result when performing analysis of variance (ANOVA).

Typically, once the results of a K-means cluster analysis are obtained, one can calculate the means for each cluster for each dimension to assess how the clusters differ from each other.

Ideally, you should get very different means for most, if not all, of the measurements used in the analysis.

Source: http://www.biometrica.tomsk.ru/textbook/modules/stcluan.html

Classification of objects according to their characteristics

Cluster analysis (cluster analysis) - a set of multidimensional statistical methods for classifying objects according to their characteristics, dividing the totality of objects into homogeneous groups that are close in terms of defining criteria, selecting objects of a certain group.

A cluster is a group of objects identified as a result of cluster analysis based on a given measure of similarity or difference between objects.

The object is the specific subjects of study that need to be classified. The objects in the classification are, as a rule, observations. For example, consumers of products, countries or regions, products, etc.

Although it is possible to carry out cluster analysis by variables. Classification of objects in multidimensional cluster analysis occurs according to several criteria simultaneously.

These can be both quantitative and categorical variables, depending on the method of cluster analysis. So, the main objective cluster analysis - finding groups of similar objects in the sample.

The set of multidimensional statistical methods of cluster analysis can be divided into hierarchical methods (agglomerative and divisive) and non-hierarchical (k-means method, two-stage cluster analysis).

However, there is no generally accepted classification of methods, and sometimes cluster analysis methods also include methods for constructing decision trees, neural networks, discriminant analysis, and logistic regression.

The scope of cluster analysis, due to its versatility, is very wide. Cluster analysis is used in economics, marketing, archeology, medicine, psychology, chemistry, biology, public administration, philology, anthropology, sociology and other fields.

Here are some examples of applying cluster analysis:

  • medicine - classification of diseases, their symptoms, methods of treatment, classification of patient groups;
  • marketing - the tasks of optimizing the company's product line, segmenting the market by groups of goods or consumers, identifying a potential consumer;
  • sociology - division of respondents into homogeneous groups;
  • psychiatry - correct diagnosis of symptom groups is crucial for successful therapy;
  • biology - classification of organisms by group;
  • economy - classification of subjects of the Russian Federation by investment attractiveness.

Source: http://www.statmethods.ru/konsalting/statistics-methody/121-klasternyj-analyz.html

General information about cluster analysis

Cluster analysis includes a set of different classification algorithms. A common question asked by researchers in many fields is how to organize observed data into visual structures.

For example, biologists aim to break animals into different kinds to meaningfully describe the differences between them.

The task of cluster analysis is to divide the initial set of objects into groups of similar, close objects. These groups are called clusters.

In other words, cluster analysis is one of the ways to classify objects according to their features. It is desirable that the classification results have a meaningful interpretation.

The results obtained by cluster analysis methods are used in various fields. In marketing, it is the segmentation of competitors and consumers.

In psychiatry, the correct diagnosis of symptoms such as paranoia, schizophrenia, etc. is crucial for successful therapy.

In management, the classification of suppliers is important, the identification of similar production situations in which marriage occurs. In sociology, the division of respondents into homogeneous groups. In portfolio investing, it is important to group securities by similarity in the trend of profitability, in order to draw up, based on the information obtained about the stock market, an optimal investment portfolio that allows maximizing profit from investments for a given degree of risk.

In general, whenever it is necessary to classify a large amount of information of this kind and present it in a form suitable for further processing, cluster analysis turns out to be very useful and effective.

Cluster analysis allows considering a fairly large amount of information and greatly compressing large arrays of socio-economic information, making them compact and visual.

Attention!

Cluster analysis is of great importance in relation to sets of time series characterizing economic development(for example, general economic and commodity conjuncture).

Here it is possible to single out the periods when the values ​​of the corresponding indicators were quite close, as well as to determine the groups of time series, the dynamics of which are most similar.

In the problems of socio-economic forecasting, it is very promising to combine cluster analysis with other quantitative methods (for example, with regression analysis).

Advantages and disadvantages

Cluster analysis allows for an objective classification of any objects that are characterized by a number of features. There are a number of benefits to be derived from this:

  1. The resulting clusters can be interpreted, that is, to describe what kind of groups actually exist.
  2. Individual clusters can be culled. This is useful in cases where certain errors were made in the data set, as a result of which the values ​​of indicators for individual objects deviate sharply. When applying cluster analysis, such objects fall into a separate cluster.
  3. For further analysis, only those clusters that have the characteristics of interest can be selected.

Like any other method, cluster analysis has certain disadvantages and limitations. In particular, the composition and number of clusters depends on the selected partitioning criteria.

When reducing the initial data array to a more compact form, certain distortions may occur, and the individual features of individual objects may also be lost due to their replacement by the characteristics of the generalized values ​​of the cluster parameters.

Methods

Currently, more than a hundred different clustering algorithms are known. Their diversity is explained not only by different computational methods, but also by different concepts underlying clustering.

The Statistica package implements the following clustering methods.

  • Hierarchical algorithms - tree clustering. Hierarchical algorithms are based on the idea of ​​sequential clustering. At the initial step, each object is considered as a separate cluster. At the next step, some of the clusters closest to each other will be combined into a separate cluster.
  • K-means method. This method is the most commonly used. It belongs to the group of so-called reference methods of cluster analysis. The number of clusters K is set by the user.
  • Two way association. When using this method, clustering is carried out simultaneously both by variables (columns) and by observation results (rows).

The two-way join procedure is performed when it can be expected that simultaneous clustering on variables and observations will provide meaningful results.

The results of the procedure are descriptive statistics on variables and cases, as well as a two-dimensional color chart on which data values ​​are color-coded.

By the distribution of color, you can get an idea of ​​\u200b\u200bhomogeneous groups.

Normalization of variables

The division of the initial set of objects into clusters is associated with the calculation of distances between objects and the choice of objects, the distance between which is the smallest of all possible.

The most commonly used is the Euclidean (geometric) distance familiar to all of us. This metric corresponds to intuitive ideas about the proximity of objects in space (as if the distances between objects were measured with a tape measure).

But for a given metric, the distance between objects can be strongly affected by changes in scales (units of measurement). For example, if one of the features is measured in millimeters and then its value is converted to centimeters, the Euclidean distance between objects will change dramatically. This will lead to the fact that the results of cluster analysis may differ significantly from the previous ones.

If the variables are measured in different units of measurement, then their preliminary normalization is required, that is, the transformation of the initial data, which converts them into dimensionless quantities.

Normalization strongly distorts the geometry of the original space, which can change the results of clustering

In the Statistica package, any variable x is normalized according to the formula:

To do this, right-click on the variable name and select the sequence of commands from the menu that opens: Fill/ Standardize Block/ Standardize Columns. The values ​​of the normalized variable will become equal to zero, and the variances will become equal to one.

K-means method in Statistica

The K-means method splits a set of objects into a given number K of different clusters located at the greatest possible distance from each other.

Typically, once the results of a K-means cluster analysis are obtained, one can calculate the averages for each cluster for each dimension to assess how the clusters differ from each other.

Ideally, you should get very different means for most of the measurements used in the analysis.

The F-statistic values ​​obtained for each dimension are another indicator of how well the corresponding dimension discriminates between clusters.

As an example, consider the results of a survey of 17 employees of an enterprise on satisfaction with career quality indicators. The table contains the answers to the questionnaire questions on a ten-point scale (1 is the minimum score, 10 is the maximum).

The variable names correspond to the answers to the following questions:

  1. SLT - a combination of personal goals and the goals of the organization;
  2. OSO - a sense of fairness in wages;
  3. TBD - territorial proximity to the house;
  4. PEW - a sense of economic well-being;
  5. CR - career growth;
  6. ZhSR - the desire to change jobs;
  7. OSB is a sense of social well-being.

Using this data, it is necessary to divide the employees into groups and select the most effective control levers for each of them.

At the same time, the differences between groups should be obvious, and within the group, the respondents should be as similar as possible.

To date, most sociological surveys give only a percentage of votes: the main number of positive answers is considered, or the percentage of those who are dissatisfied, but this issue is not systematically considered.

Most often, the survey does not show trends in the situation. In some cases, it is necessary to count not the number of people who are “for” or “against”, but the distance, or the measure of similarity, that is, to determine groups of people who think about the same.

Cluster analysis procedures can be used to identify, on the basis of survey data, some really existing relationships of features and generate their typology on this basis.

Attention!

The presence of any a priori hypotheses of a sociologist when working with cluster analysis procedures is not a necessary condition.

In the Statistica program, cluster analysis is performed as follows.

When choosing the number of clusters, be guided by the following: the number of clusters, if possible, should not be too large.

The distance at which the objects of a given cluster were joined should, if possible, be much less than the distance at which something else joins this cluster.

When choosing the number of clusters, most often there are several correct solutions at the same time.

We are interested, for example, in how the answers to the questions of the questionnaire correlate with ordinary employees and the management of the enterprise. Therefore, we choose K=2. For further segmentation, you can increase the number of clusters.

  1. select observations with the maximum distance between cluster centers;
  2. sort distances and select observations at regular intervals (default setting);
  3. take the first observation centers and attach the rest of the objects to them.

Option 1 is suitable for our purposes.

Many clustering algorithms often “impose” a structure that is not inherent in the data and disorient the researcher. Therefore, it is extremely necessary to apply several cluster analysis algorithms and draw conclusions based on overall assessment algorithm results

The results of the analysis can be viewed in the dialog box that appears:

If you select the Graph of means tab, a graph of the coordinates of the cluster centers will be plotted:


Each broken line on this graph corresponds to one of the clusters. Each division of the horizontal axis of the graph corresponds to one of the variables included in the analysis.

The vertical axis corresponds to the average values ​​of the variables for the objects included in each of the clusters.

It can be noted that there are significant differences in the attitude of the two groups of people to a service career on almost all issues. Only in one issue is there complete unanimity - in the sense of social well-being (OSB), or rather, the lack of it (2.5 points out of 10).

It can be assumed that cluster 1 represents workers and cluster 2 represents management. Managers are more satisfied with career development (CR), a combination of personal goals and organizational goals (SOLs).

They have a higher sense of economic well-being (SEW) and a sense of pay equity (SWA).

They are less concerned about proximity to home than workers, probably because of less transportation problems. Also, managers have less desire to change jobs (JSR).

Despite the fact that workers are divided into two categories, they give relatively the same answers to most questions. In other words, if something does not suit the general group of employees, the same does not suit senior management, and vice versa.

The harmonization of the graphs allows us to conclude that the well-being of one group is reflected in the well-being of another.

Cluster 1 is not satisfied with the territorial proximity to the house. This group is the main part of the workers who mainly come to the enterprise from different parts of the city.

Therefore, it is possible to offer the top management to allocate part of the profits to the construction of housing for the employees of the enterprise.

Significant differences are seen in the attitude of the two groups of people to a service career. Those employees who are satisfied with career growth, who have a high coincidence of personal goals and the goals of the organization, do not have a desire to change jobs and feel satisfaction with the results of their work.

Conversely, employees who want to change jobs and are dissatisfied with the results of their work are not satisfied with the above indicators. Top management should Special attention to the current situation.

The results of the analysis of variance for each attribute are displayed by pressing the Analysis of variance button.

The sums of squares of deviations of objects from cluster centers (SS Within) and the sums of squares of deviations between cluster centers (SS Between), F-statistics values ​​and p significance levels are displayed.

Attention!

For our example, the significance levels for the two variables are quite large, which is explained by the small number of observations. In the full version of the study, which can be found in the paper, the hypotheses about the equality of means for cluster centers are rejected at significance levels less than 0.01.

The Save classifications and distances button displays the numbers of objects included in each cluster and the distances of objects to the center of each cluster.

The table shows the case numbers (CASE_NO) that make up the clusters with CLUSTER numbers and the distances from the center of each cluster (DISTANCE).

Information about objects belonging to clusters can be written to a file and used in further analysis. IN this example Comparison of the obtained results with questionnaires showed that cluster 1 consists mainly of ordinary workers, and cluster 2 - of managers.

Thus, it can be seen that when processing the results of the survey, cluster analysis turned out to be a powerful method that allows drawing conclusions that cannot be reached by constructing a histogram of averages or by calculating the percentage of those satisfied with various indicators of the quality of working life.

Tree clustering is an example of a hierarchical algorithm, the principle of which is to sequentially cluster first the closest, and then more and more distant elements from each other into a cluster.

Most of these algorithms start from a matrix of similarity (distances), and each individual element is considered at first as a separate cluster.

After loading the cluster analysis module and selecting Joining (tree clustering), you can change the following parameters in the clustering parameters entry window:

  • Initial data (Input). They can be in the form of a matrix of the studied data (Raw data) and in the form of a matrix of distances (Distance matrix).
  • Clustering (Cluster) observations (Cases (raw)) or variables (Variable (columns)), describing the state of the object.
  • Distance measures. Here you can select the following measures: Euclidean distances, Squared Euclidean distances, City-block (Manhattan) distance, Chebychev distance metric, Power ...), the percentage of disagreement (Percent disagreement).
  • Clustering method (Amalgamation (linkage) rule). The following options are possible here: Single Linkage, Complete Linkage, Unweighted pair-group average, Weighted pair-group average ), Unweighted pair-group centroid, Weighted pair-group centroid (median), Ward's method.

As a result of clustering, a horizontal or vertical dendrogram is built - a graph on which the distances between objects and clusters are determined when they are sequentially combined.

The tree structure of the graph allows you to define clusters depending on the selected threshold - a given distance between clusters.

In addition, the matrix of distances between the original objects (Distance matrix) is displayed; mean and standard deviations for each source object (Distiptive statistics).

For the considered example, we will carry out a cluster analysis of variables with default settings. The resulting dendrogram is shown in the figure.


The vertical axis of the dendrogram plots the distances between objects and between objects and clusters. So, the distance between the variables SEB and OSD is equal to five. These variables at the first step are combined into one cluster.

The horizontal segments of the dendrogram are drawn at levels corresponding to the threshold distances selected for a given clustering step.

It can be seen from the graph that the question “desire to change jobs” (JSR) forms a separate cluster. In general, the desire to dump anywhere visits everyone equally. Further, a separate cluster is the question of territorial proximity to home (LHB).

In terms of importance, it is in second place, which confirms the conclusion about the need for housing construction, made according to the results of the study using the K-means method.

Feelings of economic well-being (PEW) and pay equity (PWA) are combined - this is a block of economic issues. Career progression (CR) and the combination of personal goals and organization goals (COL) are also combined.

Other clustering methods, as well as the choice of other types of distances, do not lead to a significant change in the dendrogram.

Results:

  1. Cluster analysis is a powerful tool for exploratory data analysis and statistical research in any subject area.
  2. The Statistica program implements both hierarchical and structural methods of cluster analysis. The advantages of this statistical package are due to their graphical capabilities. Two-dimensional and three-dimensional graphical representations of the obtained clusters in the space of the studied variables are provided, as well as the results of the hierarchical procedure for grouping objects.
  3. It is necessary to apply several cluster analysis algorithms and draw conclusions based on a general assessment of the results of the algorithms.
  4. Cluster analysis can be considered successful if it is performed in different ways, the results are compared and found general patterns, and stable clusters are found regardless of the clustering method.
  5. Cluster analysis allows you to identify problem situations and outline ways to solve them. Therefore, this method of non-parametric statistics can be considered as an integral part of system analysis.

Last year, Avito held a number of competitions. Including - a competition for recognizing car brands, the winner of which, Evgeny Nizhibitsky, spoke about his decision at a training session.


Formulation of the problem. From the images of the cars, you need to determine the make and model. The metric was the accuracy of predictions, that is, the proportion of correct answers. The sample consisted of three parts: the first part was available for training initially, the second was given later, and the third was required to show the final predictions.


Computing resources. I used the home computer that kept my room warm all the time and the servers provided at work.

Model overview. Since our task is to recognize, the first thing we want to do is take advantage of the progress in the quality level of image classification on the well-known ImageNet. As you know, modern architectures make it possible to achieve even higher quality than that of a person. So I started by reviewing recent articles and put together a summary table of architectures, implementations, and qualities based on ImageNet.


Note that nai the best quality achieved on architectures and .

Fine-tuning networks. Training a deep neural network from scratch is a rather time-consuming task, and besides, it is not always effective in terms of the result. Therefore, the technique of retraining networks is often used: a network already trained on ImageNet is taken, the last layer is replaced with a layer with the required number of classes, and then the network is tuned with a low learning rate, but already on data from the competition. This scheme allows you to train the network faster and with higher quality.

The first approach to GoogLeNet retraining showed approximately 92% accuracy during validation.

Crop predictions. Using a neural network to predict on a test sample, you can improve the quality. To do this, cut out fragments of a suitable size in different places original image, and then average the results. A 1x10 crop means that the center of the image is taken, four corners, and then everything is the same, but reflected horizontally. As you can see, the quality increases, but the prediction time increases.

Validation of results. After the second part of the sample appeared, I split the sample into several parts. All further results are shown on this partition.

ResNet-34 Torch. You can use the ready-made repository of the authors of the architecture, but in order to get predictions on the test in the desired format, you have to fix some scripts. In addition, you need to solve the problems of high memory consumption by dumps. Validation accuracy is about 95%.


Inception-v3 TensorFlow. Here, too, a ready-made implementation was used, but the image preprocessing was changed, and cropping of images during batch generation was also limited. The result is almost 96% accuracy.


Ensemble of models. As a result, we got two ResNet models and two Inception-v3 models. What validation quality can be obtained by mixing models? The class probabilities were averaged using the geometric mean. Weights (in this case- degrees) were selected on a delayed sample.


results. ResNet training on GTX 980 took 60 hours and Inception-v3 on TitanX took 48 hours. During the competition, we managed to try out new frameworks with new architectures.


The task of classifying bank customers

Link to Kaggle.

Stanislav Semyonov tells how he and other members of the Kaggle top teamed up and won a prize in the competition for classifying applications from clients of a large bank - BNP Paribas.


Formulation of the problem. The obfuscated data from the insurance claims needs to predict whether the claim can be approved without additional manual checks. For a bank, this is the process of automating the processing of applications, and for data analysts, it is simply a task of machine learning on binary classification. There are about 230 thousand objects and 130 features. The metric is LogLoss . It is worth noting that the winning team deciphered the data, which helped them win the competition.

Getting rid of artificial noise in features. The first step is to look at the data. A few things immediately come to mind. Firstly, all features take values ​​from 0 to 20. Secondly, if you look at the distribution of any of the features, you can see the following picture:

Why is that? The fact is that at the stage of anonymizing and noisy data, random noise was added to all values, and then scaling by a segment from 0 to 20 was carried out. The reverse transformation was carried out in two steps: first, the values ​​were rounded to a certain decimal place, and then the denominator . Was this required if the tree still picks up the threshold when splitting? Yes, after the inverse transformation, the differences of the variables begin to make more sense, and for categorical variables, it becomes possible to conduct one-hot coding.

Removing Linear Dependent Features. We also noticed that some signs are the sum of others. It is clear that they are not needed. Subsets of features were taken to determine them. On such subsets, a regression was built to predict some other variable. And if the predicted values ​​were close to the true ones (it is worth considering artificial noise), then the feature could be removed. But the team did not bother with this and used a ready-made set of filtered features. The set was prepared by someone else. One of the features of Kaggle is the presence of a forum and public solutions, through which participants share their findings.

How to understand what to use? There is a small hack. Let's say you know that someone in an old competition used some technique that helped them place high (in the forums they usually write short solutions). If in the current competition this participant is again among the leaders, most likely, the same technique will work here too.

Coding of categorical variables. It was striking that a certain variable V22 has a large number of values, but at the same time, if we take a subsample for a certain value, the number of levels (different values) of other variables decreases markedly. There is also a good correlation with the target variable. What can be done? The simplest solution is to build a separate model for each value of V22, but this is the same as splitting over all values ​​of the variable in the first split of the tree.

There is another way to use the received information - coding by the average value of the target variable. In other words, each value of the categorical variable is replaced by the average value of the target for objects in which this feature takes the same value. It is impossible to perform such encoding directly for the entire training set: in the process, we will implicitly introduce information about the target variable into the features. We are talking about information that almost any model is sure to detect.

Therefore, such statistics are counted by folds. Here is an example:

Let's assume that the data is divided into three parts. For each fold of the training sample, we will calculate a new feature for the other two folds, and for the test sample, for the entire training set. Then the information about the target variable will not be included in the sample so explicitly, and the model will be able to use the knowledge gained.

Will there be problems with anything else? Yes - with rare categories and cross-validation.

Rare categories. Let's say a certain category occurs only a few times and the corresponding objects belong to class 0. Then the average value of the target variable will also be zero. However, a completely different situation may arise on a test sample. The decision is the smoothed average (or smoothed likelihood), which is calculated using the following formula:

Here global mean is the average value of the target variable over the entire sample, nrows is the number of times a particular value of the categorical variable occurs, alpha is the regularization parameter (for example, 10). Now, if some value occurs rarely, the global average will be given more weight, and if it occurs often enough, the result will be close to the initial category average. By the way, this formula allows you to process previously unknown values ​​of a categorical variable.

Cross Validation. Let's say we calculated all smoothed means for categorical variables for other folds. Can we evaluate the quality of the model by standard k-fold cross-validation? No. Let's look at an example.

For example, we want to evaluate the model on the third fold. We train the model on the first two folds, but they have a new variable with the average value of the target variable, which we have already used the third test fold to calculate. This does not allow us to correctly evaluate the results, but the problem that has arisen is solved by calculating statistics on folds within folds. Let's look at the example again:

We still want to evaluate the model on the third fold. Let's divide the first two folds (the training sample of our assessment) into some other three folds, in them we will calculate a new feature according to the already analyzed scenario, and for the third fold (this is the test sample of our assessment) we will calculate the first two folds together. Then no information from the third fold will be used when training the model, and the estimate will be honest. In the competition that we are discussing, only such cross-validation allowed to correctly assess the quality of the model. Of course, the "external" and "internal" number of folds can be any.

Feature building. We used not only the already mentioned smoothed averages of the target variable, but also weights of evidence. It's almost the same, but with a logarithmic transformation. In addition, features like the difference in the number of objects of positive and negative classes in the group turned out to be useful without any normalization. The intuition here is the following: the scale shows the degree of confidence in the class, but what to do with quantitative features? After all, if they are processed in a similar way, then all values ​​\u200b\u200bare “clogged” with regularization by the global average. One option is to divide the values ​​into bins, which are then considered separate categories. Another way is simply to build some kind of linear model on the same feature with the same target. In total, about two thousand signs were obtained from 80 filtered ones.

stacking and blending. As with most competitions, model staking is an important part of the solution. In short, the essence of stacking is that we pass the predictions of one model as a feature to another model. However, it is important not to overtrain once again. Let's just take an example:


Taken from Alexander Dyakonov's blog

For example, we decided to split our sample into three folds at the staking stage. Similar to counting statistics, we must train the model on two folds, and add the predicted values ​​for the remaining fold. For a test sample, you can average the model predictions from each pair of folds. Each level of stacking is the process of adding a group of new features-predictions of models based on the existing dataset.

At the first level, the team had 200-250 different models, at the second - another 20-30, at the third - a few more. The result is blending, that is, mixing the predictions of different models. Various algorithms were used: gradient boosting with different parameters, random forests, neural networks. main idea- apply the most diverse models with different parameters, even if they do not give the highest quality.

Teamwork. Usually, participants are united in teams before the end of the competition, when everyone already has their own achievements. We teamed up with other caglers from the very beginning. Each team member had a folder in the shared cloud where datasets and scripts were placed. The general cross-validation procedure was approved in advance so that it could be compared with each other. The roles were distributed as follows: I came up with new features, the second participant built models, the third one selected them, and the fourth one managed the whole process.

Where to get power. Testing a large number of hypotheses, building multi-level stacking, and training models can take too much time when using a laptop. Therefore, many participants use computing servers with big amount nuclei and random access memory. I usually use AWS servers, and my team members seem to be using machines at work for contests while they are idle.

Communication with the organizing company. After a successful performance in the competition, communication with the company takes place in the form of a joint conference call. Participants talk about their decision and answer questions. In BNP, people were not surprised by multi-level staking, but they were interested, of course, in the construction of features, teamwork, validation of results - everything that could be useful to them in improving their own system.

Do I need to decrypt the dataset. The winning team noticed one feature in the data. Some features have missing values, and some do not. That is, some characteristics did not depend on specific people. In addition, 360 unique values ​​were obtained. It is logical to assume that we are talking about some time marks. It turned out that if we take the difference between two such signs and sort the entire sample by it, then at first there will be zeros more often, and then ones. This is exactly what the winners did.

Our team took third place. Almost three thousand teams participated in total.

Ad category recognition task

Link to DataRing .

This is another contest "Avito". It took place in several stages, the first of which (as well as the third, by the way) was won by Arthur Kuzin.


Formulation of the problem. According to the photos from the ad, you must determine the category. Each ad was matched from one to five images. The metric took into account the coincidence of categories at different levels of the hierarchy - from general to narrower ones (the last level contains 194 categories). In total, there were almost a million images in the training sample, which is close to the size of ImageNet.


Recognition difficulties. It would seem that you just need to learn to distinguish a TV from a car, and a car from shoes. But, for example, there is a category “British cats”, and there are “other cats”, and among them there are very similar images - although you can still distinguish them from each other. What about tires, rims and wheels? This is where a person can't do it. These difficulties are the reason for the appearance of a certain limit of the results of all participants.


Resources and framework. I had at my disposal three computers with powerful video cards: a home one provided by the laboratory at the Moscow Institute of Physics and Technology, and a computer at work. Therefore, it was possible (and had to) train several networks at the same time. MXNet was chosen as the main framework for training neural networks, created by the same guys who wrote the well-known XGBoost. This alone was a reason to trust their new product. The advantage of MXNet is that an efficient iterator with standard augmentation is available right out of the box, which is enough for most tasks.


Network architectures. The experience of participating in one of the previous competitions showed that the architectures of the Inception series show the best quality. I have used them here. It was added to GoogLeNet because it speeded up the training of the model. We also used the Inception-v3 and Inception BN architectures from the Model Zoo model library, which added a dropout before the last fully connected layer. Due to technical problems, it was not possible to train the network using stochastic gradient descent, so Adam was used as an optimizer.



Data augmentation. To improve the quality of the network, augmentation was used - adding distorted images to the sample in order to increase the diversity of data. Transformations such as random cropping of the photo, reflection, rotation by a small angle, changing the aspect ratio and shift were involved.

Accuracy and learning speed. At first I divided the sample into three parts, but then I abandoned one of the validation steps for mixing models. Therefore, later the second part of the sample was added to the training set, which improved the quality of the networks. In addition, GoogLeNet was originally trained on Titan Black, which has half the memory of Titan X. So this network was retrained with a larger batch size, and its accuracy increased. If you look at the training time of networks, we can conclude that in conditions of limited time it is not worth using Inception-v3, since training is much faster with the other two architectures. The reason is the number of parameters. The fastest learner is Inception BN.

Building Predictions.

Like Eugene in the contest with car brands, Arthur used crop predictions - but not in 10 sections, but in 24. The sections were the corners, their reflections, the center, the turns of the central parts and ten more random ones.

If you save the state of the network after each epoch, the result is many different models, not just the final network. Given the time remaining until the end of the competition, I could use the predictions of 11 epoch models - since building predictions using the network also takes a long time. All these predictions were averaged according to the following scheme: first, using the arithmetic mean within the crop groups, then using the geometric mean with weights selected on the validation set. These three groups are mixed, then we repeat the operation for all epochs. At the end, the class probabilities of all images of one ad are averaged using a geometric mean without weights.


results. When fitting weights during the validation phase, a competition metric was used because it did not correlate well with normal accuracy. Prediction in different areas of images gives only a small part of the quality compared to a single prediction, but it is due to this increase that it is possible to show the best result. At the end of the competition, it turned out that the first three places differ in results by thousandths. For example, Zhenya Nizhibitsky had the only model that was slightly inferior to my ensemble of models.


Learning from scratch vs. fine-tuning. After the end of the competition, it turned out that despite the large sample size, it was worth training the network not from scratch, but using a pre-trained network. This approach shows better results.

Reinforcement learning problem

The Black Box Challenge, about which , was not quite like the usual "cagle". The fact is that for the solution it was not enough to mark up some “test” sample. It was required to program and load into the system the “agent” code, which was placed in an environment unknown to the participant and independently made decisions in it. Such tasks belong to the field of reinforcement learning.

Mikhail Pavlov from 5vision spoke about approaches to the solution. He took second place in the competition.


Formulation of the problem. For an environment with unknown rules, it was necessary to write an "agent" that would interact with the specified environment. Schematically, this is a kind of brain that receives information about the state and reward from the black box, makes a decision about the action, and then receives a new state and a reward for the action. Actions are repeated one after another during the game. The current state is described by a vector of 36 numbers. The agent can take four actions. The goal is to maximize the amount of rewards for the entire game.


Environmental Analysis. The study of the distribution of environment state variables showed that the first 35 components do not depend on the selected action, and only the 36th component changes depending on it. At the same time, different actions influenced differently: some increased or decreased, some did not change at all. But it cannot be said that the entire environment depends on one component: there may be some hidden variables in it. In addition, the experiment showed that if you perform more than 100 identical actions in a row, then the reward becomes negative. So strategies like “do only one action” fell away immediately. Some of the participants in the competition noticed that the reward is proportional to the same 36th component. It has been suggested on the forum that the black box mimics financial market, where the portfolio is the 36th component, and the actions are buying, selling, and deciding to do nothing. These options were correlated with the change in the portfolio, and the meaning of one action was not clear.


Q-learning. During the participation, the main goal was to try different reinforcement learning techniques. One of the simplest and most famous methods is q-learning. Its essence is in an attempt to build a Q function that depends on the state and the selected action. Q evaluates how "good" it is to choose a particular action in a particular state. The concept of “good” includes the reward that we will receive not only now, but also in the future. This function is trained iteratively. During each iteration, we try to bring the function closer to itself in the next step of the game, taking into account the reward received now. You can read more. The use of q-learning involves working with fully observable Markov processes (in other words, the current state must contain all information from the environment). Despite the fact that the environment, according to the organizers, did not meet this requirement, q-learning could be applied quite successfully.

Adaptation to black box. Empirically, it was found that n-step q-learning was best suited for the environment, where the reward was used not for one last action, but for n actions ahead. The environment allowed you to save the current state and roll back to it, which made it easier to collect a sample - you could try to perform each action from one state, and not just one. At the very beginning of training, when the q-function was not yet able to evaluate actions, the strategy “perform action 3” was used. It was assumed that it did not change anything and it was possible to start training on the data without noise.

Learning process. The training went like this: with the current policy (the agent’s strategy), we play the entire episode, accumulating a sample, then using the obtained sample, we update the q-function, and so on - the sequence is repeated for a certain number of epochs. The results were better than when updating the q-function during the game. Other methods - the replay memory technique (with a common data bank for training, where new game episodes are entered) and the simultaneous training of several agents playing asynchronously - also turned out to be less effective.

Models. The solution used three regressions (each one time per action) and two neural networks. Some quadratic features and interactions have been added. The final model is a mixture of all five models (five Q-functions) with equal weights. In addition, online retraining was used: in the process of testing, the weights of the old regressions were mixed with the new weights obtained on the test set. This was done only for regressions, since their solutions can be written analytically and recalculated fairly quickly.


Other ideas. Naturally, not all ideas improved the final result. For example, reward discounting (when we do not just maximize the total reward, but consider each next move less useful), deep networks, dueling architecture (with an assessment of the usefulness of the state and each action separately) did not give an increase in results. Due to technical problems, it was not possible to apply recurrent networks - although, in an ensemble with other models, they might provide some benefit.


Results. Team 5vision took second place, but with a very small margin from the owners of the "bronze".


So, why enter a data analysis competition?

  • Prizes. Successful performance in most competitions is rewarded with cash prizes or other valuable gifts. Over seven million dollars have been raffled off on Kaggle in seven years.
  • Career. Sometimes a prize.
  • An experience. This, of course, is the most important thing. You can explore a new area and start solving problems that you have not encountered before.

Currently, machine learning training sessions are held on Saturdays every other week. The venue is the Moscow office of Yandex, the standard number of guests (guests plus Yandexoids) is 60-80 people. The main feature of the trainings is their topicality: each time the competition is sorted out, which ended one or two weeks ago. This makes it difficult to plan everything exactly, but the competition is still fresh in memory and many people gather in the hall who have tried their hand at it. The training is supervised by Emil Kayumov, who, by the way, helped with writing this post.

In addition, there is another format: solutions, where novice specialists jointly participate in existing competitions. Decisions are held on those Saturdays when there are no trainings. Anyone can come to both types of events, announcements are published in groups