Tightness of a linear relationship between random variables. Determining the tightness of the relationship between random variables

Regression Analysis

Processing the results of the experiment by the method

When studying the processes of functioning of complex systems, one has to deal with a number of simultaneously acting random variables. To understand the mechanism of phenomena, cause-and-effect relationships between the elements of the system, etc., we are trying to establish the relationship of these quantities based on the observations received.

V mathematical analysis dependence, for example, between two quantities is expressed by the concept of a function

where each value of one variable corresponds to only one value of the other. This dependence is called functional.

The situation with the concept of dependence of random variables is much more complicated. Usually between random variables(random factors) that determine the process of functioning of complex systems, there is usually such a relationship in which, with a change in one quantity, the distribution of another changes. Such a connection is called stochastic, or probabilistic. In this case, the magnitude of the change in the random factor Y, corresponding to the change in the value X, can be broken down into two components. The first is related to addiction. Y from X, and the second with the influence of "own" random components Y and X. If the first component is missing, then the random variables Y and X are independent. If the second component is missing, then Y and X depend functionally. In the presence of both components, the ratio between them determines the strength or tightness of the relationship between random variables Y and X.

There are various indicators that characterize certain aspects of the stochastic relationship. So, linear dependence between random variables X and Y determines the correlation coefficient.

where are the mathematical expectations of random variables X and Y.

– standard deviations of random variables X and Y.


The linear probabilistic dependence of random variables lies in the fact that as one random variable increases, the other tends to increase (or decrease) according to a linear law. If random variables X and Y are connected by a strict linear functional dependence, for example,

y=b 0 +b 1 x 1,

then the correlation coefficient will be equal to ; where the sign corresponds to the sign of the coefficient b 1.If the values X and Y are connected by an arbitrary stochastic dependence, then the correlation coefficient will vary within

It should be emphasized that for independent random variables the correlation coefficient is equal to zero. However, the correlation coefficient as an indicator of the dependence between random variables has serious drawbacks. First, from the equality r= 0 does not imply independence of random variables X and Y(with the exception of random variables subject to the normal distribution law, for which r= 0 means at the same time the absence of any dependence). Secondly, the extreme values ​​are also not very useful, since they do not correspond to any functional dependence, but only to a strictly linear one.



Full description dependencies Y from X, and, moreover, expressed in exact functional relationships, can be obtained by knowing the conditional distribution function .

It should be noted that in this case one of the observed variables is considered nonrandom. Fixing simultaneously the values ​​of two random variables X and Y, when comparing their values, we can attribute all errors only to the value Y. Thus, the observation error will be the sum of its own random error of the quantity Y and from the matching error arising from the fact that with the value Y not quite the same value is matched X which actually took place.

However, finding the conditional distribution function, as a rule, turns out to be a very difficult task. The easiest way to investigate the relationship between X and Y with a normal distribution Y, since it is completely determined by the mathematical expectation and variance. In this case, to describe the dependence Y from X you do not need to build a conditional distribution function, but just indicate how, when changing the parameter X the mathematical expectation and variance of the value change Y.

Thus, we come to the need to find only two functions:

Conditional variance dependence D from parameter X is called skhodastichesky dependencies. It characterizes the change in the accuracy of the observation technique with a change in the parameter and is used quite rarely.

Dependence of the conditional mathematical expectation M from X is called regression, it gives the true dependence of the quantities X and At, devoid of all random layers. Therefore, the ideal goal of any study of dependent variables is to find a regression equation, and the variance is used only to assess the accuracy of the result.

Correlation-statistical relationship of two or more random variables.

The partial correlation coefficient characterizes the degree of linear relationship between two quantities, has all the properties of a pair, i.e. varies from -1 to +1. If the partial correlation coefficient is ±1, then the relationship between the two quantities is functional, and its equality to zero indicates linear independence these quantities.

The multiple correlation coefficient characterizes the degree of linear dependence between the value x 1 and the other variables (x 2, x s) included in the model, varies from 0 to 1.

An ordinal (ordinal) variable helps to sort the statistically studied objects according to the degree of manifestation of the analyzed property in them.

Rank correlation - a statistical relationship between ordinal variables (a measurement of a statistical relationship between two or more rankings of the same finite set of objects O 1, O 2, ..., O p.)

ranking is the arrangement of objects in descending order of the degree of manifestation in them of the k-th property under study. In this case, x(k) is called the rank of the i-th object according to the k-th feature. Rage characterizes the ordinal place occupied by the object O i, in a series of n objects.

39. Correlation coefficient, determination.

The correlation coefficient shows the degree of statistical dependence between two numerical variables. It is calculated as follows:

where n– number of observations,

x is the input variable,

y is the output variable. Correlation coefficient values ​​are always in the range from -1 to 1 and are interpreted as follows:

    if coefficient correlation is close to 1, then there is a positive correlation between the variables.

    if coefficient correlation is close to -1, which means that there is a negative correlation between the variables

    intermediate values ​​close to 0 will indicate a weak correlation between the variables and, accordingly, a low dependence.

Determination coefficient(R 2 )- it is the proportion of the explained variance of the deviations of the dependent variable from its mean.

The formula for calculating the coefficient of determination:

R 2 \u003d 1 - ∑ i (y i -f i) 2 : ∑ i (y i -y(dash)) 2

Where y i is the observed value of the dependent variable, and f i is the value of the dependent variable predicted by the regression equation, y(dash) is the arithmetic mean of the dependent variable.

Question 16

According to this method, the stocks of the next Supplier are used to meet the needs of the next Consumers until they are completely exhausted. After that, the stocks of the next Supplier by number are used.

Filling in the table of the transport task starts from the upper left corner and consists of a number of steps of the same type. At each step, based on the stocks of the next Supplier and the requests of the next Consumer, only one cell is filled in and, accordingly, one Supplier or Consumer is excluded from consideration.

To avoid errors, after constructing the initial basic (reference) solution, it is necessary to check that the number of occupied cells is equal to m + n-1.

Between changes 7 and X. To assess the closeness of the relationship between random variables, indicators are used


As we have already said, one of the main differences between the sequence of observations that form the time series is that the members of the time series are, generally speaking, statistically interdependent. The degree of tightness of the statistical relationship between the random variables Xt and Xt + T can be measured by the pairwise correlation coefficient

The estimate of the general parameter is obtained on the basis of a sample indicator, taking into account the representativeness error. In another case, in relation to the properties of the general population, some hypothesis is put forward about the value of the mean, variance, the nature of the distribution, the form and closeness of the relationship between the variables. Hypothesis testing is carried out on the basis of identifying the consistency of empirical data with hypothetical (theoretical) ones. If the discrepancy between the compared values ​​does not go beyond the limits of random errors, the hypothesis is accepted. However, no conclusions are made about the correctness of the hypothesis itself, we are talking only about the consistency of the compared data. The basis for testing statistical hypotheses is the data of random samples. It makes no difference whether the hypotheses are evaluated against a real or hypothetical population. The latter opens the way for the application of this method outside the actual sample when analyzing the results of the experiment, the data of continuous observation, but a small number. In this case, it is recommended to check whether the established regularity is caused by a coincidence of random circumstances, to what extent it is typical for the complex of conditions in which the population under study is located.

It turns out that the correlation and regression characteristics of the scheme (, m]) may differ significantly from the corresponding characteristics of the original (undistorted) scheme (, n). normal errors on the original two-dimensional normal scheme (, m) always reduces the absolute value of the regression coefficient Ql in relation (B. 15), and also weakens the degree of tightness of the relationship between u (i.e., reduces the absolute value of the correlation coefficient r).

Influence of measurement errors on the value of the correlation coefficient. Let us want to estimate the degree of closeness of the correlation between the components of a two-dimensional normal random variable (, TJ), but we can observe them only with some random measurement errors, respectively, es and e (see the D2 dependence diagram in the introduction). Therefore, the experimental data are (xit i/i), i = 1, 2,. .., n, are practically sample values ​​of the distorted two-dimensional random variable (, r)), where =

Method R.a. consists in deriving a regression equation (including an estimate of its parameters), with the help of which the average value of a random variable is found, if the value of another (or others in the case of multiple or multivariate regression) is known. (In contrast, correlation analysis is used to find and express the strength of the relationship between random variables71.)

In the study of the correlation of signs that are not connected by a consistent change in time, each sign changes under the influence of many causes, taken as random. In the series of dynamics, a change is added to them during the time of each series. This change leads to the so-called autocorrelation - the influence of changes in the levels of previous series on subsequent ones. Therefore, the correlation between the levels of time series correctly shows the tightness of the connection between the phenomena reflected in the time series, only if there is no autocorrelation in each of them. In addition, autocorrelation leads to a distortion of the mean square errors of the regression coefficients, which makes it difficult to build confidence intervals for the regression coefficients, as well as to check their significance.

The theoretical and sample correlation coefficients defined by relations (1.8) and (1.8), respectively, can be formally calculated for any two-dimensional observation system; they are measures of the degree of closeness of the linear statistical relationship between the analyzed features. However, only in the case of a joint normal distribution of the random variables under study and u, the correlation coefficient r has a clear meaning as a characteristic of the degree of closeness of the connection between them. In particular, in this case, the ratio r - 1 confirms a purely functional linear relationship between the quantities under study, and the equation r = 0 indicates their complete mutual independence. In addition, the correlation coefficient, together with the means and variances of random variables and TJ, constitutes those five parameters that provide comprehensive information about

The relationship that exists between random variables of different nature, for example, between the X value and the Y value, is not necessarily a consequence of the direct dependence of one variable on the other (the so-called functional relationship). In some cases, both quantities depend on a whole set of different factors common to both quantities, as a result of which patterns related to each other are formed. When a relationship between random variables is discovered with the help of statistics, we cannot claim that we have discovered the cause of the ongoing change in parameters, rather, we only saw two interconnected consequences.

For example, children who watch more American action movies on TV read less. Children who read more learn better. It is not so easy to decide which are the causes and which are the effects, but this is not the task of statistics. Statistics can only put forward a hypothesis about the presence of a connection, back it up with numbers. If there is indeed a connection, the two random variables are said to be correlated. If an increase in one random variable is associated with an increase in the second random variable, the correlation is called direct. For example, the number of pages read per year and the average score (performance). If, on the contrary, an increase in one value is associated with a decrease in another, one speaks of an inverse correlation. For example, the number of action movies and the number of pages read.

The mutual relationship of two random variables is called correlation, correlation analysis allows you to determine the presence of such a relationship, to assess how close and significant this relationship is. All of this is quantified.

How to determine if there is a correlation between the values? In most cases, this can be seen on a regular chart. For example, for each child from our sample, we can determine the value X i (number of pages) and Y i ( GPA annual estimate) and record this data in a table. Build the X and Y axes, and then plot the entire series of points on the graph so that each of them has a specific pair of coordinates (X i , Y i) from our table. Since in this case we find it difficult to determine what can be considered a cause and what a consequence, it does not matter which axis is vertical and which is horizontal.


If the graph looks like a), then this indicates the presence of a direct correlation, if it looks like b) - the correlation is inverse. Lack of correlation
Using the correlation coefficient, you can calculate how close the relationship exists between the values.

Suppose there is a correlation between price and demand for a product. The number of purchased units of goods, depending on the price from different sellers, is shown in the table:

It can be seen that we are dealing with an inverse correlation. To quantify the tightness of the connection, the correlation coefficient is used:

We calculate the coefficient r in Excel, using the f x function, then statistical functions, the CORREL function. At the prompt of the program, we enter two different arrays (X and Y) into the two corresponding fields with the mouse. In our case, the correlation coefficient turned out to be r = - 0.988. It should be noted that the closer the correlation coefficient is to 0, the weaker the relationship between the values. The closest relationship with direct correlation corresponds to a coefficient r close to +1. In our case, the correlation is inverse, but also very close, and the coefficient is close to -1.

What can be said about random variables whose coefficient has an intermediate value? For example, if we got r=0.65. In this case, statistics allow us to say that two random variables are partially related to each other. Let's say 65% ​​of the impact on the number of purchases had price, and 35% - other circumstances.

And one more important circumstance should be mentioned. Since we are talking about random variables, there is always the possibility that the connection we noticed is a random circumstance. Moreover, the probability of finding a connection where there is none is especially high when there are few points in the sample, and when evaluating, you did not build a graph, but simply calculated the value of the correlation coefficient on a computer. So, if we leave only two different points in any arbitrary sample, the correlation coefficient will be equal to either +1 or -1. From school course In geometry, we know that a straight line can always be drawn through two points. To assess the statistical significance of the fact of the connection you discovered, it is useful to use the so-called correlation correction:

While the task of correlation analysis is to establish whether these random variables are related, the goal of regression analysis is to describe this relationship with an analytical dependence, i.e. using an equation. We will consider the simplest case, when the connection between points on the graph can be represented by a straight line. The equation of this straight line is Y=aX+b, where a=Yav.-bXav.,

Knowing , we can find the value of the function by the value of the argument at those points where the value of X is known, but Y is not. These estimates are very useful, but they must be used with caution, especially if the relationship between the quantities is not too close.

We also note that from a comparison of the formulas for b and r, it can be seen that the coefficient does not give the value of the slope of the straight line, but only shows the very fact of the existence of a connection.

The company employs 10 people. Table 2 shows data on their work experience and

monthly salary.

Calculate from this data

  • - the value of the sample covariance estimate;
  • - the value of the sample Pearson correlation coefficient;
  • - evaluate the direction and strength of the connection according to the obtained values;
  • - determine how legitimate the statement that this company uses the Japanese management model, which consists in the assumption that the more time an employee spends in this company, the higher his salary should be.

Based on the correlation field, one can hypothesize (for the general population) that the relationship between all possible values ​​of X and Y is linear.

To calculate the regression parameters, we will build a calculation table.

Sample means.

Sample variances:

The estimated regression equation will look like

y = bx + a + e,

where ei are the observed values ​​(estimates) of the errors ei, a and b, respectively, the estimates of the parameters b and in the regression model that should be found.

To estimate the parameters b and c - use LSM (least squares).

System of normal equations.

a?x + b?x2 = ?y*x

For our data, the system of equations has the form

  • 10a + 307b = 33300
  • 307 a + 10857 b = 1127700

We multiply the equation (1) of the system by (-30.7), we get a system that we solve by the method of algebraic addition.

  • -307a -9424.9 b = -1022310
  • 307 a + 10857 b = 1127700

We get:

1432.1b = 105390

Where b = 73.5912

Now we find the coefficient "a" from equation (1):

  • 10a + 307b = 33300
  • 10a + 307 * 73.5912 = 33300
  • 10a = 10707.49

We get empirical regression coefficients: b = 73.5912, a = 1070.7492

Regression equation (empirical regression equation):

y = 73.5912 x + 1070.7492

covariance.

In our example, the relationship between feature Y and factor X is high and direct.

Therefore, we can safely say that the more time an employee works in a given company, the higher his salary.

4. Testing statistical hypotheses. When solving this problem, the first step is to formulate a testable hypothesis and an alternative one.

Checking the equality of general shares.

A study was conducted on student performance at two faculties. The results for the variants are shown in Table 3. Can it be argued that both faculties have the same percentage of excellent students?

simple arithmetic mean

We test the hypothesis about the equality of the general shares:

Let's find the experimental value of Student's criterion:

Number of degrees of freedom

f \u003d nx + ny - 2 \u003d 2 + 2 - 2 \u003d 2

Determine the value of tkp according to the Student's distribution table

According to Student's table we find:

Ttabl(f;b/2) = Ttabl(2;0.025) = 4.303

According to the table of critical points of the Student's distribution at a significance level b = 0.05 and a given number of degrees of freedom, we find tcr = 4.303

Because tobs > tcr, then the null hypothesis is rejected, the general shares of the two samples are not equal.

Checking the uniformity of the general distribution.

The university management wants to find out how the popularity of the Faculty of Humanities has changed over time. The number of applicants who applied for this faculty was analyzed in relation to the total number of applicants in the corresponding year. (Data are given in Table 4). If we consider the number of applicants as a representative sample of the total number of school graduates of the year, can it be argued that the interest of schoolchildren in the specialties of this faculty does not change over time?

Option 4

Solution: Table for calculating indicators.

Interval midpoint, xi

Cumulative frequency, S

Frequency, fi/n

To evaluate the distribution series, we find the following indicators:

weighted average

The range of variation is the difference between the maximum and minimum values ​​of the attribute of the primary series.

R = 2008 - 1988 = 20 Dispersion - characterizes the measure of spread around its mean value (measure of dispersion, i.e. deviation from the mean).

Standard deviation (mean sampling error).

Each value of the series differs from the average value of 2002.66 by an average of 6.32

Testing the hypothesis about the uniform distribution of the general population.

In order to test the hypothesis about the uniform distribution of X, i.e. according to the law: f(x) = 1/(b-a) in the interval (a,b) it is necessary:

Estimate the parameters a and b - the ends of the interval in which the possible values ​​of X were observed, according to the formulas (the * denotes the estimates of the parameters):

Find the probability density of the estimated distribution f(x) = 1/(b* - a*)

Find theoretical frequencies:

n1 = nP1 = n = n*1/(b* - a*)*(x1 - a*)

n2 = n3 = ... = ns-1 = n*1/(b* - a*)*(xi - xi-1)

ns = n*1/(b* - a*)*(b* - xs-1)

Compare empirical and theoretical frequencies using the Pearson test, assuming the number of degrees of freedom k = s-3, where s is the number of initial sampling intervals; if, however, a combination of small frequencies, and therefore the intervals themselves, was made, then s is the number of intervals remaining after the combination. Let's find the estimates of the parameters a* and b* of the uniform distribution by the formulas:

Let's find the density of the supposed uniform distribution:

f(x) = 1/(b* - a*) = 1/(2013.62 - 1991.71) = 0.0456

Let's find the theoretical frequencies:

n1 = n*f(x)(x1 - a*) = 0.77 * 0.0456(1992-1991.71) = 0.0102

n5 = n*f(x)(b* - x4) = 0.77 * 0.0456(2013.62-2008) = 0.2

ns = n*f(x)(xi - xi-1)

Since the Pearson statistic measures the difference between the empirical and theoretical distributions, the larger its observed value Kobs, the stronger the argument against the main hypothesis.

Therefore, the critical region for this statistic is always right-handed :)