Coffee Chat Brewing AI Knowledge

eng kor

Hopkins Statistic

The Hopkins statistic (Lawson and Jurs 1990) is a measure used to determine the clustering tendency of a dataset. It is particularly useful in assessing whether the data points are uniformly distributed or if there is a significant clustering structure present. Using the Hopkins statistic can help decide whether to proceed with clustering algorithms or not, serving as a preliminary step in many clustering analyses to ensure that the data is suitable for such methods.

How Hopkins Statistic works

The Hopkins statistic is calculated as follows. If $D$ is the original dataset:

  1. A subset $R$ of the $m$ data points $(p_1, …, p_m)$ is randomly selected from $D$.
  2. A set of $m$ points $(q_1, …, q_n)$ is generated from a random uniform distribution $U$ within the same range as $D$.
  3. For each point $p_i$ in $R$, the distance to the nearest neighbor $p_j$ is calculated: $w_i=dist(p_i, p_j)$
  4. For each point $q_i$ in $U$, the distance to the nearest neighbor $p_j$ withtin $R$ is calculated: $u_i=dist(q_i, p_j)$
  5. The Hopkins statistic $H$ is then defined as the mean nearest neighbor distance in $U$ divided by the sum of the mean nearest neighbor distances in $R$ and across the $U$:
\[H={\Sigma_{i=1}^n u_i^d \over \Sigma_{i=1}^n u_i^d + \Sigma_{i=1}^n w_i^d}\]
  • $d$: the dimension of the data


Interpretation of Hopkins Statistic

If $D$ were unifomly distributed, then the distances for real points ($\Sigma_{i=1}^n w_i$) and the artificial one ($\Sigma_{i=1}^n u_i$) would be close to each other. Thus, $H$ would be about 0.5. However, if clusters are present in $D$, then $\Sigma_{i=1}^n w_i$ would be substantially larger than for $\Sigma_{i=1}^n u_i$ in expectation. And thus, the $H$ will increase.

  • If $H$ is close to 0.5, the data is uniformly distributed, indicating no clustering tendency. This is defined as the null hypothesis.
  • If $H$ is close to 0, the data points are regularly spaced.
  • If $H$ is close to 1, the data has a strong clustering tendency. This is defined as the alternative hypothesis.

hopkins

In practice, if the value for $H$ is significantly greater than 0.5, it suggests that the dataset is clusterable. A value for $H$ higher than 0.75 indicates a clustering tendency at the 90% confidence level. Conversely, if it is much lower than 0.5, it indicates a regular spacing between data points, which is not typical for clustering applications.

To determine the dataset’s clustering tendency, you can iteratively conduct the Hopkins statistics test using 0.5 as the threshold. If the $H$ value is consistently smaller than 0.5, the dataset has no clustering tendency. Conversely, if it is consistently larger than 0.5, the null hypothesis is rejected, indicating the clustering tendency.


References

  • https://www.datanovia.com/en/lessons/assessing-clustering-tendency/