What is curse of dimensionality? How does it affect distance and similarity measures?

What is curse of dimensionality? How does it affect distance and similarity measures?



-Refers to various phenomena that arise when analyzing and organizing data in high dimensional spaces
-Common theme: when number of dimensions increases, the volume of the space increases so fast that the available data becomes sparse
-Issue with any method that requires statistical significance: the amount of data needed to support the result grows exponentially with the dimensionality
-Issue when algorithms don't scale well on high dimensions typically when O(nkn)O(nkn)
-Everything becomes far and difficult to organize

Illustrative example: compare the proportion of an inscribed hypersphere with radius rr and dimension d to that of a hypercube with edges of length 2r2r
- Volume of such a sphere is Vsphere=2rdπd/2dΓ(d/2)Vsphere=2rdπd/2dΓ(d/2)
- The volume of the cube is: Vcube=2rdVcube=2rd
As d increases (space dimension), the volume of hypersphere becomes insignificant relative to the volume of the hypercube:

limd→∞VsphereVcube=πd/2d2d−1Γ(d/2)=0
limd→∞VsphereVcube=πd/2d2d−1Γ(d/2)=0

- Nearly all of the dimensional space is far away from the center
- It consists almost entirely of the corners of the hypercube, with no middle!

Popular posts from this blog

After analyzing the model, your manager has informed that your regression model is suffering from multicollinearity. How would you check if he's true? Without losing any information, can you still build a better model?

Is rotation necessary in PCA? If yes, Why? What will happen if you don't rotate the components?

What does Latency mean?