What is curse of dimensionality? How does it affect distance and similarity measures?
What is curse of dimensionality? How does it affect distance and similarity measures?
-Refers to various phenomena that arise when analyzing and organizing data in high dimensional spaces
-Common theme: when number of dimensions increases, the volume of the space increases so fast that the available data becomes sparse
-Issue with any method that requires statistical significance: the amount of data needed to support the result grows exponentially with the dimensionality
-Issue when algorithms don't scale well on high dimensions typically when O(nkn)O(nkn)
-Everything becomes far and difficult to organize
Illustrative example: compare the proportion of an inscribed hypersphere with radius rr and dimension d to that of a hypercube with edges of length 2r2r
- Volume of such a sphere is Vsphere=2rdπd/2dΓ(d/2)Vsphere=2rdπd/2dΓ(d/2)
- The volume of the cube is: Vcube=2rdVcube=2rd
As d increases (space dimension), the volume of hypersphere becomes insignificant relative to the volume of the hypercube:
limd→∞VsphereVcube=πd/2d2d−1Γ(d/2)=0
limd→∞VsphereVcube=πd/2d2d−1Γ(d/2)=0
- Nearly all of the dimensional space is far away from the center
- It consists almost entirely of the corners of the hypercube, with no middle!