Local Outlier Factor

pjaipurk · August 9, 2023, 3:19pm

Local outlier factor (LOF) is an algorithm that identifies the outliers present in the dataset. But what does the local outlier mean?

When a point is considered as an outlier based on its local neighborhood, it is a local outlier. LOF will identify an outlier considering the density of the neighborhood. LOF performs well when the density of the data is not the same throughout the dataset.

To understand LOF, we have to learn a few concepts sequentially:

K-distance and K-neighbors
Reachability distance (RD)
Local reachability density (LRD)
Local Outlier Factor (LOF)

K-distance is the distance between the point, and it’s Kᵗʰ nearest neighbor. K-neighbors denoted by Nₖ(A) includes a set of points that lie in or on the circle of radius K-distance. K-neighbors can be more than or equal to the value of K. How’s this possible?

K-distance of A with K=2

If K=2, K-neighbors of A will be C, B, and D. Here, the value of K=2 but the ||N₂(A)|| = 3. Therefore, ||Nₖ(point)|| will always be greater than or equal to K.

3. REACHABILITY DENSITY (RD)

It is defined as the maximum of K-distance of Xj and the distance between Xi and Xj. The distance measure is problem-specific (Euclidean, Manhattan, etc.)

Illustration of reachability distance with K=2

In layman terms, if a point Xi lies within the K-neighbors of Xj, the reachability distance will be K-distance of Xj (blue line), else reachability distance will be the distance between Xi and Xj (orange line).

4. LOCAL REACHABILITY DENSITY (LRD)

LRD is inverse of the average reachability distance of A from its neighbors. Intuitively according to LRD formula, more the average reachability distance (i.e., neighbors are far from the point), less density of points are present around a particular point. This tells how far a point is from the nearest cluster of points. Low values of LRD implies that the closest cluster is far from the point.

5. LOCAL OUTLIER FACTOR (LOF)

LRD of each point is used to compare with the average LRD of its K neighbors. LOF is the ratio of the average LRD of the K neighbors of A to the LRD of A.

Intuitively, if the point is not an outlier (inlier), the ratio of average LRD of neighbors is approximately equal to the LRD of a point (because the density of a point and its neighbors are roughly equal). In that case, LOF is nearly equal to 1. On the other hand, if the point is an outlier, the LRD of a point is less than the average LRD of neighbors. Then LOF value will be high.

Generally, if LOF> 1, it is considered as an outlier, but that is not always true. Let’s say we know that we only have one outlier in the data, then we take the maximum LOF value among all the LOF values, and the point corresponding to the maximum LOF value will be considered as an outlier.