Data Skew

Data skew primarily refers to a non uniform distribution in a dataset. Skewed distribution can follow common distributions (e.g., Zipfian, Gaussian, Poisson), but many studies consider Zipfian [1] distribution to model skewed datasets. Using a real bibliographic database, [2] provides real-world parameters for the Zipf distribution model. The direct impact of data skew on parallel execution of complex database queries is a poor load balancing leading to high response time.

Key Points

Walton et al. [3] classify the effects of skewed data distribution on a parallel execution, distinguishing intrinsic skew from partition skew. Intrinsic skew is skew inherent in the dataset (e.g., there are more citizens in Paris than in Waterloo) and is thus called Attribute value skew (AVS).Partition skew occurs on parallel implementations when the workload is not evenly distributed between nodes, even when input data is uniformly...

