For weighted sampling with replacement, there is a simple reduction to unweighted sampling with replacement. The unweighted version, where all weights are equal, is well studied, and admits tight upper and lower bounds on message complexity. We consider message-efficient continuous random sampling from a distributed stream, where the probability of inclusion of an item in the sample is proportional to a weight associated with the item. We also derive a tight message lower bound, which closes the message complexity of this fundamental problem. Finally, we use our weighted sampling algorithm to improve the message complexity of distributed $L_1$ tracking, also known as count tracking, which is a widely studied problem in distributed streaming. In addition to the upper bound, we also provide a lower bound on the message complexity that is nearly tight up to a $łog(1/\eps)$ factor. Residual heavy hitters generalize the notion of $\ell_1$ heavy hitters and are important in streams that have a skewed distribution of weights. Here the goal is to identify stream items that contribute significantly to the residual stream, once the heaviest items are removed. As an application of our algorithm for weighted SWOR, we derive the first distributed streaming algorithms for trackingheavy hitters with residual error. Our algorithm also has optimal space and time complexity. In this work, we present the first message-optimal algorithm for weighted SWOR from a distributed stream. Weighted samplingwithout replacement (weighted SWOR) eludes this issue, since such heavy items can be sampled at most once. However, in many applications the stream may have only a few heavy items which may dominate a random sample when chosen with replacement. Using extensive simulations on real-world network traces, we show that our algorithms are competitively accurate compared to the best existing solutions despite the fact that they make no assumptions on the underlying network or the placement of measurement switches. The AROMA infrastructure includes controller algorithms that approximate a variety of essential measurement tasks while providing formal accuracy guarantees. Therefore, AROMA can be deployed in many settings, and can also work in the data plane using programmable PISA switches. We introduce AROMA, a measurement infrastructure that generates a uniform sample of packets and flows regardless of the topology, workload and routing. Therefore, existing solutions often simplify the problem by making assumptions on the routing or measurement switch placement. However, performing such analytics without ``overcounting'' flows or packets that traverse multiple measurement switches is challenging. These measurements are often performed by collecting samples at network switches, which are then sent to the controller for aggregation. Network-wide traffic analytics are often needed for various network monitoring tasks. Finally, we compare the complexity and performance of our scheme with other potential approaches. The selection algorithm can be implemented in a queue-like data structure in which memory usage is uniformly bounded during measurement. We show that usage estimates arising from such selection are unbiased, and show how to estimate their variance, both offline for modeling purposes, and online during the sampling itself. Such limits are often required during arbitrary downstream sampling, resampling and aggregation operations employed in analysis of the data.This paper proposes a correlated sampling strategy that is able to select an arbitrarily small number of the "best" representatives of a set of flows. However, while this approach controls estimator variance, it does not place hard limits on the number of flows sampled. Recent work has shown that non-uniform sampling is necessary in order to control estimation variance arising from the observed heavy-tailed distribution of flow lengths. This motivates sampling of flow records.This paper addresses sampling strategy for flow records. However, the increasingly large volumes of flow statistics incur concomitant costs in the resources of the measurement infrastructure. IP flow records are commonly collected for this purpose: these enable determination of fine-grained usage of network resources. Many network management applications use as their data traffic volumes differentiated by attributes such as IP address or port number.
0 Comments
Leave a Reply. |