Random sample from a very long iterable, in python -
i have long python generator want "thin out" randomly selecting subset of values. unfortunately, random.sample()
not work arbitrary iterables. apparently, needs supports len()
operation (and perhaps non-sequential access sequence, that's not clear). , don't want build enormous list can thin out.
as matter of fact, possible sample sequence uniformly in 1 pass, without knowing length-- there's nice algorithm in programming perl
(edit: "reservoir sampling", @user2357112!). know of standard python module provides functionality?
demo of problem (python 3)
>>> import itertools, random >>> random.sample(iter("abcd"), 2) ... typeerror: population must sequence or set. dicts, use list(d).
on python 2, error more transparent:
traceback (most recent call last): file "<pyshell#12>", line 1, in <module> random.sample(iter("abcd"), 2) file "/usr/local/cellar/python/2.7.9/frameworks/python.framework/versions/2.7/lib/python2.7/random.py", line 321, in sample n = len(population) typeerror: object of type 'iterator' has no len()
if there's no alternative random.sample()
, i'd try luck wrapping generator object provides __len__
method (i can find out length in advance). i'll accept answer shows how cleanly.
since know length data returned iterable, can use xrange()
generate indices iterable. can run iterable until you've grabbed of data:
import random def sample(it, length, k): indices = random.sample(xrange(length), k) result = [none]*k index, datum in enumerate(it): if index in indices: result[indices.index(index)] = datum return result print sample(iter("abcd"), 4, 2)
in alternative, here implementation of resevior sampleing using "algorithm r":
import random def r(it, k): '''https://en.wikipedia.org/wiki/reservoir_sampling#algorithm_r''' = iter(it) result = [] i, datum in enumerate(it): if < k: result.append(datum) else: j = random.randint(0, i-1) if j < k: result[j] = datum return result print r(iter("abcd"), 2)
note algorithm r doesn't provide random order results. in example given, 'b'
never precede 'a'
in results.
Comments
Post a Comment