Random sample from a very long iterable, in python -


i have long python generator want "thin out" randomly selecting subset of values. unfortunately, random.sample() not work arbitrary iterables. apparently, needs supports len() operation (and perhaps non-sequential access sequence, that's not clear). , don't want build enormous list can thin out.

as matter of fact, possible sample sequence uniformly in 1 pass, without knowing length-- there's nice algorithm in programming perl (edit: "reservoir sampling", @user2357112!). know of standard python module provides functionality?

demo of problem (python 3)

>>> import itertools, random >>> random.sample(iter("abcd"), 2) ... typeerror: population must sequence or set.  dicts, use list(d). 

on python 2, error more transparent:

traceback (most recent call last):   file "<pyshell#12>", line 1, in <module>     random.sample(iter("abcd"), 2)   file "/usr/local/cellar/python/2.7.9/frameworks/python.framework/versions/2.7/lib/python2.7/random.py", line 321, in sample     n = len(population) typeerror: object of type 'iterator' has no len() 

if there's no alternative random.sample(), i'd try luck wrapping generator object provides __len__ method (i can find out length in advance). i'll accept answer shows how cleanly.

since know length data returned iterable, can use xrange() generate indices iterable. can run iterable until you've grabbed of data:

import random  def sample(it, length, k):     indices = random.sample(xrange(length), k)     result = [none]*k     index, datum in enumerate(it):         if index in indices:             result[indices.index(index)] = datum     return result  print sample(iter("abcd"), 4, 2) 

in alternative, here implementation of resevior sampleing using "algorithm r":

import random  def r(it, k):     '''https://en.wikipedia.org/wiki/reservoir_sampling#algorithm_r'''     = iter(it)     result = []     i, datum in enumerate(it):         if < k:             result.append(datum)         else:             j = random.randint(0, i-1)             if j < k:                 result[j] = datum     return result  print r(iter("abcd"), 2) 

note algorithm r doesn't provide random order results. in example given, 'b' never precede 'a' in results.


Comments

Popular posts from this blog

java - Suppress Jboss version details from HTTP error response -

gridview - Yii2 DataPorivider $totalSum for a column -

Sass watch command compiles .scss files before full sftp upload -