python - concurrent.futures not parallelizing write -
i have list dataframe_chunk contains chunks of large pandas dataframe.i write every single chunk different csv, , in parallel. however, see files being written sequentially , i'm not sure why case. here's code:
import concurrent.futures cfu  def write_chunk_to_file(chunk, fpath):       chunk.to_csv(fpath, sep=',', header=false, index=false)  pool = cfu.threadpoolexecutor(n_cores)  futures = [] in range(n_cores):     fpath = '/path_to_files_'+str(i)+'.csv'     futures.append(pool.submit( write_chunk_to_file(dataframe_chunk[i], fpath)))  f in cfu.as_completed(futures):     print("finished @ ",time.time()) any clues?
one thing stated in python 2.7.x threading docs  not in 3.x docs  python cannot achieve true parallelism using threading library - 1 thread execute @ time.
you should try using concurrent.futures processpoolexecutor uses separate processes each job , therefore can achieve true parallelism on multi-core cpu.
update
here program adapted use multiprocessing library instead:
#!/usr/bin/env python3  multiprocessing import process  import os import time  n_cores = 8  def write_chunk_to_file(chunk, fpath):       open(fpath, "w") f:       x in range(10000000):         f.write(str(x))  futures = []  print("my pid:", os.getpid()) input("hit return start:")  start = time.time() print("started at:", start)  in range(n_cores):     fpath = './tmp/file-'+str(i)+'.csv'     p = process(target=write_chunk_to_file, args=(i,fpath))     futures.append(p)  p in futures:   p.start()  print("all jobs started.")  p in futures:   p.join()  print("all jobs finished @ ",time.time()) you can monitor jobs shell command in window:
while true; clear; pstree 12345; ls -l tmp; sleep 1; done (replace 12345 pid emitted script.)
Comments
Post a Comment