python - concurrent.futures not parallelizing write -


i have list dataframe_chunk contains chunks of large pandas dataframe.i write every single chunk different csv, , in parallel. however, see files being written sequentially , i'm not sure why case. here's code:

import concurrent.futures cfu  def write_chunk_to_file(chunk, fpath):       chunk.to_csv(fpath, sep=',', header=false, index=false)  pool = cfu.threadpoolexecutor(n_cores)  futures = [] in range(n_cores):     fpath = '/path_to_files_'+str(i)+'.csv'     futures.append(pool.submit( write_chunk_to_file(dataframe_chunk[i], fpath)))  f in cfu.as_completed(futures):     print("finished @ ",time.time()) 

any clues?

one thing stated in python 2.7.x threading docs not in 3.x docs python cannot achieve true parallelism using threading library - 1 thread execute @ time.

you should try using concurrent.futures processpoolexecutor uses separate processes each job , therefore can achieve true parallelism on multi-core cpu.

update

here program adapted use multiprocessing library instead:

#!/usr/bin/env python3  multiprocessing import process  import os import time  n_cores = 8  def write_chunk_to_file(chunk, fpath):       open(fpath, "w") f:       x in range(10000000):         f.write(str(x))  futures = []  print("my pid:", os.getpid()) input("hit return start:")  start = time.time() print("started at:", start)  in range(n_cores):     fpath = './tmp/file-'+str(i)+'.csv'     p = process(target=write_chunk_to_file, args=(i,fpath))     futures.append(p)  p in futures:   p.start()  print("all jobs started.")  p in futures:   p.join()  print("all jobs finished @ ",time.time()) 

you can monitor jobs shell command in window:

while true; clear; pstree 12345; ls -l tmp; sleep 1; done 

(replace 12345 pid emitted script.)


Comments

Popular posts from this blog

java - Suppress Jboss version details from HTTP error response -

gridview - Yii2 DataPorivider $totalSum for a column -

Sass watch command compiles .scss files before full sftp upload -