python - concurrent.futures not parallelizing write -
i have list dataframe_chunk
contains chunks of large pandas dataframe.i write every single chunk different csv, , in parallel. however, see files being written sequentially , i'm not sure why case. here's code:
import concurrent.futures cfu def write_chunk_to_file(chunk, fpath): chunk.to_csv(fpath, sep=',', header=false, index=false) pool = cfu.threadpoolexecutor(n_cores) futures = [] in range(n_cores): fpath = '/path_to_files_'+str(i)+'.csv' futures.append(pool.submit( write_chunk_to_file(dataframe_chunk[i], fpath))) f in cfu.as_completed(futures): print("finished @ ",time.time())
any clues?
one thing stated in python 2.7.x threading
docs not in 3.x docs python cannot achieve true parallelism using threading
library - 1 thread execute @ time.
you should try using concurrent.futures
processpoolexecutor
uses separate processes each job , therefore can achieve true parallelism on multi-core cpu.
update
here program adapted use multiprocessing
library instead:
#!/usr/bin/env python3 multiprocessing import process import os import time n_cores = 8 def write_chunk_to_file(chunk, fpath): open(fpath, "w") f: x in range(10000000): f.write(str(x)) futures = [] print("my pid:", os.getpid()) input("hit return start:") start = time.time() print("started at:", start) in range(n_cores): fpath = './tmp/file-'+str(i)+'.csv' p = process(target=write_chunk_to_file, args=(i,fpath)) futures.append(p) p in futures: p.start() print("all jobs started.") p in futures: p.join() print("all jobs finished @ ",time.time())
you can monitor jobs shell command in window:
while true; clear; pstree 12345; ls -l tmp; sleep 1; done
(replace 12345 pid emitted script.)
Comments
Post a Comment