The multiprocessing Pool
class makes it easy to achieve parallel execution. More details can be found in the official Python documentation [1], which is an excellent source that everyone should at least skim through. The Pool
class manages your worker process and makes it easy distribute workload. This post shows you how use the Pool
in a module, import it and invoke it from our "main" function. But you can just implement it without classes as shown in the Python documentation.
Suppose, we have to run a function, say, get the squares (\(f(x) = x^2)\), for every elements in an array of length \(N\). However, \(N\) is large and executing it sequentially will take \(T\) time, which is a long time. Thankfully, you have a modern computer that uses a multi-core processor, say 2 cores, and so you can run the function in parallel. If we run process two numbers in parallel, we essentially reduce the time taken by \(T/2\) (give or take).
For that, we first write a class called "Workers", which uses the Pool
class and implement the function \(f\) as a @staticmethod
[2] because we want to access this function without creating an instance. If you do not want to use @staticmethod
decorator, you can move the function defination to outside of the class. Second, we use Pool.imap_unordered
since the order of output does not matter and we can process the output as soon as it is available. If order is important, we can use Pool.imap
. In cases where we want to store the return values and handle them later, we will use Pool.map
. There are many methods provided in the Pool
class for different type of management and can be found in the Python Docs [1]. Finally, we save the class as "Worker.py".
Here's the code:
## saved as Worker.py
from multiprocessing import Pool
class Workers:
def __init__(self, list_of_tasks):
self.list_of_tasks = list_of_tasks
@staticmethod
def square(x):
''' returns square of number '''
return x*x
def start(self, no_of_parallel_workers):
with Pool(no_of_parallel_workers) as p:
results = p.imap_unordered(Workers.square, self.list_of_tasks)
for individual_result in results:
print(individual_result)
Now, we import our module, create an instance of Workers
and call the public function run
within a condition. That is, we should not hastily run this function without checking the name of the scope. The scope is stored in the __name__
global variable [3]. We must check if __name__
is equal to '__main__'
(that is, we are in the main scope) and then call workers.run()
. This is because, on Windows, the multiprocessing library will clone the main process and create as many child processes as specified. Each of these child processes will go through the same lines of code, eventually reaching the workers.run
line and clone themselves unless a conditional statment prevents them. During scope checking, the child process's __name__
will not be in '__main__'
scope and hence our code will only create new processes in the main scope. In addition, checking the main scope is a good way to deny any unwanted code execution during module imports [4].
Here is the final code and output:
from workers import Workers
numbers_to_square = [1,2,3,4]
workers = Workers(numbers_to_square)
if __name__ == '__main__':
# use two parallel worker to get the square of numbers
workers.start(2)
1
4
9
16