When running a custom Python MapReduce job through Hadoop Streaming, you may find that the job fails if your script imports a third-party library.
When checking the logs, the failure usually turns out to be an import error because the worker cannot find that external package.
If the Python package is installed under /usr/lib/python2.6/site-packages, add the following parameter to the MapReduce command:
1 | -cmdenv PYTHONPATH=$PYTHONPATH:/usr/lib/python2.6/site-packages |
This passes the PYTHONPATH environment variable into the MapReduce task, so the Python interpreter can find the library and the job can run normally.