0%

Use Third-Party Python Libraries in Hadoop Streaming

When running a custom Python MapReduce job through Hadoop Streaming, you may find that the job fails if your script imports a third-party library.

When checking the logs, the failure usually turns out to be an import error because the worker cannot find that external package.

If the Python package is installed under /usr/lib/python2.6/site-packages, add the following parameter to the MapReduce command:

1
-cmdenv PYTHONPATH=$PYTHONPATH:/usr/lib/python2.6/site-packages

This passes the PYTHONPATH environment variable into the MapReduce task, so the Python interpreter can find the library and the job can run normally.

如果我的文字帮到了您,那么可不可以请我喝罐可乐?