0%

Use Python to Query Hive Through HiveServer2

HiveServer2 provides a remote interface for executing Hive queries, built on Thrift RPC, and also supports multi-user concurrency and authentication.

For Python users, the pyhs2 module can be used to connect to HiveServer2, execute queries, and fetch results.

The pyhs2 project is hosted on GitHub:

https://github.com/BradRuderman/pyhs2

It can be installed with:

1
easy_install pyhs2

If installation fails, try installing these dependencies first:

1
2
yum install cyrus-sasl-plain
yum install cyrus-sasl-devel

Here is a simple test script. pyhs2 provides the basics quite nicely, and since query results are returned as lists, it is convenient for daily scheduled scripts and lightweight automation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#!/usr/bin/env python
# -*- coding: utf-8 -*-

__author__ = 'knktc'
__version__ = '0.1'

import pyhs2


class HiveClient:
def __init__(self, db_host, user, password, database, port=10000, authMechanism="PLAIN"):
self.conn = pyhs2.connect(host=db_host,
port=port,
authMechanism=authMechanism,
user=user,
password=password,
database=database)

def query(self, sql):
with self.conn.cursor() as cursor:
cursor.execute(sql)
return cursor.fetch()

def close(self):
self.conn.close()


def main():
hive_client = HiveClient(db_host='hiveserver2.hadoop',
port=10000,
user='hdfs',
password='mypass',
database='test_log',
authMechanism='PLAIN')
result = hive_client.query('select * from t_test limit 10')
print(result)
hive_client.close()


if __name__ == '__main__':
main()
如果我的文字帮到了您,那么可不可以请我喝罐可乐?