HowTo: Use Hive with Apache Logs
In HowTo: Rotate Logs to S3 and HowTo: Rotate Apache Logs to S3 Based on EC2 Tags we rotated apache logs to S3 and put them in a directory such that it would facilitate running a hive script against the logs. In this article we give a demonstration of such a hive script.
Command used by logrotate to upload apache access_log to S3
1
2
3
4
5
/usr/bin/s3cmd \
-m text/plain \
sync \
/var/log/apache2/access_log-* \
s3://${BUCKET}/apache/access_log/site=${SITE}/instance=${INSTANCE_ID}/
With BUCKET
being logging-bucket
and SITE
being www.example.com
and INSTANCE_ID
being ``, here are some example paths for the access_log
files:
The first thing we will want to do is start up an interactive hive session, see HowTo: AWS CLI Elastic MapReduce - Interactive Hive for one way of starting an interactive hive Elastic MapReduce cluster.
Once in hive, we’ll set the input location and load hive_contrib.jar
hive>
1
2
3
set INPUT=s3n://logging-bucket/apache;
add jar /home/hadoop/hive/lib/hive_contrib.jar;
All Sites
We can create a table that will load all the sites:
hive>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
CREATE EXTERNAL TABLE IF NOT EXISTS raw_access_log(
remote_host STRING,
remote_logname STRING,
remote_user STRING,
request_time STRING,
first_line STRING,
http_status STRING,
bytes STRING
)
PARTITIONED BY (site string, instance string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s"
)
LOCATION '${hiveconf:INPUT}/access_log/'
;
ALTER TABLE raw_access_log RECOVER PARTITIONS;
One thing to note is that the partitions will be columns available in the table
hive>
1
describe raw_access_log;
Output
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
OK
remote_host string from deserializer
remote_logname string from deserializer
remote_user string from deserializer
request_time string from deserializer
first_line string from deserializer
http_status string from deserializer
bytes string from deserializer
site string None
instance string None
# Partition Information
# col_name data_type comment
site string None
instance string None
Time taken: 0.722 seconds, Fetched: 15 row(s)
Change Column Types
Another issue is that all the types are strings. It would be nice if we could have the request_time
be a timestamp and have bytes
actually be a number.
There are two options we can take: create a view that casts the types; or create a new table an populate And we can do that by creating a new table and formatting those two columns when populating the new table.
View
hive>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
CREATE VIEW access_log_view AS
SELECT
remote_host,
remote_logname,
remote_user,
CAST(FROM_UNIXTIME(UNIX_TIMESTAMP(request_time, '[dd/MMM/yyyy:HH:mm:ss Z]')) AS TIMESTAMP) AS request_time,
first_line,
http_status,
CAST(REGEXP_REPLACE(bytes, '-', '0') AS BIGINT) AS bytes,
site,
instance
FROM
raw_access_log
;
Table
hive>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
CREATE TABLE IF NOT EXISTS access_log_table (
remote_host STRING,
remote_logname STRING,
remote_user STRING,
request_time TIMESTAMP,
first_line STRING,
http_status STRING,
bytes BIGINT,
site STRING,
instance STRING
)
;
INSERT OVERWRITE TABLE access_log_table
SELECT
remote_host,
remote_logname,
remote_user,
CAST(FROM_UNIXTIME(UNIX_TIMESTAMP(request_time, '[dd/MMM/yyyy:HH:mm:ss Z]')) AS TIMESTAMP),
first_line,
http_status,
CAST(REGEXP_REPLACE(bytes, '-', '0') AS BIGINT),
site,
instance
FROM
raw_access_log
;
Example queries
Get the logs for www.example.com
:
hive>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
set SITE=www.example.com;
SELECT
remote_host,
remote_logname,
remote_user,
request_time,
first_line,
http_status,
bytes
FROM
access_log_view
WHERE
site = '${hiveconf:SITE}'
;