1.首先下载测试数据,数据也可以创建
http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
2. 数据类型与字段名称
movies.csv(电影元数据)movieId,title,genresratings.csv(用户打分数据)userId,movieId,rating,timestamp
3. 先把数据存放到HDFS上
hdfs dfs -mkdir /hive_operatehdfs dfs -mkdir /hive_operate/movie_tablehdfs dfs -mkdir /hive_operate/rating_tablehdfs dfs -put movies.csv /hive_operate/movie_tablehdfs dfs -put ratings.csv /hive_operate/rating_table
4. 创建movie_table和rating_table
$ cat create_movie_table.sqlcreate external table movie_table -- 创建外表(movieId STRING,title STRING,genres STRING)row format delimited fields terminated by ','stored as textfilelocation '/hive_operate/movie_table'; -- 指定hdfs的目录结构$ cat create_rating_table.sqlcreate external table rating_table(userId STRING,movieId STRING,rating STRING,ts STRING)row format delimited fields terminated by ','stored as textfilelocation '/hive_operate/rating_table';其中字段名为timestamp为hive的保留字段,执行的时候会报错,需用反引号或者修改字段名,我这边修改的字段名
5. 执行
在命令行执行该语句
6. 查看
hive> show tables;OKmovie_tablerating_tablehive> select * from rating_table limit 10;OK1 31 2.5 12607591441 1029 3.0 12607591791 1061 3.0 12607591821 1129 2.0 12607591851 1172 4.0 12607592051 1263 2.0 12607591511 1287 2.0 12607591871 1293 2.0 12607591481 1339 3.5 12607591251 1343 2.0 1260759131
7. 生成新表(行为表)
create table behavior_table asselect B.userid, A.movieid, B.rating, A.titlefrom movie_table Ajoin rating_table Bon A.movieid == B.movieid;
8. 把Hive表数据导入到本地
导入到本地中,语句中多了local
table->local fileinsert overwrite local directory '/root/hive_test/1.txt' select * from behavior_table;
9. 把Hive表数据导入到HDFS上
table->hdfs fileinsert overwrite directory '/root/hive_test/1.txt' select * from behavior_table;
10. 把本地数据导入到Hive表中
命令行中由LOCAL字段信息
local file -> tableLOAD DATA LOCAL INPATH '/root/hive_test/a.txt' OVERWRITE INTO TABLE behavior_table;
11. 把HDFS上的数导入到HIve表中
hdfs file -> tableLOAD DATA INPATH '/a.txt' OVERWRITE INTO TABLE behavior_table;
12.导入hive表中,需要分区的
LOAD DATA LOCAL INPATH '/home/hadoop/hfxdoc/classmem_Misli.txt' -- 文件名INTO TABLE classmem -- 表名partition (teacher = 'Mis.li') -- 分区名
