博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
学习笔记:Pig基础
阅读量:7279 次
发布时间:2019-06-30

本文共 5489 字,大约阅读时间需要 18 分钟。

一、Pig基本介绍

 1. 起源

MapReduce的一个缺点是开发周期太长。写mapperreducer,对代码进行编译和打包,提交作业,获取结果,这整个过程非常耗时。事实上,正是由于YAHOO公司想让科研人员和工程师能够便捷地挖掘大规模数据集,才设计了Pig.

2. 基础

一种探索大规模数据集的脚本语言。

Pig的好处在于仅用控制台上的几行Pig代码就能够处理TB级的数据。

二、Pig实验

该文件是某网站访问日志,请大家使用pig计算出每个ip的点击次数

1.数据源

119.146.220.12 - - [31/Jan/2012:23:59:44 +0800] "POST /forum.php?mod=post&action=reply&fid=53&tid=69&extra=page%3D1&replysubmit=yes&infloat=yes&handlekey=fastpost&inajax=1 HTTP/1.1" 200 397 "http://f.dataguru.cn/thread-69-1-1.html" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"119.146.220.12 - - [31/Jan/2012:23:59:45 +0800] "GET /forum.php?mod=viewthread&tid=69&viewpid=677&from=&inajax=1&ajaxtarget=post_new HTTP/1.1" 200 4794 "http://f.dataguru.cn/thread-69-1-1.html" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"119.146.220.12 - - [31/Jan/2012:23:59:45 +0800] "GET /static/js/common_extra.js?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/thread-69-1-1.html" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"119.146.220.12 - - [31/Jan/2012:23:59:48 +0800] "GET /static/js/jquery-1.6.js HTTP/1.1" 404 299 "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"119.146.220.12 - - [31/Jan/2012:23:59:48 +0800] "GET /static/js/floating-jf.js HTTP/1.1" 404 300 "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"119.146.220.12 - - [31/Jan/2012:23:59:48 +0800] "GET /static/js/common.js?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"119.146.220.12 - - [31/Jan/2012:23:59:48 +0800] "GET /data/cache/style_2_forum_forumdisplay.css?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"119.146.220.12 - - [31/Jan/2012:23:59:48 +0800] "GET /forum.php?mod=forumdisplay&fid=53&page=1 HTTP/1.1" 200 49334 "http://f.dataguru.cn/thread-69-1-1.html" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"119.146.220.12 - - [31/Jan/2012:23:59:48 +0800] "GET /data/cache/style_2_widthauto.css?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"119.146.220.12 - - [31/Jan/2012:23:59:48 +0800] "GET /static/js/forum.js?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"119.146.220.12 - - [31/Jan/2012:23:59:48 +0800] "GET /popwin_js.php?fid=53 HTTP/1.1" 404 289 "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"119.146.220.12 - - [31/Jan/2012:23:59:49 +0800] "GET /static/js/seditor.js?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"119.146.220.12 - - [31/Jan/2012:23:59:49 +0800] "GET /home.php?mod=spacecp&ac=pm&op=checknewpm&rand=1328025588 HTTP/1.1" 200 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"220.181.94.221 - - [31/Jan/2012:23:59:49 +0800] "GET /home.php?mod=spacecp&ac=pm&op=showmsg&handlekey=showmsg_11&touid=11&pmid=0&daterange=2&pid=77&tid=26 HTTP/1.1" 200 10074 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"119.146.220.12 - - [31/Jan/2012:23:59:48 +0800] "GET /data/cache/style_2_common.css?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"119.146.220.12 - - [31/Jan/2012:23:59:51 +0800] "GET /static/js/jquery-1.6.js HTTP/1.1" 404 299 "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"119.146.220.12 - - [31/Jan/2012:23:59:52 +0800] "GET /static/js/floating-jf.js HTTP/1.1" 404 300 "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"119.146.220.12 - - [31/Jan/2012:23:59:55 +0800] "GET /popwin_js.php?fid=53 HTTP/1.1" 404 289 "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"119.146.220.12 - - [31/Jan/2012:23:59:55 +0800] "GET /static/js/smilies.js?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"119.146.220.12 - - [31/Jan/2012:23:59:55 +0800] "GET /data/cache/common_smilies_var.js?AZH HTTP/1.1" 304 - "http://f.dataguru.cn/forum.php?mod=forumdisplay&fid=53&page=1" "Mozilla/5.0 (Windows NT 5.1; rv:8.0.1) Gecko/20100101 Firefox/8.0.1"

 

2. Pig 命令

//加载HDFS中访问日志,使用空格进行分割,只加载ip列records = LOAD 'hdfs://hadoop:9000/class7/input/website_log.txt' USING PigStorage(' ') AS (ip:chararray);// 按照ip进行分组,统计每个ip点击数records_b = GROUP records BY ip;records_c = FOREACH records_b GENERATE group,COUNT(records) AS click;// 按照点击数排序,保留点击数前10个的ip数据records_d = ORDER records_c by click DESC;top10 = LIMIT records_d 10;// 把生成的数据保存到HDFS的class7目录中STORE top10 INTO 'hdfs://hadoop:9000/class7/out';

 

转载于:https://www.cnblogs.com/FrankZhou2017/p/9145419.html

你可能感兴趣的文章
2014冬去春来
查看>>
Python全栈--6.1-match-search-findall-group(s)的区别以及计算器实例
查看>>
基本概念
查看>>
《Linux内核设计与实现》读书笔记(10)--- 定时器和时间管理(2)
查看>>
Spark On YARN内存分配
查看>>
Python学习笔记【第十三篇】:Python网络编程一Socket基础
查看>>
Hibernate ORM框架——项目一:Hibernate查询;项目二:集合相关查询
查看>>
Ionic2开发环境搭建
查看>>
ccf 最优灌溉
查看>>
(30)批处理文件.bat
查看>>
基于MFC和opencv的FFT
查看>>
0823模拟赛
查看>>
Ajax
查看>>
HDU 1849 Rabbit and Grass 【Nim博弈】
查看>>
JMeter-Java压力测试工具-01
查看>>
搜狐在线笔试 时间复杂度O(n)实现数组A[n]中所有元素循环左移k个位置
查看>>
写python时加入缩进设置
查看>>
ubuntu下安装opencv 2.4.9 脚本,支持摄像头和cuda
查看>>
Tensorflow 线性回归预测房价实例
查看>>
UBUNTU tftp 配置
查看>>