# mapreduce-python **Repository Path**: mbw030714/mapreduce-python ## Basic Information - **Project Name**: mapreduce-python - **Description**: 使用python接入Hadoop的API进行简单的mapreduce计算。 - **Primary Language**: Python - **License**: GPL-2.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 2 - **Forks**: 0 - **Created**: 2022-04-02 - **Last Updated**: 2022-04-03 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # mapreduce-python #### 介绍 使用python接入Hadoop的API进行简单的mapreduce计算。 参考官方文档 #### 软件架构 软件架构说明 1. mapper.py用于执行mapreduce的mapper分片操作 2. reducer.py用于执行mapreduce的reducer整合操作 3. 我们定义testdata.csv文件中的gender=0为女性,gender=1为男性 #### 安装教程(基于python3.8 & hadoop3.3.2) 1. 在Linux操作系统中安装python3环境 `sudo apt-get update -y` `sudo apt-get install python3 -y` 2. 将本项目clone到本地 `cd ~` `git clone https://gitee.com/mbw030714/mapreduce-python.git` 3. 将testdata上传到hdfs根目录中 `hdfs dfs -put ~/mapreduce-python/testdata.csv /` 4. 配置mapred-site.xml以提供集群运算功能 (此处使用nano文本编辑器) `nano ${HADOOP_HOME}/etc/hadoop/mapred-site.xml` 在中间添加如下内容并保存 ``` yarn.app.mapreduce.am.env HADOOP_MAPRED_HOME=${HADOOP_HOME} mapreduce.map.env HADOOP_MAPRED_HOME=${HADOOP_HOME} mapreduce.reduce.env HADOOP_MAPRED_HOME=${HADOOP_HOME} ``` 5. 记得**重新启动**Hadoop! #### 使用说明 1. 在终端中运行`hadoop-streaming-3.3.2.jar`并搭配参数即可进行运算 ``` hadoop jar ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-streaming-3.3.2.jar \ -D stream.non.zero.exit.is.failure=false \ -files ~/hadoop_python/mapper.py, ~/hadoop_python/reducer.py \ -input /testdata.csv -output /output \ -mapper "python3 mapper.py" \ -reducer "python3 reducer.py" ``` 2. 若成功运行,应该能看到如下输出 ``` packageJobJar: [/tmp/hadoop-unjar212095934203121354/] [] /tmp/streamjob2467242584601343980.jar tmpDir=null 2022-04-01 16:55:15,579 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at master/192.168.182.131:8032 2022-04-01 16:55:15,726 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at master/192.168.182.131:8032 2022-04-01 16:55:16,044 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop- yarn/staging/hadoop/.staging/job_1648832063677_0001 2022-04-01 16:55:16,365 INFO mapred.FileInputFormat: Total input files to process : 1 2022-04-01 16:55:16,441 INFO mapreduce.JobSubmitter: number of splits:2 2022-04-01 16:55:16,549 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1648832063677_0001 2022-04-01 16:55:16,549 INFO mapreduce.JobSubmitter: Executing with tokens: [] 2022-04-01 16:55:16,683 INFO conf.Configuration: resource-types.xml not found 2022-04-01 16:55:16,683 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'. 2022-04-01 16:55:16,824 INFO impl.YarnClientImpl: Submitted application application_1648832063677_0001 2022-04-01 16:55:16,874 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1648832063677_0001/ 2022-04-01 16:55:16,875 INFO mapreduce.Job: Running job: job_1648832063677_0001 2022-04-01 16:55:22,956 INFO mapreduce.Job: Job job_1648832063677_0001 running in uber mode : false 2022-04-01 16:55:22,957 INFO mapreduce.Job: map 0% reduce 0% 2022-04-01 16:55:29,025 INFO mapreduce.Job: map 100% reduce 0% 2022-04-01 16:55:34,059 INFO mapreduce.Job: map 100% reduce 100% 2022-04-01 16:55:34,066 INFO mapreduce.Job: Job job_1648832063677_0001 completed successfully 2022-04-01 16:55:34,124 INFO mapreduce.Job: Counters: 54 File System Counters FILE: Number of bytes read=240 FILE: Number of bytes written=838673 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=1000 HDFS: Number of bytes written=224 HDFS: Number of read operations=11 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 HDFS: Number of bytes read erasure-coded=0 Job Counters Launched map tasks=2 Launched reduce tasks=1 Data-local map tasks=2 Total time spent by all maps in occupied slots (ms)=7513 Total time spent by all reduces in occupied slots (ms)=2286 Total time spent by all map tasks (ms)=7513 Total time spent by all reduce tasks (ms)=2286 Total vcore-milliseconds taken by all map tasks=7513 Total vcore-milliseconds taken by all reduce tasks=2286 Total megabyte-milliseconds taken by all map tasks=7693312 Total megabyte-milliseconds taken by all reduce tasks=2340864 Map-Reduce Framework Map input records=27 Map output records=26 Map output bytes=182 Map output materialized bytes=246 Input split bytes=196 Combine input records=0 Combine output records=0 Reduce input groups=8 Reduce shuffle bytes=246 Reduce input records=26 Reduce output records=8 Spilled Records=52 Shuffled Maps =2 Failed Shuffles=0 Merged Map outputs=2 GC time elapsed (ms)=203 CPU time spent (ms)=1250 Physical memory (bytes) snapshot=806965248 Virtual memory (bytes) snapshot=7824416768 Total committed heap usage (bytes)=635437056 Peak Map Physical memory (bytes)=296779776 Peak Map Virtual memory (bytes)=2606022656 Peak Reduce Physical memory (bytes)=217452544 Peak Reduce Virtual memory (bytes)=2612670464 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=804 File Output Format Counters Bytes Written=224 2022-04-01 16:55:34,124 INFO streaming.StreamJob: Output directory: /user/mengbowen/output ``` 3. 查看结果 如果成功运行,我们可以查看计算的结果 `hdfs dfs -cat /output/part-00000` 我的运行结果如下 ``` 2006 year:0 female,1 male 2008 year:2 female,2 male 2009 year:2 female,1 male 2010 year:2 female,0 male 2011 year:3 female,3 male 2012 year:2 female,2 male 2013 year:1 female,2 male 2014 year:0 female,2 male ``` #### 动手试一试吧! #### 好用的话请给点个星星! #### 参与贡献 1. Fork 本仓库 2. 新建 Feat_xxx 分支 3. 提交代码 4. 新建 Pull Request #### 特技 1. 使用 Readme\_XXX.md 来支持不同的语言,例如 Readme\_en.md, Readme\_zh.md 2. Gitee 官方博客 [blog.gitee.com](https://blog.gitee.com) 3. 你可以 [https://gitee.com/explore](https://gitee.com/explore) 这个地址来了解 Gitee 上的优秀开源项目 4. [GVP](https://gitee.com/gvp) 全称是 Gitee 最有价值开源项目,是综合评定出的优秀开源项目 5. Gitee 官方提供的使用手册 [https://gitee.com/help](https://gitee.com/help) 6. Gitee 封面人物是一档用来展示 Gitee 会员风采的栏目 [https://gitee.com/gitee-stars/](https://gitee.com/gitee-stars/)