大数据技术的落地依赖“工具链掌握+场景化应用”的双重能力。从 Python 爬虫、Hive数据分析到 Flink 实时计算、数仓架构设计,技能点的综合应用能力已成为企业招聘的核心标准。本书以“真实项目驱动实训”为核心思路,精选 4 个典型实训项目构建阶梯式训练体系,涵盖离线处理、实时计算、数仓设计等核心场景,强化工程思维;整合 Python 爬虫、Hive、Flink、Kafka 等多种主流工具,覆盖数据采集、清洗、存储、分析、可视化全流程;融入大数据竞赛考点,衔接岗位技能需求。本书适合作为高等学校大数据相关专业的实训教材,也可为数据工程从业者提供实践参考。
张志伟,副教授,宿州学院信息工程学院软件工程教研室主任,博士毕业于华南理工大学计算机科学与技术专业,研究方向为数据科学与大数据技术、人工智能,主持多项国家自然科学基金委员会项目和省级项目,编写图书3部。
第 1 章 历史天气数据分析项目································································································.1
任务一 需求分析·················································································································.1
任务二 技术架构分析及设计 ·····························································································.2
任务三 历史天气数据采集 ·································································································.5
任务四 导入天气数据至 Hive···························································································.13
任务五 历史天气数据分析 ·······························································································.22
任务六 结果指标表导出···································································································.33
任务七 数据可视化···········································································································.36
第 2 章 音乐推荐系统··············································································································.44
任务一 需求分析···············································································································.44
任务二 技术架构分析及设计 ···························································································.45
任务三 数据集合和项目概述 ···························································································.47
任务四 数据加载模块·······································································································.52
任务五 数据统计模块·······································································································.55
任务六 离线推荐模块·······································································································.59
任务七 实时推荐模块·······································································································.65
第 3 章 电商离线数仓··············································································································.72
任务一 需求分析···············································································································.72
任务二 数仓概述及架构分析 ···························································································.73
任务三 数据源···················································································································.75
任务四 数仓建设···············································································································.77
任务五 工作流调度···········································································································117
任务六 数据可视化·········································································································.128
第 4 章 智慧社区实时数仓····································································································.136
任务一 需求分析·············································································································.136
任务二 技术架构分析及设计 ·························································································.137
任务三 数据源与预处理·································································································.140
任务四 实时计算框架配置 ·····························································································.153
任务五 DIM 层构建········································································································.155
任务六 ODS 层构建········································································································.169
任务七 DWD 层构建 ······································································································.174
任务八 DWS 层构建·······································································································.182
任务九 数据可视化与应用 ·····························································································.192
附录 A Hadoop 部署与配置··································································································.196
附录 B MySQL 部署··············································································································.206
附录 C Hive 部署与配置 ·······································································································.208
附录 D DataX 部署与配置 ····································································································.215
附录 E Zookeeper 部署与配置·····························································································.216
附录 F Kafka 部署与配置······································································································.220
附录 G Flume 部署与配置····································································································.224
附录 H DolphinScheduler 部署与配置················································································.227
附录 I Superset 部署与配置·································································································.234