https://www.apache.org/dyn/closer.lua/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
Apache Download Mirrors
<!-- This document is currently not in use, but should be kept in sync with https://www.apache.org/dyn/closer.html for future use --> We suggest the following location for your download: https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tg
www.apache.org
Spark 다운로드
[root@localhost ~]# wget https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
압축풀고 이름 바꾸기
[root@localhost ~]# mv spark-3.5.1-bin-hadoop3 spark
bashrc 수정
# spark path
export SPARK_HOME=/root/spark
export PATH=$SPARK_HOME/bin:$PATH
Spark 접속
[root@localhost ~]# cd spark
[root@localhost spark]# spark-shell
Spark 실습
scala> var myRange = spark.range(1000).toDF("number")
myRange: org.apache.spark.sql.DataFrame = [number: bigint]
scala> myRange.show()
+------+
|number|
+------+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
+------+
only showing top 20 rows
var divisBy2 = myRange.where("number % 2 = 0")
Spark에서 csv 파일 읽기
로컬에서 다운로드 받은 csv 파일 리눅스로 옮기기
C:\Users\Playdata>scp C:\Users\Playdata\Downloads\2015-summary.csv root@192.168.111.100:/root
[root@localhost ~]# mv 2015-summary.csv spark/2015-summary.csv
csv 파일 읽기
var flightData = spark.read.option("inferSchema","true").option("header","true").csv("2015-summary.csv")
option("inferSchema","true") : 자동으로 type 감지
View로 만들기
flightData.createOrReplaceTempView("flightData")
SQL 실행
var flightData_sql = spark.sql("""
| select DEST_COUNTRY_NAME, count(*)
| from flightData
| group by DEST_COUNTRY_NAME
| """)
var flightData_gb = flightData.groupBy("DEST_COUNTRY_NAME").count()
최댓값 찾기
import org.apache.spark.sql.functions.max
var result = flightData.select(max("count")).take(1)
print(result(0))
[370002]
SQL
var max_sql = spark.sql("""
| select DEST_COUNTRY_NAME, sum(count) as des_total
| from flightData
| group by DEST_COUNTRY_NAME
| order by sum(count) desc
| limit 5
| """)
'Data Engineering > Spark' 카테고리의 다른 글
[Spark] 기본 연산 - 2 (0) | 2024.03.21 |
---|---|
[Spark] 기본 연산 (0) | 2024.03.21 |
[Spark] Jupyterlab, Spark를 이용한 데이터 전처리 (0) | 2024.03.20 |
[Spark] Zeppelin 설치하기 (1) | 2024.03.20 |
[Spark] PySpark (0) | 2024.03.20 |