您好,登錄后才能下訂單哦!
小編給大家分享一下Spark sql流式處理的示例分析,相信大部分人都還不怎么了解,因此分享這篇文章給大家參考一下,希望大家閱讀完這篇文章后大有收獲,下面讓我們一起去了解一下吧!
Spark sql支持流式處理,流式處理有Source,Sink。Source定義了流的源頭,Sink定義了流的目的地,流的執行是從Sink開始觸發的。
Dataset的writeStream定義了流的目的地并觸發流的真正執行,所以分析就從writeStream開始。
writeStream = new DataStreamWriter[T](this)
DataStreamWriter
DataStreamWriter的作用是將入參的dataset寫入到外部存儲,比如kafka,database,txt等。
主要觸發方法是start方法,返回一個StreamingQuery對象,代碼:
def start(): StreamingQuery = { if (source == "memory") { assertNotPartitioned("memory") val (sink, resultDf) = trigger match { case _: ContinuousTrigger => val s = new MemorySinkV2() val r = Dataset.ofRows(df.sparkSession, new MemoryPlanV2(s, df.schema.toAttributes)) (s, r) case _ => val s = new MemorySink(df.schema, outputMode) val r = Dataset.ofRows(df.sparkSession, new MemoryPlan(s)) (s, r) } val chkpointLoc = extraOptions.get("checkpointLocation") val recoverFromChkpoint = outputMode == OutputMode.Complete() val query = df.sparkSession.sessionState.streamingQueryManager.startQuery( extraOptions.get("queryName"), chkpointLoc, df, extraOptions.toMap, sink, outputMode, useTempCheckpointLocation = true, recoverFromCheckpointLocation = recoverFromChkpoint, trigger = trigger) resultDf.createOrReplaceTempView(query.name) query } else if (source == "foreach") { assertNotPartitioned("foreach") val sink = new ForeachSink[T](foreachWriter)(ds.exprEnc) df.sparkSession.sessionState.streamingQueryManager.startQuery( extraOptions.get("queryName"), extraOptions.get("checkpointLocation"), df, extraOptions.toMap, sink, outputMode, useTempCheckpointLocation = true, trigger = trigger) } else { val ds = DataSource.lookupDataSource(source, df.sparkSession.sessionState.conf) val disabledSources = df.sparkSession.sqlContext.conf.disabledV2StreamingWriters.split(",") val sink = ds.newInstance() match { case w: StreamWriteSupport if !disabledSources.contains(w.getClass.getCanonicalName) => w case _ => val ds = DataSource( df.sparkSession, className = source, options = extraOptions.toMap, partitionColumns = normalizedParCols.getOrElse(Nil)) ds.createSink(outputMode) } df.sparkSession.sessionState.streamingQueryManager.startQuery( extraOptions.get("queryName"), extraOptions.get("checkpointLocation"), df, extraOptions.toMap, sink, outputMode, useTempCheckpointLocation = source == "console", recoverFromCheckpointLocation = true, trigger = trigger) } }
我們這里看最后一個條件分支的代碼,ds是對應的DataSource,sink有時候就是ds。最后通過streamingQueryManager的startQuery啟動流的計算,返回計算中的StreamingQuery對象。
streamingQueryManager的startQuery方法里主要調用createQuery方法創建StreamingQueryWrapper對象,這是個私有方法:
private def createQuery( userSpecifiedName: Option[String], userSpecifiedCheckpointLocation: Option[String], df: DataFrame, extraOptions: Map[String, String], sink: BaseStreamingSink, outputMode: OutputMode, useTempCheckpointLocation: Boolean, recoverFromCheckpointLocation: Boolean, trigger: Trigger, triggerClock: Clock): StreamingQueryWrapper = { var deleteCheckpointOnStop = false val checkpointLocation = userSpecifiedCheckpointLocation.map { userSpecified => new Path(userSpecified).toUri.toString }.orElse { df.sparkSession.sessionState.conf.checkpointLocation.map { location => new Path(location, userSpecifiedName.getOrElse(UUID.randomUUID().toString)).toUri.toString } }.getOrElse { if (useTempCheckpointLocation) { // Delete the temp checkpoint when a query is being stopped without errors. deleteCheckpointOnStop = true Utils.createTempDir(namePrefix = s"temporary").getCanonicalPath } else { throw new AnalysisException( "checkpointLocation must be specified either " + """through option("checkpointLocation", ...) or """ + s"""SparkSession.conf.set("${SQLConf.CHECKPOINT_LOCATION.key}", ...)""") } } // If offsets have already been created, we trying to resume a query. if (!recoverFromCheckpointLocation) { val checkpointPath = new Path(checkpointLocation, "offsets") val fs = checkpointPath.getFileSystem(df.sparkSession.sessionState.newHadoopConf()) if (fs.exists(checkpointPath)) { throw new AnalysisException( s"This query does not support recovering from checkpoint location. " + s"Delete $checkpointPath to start over.") } } val analyzedPlan = df.queryExecution.analyzed df.queryExecution.assertAnalyzed() if (sparkSession.sessionState.conf.isUnsupportedOperationCheckEnabled) { UnsupportedOperationChecker.checkForStreaming(analyzedPlan, outputMode) } if (sparkSession.sessionState.conf.adaptiveExecutionEnabled) { logWarning(s"${SQLConf.ADAPTIVE_EXECUTION_ENABLED.key} " + "is not supported in streaming DataFrames/Datasets and will be disabled.") } (sink, trigger) match { case (v2Sink: StreamWriteSupport, trigger: ContinuousTrigger) => UnsupportedOperationChecker.checkForContinuous(analyzedPlan, outputMode) new StreamingQueryWrapper(new ContinuousExecution( sparkSession, userSpecifiedName.orNull, checkpointLocation, analyzedPlan, v2Sink, trigger, triggerClock, outputMode, extraOptions, deleteCheckpointOnStop)) case _ => new StreamingQueryWrapper(new MicroBatchExecution( sparkSession, userSpecifiedName.orNull, checkpointLocation, analyzedPlan, sink, trigger, triggerClock, outputMode, extraOptions, deleteCheckpointOnStop)) } }
它根據是否連續流操作還是微批處理操作分成ContinuousExecution和MicroBatchExecution,他們都是StreamExecution的子類,StreamExecution是流處理的抽象類。稍后會分析StreamExecution的類結構。
ContinuousExecution和MicroBatchExecution兩者的代碼結構和功能其實是很類似的,我們先拿ContinuousExecution舉例吧。
ContinuousExecution
首先ContinuousExecution是沒有結束的,是沒有結束的流,當暫時流沒有數據時,ContinuousExecution會阻塞線程等待新數據的到來,這是通過awaitEpoch方法來控制的。
其實,commit方法在每條數據處理完后被觸發,commit方法將當前處理完成的偏移量(offset)寫到commitLog中。
再看logicalPlan,在ContinuousExecution中入參的邏輯計劃是StreamingRelationV2類型,會被轉換成ContinuousExecutionRelation類型的LogicalPlan:
analyzedPlan.transform {
case r @ StreamingRelationV2(
source: ContinuousReadSupport, _, extraReaderOptions, output, _) =>
toExecutionRelationMap.getOrElseUpdate(r, {
ContinuousExecutionRelation(source, extraReaderOptions, output)(sparkSession)
})
}
還有addOffset方法,在每次讀取完offset之后會將當前的讀取offset寫入到offsetLog中,以便下次恢復時知道從哪里開始。addOffset和commit兩個方法一起保證了Exactly-once語義的執行。
以上是“Spark sql流式處理的示例分析”這篇文章的所有內容,感謝各位的閱讀!相信大家都有了一定的了解,希望分享的內容對大家有所幫助,如果還想學習更多知識,歡迎關注億速云行業資訊頻道!
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。