Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Stateful stream processing made easy with Apache Flink

225 views

Published on

Stateful stream processing made easy with Apache Flink
By Alberto Mancini & Francesca Tosi

Published in: Technology
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Stateful stream processing made easy with Apache Flink

  1. 1. Stateful stream processing made easy with Apache Flink A. Mancini - F. Tosi ROME - APRIL 13/14 2018 1
  2. 2. STATEFUL STREAM PROCESSING MADE EASY WITH APACHE FLINK Alberto Mancini alberto@k-teq.com Francesca Tosi francesca@k-teq.com 2
  3. 3. Alberto Mancini Software Architect K-TEQ Srls alberto@k-teq.com #Flink #Java #GWT #Web 3 WHO WE ARE Francesca Tosi Software Architect K-TEQ Srls francesca@k-teq.com #Java #Flink #Web #GWT
  4. 4. 4 Apache Flink
  5. 5. Apache Flink is an open source stream processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming dataflow engine written in Java and Scala. 5 TL;DR
  6. 6. Flink's pipelined runtime system enables the execution of bulk/batch and stream processing programs. 6 TL;DR - Native Stream - Low Latency - High Throughput - Stateful - Exactly-one guarantees - Distributed - Expressive Apis - … Main Features
  7. 7. 7 TL;DR https://github.com/apache/flink #contributors +377 #commits +13.565 #committers over 25 #linesOdCode +1.257.949 Flink conferences (at the beginning of this week Flink Forward in San Francisco, 5th Int’l Flink conference).
  8. 8. 8 TL;DR 2017 - Year in Review #contributors #stars #forks
  9. 9. 9 TL;DR 2017 - Year in Review #LinesOfCode
  10. 10. 10 TL;DR https://github.com/apache/flink #contributors +25 #commits +13.565 #committers over 25 #linesOdCode +1.257.949 Flink conferences (at the beginning of this week Flink Forward in San Francisco, 5th Int’l Flink conference).
  11. 11. 11 TL;DR #FlinkForward
  12. 12. A bit of History Actually flink was born as a sort of spin-off from the project Stratosphere. “Stratosphere is a research project whose goal is to develop the next generation Big Data Analytics platform. aimed at Next Generation Big Data Analytics Platform The project includes universities from the area of Berlin, namely, TU Berlin, Humboldt University and the Hasso Plattner Institute.” 12 TL;DR
  13. 13. 13 TL;DR (Flink use cases) uses Flink for real-time process monitoring and ETL. Telefónica NEXT's TÜV-certified Data Anonymization Platform is powered by Flink. uses a fork of Flink called Blink to optimize search rankings in real time.
  14. 14. 14 TL;DR (Flink use cases) Ericsson used Flink to build a real-time anomaly detector over large infrastructures. MediaMath uses Flink to power its real-time reporting infrastructure. uses Flink to surface near real-time intelligence from SaaS application activity. https://flink.apache.org/poweredby.html
  15. 15. 15 co-flatMap keyBy Map TL;DR
  16. 16. StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); //create a kafka source FlinkKafkaConsumer010<Event> kafkaConsumer = new FlinkKafkaConsumer010<Event>("topic", new MessageDeserializer(), properties); DataStreamSource<Event> kafkaStream = env.addSource(kafkaConsumer); ZookeeperSource zookeeperSource = new ZookeeperSource("Quorum", "/path"); DataStreamSource<Map<String, String>> zookeeperStream = env.addSource(zookeeperSource); ConnectedStreams<Event, Map<String, String>> connectedStream = kafkaStream.connect(zookeeperStream); CoFlatMapFunction<Event, Map<String, String>, Event> coFlatMap = new DoSomethingFlatMap(); SingleOutputStreamOperator<Event> flatmappedStream = connectedStream.flatMap(coFlatMap); KeyedStream<Event, String> keyedStream = flatmappedStream.keyBy(new MyKeySelector()); SingleOutputStreamOperator<OutputEvent> transformed = keyedStream.map(new AnotherMapThatIsExecutedInKeyedContext()); transformed.writeUsingOutputFormat( new MyHbaseOutputFormat() ); env.execute("Codemotion Sample"); 16 TL;DR (code snippet)
  17. 17. StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); //create a kafka source FlinkKafkaConsumer010<Event> kafkaConsumer = new FlinkKafkaConsumer010<Event>("topic", new MessageDeserializer(), properties); DataStreamSource<Event> kafkaStream = env.addSource(kafkaConsumer); ZookeeperSource zookeeperSource = new ZookeeperSource("Quorum", "/path"); DataStreamSource<Map<String, String>> zookeeperStream = env.addSource(zookeeperSource); ConnectedStreams<Event, Map<String, String>> connectedStream = kafkaStream.connect(zookeeperStream); CoFlatMapFunction<Event, Map<String, String>, Event> coFlatMap = new DoSomethingFlatMap(); SingleOutputStreamOperator<Event> flatmappedStream = connectedStream.flatMap(coFlatMap); KeyedStream<Event, String> keyedStream = flatmappedStream.keyBy(new MyKeySelector()); SingleOutputStreamOperator<OutputEvent> transformed = keyedStream.map(new AnotherMapThatIsExecutedInKeyedContext()); transformed.writeUsingOutputFormat( new MyHbaseOutputFormat() ); env.execute("Codemotion Sample"); 17 TL;DR (code snippet)
  18. 18. StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); FlinkKafkaConsumer010<Event> kafkaConsumer = new FlinkKafkaConsumer010<>("topic", new MessageDeserializer(), properties); //start from latest message kafkaConsumer.setStartFromLatest(); ZookeeperSource zookeeperSource = new ZookeeperSource("Quorum", "/path"); env.addSource(kafkaConsumer) .connect( env.addSource(zookeeperSource) ) .flatMap( new DoSomethingFlatMap() ) .keyBy( new MyKeySelector() ) .map( new AnotherMapThatIsExecutedInKeyedContext() ) .writeUsingOutputFormat( new MyHbaseOutputFormat() ); env.execute("Codemotion Sample"); 18 TL;DR (code snippet)
  19. 19. 19 TL;DR (Plan Visualizer)
  20. 20. 20 TL;DR (Plan Visualizer)
  21. 21. That's a lot of fun but let’s start from the very beginning … 21
  22. 22. BigData is … well, big! IBM (2014): every day about 2.5 trillion (1018 ) of data bytes are created and 90% of the data has been created only in the last two years Each year about EXABYTE (10^18, 2^60) of data. 22
  23. 23. BigData still grows From: https://dvmobile.io/dvmobile-blog/feeling-overwhelmed-by-a-deluge- of-iot-data-iot-data-analytics-dashboards-can-help 23
  24. 24. Processing Model From: https://data-artisans.com/what-is-stream-processing 24
  25. 25. From: http://dme.rwth-aachen.de/en/research/projects/mapreduce 25 Computing Model (google docet)
  26. 26. From: http://sqlknowledgebank.blogspot.it/2018/01/azure-data-lake-introductory.html 26 … that eventually can evolve as
  27. 27. From: https://dvmobile.io/dvmobile-blog/feeling-overwhelmed-by-a-deluge-of-iot-data-iot- data-analytics-dashboards-can-help 27 BigData still grows
  28. 28. 1,60,0,1,60,0,1,60,0,1,60,0,1,60,0,1,60,0,1,60,0,1,60,0,1,60,0,1,60,0,1,60,0,60,0,0,0,1,60,0,60,0,0,0,1,60,0,60,0,0,0,1,60,0,60,0,0,0,1,60,0,60,0,0,0,1,1505661693,0, 1,1505661693,0,1,1505661693,0,1,1505661693,0,1,1505661693,0,1,60,0,60,0,0,0,1,60,0,60,0,0,0,1,60,0,60,0,0,0,1,60,0,60,0,0,0,1,60,0,60,0,0,0 1,354,0,1,354,0,1,354,0,1,354,0,1,354,0,1.000000032,353.9999996,4.59E-06,1.000031776,353.9996187,0.004575383,1.031757072,353.6306448,4.295 83941,2.597515222,346.6197997,34.09504708,5.319895236,344.2626949,22.18829857,1.000000032,353.9999996,0.002142519,353.9999996,4.5 9E-06,0,0,1.000031776,353.9996187,0.067641574,353.9996187,0.004575383,0,0,1.031757072,353.6306448,2.072640685,353.6306448,4.295839 41,0,0,2.597515222,346.6197997,5.839096427,346.6197997,34.09504708,0,0,5.319895236,344.2626949,4.710445687,344.2626949,22.18829857, 0,0,1.000000032,4.980575201,4.23E-07,1.000031776,4.980690887,0.000422029,1.031757072,5.092511423,0.418370441,2.597515222,31.24484876,6 976.104978,5.319895236,58.01651105,12740.63613,1.000000032,353.9999996,0.002142519,353.9999996,4.59E-06,0,0,1.000031776,353.9996187,0. 067641574,353.9996187,0.004575383,0,0,1.031757072,353.6306448,2.072640685,353.6306448,4.29583941,0,0,2.597515222,346.6197997,5.839 096427,346.6197997,34.09504708,0,0,5.319895236,344.2626949,4.710445687,344.2626949,22.18829857,0,0 1.857878541,360.4589798,35.78933753,1.91212736,360.2757326,35.92397154,1.969806657,360.0919684,35.9915418,1.99693884,360.0091976,35. 9999154,1.999693461,360.0009198,35.99999915,1.857878569,360.4589795,35.78934202,1.912156343,360.2754556,35.92848955,2.000604877, 359.8134525,40.39880279,3.589563813,352.0188402,100.0815133,6.318264483,347.7030867,81.6250773,1.857878569,360.4589795,5.982419412, 360.4589795,35.78934202,0,0,1.912156343,360.2754556,5.994037834,360.2754556,35.92848955,0,0,2.000604877,359.8134525,6.356005254, 359.8134525,40.39880279,0,0,3.589563813,352.0188402,10.00407484,352.0188402,100.0815133,0,0,6.318264483,347.7030867,9.034659778,347. 7030867,81.6250773,0,0,1.857878569,2.323596243,6.056225848,1.912156343,2.399071467,6.079503362,2.000604877,2.569134347,6.58053185,3 .589563813,22.55281278,5228.309522,6.318264483,48.84116231,11171.88781,1.857878569,360.4589795,5.982419412,360.4589795,35.78934202,0, 0,1.912156343,360.2754556,5.994037834,360.2754556,35.92848955,0,0,2.000604877,359.8134525,6.356005254,359.8134525,40.39880279,0, 0,3.589563813,352.0188402,10.00407484,352.0188402,100.0815133,0,0,6.318264483,347.7030867,9.034659778,347.7030867,81.6250773,0,0 1,337,0,1,337,0,1,337,0,1,337,0,1,337,0,1,337,0,1,337,0,1,337,0,1,337,0,1,337,0,1,337,0,337,0,0,0,1,337,0,337,0,0,0,1,337,0,337,0,0,0,1,337,0,337,0,0,0,1,337,0,337, 0,0,0,1,1505661697,0,1,1505661697,0,1,1505661697,0,1,1505661697,0,1,1505661697,0,1,337,0,337,0,0,0,1,337,0,337,0,0,0,1,337,0,337,0,0,0,1,337,0,337,0,0,0 ,1,337,0,337,0,0,0 … x 40k lines https://archive.ics.uci.edu/ml/machine-learning-databases/00442/Danmini_Doorbell/ 28 not just big: big and often raw, often senseless
  29. 29. Government - smarter surveillance: analyze data from vehicles and cameras to alert law enforcement of potential issues HealthCare - proactive treatment: continuously improve care based on personalized data streams Finance - manage risk: continuously monitor trades and calculate derivative values in real-time Automotive - improved quality and functionalities: detect problems sooner and predict breakdowns Telco - processing call data: predictive spam and fraud detection 29 areas of application cf. use-cases-streaming-analytics
  30. 30. Government - smarter surveillance: analyze data from vehicles and cameras to alert law enforcement of potential issues HealthCare - proactive treatment: continuously improve care based on personalized data streams Finance - manage risk: continuously monitor trades and calculate derivative values in real-time Automotive - improved quality and functionalities: detect problems sooner and predict breakdowns Telco - processing call data: predictive spam and fraud detection 30 areas of application cf. use-cases-streaming-analytics
  31. 31. 31 stream processing
  32. 32. From: http://7blog.7host.com/?page=ConfluentInc/kafka-and-stream-processing-taking-analytics-realt ime-mike-spicer 32 process the data as asap
  33. 33. From: https://www.flickr.com/photos/sheila_sund/15221366308 33 long running (distributed) application
  34. 34. - state management - fault tolerance and recovery - performance and scalability - programming model - ecosystem 34 Challenges
  35. 35. only a few problems can be solved without keeping some sort of application state state is needed for any kind of aggregation or counting 35 State Management
  36. 36. only a few problems can be solved without keeping some sort of application state ● on a single node (and neglecting threading) keeping state seems easy, just keep it in the local memory but with threads in the picture, even on a single node, keeping state consistent requires careful synchronization; ● on a multi node/multi thread/long running application it may will end up in a mess. 36 State Management
  37. 37. only a few problems can be solved without keeping some sort of application state Classical solution: keep state in an external database (KV-stores are frequently the best fit). ● yet another system to manage ● yet another bottleneck to avoid ● yet another syncronization point to care about 37 State Management
  38. 38. Only a few problems can be solved without keeping some sort of application state Flink keeps state for your application: synchronize, distribute and even rescale. 38 State Management
  39. 39. Flink state comes in two flavors. http://7blog.7host.com/?page=dataArtisans/apache-flink-training-working-with-state 39 State Management
  40. 40. Flink state backends are threefold: - MemoryStateBackend holds data internally as objects on the Java heap, then collected in the JobManager (master) - FsStateBackend holds in-flight data in the TaskManager’s memory then on filesystem (hdfs or s3 for instance) - RocksDBStateBackend holds in-flight data in a RocksDB data base per task then the whole RocksDB data base is stored on disk 40 State Management
  41. 41. From: https://www.flickr.com/photos/sheila_sund/15221366308 41 Long running distributed stateful application
  42. 42. From: https://www.wikihow.com/Untangle-a-Newton%27s-Cradle 42
  43. 43. Checkpoints 43 fault tolerance and recovery
  44. 44. Savepoints “Savepoints are externally stored self-contained checkpoints that you can use to stop-and-resume or update your Flink programs. They use Flink’s checkpointing mechanism to create a (non-incremental) snapshot of the state of your streaming program and write the checkpoint data and meta data out to an external file system.” caveat: you must consider serializers evolution for objects stored in the state 44 fault tolerance and recovery
  45. 45. - event time semantics - flexible windowing - metrics and counters - queryable state - DataSet API - Event Processing (CEP) - Graphs: Gelly - Machine Learning, - TABLE API, - SteamSQL 45 - Map: DataStream → DataStream - FlatMap: DataStream → DataStream - Filter: DataStream → DataStream - KeyBy: DataStream → KeyedStream - Reduce: KeyedStream (DataStream) - Fold: KeyedStream (DataStream) - Union: DataStream* → DataStream - Connect: DataStream, DataStream → ConnectedStreams - CoMap: C’dStreams → DataStream - CoFlatMap: C’dStreams → DataStream - Split: DataStream → SplitStream - Select: SplitStream → DataStream - Iterate: DataStream → IterativeStream → DataStream - … and there’s a lot more to tell
  46. 46. 46 the big picture
  47. 47. Q & A 47 -
  48. 48. Q. Isn’t easier to use just Kafka ? A. Well, No. Kafka, or exactly Kafka Streams API is a library that any standard Java application can embed and hence does not attempt to dictate a deployment method; whereas Flink is a cluster framework, which means that the framework takes care of deploying the application https://www.confluent.io/blog/apache-flink-apache-kafka-streams-comparison-guideline-users/ 48
  49. 49. Q. Did You write “Real Time” ? A. Well, yes … actually, i meant … Rigorously “Real-time programs must guarantee response within specified time constraints” Here Real-time means (as in Cambridge Dictionary) “communicated, shown, presented, etc. at the same time as events actually happen” 49
  50. 50. Q. What about Apache Storm ? A. There is actually a compatiblility suite that let’s you ● Run unmodified Storm topologies ● Embed Storm code (spouts and bolts) as operators inside Flink DataStream programs. 50
  51. 51. Q. any kind of comparison chart ? https://www.gmv.com/blog_gmv/future-streaming-technologies-apache-flink/ 51

×