Decentralization and zookeeper
Publish: 2021-04-21 10:21:42
1. Yes, it's not too late. I suggest you go to the okex exchange to register. This is a domestic first-line platform, and its security is very guaranteed. If you are satisfied with my answer, please accept it
2.
We need a place to store meta information. Zookeeper is also distributed, which is better for configuration management. So I used it< br />
3. Many Kafka people say that they don't need to install ZK because they all use the ZK provided by Kafka
as for why Kafka uses ZK, you should first know the role of ZK as a decentralized cluster mode
consumers need to know which procers (for consumers, Kafka is the procer) are available
How can consumers know without ZK? If every time the consumer tries to connect before consumption, the procer tests whether the connection is successful, how about the efficiency
therefore, Kafka needs ZK, and the design of Kafka relies on ZK.
as for why Kafka uses ZK, you should first know the role of ZK as a decentralized cluster mode
consumers need to know which procers (for consumers, Kafka is the procer) are available
How can consumers know without ZK? If every time the consumer tries to connect before consumption, the procer tests whether the connection is successful, how about the efficiency
therefore, Kafka needs ZK, and the design of Kafka relies on ZK.
4. Many Kafka people say that they don't need to install ZK because they all use the ZK provided by Kafka
as for why Kafka uses ZK, you should first know the role of ZK as a decentralized cluster mode
consumers need to know which procers (for consumers, Kafka is the procer) are available
How can consumers know without ZK? If every time the consumer tries to connect before consumption, the procer tests whether the connection is successful, how about the efficiency
therefore, Kafka needs ZK, and the design of Kafka relies on ZK
(source: yunxiu.com)
as for why Kafka uses ZK, you should first know the role of ZK as a decentralized cluster mode
consumers need to know which procers (for consumers, Kafka is the procer) are available
How can consumers know without ZK? If every time the consumer tries to connect before consumption, the procer tests whether the connection is successful, how about the efficiency
therefore, Kafka needs ZK, and the design of Kafka relies on ZK
(source: yunxiu.com)
5. Many people in Kafka say that they don't need to install ZK because they all use the ZK that comes with Kafka. As for why Kafka uses ZK, you should first know the role of ZK as a decentralized cluster mode. Consumers need to know which procers (for consumers, Kafka is the procer) are available. How can consumers know without ZK? If every time the consumer tries to connect before consumption, the procer tests whether the connection is successful, how about the efficiency? Therefore, Kafka needs ZK, and the design of Kafka relies on ZK.
6. As for why Kafka uses ZK, you should first know the role of ZK as a decentralized cluster mode
consumers need to know which procers (for consumers, Kafka is the procer) are available
How can consumers know without ZK? If every time the consumer tries to connect before consumption, the procer tests whether the connection is successful, how about the efficiency?
consumers need to know which procers (for consumers, Kafka is the procer) are available
How can consumers know without ZK? If every time the consumer tries to connect before consumption, the procer tests whether the connection is successful, how about the efficiency?
7. Many people in Kafka say that they don't need to install ZK because they all use the ZK that comes with Kafka. As for why Kafka uses ZK, you should first know the role of ZK as a decentralized cluster mode. Consumers need to know which procers (for consumers, Kafka is the procer) are available. How can consumers know without ZK? If every time the consumer tries to connect before consumption, the procer tests whether the connection is successful, how about the efficiency? Therefore, Kafka needs ZK, and the design of Kafka relies on ZK.
8. Many people in Kafka say that they don't need to install ZK because they all use the ZK that comes with Kafka. As for why Kafka uses ZK, you should first know the role of ZK as a decentralized cluster mode. Consumers need to know which procers (for consumers, Kafka is the procer) are available. How do consumers know without ZK
9. "
to build a big data system, we need to track the source of data flow to the final valuable output, and select and integrate the appropriate components of each part in the existing Hadoop and big data ecosystem according to the actual needs to build a system platform that can support a variety of query and analysis functions. This includes not only the choice of data storage, but also the consideration and trade-off of data online and offline processing separation. In addition, no commercial application that introces big data solutions has potential security risks in the proction environment<
1
computing framework
value of big data
only when it can guide people to make valuable decisions, can data reflect its own value. Therefore, it is meaningful for big data technology to serve practical purposes. Generally speaking, big data can guide people to make valuable decisions from the following three aspects:
report generation (such as tracking and comprehensive analysis of user's historical click behavior, application activity and user stickiness calculation, etc.)
diagnostic analysis (such as analyzing why the user stickiness is declining, analyzing why the performance of the system is declining according to the log, detecting the characteristics of spam and viruses, etc.)
decision making (such as personalized news reading or song recommendation, predicting which functions can increase user stickiness, helping advertisers to accurately deliver advertisements, setting spam and virus interception strategies, etc.)
Figure 1
further, big data technology solves the goal that traditional technology cannot achieve from the following three aspects (as shown in Figure 1):
low latency (Interactive) query on historical data, the goal is to speed up the decision-making process and time, such as analyzing why a site is slow and trying to repair it
low latency query on real-time data aims to help users and applications make decisions on real-time data, such as real-time detection and blocking of virus worms (a virus worm can attack 1 million hosts in 1.3 seconds)
more sophisticated and advanced data processing algorithms can help users make "better" decisions, such as graph data processing, outlier detection, trend analysis and other machine learning algorithms
cake mode
from the perspective of transforming data into value, ring the ten-year vigorous growth of Hadoop ecosystem, horn and spark can be regarded as milestone events. The emergence of yarn separates cluster resource management and data processing pipeline, greatly innovates and promotes the development of various frameworks at the application level of big data (SQL on Hadoop framework, stream data, graph data, machine learning)
it makes users no longer constrained by maprec development mode, but can create more diverse distributed applications, and let all kinds of applications run on a unified architecture, eliminating the cost of maintaining unique resources for other frameworks. It's like a multi-layer cake. The lower two layers are HDFS and yarn. Maprec is just a candle on the upper layer of the cake. There are all kinds of candles on the cake
in this architecture, the overall data processing and analysis job is divided into three parts (Figure 2), interactive query on HBase (APACHE Phoenix, cloudera impala, etc.), writing maprec program on historical data set or batch processing business using hive, etc., and Apache storm is a standard choice for real-time stream data analysis
although the emergence of yarn has greatly enriched the application scenarios of Hadoop ecosystem, there are still two obvious challenges: one is to maintain three development stacks on one platform; the other is to maintain three development stacks on one platform; Second, it is difficult to share data in different frameworks. For example, it is difficult to query data interactively in one framework. It also means that we need a more unified and abstract computing framework<
Figure 2
unifying the world
the emergence of spark integrates batch processing tasks, interactive queries, and real-time stream data processing into a unified framework (Figure 3). At the same time, spark is well compatible with the existing open source ecosystem (Hadoop, HDFS, horn, hive, flume). By enabling memory distributed data sets and optimizing iterative workload, users can operate data more easily, and develop more sophisticated algorithms, such as machine learning and graph algorithm
there are three main reasons why spark has become the most popular big data open source community (with more than 800 contributors from more than 200 companies):
spark can be expanded and deployed to more than 8000 nodes and process Pb level data. At the same time, it also provides many good tools for developers to manage and deploy
spark provides an interactive shell for developers to test different functions in real time with Scala or python
spark provides many built-in functions, which makes it easier for developers to write low coupling and concurrent code, so that developers can focus on providing more business functions for users instead of spending time on optimizing parallel code
of course, like maprec, spark is not a panacea. For example, Apache storm is still the mainstream choice in stream data processing, which requires high real-time performance, because spark streaming is actually a microbatch (cutting a stream data into batches according to time slices, and submitting a job to each batch) rather than an event triggered real-time system, Therefore, although the proponents think that microbatch does not contribute much to system latency, it is not particularly able to meet the application scenarios with high requirements for low latency in proction environment compared with Apache storm
for example, in practice, if the average processing time of each message is counted, it is easy to reach the millisecond level, but once the indicators such as service assurance (to ensure that a message can be basically processed in millisecond) are counted, the bottleneck of the system sometimes cannot be avoided
but at the same time, we have to notice that in many use cases, it is necessary to interact with stream data and combine with static data set. For example, we need to calculate the classifier model on static data set, and on the basis of existing classifier model, we need to calculate the real-time stream data to determine the classification
the system design of spark provides a common abstraction for all kinds of work (batch processing, stream processing and interactive work), and many rich libraries (mllib machine learning library, SQL language API, graphx) are extended in the ecosystem, so that users can carry out flexible spark related operations on each batch of stream data, which provides a lot of convenience in development
with the maturity of spark, the Hadoop ecosystem has undergone earth shaking changes in just one year. Cloudera and hortonworks have joined the spark camp one after another. In Hadoop project group, there are no projects other than yard (although mesos has replaced yard in some occasions), because even HDFS and spark can be independent. But most of the time, we still need a MPP solution like impala, which relies on distributed file system, and uses hive to manage the mapping of files to tables. Therefore, Hadoop traditional ecosystem still has strong vitality
in addition, this paper briefly compares various SQL on Hadoop frameworks in the interactive analysis task, because this is also a problem we often encounter in the actual project implementation. We mainly focus on spark SQL, impala and hive on tez, among which spark SQL has the shortest history. The paper was published at the SIGMOD conference in the past 15 years. The original text compares the performance of different types of queries in data warehouse on shark, spark SQL and impala
in other words, although spark SQL uses catalyst optimizer on the basis of shark to do a lot of optimization on code generation, its overall performance is still not as good as impala, especially when it is used as join operation, impala can use "predict push down" to select tables earlier to improve performance
however, spark SQL's catalyst optimizer has been continuously optimized, and I believe there will be more and better progress in the future. In cloudera's benchmark evaluation, impala is always superior to other SQL on Hadoop frameworks. However, hortonworks evaluation points out that although a single data warehouse query impala can be completed in a short time, once multiple queries are concurrent, the advantage of hive on tez will show. In addition, hive on tez has stronger SQL expression ability than impala (mainly e to impala's nested storage model), so it is necessary to choose different solutions according to different scenarios<
Fig. 3
are they leading the way, or are they coming out from generation to generation<
Apache Flink (like spark, which has a history of five years, has been a research project of Berlin Polytechnic University, and is highly praised by its supporters as the fourth generation of big data analysis and processing framework after maprece, yarn and spark). In contrast to spark, Flink is a real real-time stream data processing system, which regards batch processing as a special case of stream data. Like spark, it is also trying to build a unified platform to run batch, stream data, interactive jobs, machine learning, graph algorithm and other applications
Flink has some design ideas that are obviously different from spark. A typical example is memory management. Flink insists on its own precise control of memory usage and directly operates binary data from the beginning, while spark has tried JAVA memory management to cache data until version 1.5, This also makes spark vulnerable to the performance loss caused by oom and JVM GC
but from another point of view, the design pattern that RDD in spark is stored as Java objects at runtime also greatly reces the threshold of user programming design. At the same time, with the introction of tungsten project, spark is graally turning to its own memory management, Specifically, in spark ecosystem, the traditional development around RDD (distributed Java object collection) is graally turning to dataframe (distributed row object collection) as the core
generally speaking, the two ecosystems are learning from each other. Flink's design gene is more advanced, but spark community is much more active. Up to now, it is undoubtedly a more mature choice, such as richer data source support (HBase, Cassandra, parquet, JSON, ORC) and more unified and concise computing representation. On the other hand, as a project initiated by the European continent, Apache Flink now has many contributors from North America, Europe and Asia. Whether this can change Europe's traditional passive role in the open source world remains to be seen in the future<
2
NoSQL database
the mainstream choice of NoSQL database is still focused on mongodb, HBase and Cassandra. Among all NoSQL options, mongodb written in C should be the fastest and easiest one for developers to deploy. Mongodb is a document oriented database
to build a big data system, we need to track the source of data flow to the final valuable output, and select and integrate the appropriate components of each part in the existing Hadoop and big data ecosystem according to the actual needs to build a system platform that can support a variety of query and analysis functions. This includes not only the choice of data storage, but also the consideration and trade-off of data online and offline processing separation. In addition, no commercial application that introces big data solutions has potential security risks in the proction environment<
1
computing framework
value of big data
only when it can guide people to make valuable decisions, can data reflect its own value. Therefore, it is meaningful for big data technology to serve practical purposes. Generally speaking, big data can guide people to make valuable decisions from the following three aspects:
report generation (such as tracking and comprehensive analysis of user's historical click behavior, application activity and user stickiness calculation, etc.)
diagnostic analysis (such as analyzing why the user stickiness is declining, analyzing why the performance of the system is declining according to the log, detecting the characteristics of spam and viruses, etc.)
decision making (such as personalized news reading or song recommendation, predicting which functions can increase user stickiness, helping advertisers to accurately deliver advertisements, setting spam and virus interception strategies, etc.)
Figure 1
further, big data technology solves the goal that traditional technology cannot achieve from the following three aspects (as shown in Figure 1):
low latency (Interactive) query on historical data, the goal is to speed up the decision-making process and time, such as analyzing why a site is slow and trying to repair it
low latency query on real-time data aims to help users and applications make decisions on real-time data, such as real-time detection and blocking of virus worms (a virus worm can attack 1 million hosts in 1.3 seconds)
more sophisticated and advanced data processing algorithms can help users make "better" decisions, such as graph data processing, outlier detection, trend analysis and other machine learning algorithms
cake mode
from the perspective of transforming data into value, ring the ten-year vigorous growth of Hadoop ecosystem, horn and spark can be regarded as milestone events. The emergence of yarn separates cluster resource management and data processing pipeline, greatly innovates and promotes the development of various frameworks at the application level of big data (SQL on Hadoop framework, stream data, graph data, machine learning)
it makes users no longer constrained by maprec development mode, but can create more diverse distributed applications, and let all kinds of applications run on a unified architecture, eliminating the cost of maintaining unique resources for other frameworks. It's like a multi-layer cake. The lower two layers are HDFS and yarn. Maprec is just a candle on the upper layer of the cake. There are all kinds of candles on the cake
in this architecture, the overall data processing and analysis job is divided into three parts (Figure 2), interactive query on HBase (APACHE Phoenix, cloudera impala, etc.), writing maprec program on historical data set or batch processing business using hive, etc., and Apache storm is a standard choice for real-time stream data analysis
although the emergence of yarn has greatly enriched the application scenarios of Hadoop ecosystem, there are still two obvious challenges: one is to maintain three development stacks on one platform; the other is to maintain three development stacks on one platform; Second, it is difficult to share data in different frameworks. For example, it is difficult to query data interactively in one framework. It also means that we need a more unified and abstract computing framework<
Figure 2
unifying the world
the emergence of spark integrates batch processing tasks, interactive queries, and real-time stream data processing into a unified framework (Figure 3). At the same time, spark is well compatible with the existing open source ecosystem (Hadoop, HDFS, horn, hive, flume). By enabling memory distributed data sets and optimizing iterative workload, users can operate data more easily, and develop more sophisticated algorithms, such as machine learning and graph algorithm
there are three main reasons why spark has become the most popular big data open source community (with more than 800 contributors from more than 200 companies):
spark can be expanded and deployed to more than 8000 nodes and process Pb level data. At the same time, it also provides many good tools for developers to manage and deploy
spark provides an interactive shell for developers to test different functions in real time with Scala or python
spark provides many built-in functions, which makes it easier for developers to write low coupling and concurrent code, so that developers can focus on providing more business functions for users instead of spending time on optimizing parallel code
of course, like maprec, spark is not a panacea. For example, Apache storm is still the mainstream choice in stream data processing, which requires high real-time performance, because spark streaming is actually a microbatch (cutting a stream data into batches according to time slices, and submitting a job to each batch) rather than an event triggered real-time system, Therefore, although the proponents think that microbatch does not contribute much to system latency, it is not particularly able to meet the application scenarios with high requirements for low latency in proction environment compared with Apache storm
for example, in practice, if the average processing time of each message is counted, it is easy to reach the millisecond level, but once the indicators such as service assurance (to ensure that a message can be basically processed in millisecond) are counted, the bottleneck of the system sometimes cannot be avoided
but at the same time, we have to notice that in many use cases, it is necessary to interact with stream data and combine with static data set. For example, we need to calculate the classifier model on static data set, and on the basis of existing classifier model, we need to calculate the real-time stream data to determine the classification
the system design of spark provides a common abstraction for all kinds of work (batch processing, stream processing and interactive work), and many rich libraries (mllib machine learning library, SQL language API, graphx) are extended in the ecosystem, so that users can carry out flexible spark related operations on each batch of stream data, which provides a lot of convenience in development
with the maturity of spark, the Hadoop ecosystem has undergone earth shaking changes in just one year. Cloudera and hortonworks have joined the spark camp one after another. In Hadoop project group, there are no projects other than yard (although mesos has replaced yard in some occasions), because even HDFS and spark can be independent. But most of the time, we still need a MPP solution like impala, which relies on distributed file system, and uses hive to manage the mapping of files to tables. Therefore, Hadoop traditional ecosystem still has strong vitality
in addition, this paper briefly compares various SQL on Hadoop frameworks in the interactive analysis task, because this is also a problem we often encounter in the actual project implementation. We mainly focus on spark SQL, impala and hive on tez, among which spark SQL has the shortest history. The paper was published at the SIGMOD conference in the past 15 years. The original text compares the performance of different types of queries in data warehouse on shark, spark SQL and impala
in other words, although spark SQL uses catalyst optimizer on the basis of shark to do a lot of optimization on code generation, its overall performance is still not as good as impala, especially when it is used as join operation, impala can use "predict push down" to select tables earlier to improve performance
however, spark SQL's catalyst optimizer has been continuously optimized, and I believe there will be more and better progress in the future. In cloudera's benchmark evaluation, impala is always superior to other SQL on Hadoop frameworks. However, hortonworks evaluation points out that although a single data warehouse query impala can be completed in a short time, once multiple queries are concurrent, the advantage of hive on tez will show. In addition, hive on tez has stronger SQL expression ability than impala (mainly e to impala's nested storage model), so it is necessary to choose different solutions according to different scenarios<
Fig. 3
are they leading the way, or are they coming out from generation to generation<
Apache Flink (like spark, which has a history of five years, has been a research project of Berlin Polytechnic University, and is highly praised by its supporters as the fourth generation of big data analysis and processing framework after maprece, yarn and spark). In contrast to spark, Flink is a real real-time stream data processing system, which regards batch processing as a special case of stream data. Like spark, it is also trying to build a unified platform to run batch, stream data, interactive jobs, machine learning, graph algorithm and other applications
Flink has some design ideas that are obviously different from spark. A typical example is memory management. Flink insists on its own precise control of memory usage and directly operates binary data from the beginning, while spark has tried JAVA memory management to cache data until version 1.5, This also makes spark vulnerable to the performance loss caused by oom and JVM GC
but from another point of view, the design pattern that RDD in spark is stored as Java objects at runtime also greatly reces the threshold of user programming design. At the same time, with the introction of tungsten project, spark is graally turning to its own memory management, Specifically, in spark ecosystem, the traditional development around RDD (distributed Java object collection) is graally turning to dataframe (distributed row object collection) as the core
generally speaking, the two ecosystems are learning from each other. Flink's design gene is more advanced, but spark community is much more active. Up to now, it is undoubtedly a more mature choice, such as richer data source support (HBase, Cassandra, parquet, JSON, ORC) and more unified and concise computing representation. On the other hand, as a project initiated by the European continent, Apache Flink now has many contributors from North America, Europe and Asia. Whether this can change Europe's traditional passive role in the open source world remains to be seen in the future<
2
NoSQL database
the mainstream choice of NoSQL database is still focused on mongodb, HBase and Cassandra. Among all NoSQL options, mongodb written in C should be the fastest and easiest one for developers to deploy. Mongodb is a document oriented database
Hot content
