Big data is a broad term that refers to data sets that are so large and complex that they require specially designed hardware and software tools to process. This data set is usually the size of trillion or EB. These data sets are collected from a variety of sources: sensors, climate information, and public information such as magazines, newspapers, and articles. Other examples of big data generation include purchase transaction records, web logs, medical records, military surveillance, video and image archives, and large e-commerce.

In big data and big data analytics, they have an interest in the impact of the business. Big data analytics is the process of finding patterns, correlations, and other useful information in the process of researching large amounts of data to help companies better adapt to change and make more informed decisions.


Hadoop is a software framework that enables distributed processing of large amounts of data. But Hadoop is handled in a reliable, efficient, and scalable way. Hadoop is reliable because it assumes that the computational elements and storage will fail, so it maintains multiple copies of the working data, ensuring that the processing can be redistributed for failed nodes. Hadoop is efficient because it works in parallel and speeds up processing through parallel processing. Hadoop is also scalable and can handle petabytes of data. In addition, Hadoop relies on a community server, so its cost is low and anyone can use it.

Hadoop is a distributed computing platform that allows users to easily architect and use. Users can easily develop and run applications that process massive amounts of data on Hadoop. It has the following main advantages:

1. High reliability. Hadoop’s ability to store and process data in bits is worthy of trust.

2. High scalability. Hadoop distributes data and performs computational tasks among available computer clusters that can be easily scaled to thousands of nodes.

3. High efficiency. Hadoop is able to dynamically move data between nodes and ensure the dynamic balance of each node, so processing is very fast.

4. High tolerance. Hadoop automatically saves multiple copies of your data and automatically redistributes failed tasks.

Hadoop comes with a framework written in the Java language, so it is ideal for running on a Linux production platform. Applications on Hadoop can also be written in other languages, such as C++.

Second, HPCC
HPCC, acronym for High Performance Computing and Communications. In 1993, the US Federal Coordinating Council for Science, Engineering, and Technology submitted a report to the Congress on “Major Challenge Project: High Performance Computing and Communications,” a report known as the HPCC Program, the US President’s Science Strategy Program. The aim is to solve a number of important scientific and technological challenges by strengthening research and development. HPCC is a US implementation of the information superhighway. The implementation of the plan will cost tens of billions of dollars. Its main goal is to develop scalable computing systems and related software to support the performance of terabit network transmission. Megabit network technology, expanding research and educational institutions and network connectivity.

The project consists of five main components:

1. High-performance computer system (HPCS), including research on future generations of computer systems, system design tools, advanced typical systems, and evaluation of legacy systems;

2. Advanced Software Technology and Algorithms (ASTA), software support for new challenges, new algorithm design, software branches and tools, computational calculations, and high-performance computing research centers;

3. National Research and Education Grid (NREN), with research and development of medium-station and 1 billion-bit transmission;

4. Basic Research and Human Resources (BRHR), with basic research, training, education, and course materials, designed to reward investigators-starting, long-term surveys to increase the flow of innovation in scalable high-performance computing. Increase the pooling of skilled and trained personnel by improving education and high-performance computing training and communication, and provide the necessary infrastructure to support these surveys and research activities;

5. Information Infrastructure Technology and Applications (IITA), which aims to ensure the United States’ leading position in advanced information technology development.

Third, Storm
Storm is a free open source software, a distributed, fault-tolerant real-time computing system. Storm can handle huge data streams with great reliability for processing bulk data from Hadoop. Storm is very simple, supports many kinds of programming languages, and is very interesting to use. Storm is open sourced by Twitter. Other well-known application companies include Groupon, Taobao, Alipay, Alibaba, Music, Admaster and more.

Storm has many application areas: real-time analytics, online machine learning, non-stop computing, distributed RPC (far-process calling protocol, a request for services from remote computer programs over a network), ETL (abbreviation for Extraction-Transformation-Loading, That is, data extraction, conversion, and loading) and so on. Storm’s processing speed is amazing: after testing, each node can process 1 million data tuples per second. Storm is scalable, fault tolerant, and easy to set up and operate.

Forth,Apache drill

To help business users find more effective ways to speed up Hadoop data queries, the Apache Software Foundation recently launched an open source project called “Drill.” Apache Drill implements Google’s Dremel.

According to Tomer Shiran, product manager at Hadoop manufacturer MapR Technologies, “Drill” has been implemented as an Apache incubator project and will continue to be promoted to global software engineers.

The project will create an open source version of the Google Dremel Hadoop tool (Google uses the tool to speed up the Internet application for Hadoop data analysis tools). And “Drill” will help Hadoop users achieve faster querying of massive data sets.

The “Drill” project is also inspired by Google’s Dremel project: it helps Google implement the analysis of massive data sets, including analyzing and crawling Web documents, tracking application data installed on the Android Market, analyzing spam, and analyzing Test results on Google’s distributed build system and more.

By developing the “Drill” Apache open source project, organizations will be able to build Drill’s API interfaces and flexible and powerful architecture to help support a wide range of data sources, data formats and query languages.

Five, RapidMiner

RapidMiner is the world’s leading data mining solution with a high level of advanced technology. Its data mining tasks cover a wide range of data art, simplifying the design and evaluation of data mining processes.

Features and features

Free data mining technology and libraries

100% in Java code (can run on the operating system)

Data mining process is simple, powerful and intuitive

Internal XML guarantees a standardized format to represent the exchange data mining process

Automated large-scale processes can be automated with a simple scripting language

Multi-level data view to ensure efficient and transparent data

Interactive prototype of graphical user interface

Command line (batch mode) automatic large-scale application

Java API (application programming interface)

Simple plugin and promotion mechanism

Powerful visualization engine, visual modeling of many cutting-edge high-dimensional data

More than 400 data mining operators support

Yale University has been successfully applied in many different application areas, including text mining, multimedia mining, functional design, data stream mining, integrated development methods and distributed data mining.

Six, Pentaho BI

Unlike traditional BI products, the Pentaho BI platform is a process-centric, solution-oriented framework. Its purpose is to integrate a series of enterprise-level BI products, open source software, APIs and other components to facilitate the development of business intelligence applications. Its emergence has enabled a series of independent products for business intelligence such as Jfree, Quartz, etc. to be integrated to form a complex and complete business intelligence solution.

The Pentaho BI platform, the core architecture and foundation of the Pentaho Open BI suite, is process-centric because the hub controller is a workflow engine. The workflow engine uses process definitions to define the business intelligence processes that execute on the BI platform. Processes can be easily customized or new processes can be added. The BI platform includes components and reports to analyze the performance of these processes. Currently, Pentaho’s main components include report generation, analysis, data mining, and workflow management. These components are integrated into the Pentaho platform through technologies such as J2EE, WebService, SOAP, HTTP, Java, JavaScript, and Portals. The release of Pentaho is mainly in the form of the Pentaho SDK.

The Pentaho SDK consists of five parts: the Pentaho platform, the Pentaho sample database, the stand-alone Pentaho platform, the Pentaho solution example, and a pre-configured Pentaho web server. The Pentaho platform is the main part of the Pentaho platform, including the source of the Pentaho platform source code; the Pentaho database provides data services for the normal operation of the Pentaho platform, including configuration information, Solution-related information, etc. For the Pentaho platform, it Not required, configuration can be replaced by other database services; the stand-alone Pentaho platform is an example of the standalone mode of the Pentaho platform, which demonstrates how to make the Pentaho platform run independently without application server support; Pentaho The solution example is an Eclipse project that demonstrates how to develop a relevant business intelligence solution for the Pentaho platform.

The Pentaho BI platform is built on top of servers, engines and components. These provide the system’s J2EE server, security, portal, workflow, rules engine, charting, collaboration, content management, data integration, analysis and modeling capabilities. Most of these components are standards based and can be replaced with other products.