Manage Big Data: How to do it with large volumes of data?

BIG DATA

Pedro Bonillo

ML and Big Data Consultant

2 years ago

Publication date

One of the big mistakes we make when dealing with a subject is to see only the part that corresponds to us and not try to acquire a broader vision of the subject, this is what systemic thinking calls a forest view and not just a tree view. This is especially important when you manage big data.

Eighteen years ago with the first article published by Doug Cutting defined most of the ideas of how to manage big data: distributed file systems, mapping and reduction algorithms, search engines, companies and universities began to face the issue from isolated perspectives: databases, data science, data analytics, data governance.

The truth is that today we can talk about Big Data as a new paradigm that covers all topics according to a set of common requirements: volume, variability, speed and accuracy. When we talk about Big data management, it is necessary to consider these 4 aspects equally: (1) Database; (2) Big Data Analytics or BI; (3) Data Science; and, (4) Governance. (see, Figure 1).

Figure 1: Data Management software

Optimization of database management

In the first aspect, the issue of databases is addressed. The model presented by Edgar Frank Codd, which has allowed us to store the data as a matrix of rows and columns; where for each new attribute a new column is created, despite the fact that many rows do not have values for that new attribute, it has only allowed organizations to manage 30% of their data, according to research by the Garner Group.

This is why a new concept has emerged to store and manage data called key-value. In this way, different rows can have different attributes. There are mainly 3 types of databases that are key value in Big Data and there is at least one free software product, which in the last 5 years I have successfully tested and installed and configured on production clients:

1.- Transactional with high availability = Apache Cassandra
2.- Documentary management = MongoDB
3.- Georeference = Neo4J

Big Data Analytics

The term Business Intelligence has evolved in the context of Big Data as Big Data Analytics. Where you do not necessarily work with the database concept, but rather with a distributed file system.

This file system continues to be populated through data extraction, transformation, cleaning and loading tools. But now these tools write in the new key value format (Avro, Parquet, ORC). The forerunner of these distributed file systems was Google File System (GFS), which evolved into what is now known as Hadoop Distributed File System (HDFS), or simply Apache Hadoop.

Apache Haddop

This new type of file system uses the concept of distributed computing, in order to be able to process large amounts of data (more than a million records, with high variability and with queries in less than 5 seconds), between several servers.

The first concept that hadoop introduces is the creation of an Active High Availability Cluster, with a file name server (Name Node) and several servers where the files reside (Data Node). Hadoop then splits a file into chunks (churn) and replicates it between as many Data Nodes as configured. This also allows a horizontal scalability of the architecture, according to the needs of the client.

Once the data is stored in the HDFS, the next component is the map reduce algorithm, this algorithm allows mapping the data and reducing it, to write its reduction and not the data as such (it is like compressing with the zip or rar algorithm). In such a way that queries can be made on the reduced (compressed) data. The version of the map reduce algorithm that ships with hadoop is called YARN (Yet another resource negotiator).

Apache Spark

An optimized version of the algorithm, to handle distributed computing clusters, through apache mesos, is Spark. Thus, currently, most organizations use spark to schedule programs and divide tasks of summarization, grouping, selection, on the data that is stored in the HDFS. Apache Spark allows you to manage map reduce tasks on top of HDFS. However, to be able to use spark it is necessary to write a program in JAVA, Python, SCALA, etc.

This program commonly loads the data that is in the HDFS, within a resilient distributed dataset (RDD) type structure, which is nothing more than a collection of elements that are partitioned within the cluster nodes and can be processed in parallel; or a data frame (rows and columns) and then performs SQL queries on the structures, to finally write the result in the same HDFS or in another Relational Database or a CSV file, for example to be consumed by some data visualization tool reports such as Tableau, Hadoop User Experience (HUE) or Looker.

Hive

Taking into account that, to perform a query on the data in the HDFS, a program would have to be made; It was necessary in the Hadoop community to establish another mechanism in which database administrators and functionals continue to perform SQL queries, this mechanism is HIVE.

Hive is a translator, interpreter of SQL statements in Map Reduce algorithms that are scheduled with Yarn or Spark on the data in the HDFS. In such a way that direct queries can be made on the data in hadoop.

Hive has a metastore, which can reside in a relational database in memory or in an object relational database such as PostgreSQL, this metastore stores the link between the metadata with which the tables are created and the data in hadoop that through partitions you can populate these tables, but the data actually resides in hadoop.

In the event that it is necessary to automate a series of queries (in the style of a store procedure), Apache Pig can also be used, which is a tool that uses a procedural language for automating queries in Hadoop. It is important to mention that the data in hadoop is not indexed, so the Lucene API is also used through SolR or ElasticSearch to place the indexed data in information cubes that can later be consulted through web services, for reporting applications or by other applications in the organization.

Finally, Big Data Analytics requires multidimensional report presentation and analysis tools, such as superset, tableau, looker and hue (hadoop user experience).

Data Science

The identification of patterns through statistical models, in large volumes of data, either to establish the probability of an event occurring (regression model) or to establish a characterization (classification model) it’s called Data Science.

To perform Data Science, the object-oriented tool R Studio is commonly used, with an implementation of the R Language, which is an evolution of the S language, used in the 90s in most computer-assisted statistical studies.

R Studio, allows performing structural analysis and transformations on a set of data:

Quantitative analysis using descriptive statistics,
Identify variables to predict and the type of model (regression or classification),
Separate the data set into train (data for model training) and test (data for testing the model)
Perform analysis using decision trees and regression to indicate the importance of attributes
Perform attribute engineering

When it is required to apply a model to a large amount of data, it is necessary to store them in Apache Hadoop and make a program to execute it with Spark, which has all the functionality of R Studio through SparkR of learning machines with the MLib library.

Governance

In Big Data, Governance is achieved through the modeling and implementation of eight processes: (1) Data operations management; (2) Master data and reference management; (3) Documentation and content management; (4) Big Data security management; (5) Big Data Development; (6) Big Data quality management; (7) Bid Data Meta-Data Management; and, (8) Management of data architecture Data Warehousing & BI Management.

Figure 2: Big Data Quality Governance

In another publication I will be addressing each of these processes. For the automation of these.In conclusion, it is necessary when dealing with the topic of Big Data, to systematically address all aspects and not neglect any of them.

Data Lakehouse

Data warehouse vs Data lake One of the most negative aspects of Big Data systems is the need to be...

BIG DATAIT

Pedro Bonillo

Big Data Architecture: Proposal for a Management Architecture for Large Volumes of Data

a catalog is used, which captures the metadata and provides a query interface for all data assets. Unified governance The...

MARKETINGBUSINESS DEVELOPMENTIT

Marta Soler

BIG DATAIT

Pedro Bonillo

How to Make a Successful IT Marketing Plan?

A digital marketing strategy is the first step to achieve the most effective and optimized online marketing activities. How to...

MARKETINGBUSINESS DEVELOPMENTIT

Marta Soler

Manage Big Data: How to do it with large volumes of data?

Optimization of database management

Big Data Analytics

Apache Haddop

Apache Spark

Hive

Data Science

Governance

Data Lakehouse

BIG DATAIT

Big Data Architecture: Proposal for a Management Architecture for Large Volumes of Data

MARKETINGBUSINESS DEVELOPMENTIT

How to Make a Successful IT Marketing Plan?

Data Lakehouse

Big Data Architecture: Proposal for a Management Architecture for Large Volumes of Data

BIG DATAIT

How to Make a Successful IT Marketing Plan?

MARKETINGBUSINESS DEVELOPMENTIT

I’m looking for Elastic Workforce

I’m looking for
Design Thinking & UX/UI

I’m looking for
Web & App Development

Eduardo Cuñarro

Ignacio Rohr

Marta Soler

Manage Big Data: How to do it with large volumes of data?

Optimization of database management

Big Data Analytics

Apache Haddop

Apache Spark

Hive

Data Science

Governance

Data Lakehouse

BIG DATAIT

Big Data Architecture: Proposal for a Management Architecture for Large Volumes of Data

MARKETINGBUSINESS DEVELOPMENTIT

How to Make a Successful IT Marketing Plan?

Data Lakehouse

Big Data Architecture: Proposal for a Management Architecture for Large Volumes of Data

BIG DATAIT

How to Make a Successful IT Marketing Plan?

MARKETINGBUSINESS DEVELOPMENTIT

I’m looking for Elastic Workforce

I’m looking for Design Thinking & UX/UI

I’m looking for Web & App Development

Eduardo Cuñarro

Ignacio Rohr

Marta Soler

I’m looking for
Design Thinking & UX/UI

I’m looking for
Web & App Development