Manage Big Data: How to do it with large volumes of data?

BIG DATA

Pedro Bonillo
Pedro Bonillo
ML and Big Data Consultant

2 years ago

Publication date

One of the big mistakes we make when dealing with a subject is to see only the part that corresponds to us and not try to acquire a broader vision of the subject, this is what systemic thinking calls a forest view and not just a tree view. This is especially important when you manage big data. 

Eighteen years ago with the first article published by Doug Cutting  defined most of the ideas of  how to manage big data: distributed file systems, mapping and reduction algorithms, search engines, companies and universities began to face the issue from isolated perspectives: databases, data science, data analytics, data governance. 

The truth is that today we can talk about Big Data as a new paradigm that covers all topics according to a set of common requirements: volume, variability, speed and accuracy. When we talk about Big data management, it is necessary to consider these 4 aspects equally: (1) Database; (2) Big Data Analytics or BI; (3) Data Science; and, (4) Governance. (see, Figure 1).

 

Figure 1: Data Management software

 

Optimization of database management

In the first aspect, the issue of databases is addressed. The model presented by Edgar Frank Codd, which has allowed us to store the data as a matrix of rows and columns; where for each new attribute a new column is created, despite the fact that many rows do not have values ​​for that new attribute, it has only allowed organizations to manage 30% of their data, according to research by the Garner Group.

This is why a new concept has emerged to store and manage data called key-value. In this way, different rows can have different attributes. There are mainly 3 types of databases that are key value in Big Data and there is at least one free software product, which in the last 5 years I have successfully tested and installed and configured on production clients:

 

Big Data Analytics

The term Business Intelligence has evolved in the context of Big Data as Big Data Analytics. Where you do not necessarily work with the database concept, but rather with a distributed file system.

This file system continues to be populated through data extraction, transformation, cleaning and loading tools. But now these tools write in the new key value format (Avro, Parquet, ORC). The forerunner of these distributed file systems was Google File System (GFS), which evolved into what is now known as Hadoop Distributed File System (HDFS), or simply Apache Hadoop.

Apache Haddop

This new type of file system uses the concept of distributed computing, in order to be able to process large amounts of data (more than a million records, with high variability and with queries in less than 5 seconds), between several servers.

The first concept that hadoop introduces is the creation of an Active High Availability Cluster, with a file name server (Name Node) and several servers where the files reside (Data Node). Hadoop then splits a file into chunks (churn) and replicates it between as many Data Nodes as configured. This also allows a horizontal scalability of the architecture, according to the needs of the client.

Once the data is stored in the HDFS, the next component is the map reduce algorithm, this algorithm allows mapping the data and reducing it, to write its reduction and not the data as such (it is like compressing with the zip or rar algorithm). In such a way that queries can be made on the reduced (compressed) data. The version of the map reduce algorithm that ships with hadoop is called YARN (Yet another resource negotiator). 

Apache Spark

An optimized version of the algorithm, to handle distributed computing clusters, through apache mesos, is Spark. Thus, currently, most organizations use spark to schedule programs and divide tasks of summarization, grouping, selection, on the data that is stored in the HDFS. Apache Spark allows you to manage map reduce tasks on top of HDFS. However, to be able to use spark it is necessary to write a program in JAVA, Python, SCALA, etc. 

This program commonly loads the data that is in the HDFS, within a resilient distributed dataset (RDD) type structure, which is nothing more than a collection of elements that are partitioned within the cluster nodes and can be processed in parallel; or a data frame (rows and columns) and then performs SQL queries on the structures, to finally write the result in the same HDFS or in another Relational Database or a CSV file, for example to be consumed by some data visualization tool reports such as Tableau, Hadoop User Experience (HUE) or Looker.

Hive

Taking into account that, to perform a query on the data in the HDFS, a program would have to be made; It was necessary in the Hadoop community to establish another mechanism in which database administrators and functionals continue to perform SQL queries, this mechanism is HIVE

Hive is a translator, interpreter of SQL statements in Map Reduce algorithms that are scheduled with Yarn or Spark on the data in the HDFS. In such a way that direct queries can be made on the data in hadoop. 

Hive has a metastore, which can reside in a relational database in memory or in an object relational database such as PostgreSQL, this metastore stores the link between the metadata with which the tables are created and the data in hadoop that through partitions you can populate these tables, but the data actually resides in hadoop. 

In the event that it is necessary to automate a series of queries (in the style of a store procedure), Apache Pig can also be used, which is a tool that uses a procedural language for automating queries in Hadoop. It is important to mention that the data in hadoop is not indexed, so the Lucene API is also used through SolR or ElasticSearch to place the indexed data in information cubes that can later be consulted through web services, for reporting applications or by other applications in the organization.

Finally, Big Data Analytics requires multidimensional report presentation and analysis tools, such as superset, tableau, looker and hue (hadoop user experience).

 

Data Science

The identification of patterns through statistical models, in large volumes of data, either to establish the probability of an event occurring (regression model) or to establish a characterization (classification model) it’s called Data Science.

To perform Data Science, the object-oriented tool R Studio is commonly used, with an implementation of the R Language, which is an evolution of the S language, used in the 90s in most computer-assisted statistical studies.

R Studio, allows performing structural analysis and transformations on a set of data:

  • Quantitative analysis using descriptive statistics, 
  • Identify variables to predict and the type of model (regression or classification), 
  • Separate the data set into train (data for model training) and test (data for testing the model)
  • Perform analysis using decision trees and regression to indicate the importance of attributes 
  • Perform attribute engineering

When it is required to apply a model to a large amount of data, it is necessary to store them in Apache Hadoop and make a program to execute it with Spark, which has all the functionality of R Studio through SparkR of learning machines with the MLib library.

 

Governance

In Big Data, Governance is achieved through the modeling and implementation of eight processes: (1) Data operations management; (2) Master data and reference management; (3) Documentation and content management; (4) Big Data security management; (5) Big Data Development; (6) Big Data quality management; (7) Bid Data Meta-Data Management; and, (8) Management of data architecture Data Warehousing & BI Management.

Figure 2: Big Data Quality Governance

In another publication I will be addressing each of these processes. For the automation of these.In conclusion, it is necessary when dealing with the topic of Big Data, to systematically address all aspects and not neglect any of them.

More articles to keep reading

Data Lakehouse

Data warehouse vs Data lake One of the most negative aspects of Big Data systems is the need to be...

Data Lakehouse

Data warehouse vs Data lake One of the most negative aspects of Big Data systems is the need to be...

Eduardo Cunarro - Director and co-funder Innovant

Eduardo Cuñarro

CEO and partner of the INNOVANT holding, Best Top Developers in Uruguay (Clutch 2022) and recognized professor of Software Architecture and Design at the ORT University of Uruguay.

For more than 12 years Eduardo Cuñarro has been helping companies improve their businesses through software and technology consulting.

His purpose is to make INNOVANT the happiest place to develop and contribute with INNOVANT Believe to build a better Uruguay.

He started at the company in 2015 as CTO and minority partner and professionalized the software factory, implemented Clean Code and trained the team to execute software development in the most efficient way possible.

In 2020 he was appointed CEO, formed a specialized management team and managed to double the company both in terms of team and turnover.

His commitment, charisma, strategic thinking, determination and rigor make him a unique leader. Always complementing with humility, with gratitude, and with humor.

Working with Eduardo implies wanting to bring out the best version of yourself.

Ignacio Rohr - CEO and Co-funder Innovant

Ignacio Rohr

COO and founder of the INNOVANT holding company, Best top developers in Uruguay (Clutch 2022) and software engineer and senior consultant with more than 12 years of experience working with engineering teams.

His purpose is to make INNOVANT the happiest place to develop and contribute with INNOVANT Believe to build a better Uruguay.

He founded the company in 2011 with Encantex as his first project.

In 2020 he was appointed COO and doubling the billing and the team that same year managed to improve the company’s profitability by 21%, boosted the company’s performance and implemented an agile communication methodology between departments and customers.

His generosity, understanding and kindness make Ignacio a fundamental piece of INNOVANT’s leadership. He is that person who, above all, leads by example, responsible and hardworking, intuitive to the needs of others, always establishing excellent relationships with the team.

When you work with Ignacio you feel understood, cared for, and in a safe environment to develop in the best way.

Marta Soler - CMO and business developer

Marta Soler

CMO & Business Developer at INNOVANT, Best top developers in Uruguay (Clutch 2022) and with training in engineering, architecture and business. She was introduced to marketing when she started working at Red Bull Spain at the age of 19, she fell in love with it and today it is his great passion.

Its purpose is to make INNOVANT a powerful software development brand generating a large Spanish-speaking community. In addition, she also wants to help other professionals and companies to get clients, sales and relevance with Digital Marketing.

She started at INNOVANT in 2020 and soon became a key part of the company’s management.

She led the change of image, naming, branding and creation of the entire marketing department as well as the growth strategy through content marketing, social networks and paid media.

Marta is the joy of the team; she is creative, authentic, sensitive, expressive and intuitive, and that makes her a culture engine within the company. She infects the whole team with sympathy and energy.

Working with Marta implies looking for meaningful and free interpersonal relationships; she implies exploring creativity to the fullest; and implies that curiosity, sympathy, spontaneity and fun working with her are always guaranteed.