A common mistake in the area of Software Development is to think that the big data methodology gets in the way, that it does not work, that it is a way of hindering, and since life is chaos, it is best to make the software without at least one plan and thus the projects are done with the idea of “When the time comes, we’ll see”. The truth is that no quality software product can ignore the process with which it is made and the product obtained.
Although agile methodologies are in fashion, they are nothing more than a response to the speed with which the market requires software to go into production. But it is important to be able to distinguish between tools, techniques, methods and methodologies.
A tool is something that is used in a workplace, for example Apache Hadoop. The technique is the efficient or effective way to use the tool, for example, using Apache Hadoop, with columnar format and writing the compressed files with Snappy. A Method is a path, a way to get from one point to another, for example, Map Reduce with Yarn or Spark on top of Apache Hadoop. A methodology is a set of methods on which it has been reflected after its application and that, when put into practice under the same premises, similar results are obtained.
In Big Data, many have tried to use orthodox methodologies, for example, now everything is Big Data or Big Data does not work for me. The reality is that the big data methodology must be agile and must start from the selection of a use case that solves a problem that cannot be solved with current tools, techniques, methods and methodologies.
In this way, the proposed Big Data methodology for projects is an adaptation of Scrum (agile and flexible methodology to manage software development), with 4 Sprints; the first two of 2 weeks; the third of 4 weeks; and the last of 6 weeks; for a total of 12 weeks; with daily reviews. (see, Figure 1).
Figure 1: Methodology of Big Data projects
Sprint 1: Ingredients Search
In the first sprint, which we will call Ingredients Search, the possible use cases that we seek to improve through Big Data Management will be identified. This Sprint has a duration of 2 weeks. The selection criteria of the use case are at least three: volume (more than a million records), variability (many columns and some with null or nan), speed (perform queries that traditional relational database systems do not respond or take more than 5 minutes to respond).
Sprint 2: Prepare food
In the second sprint (Prepare Food), it is necessary to identify in each of the use cases described in the previous sprint, which are the variables that we want to calculate (key performance indicator) or predict (Distillation Tier) in each case. This Sprint has a duration of 2 weeks.
Sprint 3: Food plating and Presentation
In the third sprint (Food plating and Presentation), it is about selecting the most relevant use case for the organization, or what we will call the Golden Goal. A use case that could not be solved with traditional relational database systems or with available business intelligence tools and that can be addressed quickly, with less cost and with less effort using Big Data.
For this selected use case, it will be necessary to obtain the support of Senior Management, which implies evangelizing them based on the New Architecture and the Management of Large Volumes of Data. In addition, it is necessary to obtain the investment and expense budget to be able to execute the next Sprint, and thus be able to develop the use case, with the necessary Big Data components.
The word necessary is highlighted, since it is a common mistake to try to implement all the components of the Big Data Architecture for this use case. This sprint (the third) has a duration of 4 weeks.
Sprint 4: Delivery food
The last sprint (Delivery Food), is divided into 3 stages:
Packaging
In this stage the components of the Proposed Big Data Architecture for the selected use case are installed and configured. The extraction, transformation and loading of the data is carried out (according to the types of Information Acquisition defined: Real-time ingestion, Micro batch ingestion, Batch ingestion); their storage and use. Opportunities for improvement must be identified, regarding the increase on sales or optimizing processes. This stage lasts 4 weeks.
Proof of delivery
This second stage consists of verifying that the specifications of the use case were met, in order to correct assumptions in the strategy and be able to refine the Return on Investment. It lasts one week. Between this stage and the next, it is customary to adjust the learnings and select another use case to replicate these learnings in a new cycle of the methodology (sprint 5, Figure 1)
Disposable serviceware waste
Finally it is necessary to measure the results and get rid of the waste, in such a way to document everything that worked and remove all the waste of what did not work or could not be applied. This stage lasts 1 week.
Final Thoughts
This methodology promotes the adaptation of the organization in an evolutionary and incremental way through the use of agile methods. This is why iteration through a phase of selecting a next use case is suggested.
Through this methodology we have been able to implement successful use cases in International Banks, Payment Processors for Uber and Google and financial fintech. Thus demonstrating that a period of twelve weeks or three months is enough to implement a successful Big Data use case, just to exemplify the AWS team implements its famous Data Lab using a methodology very similar to the one described above.