Quality Assurance and Big Data

Quality Assurance and Big Data
Reading Time: 4 minutes

Presently “Big Data” is a trending topic that’s steadily gaining clout as the digital universe continues to crescendo. Big data isn’t a new term in the technology market, as it was coined by a man named Roger Magoulas more than ten years ago, but lately, we’re seeing a surge in its popularity as companies acknowledge its multitude of uses.

Many companies around the world are re-evaluating the term in order to manage larger data from social media, old databases, mobile devices, bar codes, and everything that is allied to the internet. Organizations are now seeing the real value of the data and the new opportunities to collect and understand it.

Quality assurance is a sensitive topic when speaking on the subject of big data, not only due to the high demand for software development skills in the market but also because traditional methods are falling short. Some tools could be inadequate for large data sets or data validation could be a  headache. You might ask yourself, “How can we validate data in an  Excel spreadsheet? Or record by record? Could it be a challenge determining what we do the test? How much data do we test? What should be the sample size? When is the appropriate time to collect data?”.

If we are to implement testing for a big data system, a change of perspective is necessary. Usually, when we are planning the test scenarios, we focus more on the minute details and smaller things, but for big data, it’s wise to think on a larger scale. We should think in partitions and not records, and focus on metadata, not just data. Thinking in clusters and not just machines simply produce more accurate results. As Quality Assurance Engineers we should target processes and system behavior, not just data volume.

Below are some points to take in mind when we want to test big data systems:

  • Data Validation: in big data, we are evaluating a lot of files. We can’t test any individual record in the system, so we should work using partitions.
  • Manipulating data: data could be split into smaller chunks so it could be easily manipulated, but it will become difficult if we attempt to manipulate the whole data set.
  • Backup data:  data duplication for testing purposes is not necessary – we don’t need a recovery copy as in Oracle or SQL. The new ideas for Big Data framework bring data copies in different nodes as a backup.

As the market offers several tools to manage big data, the quality engineers have additional challenges such as testing the big data protocols and processes and keeping the validity,  accuracy, conformity, consistency, and duplicity of the data involved in the testing. For those reasons, having a clear test strategy is paramount.

Given this information, we’re  going to mention some techniques to practice, and in future blogs, we will go more in-depth on how to apply those testing techniques to a Big Data System.

Non-functional Testing

Keeping in mind Big Data characteristics (the three Vs detailed in a previous blog), we will approach Performance Testing. As we know,  one of the performance testing goals is to check the response time and try to carry the system to an acceptable level. As another part of the performance testing plan, we can attempt to cover different areas as a partition size, or a number of partitions, and add features that help to identify conditions that cause future issues.

We can also apply failover testing to be aware of how it reacts the application and how to switch between nodes in case of failure.

Functional testing

For now, we’re going to mention some of the scenarios commonly used for big data systems, but as a note, those scenarios will depend on the technologies and algorithms used for a specific project. Below are some suggestions:

  • Validate Map Reduce process or any other equivalent (ie. Spark) – this way we can be certain that the information is gathering correctly, i.e. if the process matches the requirements.
  • Data Storage Validation, in this case, corresponds to varying data quality using sample data.
  • Report Validation could be tested using a peer of techniques related to the visualization approach and the attributes validation.

Automation Solution

Automation is absolutely a key that assists in simplifying big data testing and compressing the discovery life cycle (understand data / prepare data/model  /  evaluate/deploy/monitor). Automation also creates an opportunity to collect data on what the system is doing right now, and we can compare the results to see if the team is achieving the goals for the project – it stores statistics that help the team.

Another important part of automation is ensuring that the correct automation environment configuration is in place so that it may simulate a real big data environment.

Regarding automation tools, there are several options in the market that are being utilized in order to test structured and unstructured data sets, but we will cover them in a later article.

As a final comment, we noticed that more companies are requesting big data testing skills, but in order to see the benefits, we should apply the correct test strategy that includes the right tools to diagnose and evaluate the project stability.