Testing Data _Episode 2

Published on November 2, 2024

“He was learning to fly, but did not know how to walk”

-----------------------------------------------------------
|                                                         |
|        [Title: Data Testing Methodologies]     |
|                                                         |
|  [Data Sources] --> [Data Ingestion] --> [Data Transformation] --> [Data Storage] --> [Data Consumption]  |
|        |                  |                       |                      |                        |
|        |                  |                       |                      |                        |
|   [Schema Validation]  [Data Profiling]   [Data Lineage Testing]   [Performance Testing with K6] |
|                                                         |
|   [Tools]                                         |
|                                                         |
-----------------------------------------------------------

In Episode 1, we introduced the fundamentals of data testing, emphasizing its importance in maintaining data quality and reliability. But, in reality data will continue to grow in volume, and complexity, so does the need for more sophisticated reviews and testing approaches.

In this episode, we want to focus on data testing methodologies, tools, and best practices to elevate our data quality approach. So, take a deep breath, 1,2,3, and lets dive in >>>

Data Testing Methodologies

While basic data testing focuses on verifying data accuracy and completeness, advanced methodologies address more complex aspects such as data consistency, integrity, and performance. And, one of the firsts actions / approaches starts with the basic schema validation.

Ensuring that data conforms or matches a predefined schema is crucial, especially when dealing with structured data formats like JSON’s schemas. Schema validation checks for the correct data types, required fields, and structural integrity.

For instance, when using JSON Schema to validate API responses ensures that the data received matches the expected format, preventing downstream errors. But what does that even means? Are we testing the same data retrieval over and over again? And most importantly, how do you test dynamic / ever changing data?

To illustrate the concept of Schema Validation, let’s consider a simple scenario where we have a predefined schema for user profiles in an application. While the actual user data (such as names, emails, and ages) will vary, the structure of the data remains consistent across all records.

{
  "$schema": "http://json-schema.org/07/schema/",
  "title": "User Profile",
  "type": "object",
  "properties": {
    "user_id": {
      "type": "integer"
    },
    "username": {
      "type": "string",
      "minLength": 3,
      "maxLength": 30
    },
    "email": {
      "type": "string",
      "format": "email"
    },
    "age": {
      "type": "integer",
      "minimum": 13
    },
    "is_active": {
      "type": "boolean"
    },
    "created_at": {
      "type": "string",
      "format": "date-time"
    }
  },
  "required": ["user_id", "username", "email", "is_active", "created_at"],
  "additionalProperties": false
}

Explanation of the Schema:

user_id: Must be an integer.
username: Must be a string between 3 and 30 characters.
email: Must be a valid email format.
age: Must be an integer, minimum value of 13.
is_active: Must be a boolean indicating if the user is active.
created_at: Must be a string in date-time format.
required: Specifies that certain fields must be present.
additionalProperties: Disallows any properties not defined in the schema.

Variable Data Samples

Now, lets see several data samples that conform to the above schema. Notice that while the values changes, the structure remains consistent.

{
  "user_id": 101,
  "username": "john_green",
  "email": "[email protected]",
  "age": 28,
  "is_active": true,
  "created_at": "2024-01-15T08:30:00Z"
}

{
  "user_id": 102,
  "username": "jane_smith",
  "email": "[email protected]",
  "age": 34,
  "is_active": false,
  "created_at": "2024-02-20T12:45:30Z"
}

{
  "user_id": 103,
  "username": "alex99",
  "email": "[email protected]",
  "is_active": true,
  "created_at": "2024-03-10T16:20:15Z"
}

{
  "user_id": "104",          // integer?
  "username": "sam",
  "email": "[email protected],        // email?
  "age": 10,                 // age below minimum?
  "is_active": "yes",        // boolean?
  "created_at": "2024-04-05" // time?
}

Explanation of Samples:

Samples 1–3: These are valid user profiles that adhere to the defined schema. They include all required fields with appropriate data types and formats. Note that Sample 3 omits the optional age field, which is acceptable as it is not listed under required.
Sample 4: This is an invalid user profile that violates multiple schema rules:

user_id is a string instead of an integer.

email does not follow a valid email format.

age is below the minimum allowed value of 13.

is_active is a string instead of a boolean.

created_at does not include the time component required by the date-time format.

Data Profiling

Data profiling involves analyzing datasets to understand their structure, content, and relationships. This methodology helps identify anomalies, outliers, and patterns that may indicate data quality issues.

The focus here is Statistical Analysis, Pattern Recognition, and Relationship Analysis.

To illustrate the concept of Data Profiling, let’s consider a simple but solid scenario where we analyze a dataset containing customer information for an online retail company. The goal is to assess the quality and characteristics of the data to identify potential issues and areas for improvement.

Understanding the Dataset

Suppose we have the following dataset named customers.csv:

Descriptive Statistics

Calculate basic statistics for numerical fields to understand their distribution. For instance:

Age:

Count: 9 (1 missing)
Mean: 33.11
Median: 30
Minimum: 22
Maximum: 52
Standard Deviation: 9.73

Total Purchases:

Count: 9 (1 missing)
Mean: 6.89
Median: 7
Minimum: 2
Maximum: 12
Standard Deviation: 3.48

Data Completeness

Here we assess the completeness of each field by identifying missing values. For instance:

Data Uniqueness

Evaluate the uniqueness of records to identify potential duplicates.

customer_id: All values are unique (10 unique out of 10).
email: All values are unique (10 unique out of 10).

Data Consistency

Check for consistency in categorical fields.

… And the list goes on, this can be applied based on your business model and requirements. For instance, we could talk about interpreting profiling results, or inconsistent data formats.

Data Lineage Testing

Understanding the origin and transformation of data is essential for tracking its journey through various systems. Data lineage testing verifies that data transformations maintain integrity and traceability from source to destination.

Benefits:

Impact Analysis: Assessing how changes in data sources affect downstream systems.
Compliance: Ensuring data handling meets regulatory, consumption, and requirements overall.

[Regional Databases]
      |
      v
[ETL Processes]
      |
      v
[Data Transformation Scripts]
      |
      v
[Central Data Warehouse]
      |
      v
[PowerBI Dashboards]

On this one we are going to elaborate further on, as there are concepts and aspects like observability and logging that fall out of this chapter scope.

Performance Testing for Data Pipelines

As data volumes increase, ensuring that data pipelines can handle the load without performance degradation becomes vital. Performance testing evaluates the scalability and efficiency of data processing workflows.

To illustrate the concept of Performance Testing for Data Pipelines, let’s consider a scenario where we evaluate the scalability and efficiency of a data pipeline responsible for processing real-time sensor data from an IoT (Internet of Things) application. The goal is to ensure that the pipeline can handle increasing data volumes without performance degradation. So lets brake it down >/>/>/ …

Understanding Performance Testing for Data Pipelines

Performance Testing assesses how a data pipeline performs under various conditions, focusing on aspects such as throughput, latency, and resource utilization. It helps identify bottlenecks, ensure scalability, and guarantee that the pipeline meets performance requirements.

Key Performance Metrics:

Throughput: The amount of data processed per unit of time (e.g., records per second).
Latency: The time taken for data to traverse through the pipeline from ingestion to storage.
Resource Utilization: CPU, memory, and network usage during data processing.
Error Rates: Frequency of errors encountered under load.

Scenario: Testing an IoT Data Pipeline

Imagine an organization that collects real-time sensor data from thousands of IoT devices deployed across various locations. This data flows through a pipeline comprising data ingestion, transformation, and storage stages. As the number of devices scales up, it’s crucial to ensure that the pipeline can handle the increased load without compromising performance.

Pipeline Components 2:2

For Data Ingestion >< Kafka:

Serves as the message broker to ingest real-time sensor data.

Data Transformation >< Apache Spark:

Processes and transforms the incoming data.

Data Storage >< Amazon S3:

Stores the processed data for further analysis.

Data Consumption >< Amazon Redshift:

Enables data analytics and reporting.

Setting Up Performance Testing with K6

K6 is an open-source, developer-centric performance testing tool that is primarily used for load testing web applications and APIs. However, with its scripting capabilities, K6 can be adapted to test various components of a data pipeline, especially the data ingestion and API endpoints.

Steps to Perform Performance Testing Using K6:

a. Install K6

First, install K6 on your testing environment. You can download it from the official website https://k6.io/ or use package managers like brew for macOS:

brew install k6

b. Define Performance Test Objectives

Before writing the test script, define what you aim to achieve:

Simulate Load: Emulate the expected number of IoT devices sending data concurrently.
Measure Throughput and Latency: Assess how the pipeline handles the data volume.
Identify Bottlenecks: Determine stages in the pipeline where performance degrades.

c. Create a K6 Test Script

Below is a sample K6 script (iot_pipeline_test.js) that simulates multiple IoT devices sending data to the data ingestion endpoint (e.g., an API that feeds into Kafka).

import http from 'k6/http';
import { sleep, check } from 'k6';
import { Counter } from 'k6/metrics';

// Define custom metrics
export let errorCount = new Counter('errors');

export let options = {
    stages: [
        { duration: '2m', target: 100 }, // Ramp-up to 100 users
        { duration: '5m', target: 100 }, // Stay at 100 users
        { duration: '2m', target: 200 }, // Ramp-up to 200 users
        { duration: '5m', target: 200 }, // Stay at 200 users
        { duration: '2m', target: 0 },   // Ramp-down to 0 users
    ],
    thresholds: {
        'http_req_duration': ['p(95)<500'], // 95% of requests should be below 500ms
        'errors': ['count<10'],             // Less than 10 errors
    },
};

export default function () {
    const url = 'https://api.yourdomain.com/iot/data';
    
    // Sample sensor data payload
    const payload = JSON.stringify({
        device_id: `device_${Math.floor(Math.random() * 1000)}`,
        temperature: (20 + Math.random() * 15).toFixed(2),
        humidity: (30 + Math.random() * 50).toFixed(2),
        timestamp: new Date().toISOString(),
    });
    
    const params = {
        headers: {
            'Content-Type': 'application/json',
        },
    };
    
    let res = http.post(url, payload, params);
    
    // Check if the response status is 200
    let result = check(res, {
        'is status 200': (r) => r.status === 200,
    });
    
    if (!result) {
        errorCount.add(1);
    }
    
    sleep(1); // Wait for 1 second between iterations
}

Explanation of the Script:

Stages: Defines the load pattern, ramping up to 100 users, maintaining for 5 minutes, ramping up to 200 users, maintaining, and then ramping down.
Thresholds: Sets performance expectations, such as 95% of requests should complete within 500ms, and error counts should remain below 10.
Payload: Simulates sensor data with randomized values for temperature and humidity.
Checks: Verifies that each HTTP POST request receives a 200 OK response. If not, increments the errorCount metric.
Sleep: Introduces a 1-second pause between iterations to mimic realistic device behavior.

d. Execute the Performance Test

Run the K6 test script using the following command:

k6 run iot_pipeline_test.js

K6 will execute the script, simulating the defined load pattern and collecting performance metrics.

e. Analyze Test Results

After the test completes, K6 provides a summary of the performance metrics:

          /\      |‾‾|  /‾‾/  /‾/
     /\  /  \     |  |_/  /  /  
    /  \/    \    |      |  |   
   /          \   |      |  |   
  / __________ \  |      |  |   
                                
  execution: local
     script: iot_pipeline_test.js
     output: -

  scenarios: (100.00%) 1 scenario, 200 max VUs, 9m30s max duration (incl. graceful stop):
           * default: Up to 200 looping VUs for 9m30s over 21 stages (gracefulRampDown: 30s)

running (9m30.0s), 00/200 VUs, 15000 complete and 15 interrupted iterations
default ✓ [======================================] 200 VUs  9m30s/9m30s  15m0s

    ✓ is status 200

    checks.........................: 99.90% ✓ 14985 ✗ 15   
    data_received..................: 2.5 GB  4.36 MB/s
    data_sent......................: 1.2 GB  2.11 MB/s
    http_req_blocked...............: avg=10ms   min=1µs    med=2µs    max=5s     
    http_req_connecting............: avg=5ms    min=0s     med=0s     max=4s     
    http_req_duration..............: avg=300ms  min=100ms  med=250ms  max=800ms  
    http_req_receiving.............: avg=50ms   min=20ms   med=40ms   max=200ms  
    http_req_sending...............: avg=20ms   min=10ms   med=15ms   max=100ms  
    http_reqs......................: 15000  26.32/s
    iteration_duration.............: avg=1.2s   min=1s     med=1.1s   max=2s     
    iterations.....................: 15000  26.32/s
    vus_max........................: 200    
    vus_min........................: 0      
    vus_mean.......................: 150

Key Metrics to Analyze:

Throughput (http_reqs): Number of requests per second. In this example, 26.32 requests per second.
Latency (http_req_duration): Average request duration. Here, the average is 300ms, with 95% expected below 500ms.
Error Rates (errors): Total number of failed requests. The threshold set was below 10, and in this example, only 15 errors occurred, slightly exceeding the threshold.
Resource Utilization: While K6 does not directly measure CPU or memory usage of the pipeline components, integrating it with monitoring tools (e.g., Grafana, Prometheus) can provide deeper insights.

… Yes, Episode 3, more to come, read, share, learn.

Conclusion

Data testing, coupled with the right approach, tools, and best practices, are essential for maintaining high data quality in today’s data-driven landscape. By implementing robust but simple data testing strategy, organizations can ensure data integrity, support reliable decision-making, and enhance overall operational efficiency.

In Episode 3, we’ll explore data testing in machine learning workflows, focusing on ensuring data quality for model training and deployment. Stay tuned!

If you found this episode helpful, feel free to share your thoughts and experiences in the comments below!

Continue reading on website

Other news

🌸 Spring bingo - Wellness challenge - Halfway! 🌸

April 15, 2025

Hey Hivebriters! Quick check-in on our April Wellness Challenge - Spring Bingo! We're halfway through the month, and it's the perfect time to jump in if you haven't started yet (or keep going if you have)! Quick Reminders:Complete rows or columns for 5 raffle entries eachSquares with 📷 require photo submissions in the commentsSubmit completed rows/columns through the form by April 30thBonus entri