Iceberg Data Generation

Published on October 2, 2024

Data Generation Series

Generate millions of production-like records and load them into Iceberg tables with ease.

Apache Iceberg is becoming a popular file format. Credit Apache Iceberg.

You may have a use case such as testing your job or application, replicating performance issues, determining the optimal data model or just debugging a production bug. We will look at how we can seamlessly generate any number of records that follow your production data patterns. All are available through the open-source tool Data Caterer.

Choice of Interface

You have 4 choices of how you can generate data in Data Caterer:

UI
Scala
Java
YAML

In this article, we will use Scala as the interface but feel free to use the following page as a reference when using either the UI, Java or YAML interface.

Data Generation

First, checkout this repository on your local laptop. This repository will help do the following:

Define a data generation task
Generate records

Now let’s try to generate data for a scenario like bank accounts. We can do this by creating a generation task in Scala or Java within the data-caterer-example repository. Below is an example of the details required to set up the task.

class IcebergPlan extends PlanRun {
  val accountTask = iceberg("customer_accounts", "account.accounts", "/opt/app/data/customer/iceberg")
      .schema(
        field.name("account_number").regex("[0-9]{10}").unique(true),
        field.name("balance").`type`(new DecimalType(5, 2)).min(1).max(1000),
        field.name("name").expression("#{Name.name}"),
        field.name("created_by").sql("CASE WHEN account_status IN ('open', 'closed') THEN 'eod' ELSE 'event' END"),
        field.name("open_time").`type`(TimestampType).min(Date.valueOf("2022-01-01")),
        field.name("status").oneOf("open", "closed", "suspended", "pending")
      )
}

Each field has specific configuration that helps make it closer to the data that would exist in production.

Field Metadata

ACCOUNT_NUMBER

account_number follows a particular pattern where it is a 10-digit number. This can be defined via a regex like [0–9]{10}. Alongside, we also mention that values are unique ensuring that unique values are generated.

BALANCE

balance let's make the numbers not too large, so we can define a min and max for the generated numbers to be between 1 and 1000.

NAME

name is a string that also follows a certain pattern, so we could also define a regex but here we will choose to leverage the DataFaker library and create an expression to generate real-looking name. All possible faker expressions can be found here

CREATED_BY

created_by is a field that is based on the account_status field where it follows the logic: if account_status is open or closed, then it is created_by eod else created_by event. This can be achieved by defining an SQL expression like above.

OPEN_TIME

open_time is a timestamp that we want to have a value greater than a specific date. We can define a min date by using either java.sql.Dateor java.sql.Timestamp.

STATUS

status is a field that can only obtain one of four values, open, closed, suspended or pending.

We can now try to run it via the following command to see what happens:

cd data-caterer-example/
./run.sh IcebergPlan

Sample output data from running the data generation job for accounts. Underlying data is stored in parquet files. Credit myself.

You can see that we have some data that looks alright. There are some fields without values that you can populate yourself as an exercise to see what other types of data generator options are available to you. Now we move on to see what other capabilities exist when generating data.

Let’s say we wanted to generate some additional data that was related to accounts. For example, we want to generate transactions per account. Our transactions schema may look something like this:

account_number: string
full_name:      string
amount:         decimal
time:           timestamp
date:           date

We note that this dataset also has a column called account_number. What if we wanted the same account_number values to show up in both the accounts and transactions datasets to test out all the functionality? This is where we can define foreign keys in Data Caterer to help.

Foreign Keys

We can define our foreign keys like below:

val accountTask = ...

val transactionTask = iceberg("customer_transactions", "account.transactions", "/opt/app/data/customer/iceberg")
  .schema(
    field.name("account_number"),
    field.name("full_name"),
    field.name("amount").`type`(new DecimalType(4, 2)).min(1).max(100),
    field.name("time").`type`(TimestampType).min(java.sql.Date.valueOf("2022-01-01")),
    field.name("date").`type`(DateType).sql("DATE(time)")
  )

val config = ...

val myPlan = plan.addForeignKeyRelationship(
  accountTask, List("account_number", "name"),
  List(transactionTask -> List("account_number", "full_name"))
)

execute(myPlan, config, accountTask, transactionTask)

Running again via:

./run.sh IcebergPlan

Sample output data from running the data generation job for transactions. Underlying data is stored in parquet files. Credit myself.

This allows us to test any jobs or applications that rely on both accounts and transactions Iceberg files being populated for a given account_number. One thing you may have noticed is that we haven’t defined how many records to generate. By default, it will generate 1,000 records. If we wanted to change this or other count-related options, we can include the below at the task level:

//generate 10,000 records
val accountTask = iceberg("customer_accounts", "account.accounts", "/opt/app/data/customer/iceberg")
  ...
  .count(count.records(10000))

//generate between 1 and 5 records per account_number and full_name
val transactionTask = iceberg("customer_transactions", "account.transactions", "/opt/app/data/customer/iceberg")
  ...
  .count(count.recordsPerColumnGenerator(generator.min(1).max(5), "account_number", "full_name"))

We can also check out the report to see a summary of what was generated under docker/sample/report/index.html. A sample report can also be seen here.

Report summarising data generation. Credit myself.

Partitioned

If you require your Iceberg tables to be partitioned, you can control this via the partitionBy option in the connection definition:

val transactionTask = iceberg(
  "customer_transactions",
  "account.transactions",
  "/opt/app/data/customer/iceberg",
  "hadoop",  //catalog type
  "",        //warehouse URI
  Map("partitionBy" -> "account_number,full_name")
)

Automated Generation

We know that in the real world, new or altered schemas will continue to happen as new use cases arise, business requirements change, data sizes grow etc. So how can we automate this data generation process to keep up with these changes? This is where the base principle of Data Caterer being a metadata-driven tool comes into play. All we need to do is define our Iceberg schema to come from a metadata source (such as a data catalog like Open Metadata or a data contract like ODCS) and enable the enableGeneratePlanAndTasks flag:

class AdvancedIcebergPlan extends PlanRun {
  val accountTask = iceberg("customer_accounts", "account.accounts", "/opt/app/data/customer/iceberg")
    .schema(metadataSource.openDataContractStandard("/opt/app/mount/odcs/full-example.odcs.yaml"))
    .count(count.records(100))
  
  val config = configuration
    .generatedReportsFolderPath("/opt/app/data/report")
    .enableGeneratePlanAndTasks(true)
    .enableRecordTracking(true)

  execute(config, accountTask)
}

We have also enabled a flag enableRecordTracking that will be useful later. Now let’s try to run it and see what happens.

./run.sh AdvancedIcebergPlan

Delete Generated Data

One often overlooked part of data generation is cleaning up the generated data. This is important as we should look to clean up after ourselves and reduce the burden of data and infrastructure management in our test environments. We can set it to delete the generated records by enabling enableDeleteGeneratedRecords and disabling enableGenerateData.

val config = configuration
  .generatedReportsFolderPath("/opt/app/data/report")
  .enableGeneratePlanAndTasks(true)
  .enableRecordTracking(true)
  .enableDeleteGeneratedRecords(true)
  .enableGenerateData(false)

Running the job again will now delete the records that we generated before, keeping intact any other existing data that was there previously. It will also ensure that the deletion happens according to the order defined implicitly by foreign keys.

./run.sh AdvancedIcebergPlan

Conclusion

Nice! Now we have the full lifecycle of generating data in Iceberg format. If you would like to find out what else Data Caterer is capable of, more details can be found at data.catering. If you want to read other guides that take you through generating data in data sources such as Kafka or Postgres, check the list here.

Thanks for reading!

Continue reading on website

Other news

🌸 Spring bingo - Wellness challenge - Halfway! 🌸

April 15, 2025

Hey Hivebriters! Quick check-in on our April Wellness Challenge - Spring Bingo! We're halfway through the month, and it's the perfect time to jump in if you haven't started yet (or keep going if you have)! Quick Reminders:Complete rows or columns for 5 raffle entries eachSquares with 📷 require photo submissions in the commentsSubmit completed rows/columns through the form by April 30thBonus entri