
Efficiency File Counting in AWS S3 with Go Concurrency
While working on a task assigned to me at work, I needed to search through a range of folders in AWS S3 and count the total number of files. At first glance, this seemed straightforward.
The Initial Approach: Bash and AWS CLI
Initially, I tried using a bash script with the AWS CLI to count files by listing each folder one by one. However, this approach quickly became inefficient due to its sequential nature. Processing each folder individually takes a very long time, especially in folders containing many files.
Moving to Go for Faster Processing
To streamline the process, I turned to Go (Golang), which offers robust concurrency support. By leveraging Go’s goroutines, I was able to count files across multiple folders in parallel, dramatically reducing the overall time needed for the task.
Go Implementation with Concurrency
I developed a tool countS3 using Go that utilizes the AWS SDK for Go, goroutines, and sync.WaitGroup to manage concurrent tasks. The code is available on GitHub: countS3.
Here’s a breakdown of the main components of the Go implementation:
- Goroutines: Each folder count is run in a separate goroutine.
- WaitGroup: Ensures that the main function waits until all file-counting goroutines are complete.
- Channel: Collects the folder path that needs to be counted, allowing us multiple goroutines to pick up the folder to count concurrently.
Code Structure and Functions
1. Counting Files in a Single S3 Folder
The CountFilesInS3Folder function counts the files in a single folder by listing objects under a specific prefix (folder path). It excludes the folder itself (often represented by an object of size 0).
// internal/count/count.go
func CountFilesInS3Folder(client *s3.Client, bucket string, prefix string) {
var count int
for {
input := &s3.ListObjectsV2Input{
Bucket: aws.String(bucket),
Prefix: aws.String(prefix),
}
result, err := client.ListObjectsV2(context.TODO(), input)
if err != nil {
log.Fatal(err)
}
for _, object := range result.Contents {
// Exclude the folder itself (usually represented as an object with size 0)
if *object.Size > 0 {
count++
}
}
if !*result.IsTruncated {
break // No more objects to retrieve
}
}
fmt.Printf("Total files in folder %s: %d\n", prefix, count)
}
2. Queueing Folder Paths as Jobs
The QueueJob function reads a file containing folder paths and adds each folder path to the jobs channel.
// internal/queue/queue.go
func QueueJob(fileName string, jobs chan string) {
file, err := os.Open(fileName)
if err != nil {
log.Fatal(err)
}
defer file.Close()
scanner := bufio.NewScanner(file)
for scanner.Scan() {
job := scanner.Text()
jobs <- job
}
}
3. Worker Pool for Concurrent Execution
The Worker function retrieves folder paths from the jobs channel and executes CountFilesInS3Folder for each one. The worker pool size is defined by workerPool, which specifies the number of concurrent workers to use when executing the count.
// internal/worker/worker.go
func Worker(s3c *s3.Client, bucket string, jobs chan string, wg *sync.WaitGroup) {
for job := range jobs {
count.CountFilesInS3Folder(s3c, bucket, job)
}
defer wg.Done()
}
4. Starting the Workers
In the main function, workerPool is the maximum number of go routines that will be created to run the count job. This will be passed in by the -w parameter.
// cmd/counts3/main.go
for i := 1; i <= workerPool; i++ {
wg.Add(1)
go worker.Worker(s3Client, bucketName, jobs, &wg)
}
Benefits and Results
Using Go’s concurrency model, I reduced the file-counting time significantly. Instead of waiting for each folder to be processed one by one, the parallel approach aggregates results much faster, improving both efficiency and performance.
Efficiency File Counting in AWS S3 with Go Concurrency was originally published in Government Digital Products, Singapore on Medium, where people are continuing the conversation by highlighting and responding to this story.