Using Bulk Processor

What is Bulk Processor?

Bulk Processor is a high-performance application built to handle bulk audio processing tasks. It efficiently retrieves audio files from S3/S3-compatible storage systems, forwards them to a Data Flow project, and manages execution through a robust, parallel job-processing architecture.

It is designed to work in harmony with the Data Flow system, enabling seamless audio file analysis, centralized credential management, and dynamic job control—all the while ensuring reliability and performance in production environments.

It can be accessed from the Bulk Processor menu under Data Flow:

Main Concepts

Job

A job in Bulk Processor is a self-contained unit of work responsible for the complete lifecycle of processing a batch of audio files. It covers:

Fetching audio files from S3-compatible storage (e.g., AWS S3, MinIO)
Passing those files to Data Flow for analysis
Tracking progress, status, and results
Managing retries, prioritization, and recovery if the system is interrupted

Multiple jobs can run concurrently, allowing efficient parallel processing.

Job Controls

Bulk Processor supports internal control operations that allow the system to manage the execution state of jobs during their lifecycle. These include:

Start: Initiates a job.
Pause: Temporarily halts an active job. Processing can be resumed later without restarting.
Continue: Resumes a previously paused job from the point it was paused.
Terminate: Stops the job immediately and marks it as terminated. Remaining tasks are not processed.

These controls enable flexible and safe handling of job execution based on system behavior, resource availability, or operational needs.

Job Flows

Each job has two main parallel flows:

Fetch
• Downloads audio files from S3-compatible storage
• Stores a subset of files in memory for quick access
Process
• Forwards the in-memory audio files to a Data Flow project
• Uses an intelligent strategy to optimize throughput and responsiveness

Fetcher

The Fetcher is the component responsible for discovering and retrieving audio files from your S3‑compatible storage so that they can be handed off to the Processor. It sits at the front of the Fetch flow and drives your job’s Fetch → Process pipeline.

Processor

The Processor is the component responsible for sending audio files to a Data Flow project for analysis.

It requires the following configuration:

Project Name: The target Data Flow project to send the audio to.
Project Version: The specific version of the Data Flow project to use.
Project Parameters (optional): A Json of custom parameters to tailor the behavior of the project during processing.

These parameters allow fine-tuned control over how the Data Flow project handles the incoming audio.

For more details on supported parameters and their usage, refer to the Data Flow Documentation.

High Availability

Bulk Processor is designed with high availability in mind and can be deployed as multiple pods within the same Kubernetes cluster.

Key behaviors:

Horizontal scaling: Multiple Bulk Processor pods can run concurrently to handle large processing loads.
Load distribution: New jobs are automatically forwarded to the most suitable(least loaded) pod.
Failover safety: If a pod crashes, its unfinished jobs are recovered by another available instance just in a short time, thanks to the presence awareness mechanism.

This architecture ensures resilience, efficient resource utilization, and minimal disruption during updates or failures.

Storage Address

Storage address is parsed in the web application.

Examples of storage addresses:

http://minioaddress.com/bucketname/prefix/some/folder/names
https://minioaddress.com/bucketname/prefix/some/folder/names

AWS Signature Credential

AWS Signature Credential refers to the authentication mechanism used by AWS to secure and authorize API requests, most commonly using Signature Version 4 (SigV4).

Bulk Processor uses these credentials to securely access S3-compatible storage. It only supports credentials that are securely created and managed within the Data Flow system.