Dhravani is a web-based application developed under the "Center of Indian Language Data" project for creating speech corpora for Automatic Speech Recognition (ASR). The platform streamlines the creation and management of audio datasets by facilitating recording, managing, and organizing voice recordings with their transcriptions.
Users record audio from provided transcripts, with data being stored in PostgreSQL tables for both transcripts and metadata. Moderators then verify recordings for quality control, after which validated content is transferred to HuggingFace either through manual triggers or scheduled synchronization intervals. This comprehensive workflow ensures high-quality speech data collection and organization.
git clone
cd dataset-preparation-tool
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
docker build -t dataset-preparation .
docker run -p 7860:7860 -v dataset_volume:/app/datasets dataset-preparation
Note: The -v dataset_volume:/app/datasets
option mounts a volume for persistent datasets, preserving your data.
Create a .env
file in the root directory with the following configuration:
# Security
FLASK_SECRET_KEY=your_secure_secret_key
JWT_SECRET_KEY=${FLASK_SECRET_KEY} # Defaults to FLASK_SECRET_KEY
SUPER_ADMIN_PASSWORD=your_secure_admin_password
SUPER_USER_EMAILS=admin1@example.com,admin2@example.com
ENABLE_AUTH=true
# Database and Services
POSTGRES_URL=postgresql://user:password@localhost:5432/dataset_db
POCKETBASE_URL=http://localhost:8090
HF_TOKEN=your_huggingface_token
HF_REPO_ID=your_username/your_dataset
# Storage Configuration
SAVE_LOCALLY=true
DATASET_BASE_DIR=/app/datasets
TEMP_FOLDER=./temp
# Batch Processing
TRANSCRIPT_BATCH_SIZE=100
SYNC_MEMORY_LIMIT_MB=1024
UPLOAD_CHUNK_SIZE=8388608 # 8MB in bytes
UPLOAD_BATCH_SIZE=10
MAX_UPLOAD_WORKERS=4
MAX_UPLOAD_RETRIES=3
# Network Settings
NETWORK_TIMEOUT=30 # seconds
FLASK_PORT=7860
# Sync Schedule (UTC)
SYNC_HOUR=2
SYNC_MINUTE=0
SYNC_TIMEZONE=UTC
Important: Replace the placeholder values with your actual configuration parameters. Never commit sensitive credentials to version control.
flask run --host=0.0.0.0 --port=7860
Open your preferred web browser and navigate to http://localhost:7860
(or the appropriate Docker address, if applicable).
The application adopts a three-tier architecture:
User Authentication (A): The authentication flow supports a four-tier hierarchy where Super Admin have complete system access and adding of Admin with a SUPER_ADMIN_PASSWORD, followed by Admins who manage moderators and system processes. Moderators are assigned for content validation, while regular users can contribute recordings through the platform.
Data Processing (B): At the core of the system, PostgreSQL tables store both transcripts and metadata. The application organizes audio files in language-specific structures, implementing a comprehensive quality control workflow managed by moderators. This phase also handles preparation for HuggingFace synchronization, ensuring data integrity throughout the process.
Dataset Publishing (C): The final stage involves organizing validated recordings in structured, language-specific directories. The system generates and maintains metadata parquet files for efficient data management. Content synchronization with HuggingFace occurs either through scheduled automated processes or manual triggers, making the verified datasets publicly accessible.
auth_middleware.py
, PocketBase)/auth/callback
(POST): PocketBase authentication callback endpoint, responsible for storing user sessions and tokens./login
(GET): Renders the login page./logout
(GET): Logs out the current user, clearing all session and authentication cookies./token/refresh
(GET): Refreshes the access token for continued authenticated access.app.py
)/start_session
(POST): Starts a new recording session, initializing the AudioDatasetPreparator
. CSRF protected./next_transcript
(GET): Retrieves the next transcription from the LazyTranscriptLoader
./prev_transcript
(GET): Retrieves the previous transcription./skip_transcript
(GET): Skips the current transcription and retrieves the subsequent one./save_recording
(POST): Saves the audio recording and associated metadata. CSRF protected. Performs necessary audio processing and storage./languages
(GET): Retrieves a list of supported languages (defined in language_config.py
).validation_route.py
, moderator access required)/validation/
(GET): Renders the validation interface for moderators./validation/api/recordings
(GET): Retrieves recordings for validation, supporting pagination and filtering options./validation/api/verify/
(POST): Verifies or rejects a specific recording. CSRF protected./validation/api/audio/
(GET): Serves a specific audio file./validation/api/delete/
(DELETE): Deletes a recording. CSRF protected./validation/api/next_recording
(GET): Retrieves the next recording for validation, utilizing the assign_recording
function.admin_routes.py
, admin access required)/admin/
(GET): Renders the admin interface./admin/submit
(POST): Submits transcriptions from either a file upload or direct text input./admin/users/moderators
(GET): Retrieves a list of all moderators./admin/users/search
(GET): Allows searching for a user by email address./admin/users//role
(POST): Updates the role of a specific user./admin/sync/status
(GET): Checks the current status of the dataset synchronization process./admin/sync
(POST): Manually triggers a dataset synchronization.super_admin.py
, super admin access required)/admin/super/
(GET): Renders the super admin interface./admin/super/verify
(POST): Verifies the super admin password for sensitive operations./admin/super/admins
(GET): Retrieves a list of all admin users./admin/super/users/search
(GET): Allows searching for a user by email address./admin/super/users//role
(POST): Updates the role of a specific user.{
"id": "string",
"email": "string",
"name": "string",
"role": "user" | "moderator" | "admin",
"is_moderator": boolean,
"gender": "M" | "F" | "O" | null,
"age_group": "Teenagers" | "Adults" | "Elderly" | null,
"country": "string" | null,
"state_province": "string" | null,
"city": "string" | null,
"accent": "Rural" | "Urban" | null,
"language": "string" | null
}
# List/Search rule - Only admins can list all users, users can only see their own record
(@request.auth.role = "admin") || (@request.auth.id = id)
# View rule - Only admins can view any user, users can only view their own record
(@request.auth.role = "admin") || (@request.auth.id = id)
# Update rule - Admins can update any user, users/moderators can only update their own record without changing role
(
@request.auth.role = "admin"
) || (
(@request.auth.role = "user" || @request.auth.role = "moderator") &&
@request.auth.id = id &&
role = role
)
{
"id": integer,
"user_id": "string",
"audio_filename": "string",
"transcription_id": integer,
"speaker_name": "string",
"speaker_id": "string",
"audio_path": "string",
"sampling_rate": integer,
"duration": float,
"language": "string",
"gender": "string",
"country": "string",
"state": "string",
"city": "string",
"status": "pending" | "verified" | "rejected",
"verified_by": "string" | null,
"username": "string",
"age_group": "string",
"accent": "string",
"transcription": "string"
}
AudioDatasetPreparator
(prepare_dataset.py
): Manages local audio storage, processes audio files, and handles metadata operations.LazyTranscriptLoader
(lazy_loader.py
): Loads transcriptions in batches to optimize memory usage, especially with large datasets.DatasetSynchronizer
(dataset_sync.py
): Orchestrates the dataset synchronization process with Hugging Face Hub, ensuring data integrity.update_parquet_files
(prepare_parquet.py
): Updates Parquet files with the latest verified records for each language.store_metadata
(database_manager.py
): Persists recording metadata in the PostgreSQL database.assign_recording
(database_manager.py
): Assigns a recording to a moderator for validation purposes.verify_password_secure
(super_admin.py
): Securely verifies the super admin password, preventing timing attacks.set_security_headers
(security_middleware.py
): Sets security headers to protect against common web vulnerabilities.csrf_protect
(security_middleware.py
): Provides CSRF protection for data-modifying routes.