Load Dataset
Load datasets from JSON, CSV, or URLs.
Signature
def load_dataset(source: Union[str, Path, Dict[str, Any]], **kwargs) -> Dataset:
...
Parameters
- source (Union[str, Path, Dict[str, Any]]): File path, URL, or dictionary data for the dataset.
- kwargs: Additional parameters for dataset creation (e.g.,
name,metadata).
Supported Formats
Overview of the various file and data formats supported for dataset loading.
- JSON files (
.json) - CSV files (
.csv) - URLs (http/https)
- Python dictionaries (passed directly as
source)
Usage
Examples demonstrating how to load datasets from different sources.
from benchwise import load_dataset
# From JSON file
dataset = load_dataset("data/qa_dataset.json")
# From CSV file
dataset = load_dataset("data/qa_dataset.csv")
# From URL
dataset = load_dataset("https://example.com/dataset.json")
# From Python dictionary
dataset = load_dataset(
source={
"name": "my_dict_dataset",
"data": [
{"question": "Dict Q1", "answer": "Dict A1"},
{"question": "Dict Q2", "answer": "Dict A2"}
]
}
)
JSON Format
Example of the expected JSON structure for loading datasets.
{
"name": "my_dataset",
"data": [
{"question": "What is AI?", "answer": "Artificial Intelligence"}
],
"metadata": {"version": "1.0"}
}
CSV Format
Example of the expected CSV structure for loading datasets.
question,answer
What is AI?,Artificial Intelligence
What is ML?,Machine Learning
Error Handling
Strategies for handling potential errors during dataset loading.
Handle potential errors when loading datasets:
from benchwise import load_dataset
from benchwise.exceptions import DatasetError
try:
dataset = load_dataset("data/qa_dataset.json")
except FileNotFoundError:
print("Dataset file not found")
except DatasetError as e:
print(f"Error loading dataset: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
Validating Loaded Data
Check dataset integrity after loading:
dataset = load_dataset("data/qa_dataset.json")
# Validate dataset has data
if not dataset.data:
raise ValueError("Dataset is empty")
# Check for required fields
if hasattr(dataset, 'prompts') and not dataset.prompts:
raise ValueError("Dataset has no prompts")
print(f"Loaded {len(dataset.data)} samples")