AI-generated financial datasets

You want to build apps with LLMs? Then you need realistic, high quality datasets so you can evaluate. However, obtaining such datasets can be a challenging and time-consuming task. Why not use LLMs to build the datasets for you

Enter financial-datasets, an open-source Python library that simplifies the process of generating question & answer financial datasets using LLMs. With just a few lines of code, you can generate realistic datasets from various financial texts, including SEC filings such as 10-Ks and 10-Qs, as well as PDFs and other documents.

To demonstrate the power and simplicity of financial-datasets, let's take a look at an example of how to use it:

from financial_datasets.generator import DatasetGenerator

# Create dataset generator
generator = DatasetGenerator(model="gpt-3.5-turbo-0125", api_key="your-openai-key")

# Generate dataset from 10-K
dataset = generator.generate_from_10K(
    ticker="AAPL",
    year=2023,
    max_questions=100,
)

Generate 100 questions about Apple's 10-K

In this example, we start by importing the DatasetGenerator class from the financial_datasets.generator module. We then create an instance of the generator, specifying the desired LLM model and providing our OpenAI API key.

Next, we use the generate_from_10K method to generate a dataset based on Apple Inc.'s (AAPL) 10-K filing for the year 2023. We set the max_questions parameter to 100, indicating that we want to generate up to 100 question-answer pairs from the filing.

And that's it! With just a few lines of code, we have generated a synthetic financial dataset that can be used for testing, training, or analysis purposes.

Now the generate_from_10K function is a convenience function that can generate from 10-K filings. Beneath the hood it uses edgartools. For other SEC forms, you can use edgartools directly to get the filing texts to generate questions from.

Moreover, financial-datasets offers customizable data generation options, enabling you to fine-tune the generated datasets to suit your specific needs. You can control parameters such as the number of questions, the complexity of the generated data, and more.

In conclusion, financial-datasets is a powerful and user-friendly Python library that revolutionizes the way we generate financial datasets using LLMs. Its ease of use, compatibility with various financial documents, and customizable options make it an invaluable tool for anyone working with financial data and LLMs. If you're looking to streamline your financial data generation process and unlock new possibilities in your projects, give financial-datasets a try today!

pip install financial-datasets
Dwight Gunning

Dwight Gunning