Making dummy data with Python for dummies

Quick tutorial on how I make dummy data with Python & the openAI API.

Making dummy data with python

Part I: Situation

My goal with dummy data is to share my work publicly without exposing any client or personal data. It’s usually pretty easy to make dummy data myself, but my social media reports used a lot, a lot of data.

In this case, I needed to create over a dozen different files of test data in both JSON & CSV formats. The JSON was nested & pretty complex, and the CSV also was complex.

Instead of using numpy, asking chatGPT to make it, or just editing the data myself, I needed a way to automate the large amount of data.

The answer was simple: iterate over a directory of real files, and make a dummy file with each one using the openAI API.

Part II: Python

Below is the python code! Obviously it doesn’t include imports or the locations of the directories.

def bulk_dummy():
    """Makes new dummy files using the source_files directory"""
    # Ensure the dummy_files directory exists
    if not os.path.exists(dummy_directory):
        os.makedirs(dummy_directory)

    #Iterate over files in the source directory
    for filename in os.listdir(source_directory):
        # Get the full file path
        file_path = os.path.join(source_directory, filename)

        # Check the file extension to determine the file type
        if filename.endswith('.csv'):
            file_type = 'csv'
        elif filename.endswith('.json'):
            file_type = 'json'
        else:
            print(f"Skipping {filename}: Not an accepted file type.")
            continue  # Skip files that are not CSV or JSON

        # Create a dummy file
        ## TRY & EXCEPT BC SOMETIMES CHATGPT BREAKS. 
        try:
            dummy_data = chatgpt.make_dummy(file_path, file_type)
        except Exception as e:
            print(e)
            print(file_path)
            print('trying again in 5 sec')
            time.sleep(5)
            try:
                dummy_data = chatgpt.make_dummy(file_path, file_type)
            except:
                print(Exception)
                print(file_path)

        # Define the output path
        dummy_filename = f"dummy_{filename}"
        dummy_file_path = os.path.join(dummy_directory, dummy_filename)

        # Export the dummy data back to the appropriate format
        if file_type == 'csv':
            dummy_data.to_csv(dummy_file_path, index=False)
        elif file_type == 'json':
            dummy_data.to_json(dummy_file_path, orient='records')

        print(f"Dummy file created: {dummy_file_path}")

    print('\nDone!')

Comments

It’s pretty dumb, but it works.

Sometimes chatGPT would return JSON that was broken or incorrect which would break the code. To fix this, I just wait 5 sec then re-request dummy code. This somehow worked after the 2nd or 3rd try.

Part III: openAI API

Behold my chatGPT instructions code. This was based on trial & error to return data that worked for me.

Take note at the end how I use StringIO to convert the string response into a dataframe.

def make_dummy(source, type):
    """Ask chatGPT to make a dummy file based on existing data."""

    response = client.chat.completions.create(model=model,
    messages=[
        {"role": "system",
         "content": "Your job is to take make dummy sample data. Never provide anything other than the data."},
        {"role": "user",
         "content": f"Please create a dummy file of type {type} that resembles the below source data. Do NOT create new columns and do NOT use any data from the source file. Make sure to provide the data as {type}. NOTE: Provide only the data and nothing else in your response. NOTE: If type JSON, do NOT use triple quotes. Just return the json starting with the opening bracket.\n\n}}"
                    f"{source}"
                    f""}
    ],
    temperature=0.75,
    max_tokens=300)


    dummy_data = response.choices[0].message.content.strip()

    if type == 'csv':
        return pd.read_csv(StringIO(dummy_data))

    elif type == 'json':
        return pd.read_json(StringIO(dummy_data))

The End

Worked for me. Hope it helps someone.


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *