Quick tutorial on how I make dummy data with Python & the openAI API.
Part I: Situation
My goal with dummy data is to share my work publicly without exposing any client or personal data. It’s usually pretty easy to make dummy data myself, but my social media reports used a lot, a lot of data.
In this case, I needed to create over a dozen different files of test data in both JSON & CSV formats. The JSON was nested & pretty complex, and the CSV also was complex.
Instead of using numpy, asking chatGPT to make it, or just editing the data myself, I needed a way to automate the large amount of data.
The answer was simple: iterate over a directory of real files, and make a dummy file with each one using the openAI API.
Part II: Python
Below is the python code! Obviously it doesn’t include imports or the locations of the directories.
def bulk_dummy():
"""Makes new dummy files using the source_files directory"""
# Ensure the dummy_files directory exists
if not os.path.exists(dummy_directory):
os.makedirs(dummy_directory)
#Iterate over files in the source directory
for filename in os.listdir(source_directory):
# Get the full file path
file_path = os.path.join(source_directory, filename)
# Check the file extension to determine the file type
if filename.endswith('.csv'):
file_type = 'csv'
elif filename.endswith('.json'):
file_type = 'json'
else:
print(f"Skipping {filename}: Not an accepted file type.")
continue # Skip files that are not CSV or JSON
# Create a dummy file
## TRY & EXCEPT BC SOMETIMES CHATGPT BREAKS.
try:
dummy_data = chatgpt.make_dummy(file_path, file_type)
except Exception as e:
print(e)
print(file_path)
print('trying again in 5 sec')
time.sleep(5)
try:
dummy_data = chatgpt.make_dummy(file_path, file_type)
except:
print(Exception)
print(file_path)
# Define the output path
dummy_filename = f"dummy_{filename}"
dummy_file_path = os.path.join(dummy_directory, dummy_filename)
# Export the dummy data back to the appropriate format
if file_type == 'csv':
dummy_data.to_csv(dummy_file_path, index=False)
elif file_type == 'json':
dummy_data.to_json(dummy_file_path, orient='records')
print(f"Dummy file created: {dummy_file_path}")
print('\nDone!')
Comments
It’s pretty dumb, but it works.
Sometimes chatGPT would return JSON that was broken or incorrect which would break the code. To fix this, I just wait 5 sec then re-request dummy code. This somehow worked after the 2nd or 3rd try.
Part III: openAI API
Behold my chatGPT instructions code. This was based on trial & error to return data that worked for me.
Take note at the end how I use StringIO to convert the string response into a dataframe.
def make_dummy(source, type):
"""Ask chatGPT to make a dummy file based on existing data."""
response = client.chat.completions.create(model=model,
messages=[
{"role": "system",
"content": "Your job is to take make dummy sample data. Never provide anything other than the data."},
{"role": "user",
"content": f"Please create a dummy file of type {type} that resembles the below source data. Do NOT create new columns and do NOT use any data from the source file. Make sure to provide the data as {type}. NOTE: Provide only the data and nothing else in your response. NOTE: If type JSON, do NOT use triple quotes. Just return the json starting with the opening bracket.\n\n}}"
f"{source}"
f""}
],
temperature=0.75,
max_tokens=300)
dummy_data = response.choices[0].message.content.strip()
if type == 'csv':
return pd.read_csv(StringIO(dummy_data))
elif type == 'json':
return pd.read_json(StringIO(dummy_data))
The End
Worked for me. Hope it helps someone.
Leave a Reply