Automating Semi-Structured Data Creation : Generating JSON and XML Data

This guide provides a simple way to create and manage semi-structured data on Windows, Linux, and macOS systems. We’ll generate car data with dependent makes and models, save it in JSON and XML formats, and ensure the script works across different operating systems.

What We’re Doing

We’ll write a Python script that generates semi structured car data every 15 seconds. The data will be saved in JSON and XML formats in separate folders.

Why is Semi-Structured Data Important?

  1. Flexibility: Semi-structured data like JSON and XML adapts to various data structures, which is useful in many applications.
  2. Interoperability: These formats are commonly used for data exchange between different systems and applications.

Steps to Create the Program

  1. Set Up Your Project

Create a Project Folder

Windows: Open Command Prompt and run:

mkdir data_generator
cd data_generator

Linux/macOS: Open Terminal and run:

mkdir data_generator
cd data_generator

Create Subfolders

Windows/Linux/macOS: Inside the data_generator folder, create folders for JSON and XML files:

mkdir json_data xml_data

Create the Python Script

Windows/Linux/macOS: Create a file named generate_data.py

2. Write the Code

Open the File: Open generate_data.py in your favorite text editor (e.g., Notepad, VS Code, Sublime Text).

import os
import json
import xml.etree.ElementTree as ET
import random
import time
from datetime import datetime

# Create directories if they don't exist
if not os.path.exists('json_data'):
    os.makedirs('json_data')
if not os.path.exists('xml_data'):
    os.makedirs('xml_data')

# Define makes and corresponding models
car_data = {
    'Tata': ['Nexon', 'Harrier', 'Altroz', 'Tiago', 'Safari'],
    'Maruti': ['Alto', 'Swift', 'Baleno', 'Dzire', 'Vitara Brezza'],
    'Hyundai': ['i10', 'i20', 'Creta', 'Verna', 'Venue'],
    'Honda': ['City', 'Amaze', 'WR-V', 'Jazz', 'BR-V'],
    'Toyota': ['Corolla', 'Innova', 'Fortuner', 'Yaris', 'Glanza'],
    'Ford': ['Ecosport', 'Figo', 'Endeavour', 'Mustang', 'Aspire'],
    'Chevrolet': ['Beat', 'Cruze', 'Trailblazer', 'Enjoy', 'Aveo']
}

def generate_car_data():
    make = random.choice(list(car_data.keys()))
    model = random.choice(car_data[make])
    years = ['2020', '2021', '2022', '2023']
    engine_types = ['Petrol', 'Diesel', 'Electric', 'Hybrid']
    
    data = {
        'car_id': str(random.randint(10000, 99999)),
        'make': make,
        'model': model,
        'year': random.choice(years),
        'engine': {
            'type': random.choice(engine_types),
            'displacement': f"{random.randint(1000, 3000)}cc"
        },
        'features': random.sample(['ABS', 'Airbags', 'Sunroof', 'Leather Seats', 'Navigation System'], k=random.randint(1, 5)),
        'color': random.choice(['Red', 'Blue', 'Black', 'White', 'Gray']),
        'price': f"${random.randint(15000, 30000)}",
        'owner': f"Owner{random.randint(1, 100)}",
        'service_history': f"Service record {random.randint(1, 10)}",
        'warranty': f"{random.randint(1, 5)} years",
        'insurance': f"{random.choice(['Yes', 'No'])}",
        'mileage': f"{random.randint(5000, 50000)} km",
        'location': f"City{random.randint(1, 10)}",
        'accidents': f"{random.randint(0, 3)}",
        'previous_owners': f"{random.randint(1, 3)}",
        'registration': f"Reg-{random.randint(1000, 9999)}"
    }

    # Randomly remove optional fields
    optional_fields = ['color', 'price', 'owner', 'service_history', 'warranty', 'insurance', 'mileage', 'location', 'accidents', 'previous_owners', 'registration']
    for field in optional_fields:
        if random.random() < 0.5:  # 50% chance of having optional fields
            data.pop(field, None)
    
    return data

def save_json(data):
    timestamp = datetime.now().strftime('%Y%m%d%H%M%S')
    file_path = os.path.join('json_data', f'car_data_{timestamp}.json')
    with open(file_path, 'w') as f:
        json.dump(data, f, indent=4)

def save_xml(data):
    car = ET.Element('car')
    for key, value in data.items():
        if isinstance(value, dict):
            sub_elem = ET.SubElement(car, key)
            for sub_key, sub_value in value.items():
                ET.SubElement(sub_elem, sub_key).text = str(sub_value)
        elif isinstance(value, list):
            features = ET.SubElement(car, key)
            for item in value:
                ET.SubElement(features, 'feature').text = item
        else:
            ET.SubElement(car, key).text = str(value)
    
    timestamp = datetime.now().strftime('%Y%m%d%H%M%S')
    file_path = os.path.join('xml_data', f'car_data_{timestamp}.xml')
    tree = ET.ElementTree(car)
    tree.write(file_path)

def main():
    while True:
        data = generate_car_data()
        save_json(data)
        save_xml(data)
        time.sleep(15)  # Wait for 15 seconds

if __name__ == "__main__":
    main()

3. Run the Script

python generate_data.py

Check the Files: After a few seconds, check the json_data and xml_data folders for new files. After 15 seconds it will create the files.

Why This Work is Useful

  1. Understanding Semi-Structured Data:
    • Flexibility: Semi-structured data allows for flexible data storage and retrieval. It can handle various types of data and structures, adapting to different needs.
    • Interoperability: JSON and XML are commonly used for data exchange between different systems, making them valuable for web development and data integration.
  2. Practical Applications:
    • Web Development: JSON is used to send data between web servers and clients, making it crucial for building web applications.
    • Data Integration: Semi-structured data helps in combining information from various sources, making it easier to manage and analyze data.
  3. Learning and Testing:
    • Hands-On Practice: Working with semi-structured data improves your understanding of data formats and processing.
    • System Testing: Generating and using semi-structured data helps test systems and applications to ensure they can handle real-world scenarios.

Conclusion

Generating semi-structured data, such as JSON and XML, is important because it provides flexibility and adaptability for various applications. It prepares you for real-world tasks involving data processing and integration, enhancing your ability to work with diverse data formats.

Leave a Reply