User Manual - MechaDataCleaner

Introduction

MechaDataCleaner is a web-based data preparation tool designed to help you clean and optimize your data effortlessly. With its intuitive interface, you can prepare your data for analysis, reporting, and business intelligence tools like Power BI.

Key Features

Automated Schema Detection - AI-powered column type inference
Data Cleaning - Standardization, validation, deduplication
Batch Processing - Clean multiple files at once
AI Assistance - Interactive help for data cleaning tasks
Dark Mode Support - Comfortable interface for extended use

Who Is This For?

Data Analysts preparing files for dashboards
Business Users needing clean, standardized datasets
Data Engineers building ETL pipelines

Getting Started

Accessing the App

Visit the MechaDataCleaner website and log in to your account.
Once logged in, you will be directed to the main dashboard.
Choose between cleaning a single file or processing multiple files in batch mode.

Quick Start

Click Browse files or drag and drop a CSV/Excel file.
Review the auto-detected column types in the grid.
Check boxes in the grid to enable cleaning operations per column.
Adjust sidebar settings as needed.
Click Clean Data to process your file.
Download the cleaned file and schema.

Interface Overview

Main Tabs

Single File Tab

Upload and clean individual files.
Review column types and configure transformations.
Preview data before and after cleaning.
Download cleaned data, schemas, and audit logs.

Batch Upload Tab

Upload multiple files at once.
Process all files with the same settings.
Download all results as a ZIP file.
View quality metrics for each file.

Sidebar Sections

Settings - Core cleaning configuration.
Cleaning Options - Deduplication and validation.
Dates & Formatting - Date handling preferences.
Data Transformations - Pre/post-processing options.
Custom Rules - Create custom transformation rules.
Account Info - Usage limits and profile.

Core Features

File Upload

Supported Formats:

CSV (.csv)
Excel (.xlsx, .xls)

Features:

Automatic encoding detection.
Mojibake (text corruption) fixing.
Large file support with sampling.

Column Type Detection

The app automatically detects these types:

Basic: str, int, float, bool, category
Dates: date, datetime
Validation: email, phone, url
Advanced: ipv4, ipv6, uuid, domain, and more

Inference Modes:

Strict - Conservative type detection (fewer false positives).
Relaxed - Aggressive type detection (catches more patterns).

AI type preview: With AI-enhance schema enabled, you now see AI-inferred types in the type table before applying them, so you can review and override suggestions confidently.

Data Cleaning

Automated Operations:

Remove exact duplicates.
Standardize text (trim, case normalization).
Trim suffixes (drop the last N characters) to remove trailing IDs or noise.
Validate emails, phones, URLs.
Handle missing values.
Detect and handle outliers with a quick IQR-based summary in the Data Quality Profile.
Normalize headers and column names.

Sidebar Settings - Complete Guide

This section provides detailed explanations of every setting available in the sidebar. Each option is explained with its purpose, behavior, and use cases.

Cleaning Options

AI-enhance schema

What it does: When enabled, this feature uses OpenAI's GPT model to analyze your data and intelligently detect semantic types that go beyond simple pattern matching. The AI examines the actual content and context of your columns to make smart inferences.

How it works: The system sends sample data to the AI model, which analyzes patterns, context, and relationships to identify column types like email addresses, phone numbers, dates in unusual formats, currency values, postal codes, and other semantic types that basic regex patterns might miss.

When to use it:

Your data has inconsistent formatting (e.g., phone numbers with/without country codes, mixed date formats)
You need to detect semantic types that aren't obvious from patterns alone
You want the highest accuracy in type detection and are willing to wait longer
Your data contains domain-specific fields that benefit from contextual understanding

When NOT to use it:

You're processing batch files (it's automatically disabled for batch mode)
You need fast processing times
Your data is already well-structured with consistent formats
You're working with simple data types that don't require AI analysis

Performance impact: Enabling AI enhancement increases processing time by 5-15 seconds depending on dataset size, as it requires API calls to OpenAI.

Deduplication

Controls how duplicate rows are identified and removed from your dataset. Choose the strategy that best fits your data's structure and your requirements.

None

What it does: Keeps all rows in your dataset, even if they are perfect duplicates.

When to use: Use this when you deliberately want to keep duplicate records (e.g., transaction logs where the same transaction might legitimately appear multiple times, or when duplicates represent separate events that happened to have identical values).

Exact (all columns)

What it does: Removes rows where every single column has identical values to another row. The system compares all columns simultaneously and removes any row that is a perfect duplicate of another.

How it works: For each row, the system creates a hash of all column values combined. If two rows produce the same hash, they are considered duplicates, and all but the first occurrence are removed.

When to use:

You want to remove only perfect duplicates where everything matches
Your dataset has been merged from multiple sources and may contain exact duplicates
You want the simplest, fastest deduplication method
Every column is meaningful for determining uniqueness

Example: If you have rows [John, Doe, 25, USA] appearing three times, this will keep only one copy and remove the other two.

Exact (selected columns)

What it does: Removes rows that have matching values in specific columns you choose, ignoring other columns. This allows you to define uniqueness based on business logic rather than strict equality.

How it works: After selecting this option, you'll be prompted to choose which columns should be used for comparison. The system then identifies duplicates based only on those selected columns.

When to use:

You have a natural key (e.g., email address, employee ID, customer number) that should be unique
Some columns contain timestamps or metadata that differ but shouldn't affect uniqueness
You want to keep the most recent record when duplicates exist based on certain key fields
Your business rules define uniqueness by specific fields only

Example: If you select "Email" and "Date" as key columns, then rows with matching email+date combinations will be considered duplicates, even if other columns like "LoginTime" or "SessionID" differ.

Fuzzy match

What it does: Identifies and removes rows that are similar but not exactly identical, using string similarity algorithms. This catches near-duplicates caused by typos, formatting differences, or data entry variations.

How it works: The system calculates a similarity score between rows using algorithms like Levenshtein distance or Jaccard similarity. You set a threshold (e.g., 80%), and any rows exceeding this similarity percentage are considered duplicates.

Configurable options:

Similarity threshold (50-100%): Higher values require closer matches. 80-90% is recommended for most use cases. 95%+ catches only very similar records; 60-70% is more aggressive and may catch false positives.
Columns to compare: Select which columns should be analyzed for similarity.
Auto-pick canonical row: When enabled, automatically keeps the most complete/highest quality version. When disabled, you can manually review and choose which duplicate to keep.

When to use:

Your data contains typos, misspellings, or inconsistent formatting
Multiple data sources have entered the same entity slightly differently
You need to merge customer records that might have variations in name spelling or address formatting
Data entry errors have created near-duplicate records

Example scenarios:

"John Smith" vs "Jon Smith" vs "John Smyth" (all referring to same person)
"123 Main St" vs "123 Main Street" vs "123 Main St."
"ABC Corp" vs "ABC Corporation" vs "A.B.C. Corp"

Warning: Fuzzy matching is computationally expensive and can significantly increase processing time for large datasets. It's recommended to use it on datasets under 10,000 rows or with selected columns only.

Handle invalid rows

Determines what happens when the system encounters data that doesn't meet validation rules (e.g., malformed emails, invalid phone numbers, out-of-range dates).

Keep as-is

What it does: Leaves invalid data exactly as it is in the output file. No modifications, no deletions, no flagging.

When to use:

You want to see ALL data, including invalid entries, for manual review
You're running a data quality assessment and need to preserve original values
Invalid data might contain useful information despite not meeting strict validation
You plan to fix validation issues manually in a separate step

Flag (add column)

What it does: Adds a new column called "_INVALID" to your dataset. This column contains a boolean value (TRUE/FALSE) or descriptive text indicating whether each row contains invalid data and why.

How it works: For each row, the system runs validation checks based on the detected or assigned column types. If any field fails validation, the _INVALID column is marked TRUE and may include details about which fields failed and why.

When to use:

You want to keep invalid data but clearly identify it for downstream processing
You need to generate data quality reports showing which records are problematic
Your workflow includes a manual review step for flagged records
You want to filter invalid records in your BI tool or database after import

Example output:

Remove

What it does: Completely deletes rows that contain any invalid data from the output file. These rows will not appear in your cleaned dataset.

How it works: During processing, the system validates each row. Any row with one or more fields that fail validation checks is excluded from the final output.

When to use:

You only want clean, validated data in your output
Invalid records have no business value and would cause errors downstream
Your data pipeline requires strict data quality standards
You're feeding data into a system that would fail or malfunction with invalid entries

Warning: Removed rows are permanently excluded from the output. Make sure to review the audit log or metrics to understand how much data was removed and why. Consider using "Flag" mode first to assess the impact before switching to "Remove".

Dates & Formatting

This section controls how date columns are processed, standardized, and transformed. Date handling is critical for time-series analysis, reporting, and database imports.

Date standardization (Preprocessing)

What it does: When enabled, this scans your entire dataset for date columns and converts all dates to a consistent, standardized format BEFORE any other cleaning operations occur.

How it works: The system uses intelligent date parsing libraries that can recognize hundreds of date formats (MM/DD/YYYY, DD-MM-YYYY, YYYY/MM/DD, "January 15, 2025", "15-Jan-2025", etc.) and converts them all to a single consistent format (typically ISO 8601: YYYY-MM-DD).

Why it's a preprocessing step: Date standardization happens BEFORE cleaning because inconsistent date formats can cause downstream validation and transformation failures. By standardizing dates first, all subsequent operations work with consistent data.

When to enable:

Your data contains mixed date formats (some MM/DD, some DD/MM, some spelled out)
You're merging data from multiple sources with different regional formats
You need dates in a specific format for database import or BI tools
Your downstream systems require ISO format dates

When to disable:

Your dates are already in a consistent format
You want to preserve original date formats
You're using the "Date input format" setting to handle a specific format

Date input format

Specifies the expected format of dates in your source data. This helps the parser correctly interpret ambiguous dates.

Auto-detect

What it does: The system automatically attempts to determine the date format by analyzing date patterns in your data.

How it works: Examines multiple date columns and looks for consistent patterns. Uses heuristics like: if day values never exceed 12, it's likely MM/DD format; if values frequently exceed 12, it's likely DD/MM.

When to use: Use this when you're unsure of the format or have well-formed, unambiguous dates (e.g., "2025-01-15" or spelled-out dates like "January 15, 2025").

Caution: Auto-detect can misinterpret ambiguous dates. For example, "01/02/2025" could be January 2nd or February 1st. If you know your format, specify it explicitly.

MM/DD/YYYY (mdy) - American format

What it does: Interprets dates with month first, then day, then year. This is the standard format used in the United States.

Example interpretations:

01/15/2025 → January 15, 2025
12/31/2025 → December 31, 2025
03/04/2025 → March 4, 2025

When to use: Your data comes from U.S. systems, American data entry, or you know dates follow month-first convention.

DD/MM/YYYY (dmy) - European format

What it does: Interprets dates with day first, then month, then year. This is the standard format used in most of Europe, Australia, and many other countries.

Example interpretations:

15/01/2025 → January 15, 2025
31/12/2025 → December 31, 2025
04/03/2025 → March 4, 2025

When to use: Your data comes from European systems, international sources, or you know dates follow day-first convention.

YYYY/MM/DD (ymd) - ISO format

What it does: Interprets dates with year first, then month, then day. This is the ISO 8601 international standard format.

Example interpretations:

2025/01/15 → January 15, 2025
2025/12/31 → December 31, 2025
2025/03/04 → March 4, 2025

When to use: Your data already uses ISO format, comes from Asian systems (common in China, Japan, Korea), or you need unambiguous date parsing.

Advantage: This format is never ambiguous and sorts correctly as text, making it ideal for databases and data exchange.

Additional Date Options

Date only (no time)

What it does: Removes time components (hours, minutes, seconds) from datetime columns, leaving only the date portion.

Example transformation:

"2025-01-15 14:30:45" → "2025-01-15"
"01/15/2025 3:45 PM" → "01/15/2025"

When to use:

Time components are irrelevant for your analysis (e.g., daily aggregations)
You want to reduce file size by removing unnecessary precision
Your target system only supports date types, not datetime
You're grouping or joining on dates and don't want time differences to prevent matches

Add date keys (YYYYMMDD)

What it does: Creates NEW columns next to each date column containing integer representations of dates in YYYYMMDD format.

Example: If you have a column "OrderDate" with value "2025-01-15", this creates a new column "OrderDate_Key" with value 20250115 (an integer).

Benefits of date keys:

Faster filtering: Integer comparisons are faster than date parsing in databases and BI tools
Sortable: Sorting these integers chronologically orders dates correctly
Storage efficient: Integers take less space than formatted date strings
Easy date math: Calculating date ranges becomes simple integer subtraction
BI tool friendly: Many BI tools perform better with integer date keys

When to use:

You're building data warehouses or star schemas
Your BI tool or dashboard benefits from integer date keys
You need fast date filtering in large datasets
You want both human-readable dates AND optimized integer keys

Note: Original date columns are preserved; date keys are added as additional columns.

Replace dates with keys

What it does: Completely replaces original date columns with integer YYYYMMDD values. Unlike "Add date keys" which creates new columns, this REMOVES the original date columns.

Example transformation:

Before: OrderDate = "2025-01-15"
After: OrderDate = 20250115

When to use:

You only need dates for sorting, filtering, or math operations
You want maximum storage efficiency
Your target system prefers or requires integer date representations
You're optimizing for query performance over readability

Warning: This is a destructive transformation. Human-readable dates are lost and replaced with integers. You cannot easily convert back to formatted dates without recalculation. Use "Add date keys" if you want to keep both formats.

Text normalization

What it does: Cleans text columns by trimming whitespace, normalizing case, and optionally trimming the last N characters (suffix trim) to remove trailing IDs or noise.

When to use: Your columns contain extra spaces, mixed casing, or trailing tokens you want removed before validation and downstream joins.

Notes: Configure suffix length to drop a fixed number of characters; leave it off to only trim and normalize case.

Categorical cleanup

What it does: Merges case variations and normalizes category labels (e.g., "Apple", "apple" → "apple") to reduce duplicate categories.

When to use: Survey responses, free-text categories, or any column where spelling/case drift creates fragmented categories.

Configure missing values

What it does: Lets you choose how to handle nulls before cleaning—leave as-is, fill with defaults, or apply targeted strategies per column.

When to use: Datasets with sparse fields, optional attributes, or columns that need default placeholders for analytics/joins.

Validation actions

What it does: Defines actions for invalid values (emails, URLs, phones, etc.): keep, flag in _INVALID, or remove.

When to use: Control how strict you are with data quality depending on downstream tolerance for bad records.

Outlier detection

What it does: IQR-based outlier identification with two sensitivities: Standard (1.5x IQR) and Extreme (3x IQR). Shows a quick summary in the Data Quality Profile.

When to use: To flag anomalous numeric values before modeling or exporting; choose Standard for most cases, Extreme to catch only the most distant points.

Convert word numbers

What it does: Finds text representations of numbers (like "thirty", "five hundred", "two thousand") and converts them to numeric digits (30, 500, 2000).

Supported formats:

Simple numbers: "one" → 1, "twenty" → 20, "hundred" → 100
Compound numbers: "twenty-five" → 25, "one hundred fifty" → 150
Large numbers: "two thousand" → 2000, "five million" → 5000000
Mixed: "three hundred and forty-two" → 342

When to use:

Your data contains survey responses or form entries where numbers were written as words
You're processing transcribed speech or interview data
Data was entered by humans who spelled out numbers
You need numeric columns for mathematical operations or aggregations

Example scenarios:

Age column: "thirty-five" → 35
Quantity: "two hundred" → 200
Revenue: "five million dollars" → 5000000

Limitations: Only works with English language number words. Does not convert fractions or decimals written as words ("one half", "three point five" remain unchanged unless explicitly programmed).

Custom Rules

Create conditional transformations (Starter: 3 rules max, Pro: unlimited):

Column: Age
Operator: >
Value: 120
Action: Remove Row

Data Cleaning Workflow

Two Ways to Configure Cleaning

MechaDataCleaner offers two complementary methods to configure your data cleaning. You can use either one or both together:

Grid-Based Configuration

Configure cleaning operations per column directly in an interactive grid. Best for column-specific transformations like case changes, trimming, and character removal.

Sidebar Settings

Configure global options that apply to all columns or the entire file. Best for deduplication, date formatting, validation rules, and custom rules.

Tip: Both methods work together. Grid settings handle per-column operations while sidebar settings handle file-wide operations. When you click Clean Data, all settings from both are applied.

Step 1: Upload File

Click "Browse files" or drag and drop a CSV or Excel file.
File loads with automatic encoding detection.
Preview rows are displayed based on your plan (Free: 5, Starter: 50, Pro: 100).

Step 2: Configure Cleaning in the Grid (Per-Column Settings)

The grid appears in the main area after uploading a file. Each row represents a column from your data.

Type: Select the data type for each column from the dropdown (str, int, float, date, email, phone, etc.).
Text transformations: Check boxes to enable:
- Proper Case - Capitalize first letter of each word
- UPPER - Convert to uppercase
- lower - Convert to lowercase
Trimming options: Check boxes to enable:
- Trim Lead - Remove leading spaces
- Trim Trail - Remove trailing spaces
- Rm Non-Print - Remove non-printable characters
Character removal: Check boxes to enable:
- Rm Spaces - Remove all spaces
- Rm Letters - Remove all letters
- Rm Numbers - Remove all numbers
- Nullify 0s - Replace zeros with NULL
Bulk apply: Use the "Apply to All Columns" expander above the grid to apply the same settings to every column at once.
Advanced columns: Click the Advanced Columns multiselect to add extra operations:
- Deduplicate (Exact) - Mark exact duplicate rows
- Deduplicate (Fuzzy) - Mark similar rows using similarity matching
- Remove Emoji - Remove emoji characters
- Convert Word Numbers - Convert "thirty" to 30

Step 3: Configure Sidebar Settings (Global Options)

The sidebar on the left contains settings that apply to the entire file or multiple columns at once.

Cleaning Options:
- AI-enhance schema - Use AI for smarter type detection
- Deduplication mode - None, Exact (all columns), Exact (selected), or Fuzzy match
- Handle invalid rows - Keep as-is, Flag (add column), or Remove
Dates and Formatting:
- Date standardization - Normalize all date formats
- Date input format - Specify MM/DD/YYYY, DD/MM/YYYY, or auto-detect
- Date-only mode, date keys, and other date options
Data Transformations:
- Header normalization - Clean column names
- Outlier detection - Flag values outside IQR thresholds
- Schema validation settings
Custom Rules: Add conditional transformations (Starter: up to 3, Pro: unlimited).

Step 4: Run Cleaning

When you run cleaning, both grid settings and sidebar settings are applied together.

Option A: Create Schema Only

Generates schema JSON without cleaning. Useful for validation or reuse.

Option B: Clean Data

Applies all grid settings and sidebar options, validates and cleans data, generates quality metrics.

Step 5: Review Results

Check quality metrics (completeness, duplicates removed).
Review before/after preview.
Download: Cleaned CSV/Excel, Schema JSON, Audit log.

Batch Processing

When to Use Batch Mode

Multiple tables with same structure.
Consistent transformations across files.
ETL pipelines requiring bulk processing.

Workflow

Switch to Batch Upload tab.
Upload multiple CSV/Excel files.
Configure sidebar settings (applied to all files).
Preview individual files from dropdown.
Click "Process All Files".
Download ZIP with all results.

ZIP Contents:

cleaned_*.csv - Cleaned files.
schemas/*.json - Schema definitions.
audit_logs/*.json - Processing logs.

AI Features

AI-Enhanced Schema Detection

Requirements: AI assistance enabled in your account settings.

What It Does:

Detects semantic types (email, phone, currency).
Validates data patterns.
Suggests appropriate data types.
Adds validation rules to schema.

AI Chat Bot (BETA)

Features:

Ask about your dataset.
Get cleaning recommendations.
Troubleshoot data issues.

Usage Limits:

Free: 10 messages/month.
Starter: 100 messages/month.
Pro: 500 messages/month.

Troubleshooting

Common Issues

File Upload Fails

Check file size (large files may take longer to process).
Verify file format (CSV, XLSX, XLS only).
Try smaller sample if file is very large.

Type Detection Incorrect

Switch between Strict/Relaxed inference modes.
Manually override types in the type selection table.
Enable AI-enhancement for better detection.

Cleaning Takes Too Long

Disable AI-enhancement for faster processing.
Reduce sample size in Processing Options.
Disable expensive operations (fuzzy deduplication, outlier detection).

AI Features Not Working

Verify AI assistance is enabled in your account.
Ensure you have sufficient usage credits.

Error Messages

"You've reached your cleaning limit"

Upgrade to Starter or Pro plan.
Wait for monthly limit reset.
Contact support for assistance.

"Schema validation failed"

Review schema requirements.
Disable "Fail on schema errors" to see issues.
Check validation mode (try Loose instead of Strict).

Best Practices

Data Preparation

Start Small: Test with a sample before processing large files.
Review Types: Always verify auto-detected types are correct.
Incremental Changes: Apply transformations one at a time.
Save Schemas: Reuse schemas for consistent processing.

Performance Optimization

Disable Unused Features: Turn off AI enhancement if not needed.
Adjust Sample Size: Use smaller samples for faster iteration.
Batch Wisely: Group similar files for batch processing.
Use Strict Mode: Faster than Relaxed inference mode.

Data Quality

Flag Before Removing: Use Flag mode to review invalid data.
Check Metrics: Review quality scores before proceeding.
Audit Logs: Download and archive for traceability.
Validate Early: Use schema validation to catch issues.

Need More Help?

For additional assistance or feature requests, contact us at support@mechadatacleaner.com

Current build highlights

Introduction

Key Features

Who Is This For?

Getting Started

Accessing the App

Quick Start

Interface Overview

Main Tabs

Single File Tab

Batch Upload Tab

Sidebar Sections

Core Features

File Upload

Column Type Detection

Data Cleaning

Sidebar Settings - Complete Guide

Cleaning Options

AI-enhance schema

Deduplication

None

Exact (all columns)

Exact (selected columns)

Fuzzy match

Handle invalid rows

Keep as-is

Flag (add column)

Remove

Dates & Formatting

Date standardization (Preprocessing)

Date input format

Auto-detect

MM/DD/YYYY (mdy) - American format

DD/MM/YYYY (dmy) - European format

YYYY/MM/DD (ymd) - ISO format

Additional Date Options

Date only (no time)

Add date keys (YYYYMMDD)

Replace dates with keys

Text normalization

Categorical cleanup

Configure missing values

Validation actions

Outlier detection

Convert word numbers

Custom Rules

Data Cleaning Workflow

Two Ways to Configure Cleaning

Grid-Based Configuration

Sidebar Settings

Step 1: Upload File

Step 2: Configure Cleaning in the Grid (Per-Column Settings)

Step 3: Configure Sidebar Settings (Global Options)

Step 4: Run Cleaning

Option A: Create Schema Only

Option B: Clean Data

Step 5: Review Results

Batch Processing

When to Use Batch Mode

Workflow

AI Features

AI-Enhanced Schema Detection

AI Chat Bot (BETA)

Troubleshooting

Common Issues

File Upload Fails

Type Detection Incorrect

Cleaning Takes Too Long

AI Features Not Working

Error Messages

"You've reached your cleaning limit"

"Schema validation failed"

Best Practices

Data Preparation

Performance Optimization

Data Quality

Need More Help?