Data Deduplication and Cleaning
Remove duplicate records, handle null values and outliers to improve data quality
Overview
Real-world data often contains duplicates, errors, missing values, and other issues. Claude can help you identify and clean these problems, improving data quality for more accurate downstream analysis.
Use Cases
- Clean duplicate records in customer databases
- Handle duplicate form submissions
- Standardize inconsistent data formats
- Fill or remove missing values
Steps
Step 1: Data Quality Assessment
First analyze what issues exist in the data.
Please analyze the data quality of ~/data/customers.csv:
- Total records and field count
- Number of completely duplicate rows
- Missing value percentage for each field
- Fields with inconsistent formats (phone, email, date, etc.)
- Anomalies or obviously incorrect data
Generate a data quality report
Step 2: Remove Exact Duplicates
Remove records where all fields are identical.
Please remove exact duplicate rows from customers.csv:
- Keep the first occurrence of each record
- Count how many were deleted
- Save to ~/data/customers_dedup.csv
- Save deleted duplicates to a separate file for review
Step 3: Fuzzy Deduplication
Identify similar but not exactly identical duplicate records.
Identify near-duplicates based on key fields:
- Determine if same person based on "Name" and "Phone"
- Ignore case, spaces, and punctuation differences
- Group and list potentially duplicate records
- For each group, suggest which record to keep (most complete information)
Let me confirm before deleting
Step 4: Handle Missing Values
Handle null values according to different strategies.
For missing values, please handle according to these rules:
- "Name" field: Delete entire row (required field)
- "Phone" field: Mark as "Not Provided"
- "Age" field: Fill with median value
- "Address" field: Keep null value
- Generate processing report explaining handling method and affected row count for each field
Step 5: Standardize Formats
Unify data formats.
Please standardize the following field formats:
- Phone numbers: Unify to XXX-XXXX-XXXX format, remove +86, parentheses, etc.
- Email: Convert to lowercase
- Date: Unify to YYYY-MM-DD format
- Address: Remove leading/trailing spaces, standardize province name abbreviations
Output cleaned data to ~/data/customers_cleaned.csv
Warning: Data cleaning may result in information loss! Be sure to keep a backup of the original data. For fuzzy deduplication, it's recommended to manually confirm before batch deleting.
Tip: Create a data cleaning log to record each operation and affected record count for traceability and auditing.
FAQ
Q: How to determine which records are truly duplicates? A: Identical records are definitely duplicates. For similar records, judgment is needed based on business logic, such as the same person filling in different ways. It's recommended to display groups first, then delete after manual confirmation.
Q: Will deleting missing values lose too much data? A: Depends on the missing ratio. If a field is missing over 50%, deleting those rows would lose a lot of data. Consider deleting the column or filling with default values.
Q: How to verify cleaning results? A: Claude can generate before/after comparison reports, including record count changes, data distribution changes, etc., to help you verify if the cleaning was reasonable.