Art Price Predictor | Ahmad Ali

The Problem

Art resale markets are fragmented. A painting selling for $40 on eBay might have comparable pieces sold on LiveAuctioneers for $180. The gap is real, but no single surface shows it. Buyers who know all five platforms - eBay, Etsy, LiveAuctioneers, Mercari, ShopGoodwill - and check them manually every day have an edge. Everyone else does not.

The information asymmetry is tractable with automation. The question was how to build a pipeline that ingests listings from all five sources, normalizes them into a comparable schema, scores each one for resale opportunity, and surfaces the best candidates in a browsable, filterable board.

Approach

The pipeline runs in five stages: ingest, normalize, value, score, and surface.

Each source has its own adapter module. The adapters handle API calls or scraping, parse the raw response, and emit normalized listing records with a shared field contract: title, art type, asking price, source, listing URL, and available metadata. After normalization, the valuation module queries historical comparables to estimate the range a listing should trade at given its type and characteristics. The scoring module then computes a confidence-weighted opportunity score: high, medium, or low. The opportunity surface filters to listings where asking price sits below the estimated value floor.

I built a sanity_proof.py verification layer that runs on every deploy snapshot. It checks that each source produced records, that valuation did not emit unreasonable price estimates, and that the opportunity set is non-empty. If any check fails, the deploy snapshot is rejected before it replaces the live database. The pipeline does not ship broken state silently.

Architecture

Five source adapters feed a shared normalization layer. Normalized listings go into SQLite via a repository abstraction. The pipeline stages (valuation, scoring, opportunity detection) run against the stored records. A Flask web layer reads the final opportunity set and renders the browse board.

Source Adapters (eBay, Etsy, LiveAuctioneers, Mercari, ShopGoodwill)
    ↓
Normalization Layer (shared listing schema)
    ↓
SQLite (ListingRepository)
    ↓
Valuation (comparables.py) → Scoring (scoring.py) → Opportunities (opportunities.py)
    ↓
Sanity Proof (sanity_proof.py)
    ↓
Flask Web Routes → Browse Board
    ↓
Vercel (Python serverless, cron refresh every 12 hours)

The Vercel cron hits /internal/refresh every 12 hours, rebuilds the snapshot from current listings, and replaces the live database. The 60-second serverless limit is the binding constraint. Snapshot deploy (rebuild everything, swap atomically) stays within that window; incremental sync would not.

Key Technical Details

Source adapters share a base interface. Each adapter is responsible for one marketplace and one marketplace only. The shared output is a normalized listing dict with a fixed schema. Divergence in source format is isolated to the adapter; downstream stages see only normalized records.

class BaseListingSource:
    """All source adapters implement this contract."""
    source_name: str

    def fetch_listings(self, art_type: str) -> list[dict]:
        """Return normalized listing records for the given art_type."""
        raise NotImplementedError

Confidence scoring weights opportunity strength by comparable quality. A listing with 10 strong comparables in the same art type and price tier gets a high confidence score. A listing with 2 loose comparables gets low. The browse board exposes this as a filter so the user can narrow to high-confidence picks only.

Sanity proof is the last gate before a snapshot goes live. It runs assertions on the produced data:

def run_sanity_proof(repo: ListingRepository) -> SanityProofRecord:
    checks = [
        ("sources_populated", _check_sources_populated(repo)),
        ("valuations_in_range", _check_valuations_in_range(repo)),
        ("opportunities_non_empty", _check_opportunities_non_empty(repo)),
    ]
    passed = all(result for _, result in checks)
    return SanityProofRecord(passed=passed, checks=checks)

If passed is False, the refresh route returns a non-200 status and the old snapshot remains live. Silent failure is not an option.

deploy_snapshot.py is the single entry point for rebuilding the live database. It runs the five-family M003 fixture refresh path in sequence and persists the result to the configured SQLite path. The Vercel cron route calls this path on schedule.

Three design choices drove the architecture. SQLite over PostgreSQL because Vercel serverless with a file-based database is zero infrastructure - no connection pool, no managed instance, no cold start penalty from auth. Flask over FastAPI because the web surface is a simple read-only browse board with one filter endpoint; FastAPI's async machinery adds nothing here. Snapshot deploy over incremental sync because it is simpler to reason about, the 60-second limit requires bounded execution time, and sanity proof on a complete snapshot is stronger than checking partial updates.

Impact

What moved, what constrained it, and what trade-offs stayed visible.

Operational outcome, the limits around it, and the practical decisions that shaped the work.

Impact

Built a live art market intelligence tool that aggregates 5 marketplace sources into a filterable opportunity board, refreshing every 12 hours on Vercel. The pipeline surfaces undervalued listings by comparing asking prices against historical comparables with high, medium, and low confidence ratings.

Constraints

Vercel serverless 60-second execution limit bounds snapshot rebuild time. Marketplace terms of service rate limits require respectful scraping intervals with backoff. Comparable quality degrades for niche art types with sparse historical sold data.

Trade-offs

SQLite over PostgreSQL: zero infrastructure, portable, no cold start penalty. Flask over FastAPI: the browse board is read-only with one filter endpoint - async adds nothing. Snapshot deploy over incremental sync: simpler correctness story, bounded execution time, stronger sanity proof coverage on a complete dataset.