← Kembali

Python Menengah Scraping

Scrape harga produk + parse rupiah dengan httpx

Async scraping daftar URL produk, extract harga dengan selector CSS, lalu parse ke integer rupiah. Rate-limited supaya tidak diblokir.

Dipublikasikan 22 Mei 2026

Hari-hari scraping pakai requests + BeautifulSoup sudah lewat. Stack modern: httpx (async), selectolax (CSS selector cepat), dan asyncio.Semaphore untuk rate-limit. Dengan dependency-nya yang kecil, snippet ini bisa scrape 100 URL dalam ~10 detik tanpa kelihatan seperti bot.

Prasyarat

pip install httpx selectolax
# atau
uv pip install httpx selectolax

Kode

# scrape_harga.py
import asyncio
import re
import csv
from dataclasses import dataclass
from typing import Optional

import httpx
from selectolax.parser import HTMLParser


@dataclass
class ProductPrice:
    url: str
    title: Optional[str]
    price_rupiah: Optional[int]
    error: Optional[str] = None


def parse_rupiah(s: str) -> Optional[int]:
    """Mini-version parse rupiah. Lihat snippet python-parse-rupiah untuk full version."""
    if not s:
        return None
    cleaned = re.sub(r"[^\d,]", "", s)
    cleaned = cleaned.replace(",", "")
    return int(cleaned) if cleaned else None


async def fetch_product(
    client: httpx.AsyncClient,
    url: str,
    sem: asyncio.Semaphore,
) -> ProductPrice:
    """Fetch satu URL, extract title + harga."""
    async with sem:
        try:
            resp = await client.get(
                url,
                timeout=10.0,
                headers={"User-Agent": "Mozilla/5.0 (compatible; PriceBot/1.0)"},
            )
            resp.raise_for_status()
            
            tree = HTMLParser(resp.text)
            
            # Sesuaikan selector dengan struktur site target
            title_el = tree.css_first("h1[itemprop='name'], h1.product-title")
            price_el = tree.css_first("[data-testid='price'], .price-value, [itemprop='price']")
            
            title = title_el.text(strip=True) if title_el else None
            price = parse_rupiah(price_el.text(strip=True)) if price_el else None
            
            return ProductPrice(url=url, title=title, price_rupiah=price)
        
        except httpx.HTTPStatusError as e:
            return ProductPrice(url=url, title=None, price_rupiah=None, error=f"HTTP {e.response.status_code}")
        except Exception as e:
            return ProductPrice(url=url, title=None, price_rupiah=None, error=str(e))


async def scrape_all(urls: list[str], concurrency: int = 5) -> list[ProductPrice]:
    """Scrape banyak URL paralel, max N concurrent request."""
    sem = asyncio.Semaphore(concurrency)
    
    async with httpx.AsyncClient(follow_redirects=True) as client:
        tasks = [fetch_product(client, url, sem) for url in urls]
        return await asyncio.gather(*tasks)


def save_csv(results: list[ProductPrice], path: str):
    with open(path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=["url", "title", "price_rupiah", "error"])
        writer.writeheader()
        for r in results:
            writer.writerow({
                "url": r.url,
                "title": r.title or "",
                "price_rupiah": r.price_rupiah or "",
                "error": r.error or "",
            })


if __name__ == "__main__":
    urls = [
        "https://example.com/product/1",
        "https://example.com/product/2",
        # ... 100 lagi
    ]
    
    results = asyncio.run(scrape_all(urls, concurrency=5))
    save_csv(results, "harga.csv")
    
    success = sum(1 for r in results if r.error is None)
    print(f"Selesai: {success}/{len(results)} berhasil")

Kapan dipakai

  • Monitoring harga kompetitor untuk produk yang sudah Anda jual.
  • Sync harga supplier yang tidak punya API.
  • Riset pasar — collect harga rata-rata produk dari beberapa marketplace.

Catatan

  • Rate limit: concurrency=5 artinya max 5 request paralel. Untuk site yang strict (Tokopedia, Shopee), turunkan ke 2-3. Tambahkan asyncio.sleep(0.5) antar request kalau perlu lebih sopan.
  • User-Agent string — jangan pakai default python-httpx/... karena beberapa site langsung block.
  • Selector harus disesuaikan per site. Inspect target site pakai DevTools, cari selector yang stabil (avoid class hash seperti .sc-fzKnXr).
  • HTML structure berubah — site target kemungkinan akan ubah selector setiap quarter. Snippet ini perlu maintenance.

Etika scraping: cek robots.txt, jangan scrape data login-required, hormat rate limit. Untuk e-commerce besar yang punya official API (Tokopedia/Shopee Partner API), pakai API daripada scrape.

# tags

httpxselectolaxscrapingasync

← Semua snippet Snippet Python lain →