Scrape harga produk + parse rupiah dengan httpx
Async scraping daftar URL produk, extract harga dengan selector CSS, lalu parse ke integer rupiah. Rate-limited supaya tidak diblokir.
Dipublikasikan 22 Mei 2026
Hari-hari scraping pakai requests + BeautifulSoup sudah lewat. Stack modern: httpx (async), selectolax (CSS selector cepat), dan asyncio.Semaphore untuk rate-limit. Dengan dependency-nya yang kecil, snippet ini bisa scrape 100 URL dalam ~10 detik tanpa kelihatan seperti bot.
Prasyarat
pip install httpx selectolax
# atau
uv pip install httpx selectolax
Kode
# scrape_harga.py
import asyncio
import re
import csv
from dataclasses import dataclass
from typing import Optional
import httpx
from selectolax.parser import HTMLParser
@dataclass
class ProductPrice:
url: str
title: Optional[str]
price_rupiah: Optional[int]
error: Optional[str] = None
def parse_rupiah(s: str) -> Optional[int]:
"""Mini-version parse rupiah. Lihat snippet python-parse-rupiah untuk full version."""
if not s:
return None
cleaned = re.sub(r"[^\d,]", "", s)
cleaned = cleaned.replace(",", "")
return int(cleaned) if cleaned else None
async def fetch_product(
client: httpx.AsyncClient,
url: str,
sem: asyncio.Semaphore,
) -> ProductPrice:
"""Fetch satu URL, extract title + harga."""
async with sem:
try:
resp = await client.get(
url,
timeout=10.0,
headers={"User-Agent": "Mozilla/5.0 (compatible; PriceBot/1.0)"},
)
resp.raise_for_status()
tree = HTMLParser(resp.text)
# Sesuaikan selector dengan struktur site target
title_el = tree.css_first("h1[itemprop='name'], h1.product-title")
price_el = tree.css_first("[data-testid='price'], .price-value, [itemprop='price']")
title = title_el.text(strip=True) if title_el else None
price = parse_rupiah(price_el.text(strip=True)) if price_el else None
return ProductPrice(url=url, title=title, price_rupiah=price)
except httpx.HTTPStatusError as e:
return ProductPrice(url=url, title=None, price_rupiah=None, error=f"HTTP {e.response.status_code}")
except Exception as e:
return ProductPrice(url=url, title=None, price_rupiah=None, error=str(e))
async def scrape_all(urls: list[str], concurrency: int = 5) -> list[ProductPrice]:
"""Scrape banyak URL paralel, max N concurrent request."""
sem = asyncio.Semaphore(concurrency)
async with httpx.AsyncClient(follow_redirects=True) as client:
tasks = [fetch_product(client, url, sem) for url in urls]
return await asyncio.gather(*tasks)
def save_csv(results: list[ProductPrice], path: str):
with open(path, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["url", "title", "price_rupiah", "error"])
writer.writeheader()
for r in results:
writer.writerow({
"url": r.url,
"title": r.title or "",
"price_rupiah": r.price_rupiah or "",
"error": r.error or "",
})
if __name__ == "__main__":
urls = [
"https://example.com/product/1",
"https://example.com/product/2",
# ... 100 lagi
]
results = asyncio.run(scrape_all(urls, concurrency=5))
save_csv(results, "harga.csv")
success = sum(1 for r in results if r.error is None)
print(f"Selesai: {success}/{len(results)} berhasil")
Kapan dipakai
- Monitoring harga kompetitor untuk produk yang sudah Anda jual.
- Sync harga supplier yang tidak punya API.
- Riset pasar — collect harga rata-rata produk dari beberapa marketplace.
Catatan
- Rate limit:
concurrency=5artinya max 5 request paralel. Untuk site yang strict (Tokopedia, Shopee), turunkan ke 2-3. Tambahkanasyncio.sleep(0.5)antar request kalau perlu lebih sopan. - User-Agent string — jangan pakai default
python-httpx/...karena beberapa site langsung block. - Selector harus disesuaikan per site. Inspect target site pakai DevTools, cari selector yang stabil (avoid class hash seperti
.sc-fzKnXr). - HTML structure berubah — site target kemungkinan akan ubah selector setiap quarter. Snippet ini perlu maintenance.
Etika scraping: cek
robots.txt, jangan scrape data login-required, hormat rate limit. Untuk e-commerce besar yang punya official API (Tokopedia/Shopee Partner API), pakai API daripada scrape.
# tags
httpxselectolaxscrapingasync