Feature Overview

Control the full crawl workflow from source review to data delivery.

The platform combines external APIs, governance, scheduling, quality checks, Elasticsearch templates, and a browser console so users can understand data sources, task status, and risk.

Enterprise data platform overview
API integration workflow

1. External API

External API integration

External systems can query sources, jobs, latest runs, and platform capabilities. Dry-run endpoints simulate crawling, parsing, scheduling, deduplication, and robots checks.

  • Consistent response envelopes make error handling predictable across clients.
  • Simulation endpoints avoid production writes and are suitable for onboarding and testing.
  • OpenAPI-style JSON helps engineering teams understand endpoints and payloads quickly.

2. Governance Pipeline

Data governance and compliance checks

Each source can carry legal review status, robots information, scheduling cadence, parsing rules, and crawl results. These records form a traceable workflow for safer data collection.

  • Supports API, RSS, XPath, CSS selector, and headless browser source types.
  • Tracks proxy, user-agent, cookie, rate limit, and CAPTCHA strategies.
  • Run history and alerts help teams respond to operational issues faster.
Data governance pipeline
Web console screen

3. Web System

Browser-based management console

The web app shows sources, jobs, worker runs, alerts, and Elasticsearch templates. User identity is validated by the central member API, and this project stores no account passwords.

  • The console is deployed at `/app` under the same official subdomain.
  • Protected data is read only after a central member access token is applied.
  • The current demo data is useful for acceptance testing and later production storage integration.