Feature Overview
Control the full crawl workflow from source review to data delivery.
The platform combines external APIs, governance, scheduling, quality checks, Elasticsearch templates, and a browser console so users can understand data sources, task status, and risk.
1. External API
External API integration
External systems can query sources, jobs, latest runs, and platform capabilities. Dry-run endpoints simulate crawling, parsing, scheduling, deduplication, and robots checks.
- Consistent response envelopes make error handling predictable across clients.
- Simulation endpoints avoid production writes and are suitable for onboarding and testing.
- OpenAPI-style JSON helps engineering teams understand endpoints and payloads quickly.
2. Governance Pipeline
Data governance and compliance checks
Each source can carry legal review status, robots information, scheduling cadence, parsing rules, and crawl results. These records form a traceable workflow for safer data collection.
- Supports API, RSS, XPath, CSS selector, and headless browser source types.
- Tracks proxy, user-agent, cookie, rate limit, and CAPTCHA strategies.
- Run history and alerts help teams respond to operational issues faster.
3. Web System
Browser-based management console
The web app shows sources, jobs, worker runs, alerts, and Elasticsearch templates. User identity is validated by the central member API, and this project stores no account passwords.
- The console is deployed at `/app` under the same official subdomain.
- Protected data is read only after a central member access token is applied.
- The current demo data is useful for acceptance testing and later production storage integration.