Automated job data pipeline for LinkedIn with intelligent skill extraction and real-time analytics.
A production-ready job scraping system that collects job listings from LinkedIn, extracts technical skills using regex-based pattern matching, and provides interactive analytics through a Streamlit dashboard.
| Feature | Description |
|---|---|
| Two-Phase Scraping | Separate URL collection and detail extraction for resilience |
| 3-Layer Skill Extraction | 977 skills with regex patterns, minimal false positives |
| 150 Role Categories | Automatic role normalization with pattern matching |
| Real-Time Analytics | Interactive charts, skill trends, and export capabilities |
| Adaptive Rate Limiting | Circuit breaker with auto-tuning concurrency (2-10 workers) |
| Resume Capability | Checkpoint-based recovery from interruptions |
Job_Scrapper/
├── README.md # This file
├── requirements.txt # Production dependencies
├── requirements-dev.txt # Development dependencies
├── .gitignore # Git ignore rules
│
├── code/ # All source code
│ ├── streamlit_app.py # Main dashboard entry point
│ ├── run_scraper.py # CLI scraper runner
│ ├── save_linkedin_cookies.py # LinkedIn authentication helper
│ ├── setup_playwright.sh # Playwright browser installer (WSL/Linux)
│ │
│ ├── data/
│ │ ├── jobs.db # SQLite database (auto-created)
│ │ └── Analysis_Report/ # Generated analysis reports
│ │ ├── Data_Analyst/
│ │ ├── Data_Engineer/
│ │ └── GenAI_DataScience/
│ │
│ ├── src/
│ │ ├── config/ # Configuration files
│ │ │ ├── skills_reference_2025.json # 977 skills with regex patterns
│ │ │ ├── roles_reference_2025.json # 150 role categories
│ │ │ ├── countries.py # Country/location mappings
│ │ │ └── naukri_locations.py
│ │ │
│ │ ├── db/ # Database layer
│ │ │ ├── connection.py # SQLite connection manager
│ │ │ ├── schema.py # Table schemas
│ │ │ └── operations.py # CRUD operations
│ │ │
│ │ ├── models/
│ │ │ └── models.py # Pydantic data models
│ │ │
│ │ ├── scraper/
│ │ │ ├── unified/
│ │ │ │ ├── linkedin/ # LinkedIn scraper components
│ │ │ │ │ ├── concurrent_detail_scraper.py # Multi-tab scraper (up to 10 tabs)
│ │ │ │ │ ├── sequential_detail_scraper.py # Single-tab scraper
│ │ │ │ │ ├── playwright_url_scraper.py # URL collection
│ │ │ │ │ ├── selector_config.py # CSS selectors
│ │ │ │ │ ├── retry_helper.py # 404/503 handling
│ │ │ │ │ └── job_validator.py # Field validation
│ │ │ │ │
│ │ │ │ ├── naukri/ # Naukri scraper components
│ │ │ │ │ ├── url_scraper.py
│ │ │ │ │ ├── detail_scraper.py
│ │ │ │ │ └── selectors.py
│ │ │ │ │
│ │ │ │ ├── scalable/ # Rate limiting & resilience
│ │ │ │ │ ├── adaptive_rate_limiter.py
│ │ │ │ │ ├── checkpoint_manager.py
│ │ │ │ │ └── progress_tracker.py
│ │ │ │ │
│ │ │ │ ├── linkedin_unified.py # LinkedIn orchestrator
│ │ │ │ └── naukri_unified.py # Naukri orchestrator
│ │ │ │
│ │ │ └── services/ # External service clients
│ │ │ ├── playwright_browser.py
│ │ │ └── session_manager.py
│ │ │
│ │ ├── analysis/
│ │ │ └── skill_extraction/ # 3-layer skill extraction
│ │ │ ├── extractor.py # Main AdvancedSkillExtractor class
│ │ │ ├── layer3_direct.py # Pattern matching from JSON
│ │ │ ├── batch_reextract.py # Re-process existing jobs
│ │ │ └── deduplicator.py # Skill normalization
│ │ │
│ │ ├── ui/
│ │ │ └── components/ # Streamlit UI components
│ │ │ ├── kpi_dashboard.py
│ │ │ ├── link_scraper_form.py
│ │ │ ├── detail_scraper_form.py
│ │ │ └── analytics/
│ │ │ ├── skills_charts.py
│ │ │ └── overview_metrics.py
│ │ │
│ │ ├── utils/
│ │ │ └── cleanup_expired_urls.py
│ │ │
│ │ └── validation/
│ │ ├── validation_pipeline.py
│ │ └── single_job_validator.py
│ │
│ ├── scripts/
│ │ ├── extraction/
│ │ │ └── reextract_skills.py
│ │ │
│ │ └── validation/ # Validation suite
│ │ ├── layer1_syntax_check.sh
│ │ ├── layer2_coverage.sh
│ │ ├── layer3_fp_detection.sh
│ │ ├── layer4_fn_detection.sh
│ │ ├── cross_verify_skills.py
│ │ └── run_all_validations.sh
│ │
│ ├── tests/
│ │ ├── test_skill_validation_comprehensive.py
│ │ └── test_linkedin_selectors.py
│ │
│ └── docs/ # Documentation
│ └── archive/ # Historical docs
│
└── Analysis/ # Downloaded CSVs and notebooks (gitignored)
├── Data Analysis/
│ ├── data_visualizer.ipynb # Analysis notebook (update CSV path for charts)
│ └── csv/ # Add exported CSVs here
│
├── Data Engineering/
│ ├── data_visualizer.ipynb
│ └── csv/
│
└── GenAI & DataScience/
├── data_visualizer.ipynb
└── csv/
- Python 3.11 or higher
- Git
git clone https://github.com/Gaurav-Wankhede/Job-Scrapper.git
cd Job-Scrapper
# Create virtual environment
python -m venv venv-win
# Activate
.\venv-win\Scripts\Activate.ps1
# Install dependencies
pip install -r requirements.txtgit clone https://github.com/Gaurav-Wankhede/Job-Scrapper.git
cd Job-Scrapper
# Create virtual environment
python3 -m venv venv-linux
# Activate
source venv-linux/bin/activate
# Install dependencies
python -m pip install -r requirements.txtNote for dual-boot users: Keep separate venvs (venv-win/ and venv-linux/) as Python virtual environments are not cross-platform compatible.
# Windows
playwright install chromium
# Linux/WSL (use python -m prefix)
python -m playwright install chromiumcd code
# Windows
streamlit run streamlit_app.py
# Linux/WSL (use python -m prefix)
python -m streamlit run streamlit_app.pyThe dashboard opens at http://localhost:8501
Phase 1: URL Collection Phase 2: Detail Scraping
┌─────────────────────┐ ┌─────────────────────┐
│ Search Results │ │ Individual Jobs │
│ ├── Fast scroll │ ──▶ │ ├── Full desc │
│ ├── Extract URLs │ │ ├── Skills parse │
│ └── Store to DB │ │ └── Store details │
└─────────────────────┘ └─────────────────────┘
job_urls table jobs table
Benefits:
- Resilience: If detail scraping fails, URLs are preserved
- Efficiency: Batch process up to 10 jobs concurrently in Phase 2
- Resumable: Pick up exactly where you left off
- Deduplication: Skip already-scraped URLs automatically
| Approach | Speed | Accuracy | Maintenance |
|---|---|---|---|
| Regex (chosen) | 0.3s/job | 85-90% | Pattern file updates |
| spaCy NER | 3-5s/job | 75-80% | Model retraining |
| GPT-based | 2-10s/job | 90%+ | API costs |
Our 3-layer approach achieves 85-90% accuracy at 10x speed of NLP:
- Layer 1: Multi-word phrase extraction (priority matching)
- Layer 2: Context-aware extraction (technical context detection)
- Layer 3: Direct pattern matching (977 skill patterns from JSON)
- KPI Dashboard - View overall statistics
- Link Scraper - Phase 1: Collect job URLs
- Detail Scraper - Phase 2: Extract job details & skills
- Analytics - Analyze skill trends and export data
cd code
# Run validation suite
bash scripts/validation/run_all_validations.sh
# Re-extract skills for existing jobs
python -m src.analysis.skill_extraction.batch_reextract --batch-size 100For authenticated scraping with higher limits:
cd code
python save_linkedin_cookies.pyThis saves cookies to linkedin_cookies.json for subsequent sessions.
{
"total_skills": 977,
"skills": [
{
"name": "Python",
"patterns": ["\\bPython\\b", "\\bpython\\b", "\\bPython3\\b"]
}
]
}Create .env file in code/ directory:
# Database path (default: data/jobs.db)
DB_PATH=data/jobs.db
# Playwright browser path (for WSL)
PLAYWRIGHT_BROWSERS_PATH=.playwright-browsers-- Phase 1: URL Collection
CREATE TABLE job_urls (
job_id TEXT PRIMARY KEY,
platform TEXT NOT NULL,
input_role TEXT NOT NULL,
actual_role TEXT NOT NULL,
url TEXT NOT NULL UNIQUE,
scraped INTEGER DEFAULT 0
);
-- Phase 2: Full Details
CREATE TABLE jobs (
job_id TEXT PRIMARY KEY,
platform TEXT NOT NULL,
actual_role TEXT NOT NULL,
url TEXT NOT NULL UNIQUE,
job_description TEXT,
skills TEXT,
company_name TEXT,
posted_date TEXT,
scraped_at DATETIME DEFAULT CURRENT_TIMESTAMP
);| Metric | Value |
|---|---|
| URL Collection | 200-300 URLs/min |
| Detail Scraping | 15-20 jobs/min (10 workers) |
| Skill Extraction | 0.3s/job |
| Storage per Job | ~2KB |
cd code
chmod +x setup_playwright.sh
./setup_playwright.shUse python3 or the python -m prefix:
python3 -m streamlit run streamlit_app.py
python3 -m pip install package_nameThe adaptive rate limiter handles this automatically:
- Concurrency reduces from 10 → 2
- Circuit breaker triggers 60s pause
- Gradually recovers when stable
pkill -f streamlit
python -m streamlit run streamlit_app.pypip install -r requirements-dev.txtcd code
python -m pytest tests/ -vcd code
python -m basedpyright src/MIT License - See LICENSE file for details.