All Projects
NLP

LinkedIn Job Postings ML Pipeline

Full ML pipeline on 123,849 LinkedIn postings (2023–2024). Salary prediction, skills demand analysis (213K pairs), NLP on descriptions. 7 CSV files joined. Pay-period normalization (hourly→yearly).

Dataset

123,849 LinkedIn job postings, 7 relational CSV files

Approach

7-file join → NLP feature extraction → salary regression + skills demand analysis

Tech Stack
PythonPandasScikit-learnXGBoostLightGBMTF-IDFNLTK
Keywords
NLPSalary PredictionXGBoostLightGBMLabor MarketTF-IDF
Visualizations6 Charts
Deep Dive

End-to-end ML pipeline on a large LinkedIn dataset with rich relational structure.

Dataset (7 files joined)

FileRowsInfo
postings.csv123,849Title, company, description, location
companies.csv24,473Size, industry, followers
salaries.csv40,785Ranges (32.9% posting coverage)
job_skills.csv213,768Skill→job mappings

Salary Coverage — Pay Period Normalization

  • Yearly: 23K (direct)
  • Hourly: 16K (× 2,080 → yearly)
  • Monthly: 539 (× 12)
  • Weekly: 180 (× 52)

Task 1: Salary Prediction (Regression) Features: pay-period normalization, TF-IDF on descriptions, company size, seniority from title. Key predictors: job title, company size, location, required skills, seniority.

Task 2: Skills Demand Analysis 213,768 skill-job pairs → frequency + TF-IDF weighting. Top in-demand: Python, SQL, Communication, Project Management, Machine Learning. Fast-growing 2023–2024: LLMs, Prompt Engineering, Vector Databases.

Task 3: Market Insights

  • 85%+ postings concentrated in US/Europe top cities
  • Data Science premium: 3–4× vs Operations base salary
  • Remote premium: +$12K average for fully remote roles

Key Caveat Only 32.9% of postings have salary data — selection bias makes model non-representative of full market.