NLP

LinkedIn Job Postings ML Pipeline

Full ML pipeline on 123,849 LinkedIn postings (2023–2024). Salary prediction, skills demand analysis (213K pairs), NLP on descriptions. 7 CSV files joined. Pay-period normalization (hourly→yearly).

View on Kaggle

Dataset

123,849 LinkedIn job postings, 7 relational CSV files

Approach

7-file join → NLP feature extraction → salary regression + skills demand analysis

Tech Stack

PythonPandasScikit-learnXGBoostLightGBMTF-IDFNLTK

Keywords

NLPSalary PredictionXGBoostLightGBMLabor MarketTF-IDF

Visualizations6 Charts

Deep Dive

End-to-end ML pipeline on a large LinkedIn dataset with rich relational structure.

Dataset (7 files joined)

File	Rows	Info
postings.csv	123,849	Title, company, description, location
companies.csv	24,473	Size, industry, followers
salaries.csv	40,785	Ranges (32.9% posting coverage)
job_skills.csv	213,768	Skill→job mappings

Salary Coverage — Pay Period Normalization

▸Yearly: 23K (direct)
▸Hourly: 16K (× 2,080 → yearly)
▸Monthly: 539 (× 12)
▸Weekly: 180 (× 52)

Task 1: Salary Prediction (Regression) Features: pay-period normalization, TF-IDF on descriptions, company size, seniority from title. Key predictors: job title, company size, location, required skills, seniority.

Task 2: Skills Demand Analysis 213,768 skill-job pairs → frequency + TF-IDF weighting. Top in-demand: Python, SQL, Communication, Project Management, Machine Learning. Fast-growing 2023–2024: LLMs, Prompt Engineering, Vector Databases.

Task 3: Market Insights

▸85%+ postings concentrated in US/Europe top cities
▸Data Science premium: 3–4× vs Operations base salary
▸Remote premium: +$12K average for fully remote roles

Key Caveat Only 32.9% of postings have salary data — selection bias makes model non-representative of full market.

Back to Projects Hire Me