Downside Assertion
This challenge goals to offer an answer for recruiters who need assistance with figuring out applicable wage ranges to supply candidates, in addition to candidates who might have clarification concerning the wage ranges.
The issue addressed by this challenge is the estimation of wage ranges for professionals within the knowledge trade, particularly for the roles of Information Analyst, Information Engineer, Information Scientist, and Machine Studying Engineer. The pocket book is linked on the backside of the article
Dataset
The dataset used is from a public salary dataset collected anonymously from professionals worldwide within the ML and Information Science House. The info features a description within the hyperlink. The dataset consists of salaries for a lot of totally different roles however I filtered out 4 roles i.e Information Analyst, Information Engineer, Information Scientist, and Machine Studying Engineer.
EDA
A small description of the dataset :
That is the distribution of expertise ranges within the knowledge, SE: Senior, MI: Mid-Degree, EN: Entry-Degree, EX: Govt-Degree.
A boxplot of the wage distribution of the salaries.
Information PreProcessing
I began by changing the worker residence and firm location from nation to continent utilizing the pycountry-convert library.
!pip set up -q pycountry-convertdef get_continent(col):
strive:
if len(col) == 2:
country_code = col
else:
country_code = computer.country_name_to_country_alpha2(col.strip(''"'))
continent_name = computer.convert_continent_code_to_continent_name(computer.country_alpha2_to_continent_code(country_code))
return continent_name
besides:
return None
df['company_location']=df['company_location'].apply(lambda x: get_continent(x))
df['employee_residence']=df['employee_residence'].apply(lambda x: get_continent(x))
Dropping pointless columns.
columns_to_drop = ['salary', 'salary_currency','remote_ratio']
new_df=df.drop(columns=columns_to_drop)
Eradicating outliers within the ‘salaries_in_usd’ column to get a greater illustration and distribution of the information.
def remove_outliers(df, column_name, threshold=1.5):
Q1 = new_df[column_name].quantile(0.25)
Q3 = new_df[column_name].quantile(0.75)IQR = Q3 - Q1
lower_bound = Q1 - threshold * IQR
upper_bound = Q3 + threshold * IQR
filtered_df = df[(df[column_name] >= lower_bound) & (df[column_name] <= upper_bound)]
return filtered_df
new_df = remove_outliers(new_df, 'salary_in_usd')
Subsequent is to find out the wage ranges for the predictions utilizing the minimal and most wage quantities within the dataset.
max_salary = new_df['salary_in_usd'].max()
min_salary = new_df['salary_in_usd'].min()num_subranges = 15
subranges = np.linspace(min_salary, max_salary, num=num_subranges+1, endpoint=True)
range_labels = []
for i in vary(len(subranges)-1):
subrange_min = int(subranges[i])
subrange_max = int(subranges[i+1])
range_label = f"{subrange_min:,} - {subrange_max:,}"
range_labels.append(range_label)
range_labels
Performing one-hot encoding on the specific columns in our dataset.
categorical_cols=['experience_level','employment_type','job_title','company_size','employee_residence','company_location']
encoded_df = pd.get_dummies(new_df[categorical_cols], prefix=categorical_cols, prefix_sep='_')
df_encoded = pd.concat([new_df.drop(categorical_cols, axis=1), encoded_df], axis=1)df_encoded_ = pd.get_dummies(df_encoded['work_year'], prefix='12 months')
df_encoded_f = pd.concat([df_encoded, df_encoded_], axis=1)
Mannequin Becoming
I attempted a number of fashions which you’ll see within the GitHub repo linked beneath however the best-performing one was the ridge regression.
param_grid = {'alpha': [0.1, 1.0, 10.0],'solver': ['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']}
rig = Ridge()
grid_search = GridSearchCV(rig, param_grid, scoring='r2', cv=10)
grid_search.match(X_train, y_train)
best_ridge = grid_search.best_estimator_
y_pred = best_ridge.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Imply Squared Error (MSE):", mse)
print("Imply Absolute Error (MAE):", mae)
print("R-squared (R2) Rating:", r2)
Characteristic Importances
Plotting the options that influenced the mannequin
import seaborn as sns
import matplotlib.pyplot as pltsns.set(fashion="whitegrid")
feature_importances = np.abs(mannequin.coef_)
print(feature_importances.form)
feature_names = X_train.columns
original_feature_names = 'experience_level','employment_type','job_title','company_size','12 months','company_location','employee_residence'
if feature_importances.form[0] != len(original_feature_names):
feature_importances = feature_importances[:len(original_feature_names)]
importance_df = pd.DataFrame({'Characteristic': original_feature_names, 'Significance': feature_importances})
importance_df = importance_df.groupby('Characteristic', as_index=False).sum()
importance_df = importance_df.sort_values('Significance', ascending=False)
plt.determine(figsize=(10, 6))
sns.barplot(knowledge=importance_df, x='Significance', y='Characteristic', palette='viridis')
plt.xlabel('Significance')
plt.ylabel('Characteristic')
plt.title('Characteristic Importances - Ridge Regression')
plt.present()
Making Predictions
Making a prediction utilizing the ridge regression mannequin.
import joblib
import numpy as npdef make_prediction(feature_values):
feature_names = ['experience_level_EN', 'experience_level_EX', 'experience_level_MI',
'experience_level_SE','employment_type_CT', 'employment_type_FL', 'employment_type_FT',
'employment_type_PT', 'job_title_data engineer','job_title_data analyst',
'job_title_data scientist', 'job_title_machine learning engineer',
'company_size_M', 'company_size_S', 'company_size_L','employee_residence_Africa',
'employee_residence_Asia', 'employee_residence_Europe',
'employee_residence_North America', 'employee_residence_Oceania',
'employee_residence_South America', 'company_location_Africa',
'company_location_Asia', 'company_location_Europe',
'company_location_North America', 'company_location_Oceania',
'company_location_South America', 'year_2020', 'year_2021',
'year_2022', 'year_2023']
# Create a numpy array for the enter knowledge
input_data = np.array([[feature_values['employee_residence'] == 'Africa',
feature_values['employee_residence'] == 'Asia',
feature_values['employee_residence'] == 'Europe',
feature_values['employee_residence'] == 'North America',
feature_values['employee_residence'] == 'Oceania',
feature_values['employee_residence'] == 'South America',
feature_values['company_location'] == 'Africa',
feature_values['company_location'] == 'Asia',
feature_values['company_location'] == 'Europe',
feature_values['company_location'] == 'North America',
feature_values['company_location'] == 'Oceania',
feature_values['company_location'] == 'South America',
feature_values['experience_level'] == 'EN',
feature_values['experience_level'] == 'EX',
feature_values['experience_level'] == 'MI',
feature_values['experience_level'] == 'SE',
feature_values['employment_type'] == 'CT',
feature_values['employment_type'] == 'FL',
feature_values['employment_type'] == 'FT',
feature_values['employment_type'] == 'PT',
feature_values['job_title'] == 'knowledge analyst',
feature_values['job_title'] == 'knowledge engineer',
feature_values['job_title'] == 'knowledge scientist',
feature_values['job_title'] == 'machine studying engineer',
feature_values['company_size'] == 'M',
feature_values['company_size'] =='S',
feature_values['company_size'] =='L',
feature_values['year'] == 2020,
feature_values['year'] == 2021,
feature_values['year'] == 2022,
feature_values['year'] == 2023]])
prediction = mannequin.predict(input_data)
ranges = [(15000 , 48875),
(48875 , 82750),
(82750 , 116625),
(116625 , 150500),
(150500 , 184375),
(184375 , 218250),
(218250 , 252125),
(252125 , 286000)]
prediction_range = None
for range_min, range_max in ranges:
if range_min <= prediction < range_max:
prediction_range = f"{range_min:,} - {range_max:,}"
break
return prediction_range
An instance of a prediction.
input_features = {'experience_level': 'EX',
'employment_type': 'FL',
'job_title': 'knowledge scientist',
'12 months': 2023,
'company_size': 'L',
'employee_residence': 'Europe',
'company_location' :'Europe',
}
prediction = make_prediction( input_features)
print("Prediction:", prediction)
The result’s ‘Prediction: 15,000–48,875’.
Conclusion
The mannequin would carry out higher with extra knowledge and extra options. Contributions are welcome.
The stay Streamlit app is here.
The hyperlink to the GitHub repo.