10 minutes to xorbits.lightgbm
#
This is a short introduction to xorbits.lightgbm
which is originated from LightGBM’s quickstart.
Let’s take LGBMRegressor
as an example and explain how to build a regression model, find the relationship between independent variables (features) and the dependent variable (target), and make predictions based on these relationships.
Customarily, we import and init as follows:
In [1]: import xorbits
In [2]: import xorbits.numpy as np
In [3]: from xorbits.lightgbm import LGBMRegressor
In [4]: from xorbits.sklearn.model_selection import train_test_split
In [5]: xorbits.init()
Model Creation#
First, we build a LGBMRegressor
model and define its parameters.
This model has many adjustable hyperparameters that allow you to configure parameters such as tree depth, the number of leaf nodes, learning rate, and more to optimize the model’s performance.
In [6]: lgbm_regressor = LGBMRegressor(learning_rate=0.05, n_estimators=100)
.get_params
method returns a dictionary containing all the parameter names of the model along with their corresponding values. You can inspect these values to understand the current configuration of the model.
Inspect the parameters of the LightGBM regressor.
In [7]: paras=lgbm_regressor.get_params()
In [8]: paras
Out[8]:
{'boosting_type': 'gbdt',
'class_weight': None,
'colsample_bytree': 1.0,
'importance_type': 'split',
'learning_rate': 0.05,
'max_depth': -1,
'min_child_samples': 20,
'min_child_weight': 0.001,
'min_split_gain': 0.0,
'n_estimators': 100,
'n_jobs': None,
'num_leaves': 31,
'objective': None,
'random_state': None,
'reg_alpha': 0.0,
'reg_lambda': 0.0,
'subsample': 1.0,
'subsample_for_bin': 200000,
'subsample_freq': 0}
Set/modify parameters.
.set_params
method allows you to dynamically modify the parameter settings of a machine learning model by specifying parameter names and their corresponding values, without the need to recreate the model object.
In [9]: lgbm_regressor.set_params(learning_rate=0.1, n_estimators=100)
Out[9]:
LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
importance_type='split', learning_rate=0.1, max_depth=-1,
min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
n_estimators=100, n_jobs=None, num_leaves=31, objective=None,
random_state=None, reg_alpha=0.0, reg_lambda=0.0, subsample=1.0,
subsample_for_bin=200000, subsample_freq=0)
In [10]: lgbm_regressor.get_params()
Out[10]:
{'boosting_type': 'gbdt',
'class_weight': None,
'colsample_bytree': 1.0,
'importance_type': 'split',
'learning_rate': 0.1,
'max_depth': -1,
'min_child_samples': 20,
'min_child_weight': 0.001,
'min_split_gain': 0.0,
'n_estimators': 100,
'n_jobs': None,
'num_leaves': 31,
'objective': None,
'random_state': None,
'reg_alpha': 0.0,
'reg_lambda': 0.0,
'subsample': 1.0,
'subsample_for_bin': 200000,
'subsample_freq': 0}
Data Preparation#
We can use real data as input. For the sake of simplicity, we will use randomly generated x and y data as an example.
In [11]: x = np.random.rand(100)
In [12]: y_regression = 2 * x + 1 + 0.1 * np.random.randn(100)
In [13]: x=x.reshape(-1, 1)
In order to train the model, we split the dataset into a training set and a test set.
In [14]: X_train, X_test, y_train, y_test = train_test_split(x, y_regression, test_size=0.2)
Model Training#
The .fit
method takes the training data (independent variable x and dependent variable y) and fits the model to the data.
The model adjusts its parameters to minimize the error between the predicted values and the actual observations.
In [15]: lgbm_regressor.fit(X_train, y_train)
Out[15]:
LGBMRegressor(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
importance_type='split', learning_rate=0.1,
local_listen_port=60380, machines='127.0.0.1:60380', max_depth=-1,
min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
n_estimators=100, n_jobs=None, num_leaves=31, num_machines=1,
objective=None, random_state=None, reg_alpha=0.0, reg_lambda=0.0,
subsample=1.0, subsample_for_bin=200000, subsample_freq=0,
time_out=120, tree_learner='data')
Model Prediction#
Once you have trained a model, you can use the .predict
method to apply that model to new data and generate predictions for the new data.
In [16]: y_pred = lgbm_regressor.predict(X_test)
In [17]: y_pred
Out[17]:
array([2.27261527, 2.77106786, 2.14918293, 1.87012671, 2.27261527,
1.87012671, 1.6382904 , 2.77106786, 2.77106786, 1.25305227,
2.77106786, 2.77106786, 2.27261527, 1.91151191, 1.91151191,
1.52306828, 1.25305227, 2.43232226, 2.54141925, 2.14918293])
Model Evaluation#
.score
is typically used to assess the performance of a machine learning model.
In regression problems, the .score
method usually returns the coefficient of determination (R-squared) score, which represents the model’s ability to explain the variability in the dependent variable.
Calculate the model’s estimated accuracy on the test set.
In [18]: accuracy = lgbm_regressor.score(X_test, y_test)
In [19]: accuracy
Out[19]: 0.9398650113333329