A basic machine learning linear regression model with Spark (pyspark)
Beginner level

In this post will see how make a very basic linear regression algorythm.
Part 1: Read Dataset
Let’s start to inizialize a spark session:
spark = SparkSession.builder.appName('linear_regression').getOrCreate()
and load our csv file:
df = spark.read.csv("iris.csv", inferSchema=True, header=True)
You must know that:
- inferSchema: for determine automatically columns data types;
- header: for indicate to spark that the first line contains the name of the columns.
Take a look to our dataframe:
df.show()
+---+------------+-----------+------------+-----------+-------+
|_c0|Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|Species|
+---+------------+-----------+------------+-----------+-------+
| 1| 5.1| 3.5| 1.4| 0.2| setosa|
| 2| 4.9| 3.0| 1.4| 0.2| setosa|
| 3| 4.7| 3.2| 1.3| 0.2| setosa|
| 4| 4.6| 3.1| 1.5| 0.2| setosa|
| 5| 5.0| 3.6| 1.4| 0.2| setosa|
| 6| 5.4| 3.9| 1.7| 0.4| setosa|
| 7| 4.6| 3.4| 1.4| 0.3| setosa|
| 8| 5.0| 3.4| 1.5| 0.2| setosa|
| 9| 4.4| 2.9| 1.4| 0.2| setosa|
| 10| 4.9| 3.1| 1.5| 0.1| setosa|
| 11| 5.4| 3.7| 1.5| 0.2| setosa|
| 12| 4.8| 3.4| 1.6| 0.2| setosa|
| 13| 4.8| 3.0| 1.4| 0.1| setosa|
| 14| 4.3| 3.0| 1.1| 0.1| setosa|
| 15| 5.8| 4.0| 1.2| 0.2| setosa|
| 16| 5.7| 4.4| 1.5| 0.4| setosa|
| 17| 5.4| 3.9| 1.3| 0.4| setosa|
| 18| 5.1| 3.5| 1.4| 0.3| setosa|
| 19| 5.7| 3.8| 1.7| 0.3| setosa|
| 20| 5.1| 3.8| 1.5| 0.3| setosa|
+---+------------+-----------+------------+-----------+-------+
We’ll train our algorythm for a regression problem: predict the Petal_Width of new datas.
For do that, we must combine all the columns that we’ll use for prediction in one column. These columns are called features.
Part 2: Create the cake pan
Follow these steps:
Import libraries for assemble our features column:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
Let’s define our assembler (like a cake pan)
assembler = VectorAssembler(
inputCols=["Sepal_Length", "Sepal_Width", "Petal_Length"],
outputCol="features")
and create our new dataset with the new column “features”:
transform = assembler.transform(df)
Take a look to the new dataframe:
transform.show()
+---+------------+-----------+------------+-----------+-------+-------------+
|_c0|Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|Species| features|
+---+------------+-----------+------------+-----------+-------+-------------+
| 1| 5.1| 3.5| 1.4| 0.2| setosa|[5.1,3.5,1.4]|
| 2| 4.9| 3.0| 1.4| 0.2| setosa|[4.9,3.0,1.4]|
| 3| 4.7| 3.2| 1.3| 0.2| setosa|[4.7,3.2,1.3]|
| 4| 4.6| 3.1| 1.5| 0.2| setosa|[4.6,3.1,1.5]|
| 5| 5.0| 3.6| 1.4| 0.2| setosa|[5.0,3.6,1.4]|
| 6| 5.4| 3.9| 1.7| 0.4| setosa|[5.4,3.9,1.7]|
| 7| 4.6| 3.4| 1.4| 0.3| setosa|[4.6,3.4,1.4]|
| 8| 5.0| 3.4| 1.5| 0.2| setosa|[5.0,3.4,1.5]|
| 9| 4.4| 2.9| 1.4| 0.2| setosa|[4.4,2.9,1.4]|
| 10| 4.9| 3.1| 1.5| 0.1| setosa|[4.9,3.1,1.5]|
| 11| 5.4| 3.7| 1.5| 0.2| setosa|[5.4,3.7,1.5]|
| 12| 4.8| 3.4| 1.6| 0.2| setosa|[4.8,3.4,1.6]|
| 13| 4.8| 3.0| 1.4| 0.1| setosa|[4.8,3.0,1.4]|
| 14| 4.3| 3.0| 1.1| 0.1| setosa|[4.3,3.0,1.1]|
| 15| 5.8| 4.0| 1.2| 0.2| setosa|[5.8,4.0,1.2]|
| 16| 5.7| 4.4| 1.5| 0.4| setosa|[5.7,4.4,1.5]|
| 17| 5.4| 3.9| 1.3| 0.4| setosa|[5.4,3.9,1.3]|
| 18| 5.1| 3.5| 1.4| 0.3| setosa|[5.1,3.5,1.4]|
| 19| 5.7| 3.8| 1.7| 0.3| setosa|[5.7,3.8,1.7]|
| 20| 5.1| 3.8| 1.5| 0.3| setosa|[5.1,3.8,1.5]|
+---+------------+-----------+------------+-----------+-------+-------------+
only showing top 20 rows
We can notice that the new column is there.
Part 3: Extract our train and test dataframes
How already said, we have to predict the Petal_Width variable for new datasets.
For do that, Spark admit only two variable for train the algorythm: the features and the variable to predict
So, let’s create our simplied dataframe with these two columns:
transformed_df = transform.select('features','Petal_Width')
transformed_df.show()
+-------------+-----------+
| features|Petal_Width|
+-------------+-----------+
|[5.1,3.5,1.4]| 0.2|
|[4.9,3.0,1.4]| 0.2|
|[4.7,3.2,1.3]| 0.2|
|[4.6,3.1,1.5]| 0.2|
|[5.0,3.6,1.4]| 0.2|
|[5.4,3.9,1.7]| 0.4|
|[4.6,3.4,1.4]| 0.3|
|[5.0,3.4,1.5]| 0.2|
|[4.4,2.9,1.4]| 0.2|
|[4.9,3.1,1.5]| 0.1|
|[5.4,3.7,1.5]| 0.2|
|[4.8,3.4,1.6]| 0.2|
|[4.8,3.0,1.4]| 0.1|
|[4.3,3.0,1.1]| 0.1|
|[5.8,4.0,1.2]| 0.2|
|[5.7,4.4,1.5]| 0.4|
|[5.4,3.9,1.3]| 0.4|
|[5.1,3.5,1.4]| 0.3|
|[5.7,3.8,1.7]| 0.3|
|[5.1,3.8,1.5]| 0.3|
+-------------+-----------+
only showing top 20 rows
and define our train and test variables:
train, test = transformed_df.randomSplit([0.7,0.3])
0.7 and 0.3 stand for 70% and 30% respectively
In fact, usually the test data are limited to 20-30% of the original dataset, this to leave the right amount of data for training.
Unbalancing the test and train data too much leads to several disadvantages:
- with little train data: you would have low model accuracy
- with too much train data: you would have a flawed model and not very suitable for predicting new values
- with little test data: you can’t be sure if the model works well
Part 4: Let’s train our model
Import the class for linear regression
from pyspark.ml.regression import LinearRegression
Create an istance of this class
lr = LinearRegression(featuresCol='features', labelCol='Petal_Width',
predictionCol='prediction')
In our setting, we can find:
- featuresCol: as the name tell us, we are indicating the column with all our features
- labelCol: the values to predict
- predictionCol: the column that will contain the predictions
And finally: let’s train our model!
lr_model = lr.fit(train)
Part 5: Let’s do the first prediction
Okay, well done! We trained our model; we yelled a “YES” satisfied with the last line (or at least I did).
Now just do our first prediction on test data:
test_features = test.select('features')
predictions = lr_model.transform(test_features)
Now let’s see the differences between the predicted and original values:
predictions.show()
+-------------+-------------------+
| features| prediction|
+-------------+-------------------+
|[4.4,2.9,1.4]| 0.2165543276475982|
|[4.4,3.0,1.3]| 0.1879534561600471|
|[4.6,3.4,1.4]|0.29685042643249226|
|[4.7,3.2,1.3]|0.17199907241125534|
|[4.7,3.2,1.6]| 0.332200886106345|
|[4.9,2.4,3.3]| 0.9979107324895239|
|[4.9,3.0,1.4]| 0.1320976442188283|
|[4.9,3.6,1.4]| 0.2808960426837003|
|[5.0,2.3,3.3]| 0.9512597161107955|
|[5.0,3.0,1.6]|0.21704757004763853|
|[5.0,3.2,1.2]|0.05304461794247628|
|[5.0,3.4,1.6]|0.31624650235755314|
|[5.0,3.5,1.3]| 0.1808444217399422|
|[5.1,3.3,1.7]| 0.3229960905438546|
|[5.1,3.7,1.5]|0.31539381372370967|
|[5.2,2.7,3.9]| 1.3271597092083902|
|[5.2,3.5,1.5]|0.24394306426750245|
|[5.2,4.1,1.5]| 0.3927414627323743|
|[5.4,3.7,1.5]|0.24983996381996032|
|[5.5,2.3,4.0]| 1.215807531559756|
+-------------+-------------------+
only showing top 20 rows
test.show()
+-------------+-----------+
| features|Petal_Width|
+-------------+-----------+
|[4.4,2.9,1.4]| 0.2|
|[4.4,3.0,1.3]| 0.2|
|[4.6,3.4,1.4]| 0.3|
|[4.7,3.2,1.3]| 0.2|
|[4.7,3.2,1.6]| 0.2|
|[4.9,2.4,3.3]| 1.0|
|[4.9,3.0,1.4]| 0.2|
|[4.9,3.6,1.4]| 0.1|
|[5.0,2.3,3.3]| 1.0|
|[5.0,3.0,1.6]| 0.2|
|[5.0,3.2,1.2]| 0.2|
|[5.0,3.4,1.6]| 0.4|
|[5.0,3.5,1.3]| 0.3|
|[5.1,3.3,1.7]| 0.5|
|[5.1,3.7,1.5]| 0.4|
|[5.2,2.7,3.9]| 1.4|
|[5.2,3.5,1.5]| 0.2|
|[5.2,4.1,1.5]| 0.1|
|[5.4,3.7,1.5]| 0.2|
|[5.5,2.3,4.0]| 1.3|
+-------------+-----------+
only showing top 20 rows
Well! As we can see, the predicted values are very similiar to the original ones. So can we conclude that the model is ok? Not so fast!
When we have little data it is easy to see the differences between the two datasets. But when the amount of data begins to increase, the situation change. In the next part we will see how to get information on the degree of reliability of the model.
Part 6: Check the quality of your model
There are several statistical methods for calculating the degree of reliability of a model.
Some of which are:
- MSE: the measure of how actually the predicted values are different from the actual values
- RMSE: the standard deviation of the residuals (prediction errors)
- R2 Score: the proportion of the variance in the dependent variable that is predictable from the independent variables.
The discussion here becomes more complicated, but for now it is enough to know that:
- if r2 has a value greater than 0.75, then we have built a good model
So, let’s find our r2!
training_summary = lr_model.summary
print("r2: {}".format(training_summary.r2))
r2: 0.9284703394101349
As we can see, the value of r2 is greater than 0,75, so the model is great for our problem 😀
You will find the complete example here: https://drive.google.com/file/d/1zZE52jD4dqOdlBPvLDt-gMrbm9LH2K-O/view
You can open the file with Google Colaboratory.
Text me for any issue.
Thank you for reading 🙂
Insights
MSE | Link |
---|---|
RMSE | Link |
R2 Score | Link |
-
A basic machine learning linear regression model with Spark (pyspark)
-
Setting up a profile website and a WordPress blog with Docker compose
Related:
Written by

Antonio De Vivo
Software Engineer