Titanic exercise

Titanic exercise

With bigML you can simply apply and evaluate a selection of machine learning algorithms in a standardized frarework. For this you do not need any coding experience. Create an account on
www.bigml.com (free forever)

When you enter BigML for the first time you see a blank canvas.:
Pasted Graphic
Please login or if logged in Choose Dashboard

This is the blank canvas.. from here you can start upload data, create datasets, manipulates data, create and evaluate models.
Pasted Graphic 1

When you want to organize your projects, hover over the 3 dots at the right side of the screen:

Pasted Graphic 2
Here you can make new projects or alter existing ones.

The normal procedure is to start creating a project and then upload the dataset inside the project to start..
Create new project and then name it “Titanic”

Pasted Graphic 3

Now the project shows up in your project list…
Pasted Graphic 4
Click on the name titanic to open the project and start the work on the titanic case

Pasted Graphic 5
The first step is to get data in. That process is called Creating data sources. On the right you see icons that show you all the possibilities to get data in, the one opens up here are the direct way, other way’s are connecting to databases or elastic search etc.


Pasted Graphic 6

We create a Data Source by Create Source From URL and the url = s3://grioml/Titanic.csv
Here I created a source file for this exercise, just a plain normal csv file, could also be an excellent file.. for example..

Pasted Graphic 7

Then the following shows the process of creating models and evaluating them.

Titanic data

Let’s take the titanic data that we used before and fit the following four models on a training version (80% of cases) of that data set.
  • A logistic regression model
  • A random forest
  • A OptiML optimal model type and hyperparameters.
  • Finally, compare the performance of all 4 techniques on the test version (20% of not yet used cases) of that data set.


  • Grab the data set
    We can use the following code block to directly load the data in our workspace:

    Create a project:

    Titanic1

    Data source:
    s3://grioml/Titanic.csv

    Titanic2


    Prepare the data / create a dataset

    Titanic3

    Validation set
    Let’s split the titanic data into a training and validation set(80/20). Before we do so, we fix the random number generator seed in order to allow for reproduction of our results. Any seed value will do. My favorite seed is 123.

    Titanic4


    Modeling
    We now go through the four models where we predict Survived from the other features in titanic - with the exception of Name, naturally. If we would use Name, we would fit a zero-residual model: i.e. a model for every row seperately.

    We first need to set the target for ease of the next steps

    Titanic5



    Logistic regression model
    Let’s fit the logistic regression model
    First take out the name as a preferred field, BigML does encode all text and categorical fields so they will automatically be taken into account when modeling.
    Try out modeling with name in the set and you will see it tries to fit towards names. Keeping it out gives the following result:

    Titanic6

    And Evaluating the model gives these results:

    Titanic7
    The simple model gives this result AUC 0.8194 :


    Pasted Graphic 2


    Random forest

    Let’s train the random forest:

    Titanic8

    And look at evaluations

    Titanic9



    Pasted Graphic 3

    OptiML with cross validations

    Now we see that there is a potential in this dataset, we selected features and created better models ourselves lets see what an optimized version could do:
    And get yourself a cup of coffee and read a good book on AI/ML statistics, while BigML keeps experimenting with models and hyperparameters to find the optimal model.
    Doing cross-validations / trees / regressions / neural nets and forest boosted aranomized etc.

    Titanic10