Leverage the powerful Create ML Data structure to ignite the data scientist in you
Data processing, handling, cleaning, and shaping plays a crucial role in Machine Learning. Panda, a powerful python package is one of the most sought after libraries used by data scientists to perform transformations on their datasets.
Despite having an easy syntax, Panda requires a steep learning curve. More so, because there are plenty of functions available for a variety of use cases and it takes time to get a hang of it all. Apple, through CreateML aims at bridging the gap between machine learning and application development to allow mobile developers to train and deploy models for on-device machine learning.
The introduction of Create ML, an easy to use interface to train different types of models by using pre-built templates and algorithms, has been a game-changer of sorts. And with the release of
MLDataTableApple strives to make data processing easier for mobile developers, by providing a spreadsheet-like data structure with an easier learning curve in comparison to Pandas.
MLDataTable is a useful data structure for managing tabular data. It has most of the functionalities of the Panda library, thereby making it the Panda for iOS and macOS Developers.
Besides being useful for parsing JSON and CSV datasets, MLDataTable is used in the following Create ML model templates:
In the next few sections, we’ll be exploring the functionalities offered by MLDataTable to make data manipulation easy and fun, especially for mobile developers.
A majority of datasets are in either JSON or CSV file formats. With MLDataTable, parsing such files into a tabular form by using Create ML is quick and easy. To begin with, you need to create a macOS playground in your Xcode and
import CreateML :
import CreateMLUIvar data = try MLDataTable(contentsOf: URL(fileURLWithPath: "path/to/your/file/movie_metadata.csv"))
On running the above code in the playground, you can view the tabular data in the playground previews as shown below:
To get a hold of the column names and types simply there are getter properties
Additionally, we can set our own parsing options in the
MLDataTable initializer. With options like
maxRows we can filter data from the files into our table.
A dictionary of column names and data values(which conform to the
MLDataValueConvertible protocol) can be converted to a
MLDataTable as well. The following code creates a dummy movie dataset consists of three rows and two columns:
let movieData: [String: MLDataValueConvertible] = ["Title": ["Titanic", "Shutter Island", "Warriors"],
"Director": ["James Cameron", "Martin Scorsese", "Gavin O'Connor"]]var movieTable = try MLDataTable(dictionary: movieData)
MLDataTable can be split, merged or transformed to generate an entirely new data table.
Splitting And Sorting Tables
The following code is used to divide a
MLDataTable into training and test datasets, which are used for model training and evaluation:
let (trainingData, testingData) = data.randomSplit(by: 0.8, seed: 5)
MLDataTables can be sorted by a particular column to give rise to a new MLDataTable:
data = data.sort(columnNamed: "director_name")
Merging Two Tables
Datasets such as the ones in recommender systems need this often. To do so we can use:
func append(contentsOf: MLDataTable)— adds a new table at the end of the current
func join(with: MLDataTable, on columnsNamed: , type: .inner)— merges rows based on the matching columns. If the columnNames are set as empty, it assumes all columns in the join.
Performing operations such as adding, removing, updating columns and data is a fairly common use case in data processing. For that, MLDataTable provides us with the following functions.
Adding, Removing, Renaming Columns
To add a column to the MLDataTable, simply append the
MLDataColumn It consists of row values. The following code extends the movie dictionary to MLDataTable dataset we created earlier with a new column:
var movieTable = try MLDataTable(dictionary: movieData)let genreColumn = MLDataColumn(["Drama", "Thriller", "Drama"])movieTable.addColumn(genreColumn, named: "Genre")
For removing a column, we simply invoke the
removeColumn method on the
MLDataTable instance with the column name string.
To rename an existing column to a new name, simply invoke the
func renameColumn(named: String, to: String) function on the MLDataTable instance.
Drop Duplicates Rows, Fill Missing Columns
While Panda provides functions such as
drop_duplicates for filling missing column values and dropping duplicating rows based on a certain set of conditions, MLDataTable has the following equivalent methods:
movieTable.dropDuplicates()movieTable.fillMissing(columnNamed: "Title", with: MLDataValue.string("NA"))
dropDuplicates function removes duplicates and returns a
MLDataTable containing all duplicate rows. Additionally, the function
dropMissing is used to drop rows with missing values.
To transform a column to a new one, we can use the map function which allows updating all the rows in a thread-safe manner.
show() function on the MLDataTable instance we can view the tabular data in a visual manner in our playgrounds as shown below:
Finally, exporting the MLDataTable to a CSV or JSON file is possible using the
write function. Many Create ML application templates require a CSV format, so the following function is fairly important:
try trainingData.writeCSV(to: URL(fileURLWithPath: "path/file.csv"))
So, we’ve explored the different use cases of MLDataTable and saw how easy it is for mobile developers looking to join the machine learning bandwagon, to use this awesome Create ML structure.
The idea of sharing the importance of MLDataTable in this piece came while I was working on another story. Only if time travel was possible, I’d rewrite that piece by using MLDataTable instead of Pandas for dataset importing and exporting to Create ML.
That’s it for this one. I hope you enjoyed reading.