{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "SAM001b - Query Storage Pool from SQL Server Master Pool (2 of 3) - Convert data to parquet\n",
                "===========================================================================================\n",
                "\n",
                "Description\n",
                "-----------\n",
                "\n",
                "In this 2nd part of a 3 part tutorial, use Spark to convert a .csv file\n",
                "into a parquet file.\n",
                "\n",
                "### Convert CSV to Parquet using the PySpark kernel\n",
                "\n",
                "First open the .csv file and convert it to a data frame object."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "results = spark.read.option(\"inferSchema\", \"true\").csv('/tmp/clickstream_data/datasampleCS.csv').toDF(\"NumberID\", \"Name\", \"Name2\", \"Price\", \"Discount\", \"Money\", \"Money2\", \"Company\", \"Type\", \"Space\")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Verify the schema using the following command."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "results.printSchema()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "View the first 20 lines of this data using the following command."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "results.show()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Turn the .csv file to a parquet file."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "sc._jsc.hadoopConfiguration().set(\"mapreduce.fileoutputcommitter.marksuccessfuljobs\", \"false\")"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "results.write.mode(\"overwrite\").parquet('/tmp/clickstream_data_parquet')"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Verify the parquet file using the following commands."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "result_parquet = spark.read.parquet('/tmp/clickstream_data_parquet')"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "result_parquet.show()"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "print('Notebook execution complete.')"
            ]
        }
    ],
    "nbformat": 4,
    "nbformat_minor": 5,
    "metadata": {
        "kernelspec": {
            "name": "pyspark3kernel",
            "display_name": "PySpark3"
        },
        "azdata": {
            "side_effects": true
        }
    }
}