Intro To T-SQL For Data Science

Ojash Shrestha
4y
4.3k
0
3

Article

Intro to T-SQL for Data Science

Data Science uses multitudes of scientific methods, algorithms, and processes to extract knowledge and insight from data and uses it across wide range of domains. Data Science helps to extract patterns from raw data which would have been invisible beforehand. In order to access and experiment with these data, a Database is essential. Thus, today we’ll learn about the Database as a whole and use examples and features of T-SQL to learn to find meaning out of raw data.

Check out other articles on SQL.

Transact SQL(T-SQL)

T-SQL is one of the mostly widely used SQL Derivatives. It is known as a transactional language used to define how things are to be done. T SQL is one of the easiest and most effective way to get things done. It is importantly used to create applications and also to add business login into the application the backend systems.

Tools

Cloud VS On-Premise:

What is Cloud?

Cloud or Cloud Computing can be defined as the on-demand availability of the ecosystem of computer resources ranging from data storage (cloud storage) to computer power without the user having to actively manage the system. These are the data centers which are available to users across the internet. We can say, Cloud is the architecture of the system where someone else owns the hardware such as AWS, Azure and Alibaba Cloud and the user can use it for their need at the time of convenience. This can also be called as Off-Premise.

On-Premises

On-premises is the scenario where the user owns the Hardware and the software systems are deployed on the very premises of the user or of the organization instead of remote facility for instance, Cloud Computing or Server Farm.

If you want to install in your own machine,

Download and Install SQL Server Express

-> SQL Server 2019 Express Edition

Editions of SQL Server

Express – which is free
Web
Standard
Enterprise
Developer

Watch SQL Server Conference Videos on T-SQL for Better Performance,

Tools of the Trade (IDE)

SSMS

SQL Server Management Studio (SSMS) is an application software which is widely used to configure, manage and administer different components within the Microsoft SQL Server.

VS Code

Visual Studio (VS) Code is a free software developed by Microsoft for Windows, Linux and macOS which developers and engineering write and edit code, debug, refactor and provides multiple other functionalities.

What is an IDE?

In layman’s terms, this is an environment where you can write your code and/or perform your analysis. For E.g., Microsoft SQL Server Management Studio, VS Code, Azure Data Studio.

We would highly suggest you to try out Azure Data Studio to start out T-SQL journey as it is free and to learn from the experience.

Database

Database can be defined as the collections of information which is organized such that the information could be accessed and worked upon. It is often controlled by a database management system which all together with database is known as a database system, in short form just called database.

Database Management System (DBMS)

A DBMS acts as the interface between the database and its end-users such that the users can access, manage and update the information.

Database Objects

Database objects can be understood as the objects in a database which are used in order to store or refer to the data. Tables, Views, Sequences, Indexes, Clusters and Synonyms are all examples of the database objects.

Table

A set of related data stored in our database

Excel Reference

A sheet in Excel. Just like Excel contains spreadsheets, relational databases are composed of tables.

Let us take sample data (car_data.csv from Kaggle).

What is a Column?

A set of data of a particular type, generally there is a value for each row in our table.

What is a Row?

A collection of fields that make up a record.

What is a Field?

The smallest source of data in a database

Logical Processing Order

The logical processing order of a SELECT statement explains the methods of how the query will be process and the final result will be achieved.

FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP

Database Objects

Working with Data:

Selecting

This is the process of retrieving one or many rows or columns from one or many tables in a database. E.g., Returning 20 records with our 11 selected columns from the Cars Table

Filtering

This is the process of excluding data based on predefined rules. E.g., Returning 12 records with our 11 selected columns from the Cars Table where the cars have a body style of type “convertible”

Ordering

This is the process of sorting a dataset based on a specified sort order. E.g., Returning 12 records with our 11 selected columns from the Cars Table where the cars have a body style of type “convertible” and ordering them from the widest based on the Wheelbase.

Grouping

Grouping data within a dataset typically over 1 or more columns. E.g., Returning All the records from the Cars dataset, grouping the data over the Make to see how many Models each Make has that is of Body_Style Convertible and what is the Average Price for each Model.

Modifying

Permanent VS Temporary data modifications. For Permanent we use the UPDATE statement, for temporary we can use what best suite our needs. E.g., We are correcting the spelling error of ‘Audi’ in the Make the top statement will only display the corrected data, where the bottom will permanently change the underlying data.

Multiple Datasets

Working with Multiple datasets

Joining

Joining 2 or more datasets together on a common “join” key. The SQL join allows us to collect two or more tables using the common identifiers. There are different types of join such as: Inner Join, Outer Join, Cross Join and Semi-join.

If you want to dive deeper into T – SQL, watch this video embedded below,

SET Operations

The SET Operations like Union, Intersect, Minus help us get meaningful outputs from the data stored in table under various special conditions.

Watch this video by AI 42 to dive indepth in T-SQL for Data Science,

Building Processes

Views and Procedures

Creates a virtual table with predefined Logic, this is useful for repetitive data alterations.

Data Modification

CRUD

Create – INSERT

Read – SELECT

Update - UPDATE

Delete – DELETE

Syntax Examples

INSERT

SELECT

UPDATE

DELETE

Common Table Expressions (CTE’s)

It is a temporary result set which can be referenced within another CRUD statement.

Pivot Vs Un-Pivot

Pivot and Un-Pivot can be used to alter the expressions of table into another table. Pivot rotates the expression of the table by transforming the unique values from a column in the expression in numerous columns in output whereas Unpivot does the opposite operation of Pivot.

Stored Procedures

An encapsulated set of T-SQL commands that can be ran repeatedly.

Windows Functions

Rank

The RANK() function is a widely used window function which assigns a rank to each row in a partition of a result set. It returns a rank for each row in our window. 1+ the number in the ranking before it.

Analytical Aggregations

Analytic Functions

It calculates the aggregate value which is based on the group of rows. It returns multiple rows for each group. We can extensively use analytic functions to calculate moving averages, percentages, running total and top – N results within a group.

The analytic functions supported by the Transact- SQL are as follows,

CUME_DIST
FIRST_VALUE
LAG
LAST_VALUE
LEAD
PERCENT_RANK
PERCENTILE_CONT
PERCENTILE_DISC

Aggregate Functions

It calculates a set of values and thus returns back an individual value. It is widely used with the GROUP BY clause of the SELECT statement. Aggregate functions are always deterministic.

The aggregate functions provided by T-SQL is as listed below,

APPROX_COUNT_DISTINCT
AVG
CHECKSUM_AG
COUNT
COUNT_BIG
GROUPIN
GROUPING_ID
MAX
MIN
STDEV
STDEVP
STRING_AG
SUM
VAR
VARP

Conclusion

In this article, we learned different aspects of T-SQL. Understanding the resources itself is critical. We learned about different tools from Cloud Computing to On-premise systems to IDEs to work on. Using these tools, will help us get used to the technologies that are extensively employed in the professional setting. It also helps us experience with the state of art technologies. Furthermore, we learnt database, DBMS, different database objects, logical processing order, data modification methods and numerous other functionalities provided by T-SQL. This is a stepping stone to become able to play with data. All these tools will help us down the line to operate on data as per our needs and help us make progress on our Data Science journey. Learning to extract knowledge and insights from data as per the need for our projects and research works will make us a competent Data Scientist.