- Query processor
- Storage Engine
- Server : It is basically the executor for a given request. Requests are in form of Queries mostly, in some cases setting values of database.
- Query-Processor : This component under stand the request and create a execution plan for Server to follow.
- Storage Engine : This component talks to the filesystem that will eventually hold the data. It serves the calls from the server.
- Filesystem : This holds the data of our Database.
DataBase V/S FileSystem
There are few features that Filesystem have that database can’t provide :
- Performance (given you manage how to use the filesystem).
- Easy migration to cloud services like Aws S3 etc.
- Less Cost
But these are rudimentary features compared to features that DB provide :
- Abstraction of Data structures.
- Searching Capabilities.
- Data integrity.
- Concurrent operations.
- Data Security.
we are focusing on Concurrent Operations :
Concurrent access to data means more than one user is accessing the same data at the same time.
With this concurrent operations, there are possibilities of Anomalies which may result in Database in in-consistent state.
File system does not provide any procedure to stop anomalies. Whereas DBMS provides a locking system to stop anomalies to occur.
What are these Anomalies :
Let’s try to get to all such cases from very basic. In as database we have two main category of operations :
- Read operation.
- Write Operation.
Database usually have many entities :
- Row : It represent an entry which have related attributes.
- Column : It is usually the particular attribute for a given record/row.
- Cell : This is single value of given attribute in a given Record/Row.
- Table : Many Rows combine under a single meaning/relation. Term “relation” is exchangeably used for “Table”
- Database : It is collection of many tables.
Read operations are of various types, but essentially we are reading and not making any change. eg. :
- you may be reading a table.
- you may be reading many tables.
- you may be reading configs of database.
Write Operations are of much more concern as every change effect the other user as this is change in a shared resource i.e. database. They are of many types :
- You may add new database.
- you may add new table in existing database.
- you may add one or more rows in an existing table.
- you may update exiting rows for one or more columns.
- you may delete certain rows.
- you may update table name.
- you may delete table.
- you may update database name.
- you may delete a database.
- you may be updating config(s) of database.
- undo/redo changes. … …
Among these writes, you see there are few categories emerging, those are :
- Writes leading to change in database structure.
- Writes leading to change in values stored in the database.
If we start considering the other aspects that Database provide like : Data security then few more other types of operations come to picture, that are related to Access Control.
- you may like to revoke/Grant access for a table.
- you may like to revoke/grant access for a database.
All this leads to different sections of requests to a database :
- Database defining requests.
- Database reading requests.
- Database manipulation requests.
- Database access control requests.
SQL (Structured Query Language) the most common format for using databases for day-to-day tasks easily is also structured in such Sections :
- DDL — Data Definition Language (CREATE, ALTER, DROP, RENAME, TRUNCATE, COMMENT)
- DQL — Data Query Language (SELECT)
- DML — Data Manipulation Language (INSERT, UPDATE, DELETE, MERGE, CALL, EXPLAIN PLAN, LOCK)
- DCL — Data Control Language (GRANT, REVOKE).
Using Database :
The main idea of using database is to store our data and use it in required manner to carry out a logical unit of work. we have understood, the set of operations we can carry out, combining these operations in ordered manner constitutes “a logical unit of work”.
This logical unit of work is what we commonly call Transaction
Transaction is list of operations to be carried out, that’s it. there is no inherent constraint of transaction to be fulfilled without effecting other transactions. Only constraint is the Atomicity i.e. Either all operations went through OR none of them happened. This is though limited to write operations mostly, as they effect the DB for other concurrent operations. For “read operations”, we can’t un-read to put it simply.
Enabling Transactions lead to new set of operations and new modules to be part of database system, that effect how query plan should be defined and how server will carry that out and how storage engine will act on it. That component for our understanding and for abstraction is “Transaction process Manager” on server (database) side. Similar “Transaction Manager” sits on the user/client side.
How most “Transaction process Manager” work ? Well that be covered in later posts and its quite interesting and something I am still exploring more.
How Transaction Manager work i.e. how it manages transaction : If we were to define an interface for transaction, it is only having 2 operations.
- Mark Have Executed the Transaction i.e. : COMMIT
- Discarding any changes made in Transaction i.e. : ROLLBACK
To ensure the Server Side understands we are trying to do a transaction, we have a begin() & getStatus() method too as seen in JTA (javax.transaction.UserTransaction)
From above, If we were able to define a Transaction, but IF we were to enable users to do transaction, then in a multi user environment where user is accessing Database concurrently, Atomicity alone can’t provide a consistent view of data.
IF we desire Consistency i.e. changing database state from one valid state to other while maintaining database invariants, Let’s see what can be done to achieve it.
- Can we let go of the Concurrent execution from users ?
This will ensure that consistency is maintained as each operation is moves the data to consistent state and 2 executions are not interfering with each other at all.
- Should we use some kind of Locking to enable concurrent execution ?
This helps executions to make changes to 2 separate areas but ensure no concurrent changes are happening to same data at same time, so we get some Concurrency while maintaining Consistency.
- Should we provide each user its own copy of data to execute upon?
Here, we give executions to make changes independently, then the decision is just to which data values to agree upon, and that responsibility may be passed to Database(Server side).
- Can we reorder such operations to ensure consistency ?
This is area of study on How to determine the order of operations, though a simple example of all executions happen one after other will ensure consistency.
- Any mixture of above solutions ?
Most databases talk about mix of Isolation along with locking as a way to achieve concurrent executions. Isolation is more around having your own copy of data to work upon where based on kind of Isolation, you get different effect of READ and WRITE action.
Lets talk about what kind of Concurrency Issues we are dealing with :
- Dirty Write : Transaction T1 modifies a data item. Another transaction T2 then further modifies that data item before T1 performs a COMMIT or ROLLBACK. If T1 or T2 then performs a ROLLBACK, it is unclear what the correct data value should be. It is a Write-Write Conflict.
- Dirty Reads : Transaction T1 modifies a data item. Another transaction T2 then reads that data item before T1 performs a COMMIT or ROLLBACK. If T1 then performs a ROLLBACK, T2 has read a data item that was never committed and so never really existed. It is a Write-Read Conflict.
- Non-Repeatable Reads : Transaction T1 reads a data item. Another transaction T2 then modifies or deletes that data item and commits. If T1 then attempts to reread the data item, it receives a modified value or discovers that the data item has been deleted. It is a Read-Write Conflict.
- Lost Update : Transactions try to write/update same data block. It is a Write-Write Conflict.
- Phantom Reads : Transaction T1 reads a set of data items satisfying some search condition. Transaction T2 then creates data items that satisfy T1’s search condition and commits. If T1 then repeats its read with the same search condition, it gets a set of data items different from the first read.
These concurrency issues are commonly called Anomalies.
Understanding these Anomalies better :
C > Commit by ith user
A > Rollback/Abort by ith user
Wi[x] > write of data x by ith user
Ri[x] >read of data x by ith user
X ~= pred : data based on predicate “pred”
Dirty Write :
- W1[X], W2[X].. ((C1 OR A1) and (C2, A2) in any order)
Dirty Read :
- W1[X], R2[X], ((C1 or A1) and (C2 or A2) in any order)
Non-Repeatable Reads :
- R1[X], W2[X], ((C1 or A1) and (C2 or A2) in any order)
Lost Update :
- R1[X], W2[X], W1[X], C1, C2
Phantom Reads :
- R1[X as per pred], W2[y with pred], ((C1 or A1) and (C2 or A2) in any order)
R1[X ~= pred]
W2[Y ~= pred]
Caveat : Phantom Reads V/S Non-Repeatable Reads :
Phantom Reads says that Transaction actually read something and after few other operations, it tried to read the same data, but got either less no. of Rows [Due to deletion by other transaction] OR got more no. of Rows [Insertion by other Transaction]
Non-Repeatable Reads says that Transaction read some data and after few other operations, it tried to read same data but the rows were updated and have changed values.
How can one solve given Anomaly with as limited requirements possible :
- Dirty Write : we can solve this if each transaction is making changes to its own copy of data. If certain transaction commits then the value is updated while if it rollback/abort then the original value is retained.
- Dirty Read : we can solve this if each transaction is making changes to its own copy and we limit the other transaction to not be able to read changes made under other transaction(s).
- Non-Repeatable Reads : we can solve this if each transaction not only keep track of updated records/rows but also the data read as part of its own copy of data.
- Lost Update : Aim of solution is to ensure other transaction can’t over-write the changes done by given Transaction. Essentially if we provide each transaction its own copy of data, the end result will be either one of the update going through, while other’s lost, The only way to ensure serial update for a certain data block is locking that. This results in a new strategy called Cursor Stability
- Phantom Reads : we can solve this if each transaction not only tracks the updated records but also the data read and keeps map of query used for given data read and provide same result as per that query.
Copy of data for each transaction is achieved using multi-version databases that allow given data to have a separate version for a given transaction.
We see that, we are talking about putting some kind of limitations and in some way locking if required. these limitations are mostly trying to separate the data being operated upon from the original/current data in database.
This separation is taken up as Isolations which help in achieving different levels/degree of Consistency.
Most databases use ANSI SQL Isolation Levels :
- Read UnCommitted : This isolation level allows the transaction to be able to read other transaction’s update, though it can’t make update on data changed by other transaction. i.e. Only READ of un-committed data is allowed.
- Read Committed : This isolation level allows the transaction to only read committed data and only using that committed data, it make updates in a transaction.
- Repeatable Reads : It is one level higher than Read committed as it provide the isolation of Read-committed along with keeping hold on what ever data read while processing and providing same read output for that data.
- Serializable : It is the highest level of Isolation, where you are only allowing a single transaction to operate on Database, keeping other transaction under pending state. Though all the above Isolation levels are defined keeping Anomalies in mind but this Isolation level is solely there to provide “fully serializable execution”.
The idea behind each level is not only to limit a set of anomalies, but also to have a comparatively stronger degree of consistency guarantee too.
Now that’s set most Databases are ACID compliant. What is ACID (Atomicity, Consistency, Isolation, Durability).
These are properties of transaction that are intended to guarantee validity even in the event of errors, power failures, etc.
(Note : Consistency here is different from what we read or it means in context of CAP theorem (Consistency, Availability, Partition Tolerance))
- Atomicity : Transactions are often composed of multiple statements. Atomicity guarantees that each transaction is treated as a single “unit”, which either succeeds completely, or fails completely: if any of the statements constituting a transaction fails to complete, the entire transaction fails and the database is left unchanged.
- Consistency : Consistency ensures that a transaction can only bring the database from one valid state to another, maintaining database invariants. This does not restrict or talk about concurrent executions.
- Isolation : Transactions are often executed concurrently (e.g., multiple transactions reading and writing to a table at the same time). Isolation ensures that concurrent execution of transactions leaves the database in the same state that would have been obtained if the transactions were executed sequentially. This property motivate for concurrent executions.
- Durability : Durability guarantees that once a transaction has been committed, it will remain committed even in the case of a system failure (e.g., power outage or crash). This usually means that completed transactions (or their effects) are recorded in non-volatile memory.
In next articles, we will explore transaction process manager, undo/redo, maybe more of transaction data structure.