Querying Data on the Web - 1
Lecture 1&2
Database Management Systems
DBMS
Managing very large volumes of data
Linguistic abstractions - 语言抽象
- A DBMS supports:
- DDL: definition
- DML: manipulation
- DQL: query
- SQL is a DBMS language
- A DBMS supports:
DBMS: Languages
- Database languages VS general-purpose programming languages
- each sub-language of database languages are designed to cater for distinct needs while general-purpose languages are not.
- database languages are designed to operate on collections, without explicit iteration.
- Database languages are limited in some way
They are not Turing-complete.
They are only for operating on large collections and for simple calculations.
These make QLs declarative and can be optimized.
DBMS-Centred Applications
Internal Architecture of Classical DBMSs
- 5 main service types:
- Query processing
- Parse the declarative query
- translate to algebraic expression
- rewrite into a logical query execution plan (QEP)
- select the algorithms and access methods to obtain a quasi-optimal and costed concrete QEP (近似最优的、消耗明确的QEP)
- execute
- Transaction processing
- Concurrency control
- Recovery
- Storage management
- Query processing
- 4 main types of data
- Metadata: schema-level info and some infrastructure description
- Statistics: info about sizes and cardinalities database instances
- Indices
- Data
DBMS: Architectural Variations
- Beyond structured data:
- support un- and semi-structured data in document form
- stored in content repositories
- using keyword-based search and access methods for graph/tree fragments.
- Beyond OLTP:
- pre-process, aggregate, materialize separately
- support OLAP
- using multidimensional denormalized(非规范化的) logical schemas
- Parallelization并行化
- Shared-disk parallelism
在内存和相对较慢的磁盘之间放一个快的interconnect
并行磁盘可以避免辅助内存的抢占
- Shared-memory parallelism
在process和内存之间放一个interconnect
并行内存可以避免主内存抢占
- Shared-nothing parallelism
在整个processing units之间放一个interconnect
使用黑盒模式来并行处理使得本地资源充裕
- Shared-disk parallelism
- Distribution
Lecture 3 The Relational Case
The Relational Model
- relational database is a set of relations.
- schema: 关系名和其属性的集合
- tuple: relation中除属性名那一行之外的其他行
- relations are considered to be a set. (意味着所有的tuple都是唯一的,并且顺序不影响)
- Arity:
- 元数
- 所有属性的数目
- Cardinality:
- 基数
- 所有tuple的数目
- Integrity constraint
- IC
- 完整性约束:某一个属性必须为true
- Domain Constraint
- 属性值需要为schema-specified domain
- Key:
- 不会重复
- 如果一个关系中有多个key,每个key叫做candidate key(关系键)
- primary key从candidate key中选择
- foreign key:
- 在其他关系中为主键
- 可以用来对应其他关系中的tuple
- 作为逻辑连接点
- 表示两个entities之间的关系
Lecture 4 Relational Query Languages
Query Languages
- Relational model defines three expressively-equivalent abstract languages:
- the domain relational calculus DRC
- the tuple relational calculus TRC - SQL是TRC的一种具体语言
- the relational algebra RA
- View: 创建一个view,相当于代替一些sql语句
- View为何有用
- 可以只显示关键信息,隐藏不必要的细节
- views可以被具体化 - 指的是我们可以计算view并储存结果
- 在分布式环境中很有用
The Tuple Relational Calculus
“\” : 返回在第一个参数里不在第二个参数里的行
SQL
常见操作函数
- ⚠️select后面如果使用聚合函数(上面的几个),那必须对select的所有属性都聚合,或者其他属性都用GROUP BY
- HAVING语句:
因为WHERE语句不能和聚合函数一起用
HAVING可以和聚合函数一起用,起到筛选的作用