If there are 200 typographical errors randomly distributed in a 500 page manuscript, find the probability that a given page contains exactly 3 errors.

We can abstract this type of problems as follows:

Suppose there are $n$ distinguishable boxes and $k$ indistinguishable balls. Now, we randomly put the balls into the boxes. For each of the boxes, what is the probability that it contains $m$ balls?

For example, if the first page contains 3 errors, the second page contains 197 errors, and the rest of the pages contain no errors, then the situation corresponds to the situation where the first box contains 3 balls, the second box contains 197 balls, and the rest of the boxes contain no balls. The balls are indistinguishable because we can only determine how many errors are on each page but not which errors are on the page.

To deal with the problem, we simply need to find these two numbers:

- the number of ways to put $k$ indistinguishable balls into $n$ distinguishable boxes, and
- the number of ways to put $k-m$ indistinguishable balls into $n-1$ distinguishable boxes.

The latter corresponds to the number of ways to put the balls into the boxes provided that we already know that the given box contains $m$ balls. After we find these two numbers, their ratio is the probability in question.

To find the number of ways to put $k$ indistinguishable balls into $n$ distinguishable boxes, we can use the stars and bars method. To see this, we write a special example. Here is an example of $n=4$ and $k=6$:

\[{}|{}\star{}\star{}|{}\star{}|{}\star{}\star{}\star{},\]which corresponds to the distribution $0,2,1,3$. We can see that there are $n-1$ bars and $k$ stars. Therefore, the number of ways to put the balls is the same as the number of ways to choose the $k$ positions of the stars among $n+k-1$ positions. Therefore, the number of ways is

\[N_{n,k}=\binom{n+k-1}{k}=\frac{\left(n+k-1\right)!}{k!\left(n-1\right)!}.\]Therefore, the final probability of the given box containing $m$ balls is

\[P_{n,k}(m)=\frac{N_{n-1,k-m}}{N_{n,k}} =\frac{\left(n-1\right)k!\left(n+k-m-2\right)!}{\left(k-m\right)!\left(n+k-1\right)!}.\]Another easy way to derive this result is by using the generating function. The number $N_{n,k}$ is just the coefficient of $x^k$ in the expansion of the generating function $\left(1+x+x^2+\cdots\right)^n$. The generating function is just $\left(1-x\right)^{-n}$, which can be easily expanded by using the binomial theorem.

We are now interested in the limit $n,k\to\infty$ with $\lambda:=k/n$ fixed. By Stirling’s approximation, we have

\[P_{n,k}(m)\sim\left(n-1\right) \frac{k^{k+1/2}\left(n+k-m-2\right)^{n+k-m-2+1/2}}{\left(k-m\right)^{k-m+1/2}\left(n+k-1\right)^{n+k-1+1/2} } \mathrm e^{k-m+n+k-1-k-n-k+m+2}.\]The $1/2$’s in the exponents can just be dropped because you may find that if we extract the $1/2$’s, the factor tends to unity. The exponential is just constant $\mathrm e$. Therefore, we have

\[\begin{align*} P_{n,k}(m)&\sim\left(n-1\right) \frac{\left(\lambda n\right)^{\lambda n}\left(n+\lambda n-m-2\right)^{n+\lambda n-m-2} } {\left(\lambda n-m\right)^{\lambda n-m}\left(n+\lambda n-1\right)^{n+\lambda n-1}}\mathrm e\\ &=\left(\tfrac{n+\lambda n-m-2}{n+\lambda n-1}\right)^n \left(\tfrac{\left(n+\lambda n-m-2\right)\lambda n}{\left(\lambda n-m\right)\left(n+\lambda n-1\right)}\right)^{\lambda n} \left(\tfrac{\lambda n-m}{n+\lambda n-m-2}\right)^m \tfrac{\left(n-1\right)\left(n+\lambda n-1\right)}{\left(n+\lambda n-m-2\right)^2}\mathrm e\\ &\to\mathrm e^{-\frac{m+1}{\lambda+1}}\,\mathrm e^m\, \mathrm e^{-\frac{m+1}{\lambda+1}\lambda}\left(\tfrac\lambda{\lambda+1}\right)^m\tfrac1{\lambda+1}\mathrm e\\ &=\left(\tfrac\lambda{\lambda+1}\right)^m\tfrac1{\lambda+1}. \end{align*}\]This is just the geometric distribution with parameter $p=1/(\lambda+1)=n/(k+n)$.

If you want to simulate the number of balls in a box, here is a simple way to do this. First, because each box is the same, we can just focus on the first box without loss of generality. Then, we just need to randomly generate the positions of the $n-1$ bars among the $n+k-1$ positions, and then return the index of the first bar (which is the number of balls in the first box).

We can then write the following Ruby code to simulate the number of balls in the first box:

1
2
3

def simulate n, k
(n-1).times.inject(npkm1 = n+k-1) { |bar, i| [rand(npkm1 - i), bar].min }
end

Compare the simulated result with the theoretical result:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

def frequency m, n, k, trials
trials.times.count { simulate(n, k) == m } / trials.to_f
end
def truth m, n, k
(n-1) * (k-m+1..k).reduce(1,:*) / (n+k-m-1..n+k-1).reduce(1,:*).to_f
end
def approx m, n, k
n*k**m / ((n+k)**(m+1)).to_f
end
srand 1108
m, n, k = 3, 5000, 8000
p frequency m, n, k, 10000 # => 0.0902
p truth m, n, k # => 0.08965012972626446
p approx m, n, k # => 0.08963271594131858

- Some languages can name loops by providing a label for the loop.
In those languages, you can use
`break`

together with a label to specify which loop to break out of. Examples: Perl, Java, JavaScript, and some others. - Some languages can specify the number of layers of loops to break out of.
In those languages, you can use
`break`

together with a number to specify how many layers of loops to break out of. The only example that I know is C#. - Some languages have
`goto`

statements. You can easily break from loops to wherever you want by using`goto`

(actually breaking out of nested loops is among the only recommended cases for using`goto`

). Examples: C, C++.

However, in most other languages, it is not easy to break out of nested loops. A typical solution is this:

1
2
3
4
5
6
7
8
9
10

outer_loop do
break_outer = false
inner_loop do
if condition
break_outer = true
break
end
end
break if break_outer
end

In languages with exceptions, another possible workaround is to use exceptions (the catch–throw control flow):

1
2
3
4
5
6
7

catch :outer_loop do
outer_loop do
inner_loop do
throw :outer_loop if condition
end
end
end

I wrote a simple module to better use this workaround.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

class JumpLabel < StandardError
attr_reader :reason, :arg
{break: true, next: true, redo: false}.each do |reason, has_args|
define_method reason do |*args|
@reason = reason
@arg = args.size == 1 ? args.first : args if has_args
raise self
end
end
end
class Module
def register_label *method_names
method_names.each do |name|
old = instance_method name
define_method name do |*args, **opts, &block|
return old.bind_call self, *args, **opts unless block
old.bind_call self, *args, **opts do |*bargs, **bopts, &bblock|
block.call *bargs, **bopts, jump_label: label = JumpLabel.new, &bblock
rescue JumpLabel => catched_label
raise catched_label unless catched_label == label
case label.reason
when :break then break label.arg
when :next then next label.arg
when :redo then redo
end
end
end
end
end
end

Example usage:

1
2
3
4
5
6
7
8
9

Integer.register_label :upto, :downto
1.upto 520 do |i, jump_label:|
print i
1.downto -1314 do |j|
print j
jump_label.break 8 if j == 0
end
end.tap { puts _1 }
# => 1108

The model is as follows. There are $n$ agents (nations), they can trade some type of good, and they use the same currency. Every agent may produce or consume the good. The benefit function of the $j$th agent is $B_j$, and the cost function is $C_j$. The amount of export from the $j$th agent to the $k$th agent is $T_{j,k}$. The amount of trade cost by the $j$th agent is $S_j$. Now, we want to find the amount $Q_j$ that every agent produce and the amount $T_{j,k}$ that every agent import from other agents. Assume that $S$ is only related to $T$ and does not depend on $Q$. Also, assume that there is no externality (i.e. whenever $j\ne k$, $\partial_kB_j=0$ and $\partial_kC_j=0$). Also, assume that every agent is rational and with perfect information.

Now, consider the profit $\Pi_j$ of the $j$th agent. Subtract the cost from the benefit, and we have

\[\textstyle \Pi_j=B_j\!\left(Q_j+\sum_kT_{j,k}\right)-C_j\!\left(Q_j\right)-S_j\!\left(T\right).\]According to the fundamental theorem of welfare economics, $T$ and $Q$ is Pareto optimal under market equilibrium. We assume that this case happens at the stationary point of the social benefit, and the social benefit is the sum of the profit of every agent. We can then get the equations

\[\begin{align*} &0=\frac{\partial}{\partial Q_l}\sum_j\Pi_j =B_l'\!\left(Q_l+\sum_kT_{l,k}\right)-C_l'\!\left(Q_l\right),\quad\forall l;\\ &0=\frac{\partial}{\partial T_{l,k}}\sum_j\Pi_j =B_l'\!\left(Q_l+\sum_kT_{l,k}\right)-B_m'\!\left(Q_m+\sum_kT_{m,k}\right) -\sum_j\frac{\partial S_j}{\partial T_{l,m}}\!\left(T\right),\quad\forall l<m. \end{align*}\]Here are $n+\frac{n\left(n-1\right)}2$ equations, and exactly $Q$ and $T$ have $n+\frac{n\left(n-1\right)}2$ degrees of freedom in total (note that $T$ is anti-symmetric). In principle, we are able to solve $Q$ and $T$.

For the case where there is no trade cost, we can see that the domestic prices are all equal, and the price may be called the world price.

However, given $S=0$, the equations above are not independent.
Actually, there are only $2n-1$ independent equations
(all $2n$ components of $B’$ and $C’$ are equal).
This means that, for $n>2$, the free trade with zero trade cost is an **indeterminate system**.

This phenomenon looks counter-intuitive, but it is actually understandable: under zero trade cost, every two agents may trade arbitrary amount of goods under the same world price, this provides extra degrees of freedom to the model. To be specific, if $(Q,T)$ is a solution to the model, then $(Q,T+\Delta T)$ is also a solution, where the anti-symmetric matrix $\Delta T$ satisfies

\[\sum_k\Delta T_{j,k}=0,\quad\forall j,\]where there are $n-1$ independent equations in the $n$ equations. Therefore, the total number of degrees of freedom in the solution of the model is

\[n+\frac{n\left(n-1\right)}2-\left(\frac{n\left(n-1\right)}2-\left(n-1\right)\right)=2n-1.\]Now, the useful quantities that we can solve is the production and the net-import $T_j:=\sum_kT_{j,k}$ of every agent. Note that the net-import actually has only $n-1$ degrees of freedom because of the restriction $\sum_jT_j=0$.

It is worth pointing out that the existence of the middleman or re-exportation is completely due to the presence of trade cost. Here we consider a simplified problem: there are three agents playing respectively as the producer, the retailer, and the customer. The producer does not consume (the benefit is $0$); the customer does not produce (the cost and the marginal cost is infinity); and the retailer does not produce or consume. Assume that the trade between any two of them does not bring cost to the third one. Then, the social benefit is

\[\Pi=B\!\left(T_{\mathrm c,\mathrm r}+T_{\mathrm c,\mathrm p}\right) -C\!\left(T_{\mathrm c,\mathrm p}+T_{\mathrm r,\mathrm p}\right) -S_\mathrm c\!\left(T_{\mathrm c,\mathrm r},T_{\mathrm c,\mathrm p}\right) -S_\mathrm r\!\left(T_{\mathrm c,\mathrm r},T_{\mathrm r,\mathrm p}\right) -S_\mathrm p\!\left(T_{\mathrm c,\mathrm p},T_{\mathrm r,\mathrm p}\right).\]]]>In part 2, I will focus on non-thermal ensembles.

Before I proceed, I need to clarify that almost all ensembles that we actually use in physics are thermal ensembles, including the microcanonical ensemble, the canonical ensemble, and the grand canonical ensemble (the microcanonical ensemble can be considered as a special case of thermal ensemble where $\vec W^\parallel$ is the trivial).

The theory of thermal ensembles is built by letting the system in question be in thermal contact with a bath. Similarly, if we let the system in question be in non-thermal contact with a bath, we can get the theory of non-thermal ensembles. An example of non-thermal ensembles that is actually used in physics is the isoenthalpic–isobaric ensemble, where we let the system in question be in non-thermal contact with a pressure bath.

However, we will see that it is harder to measure-theoretically develop the theory of non-thermal ensembles if we continue to use the same method as in the theory of thermal ensembles.

A **thermal contact** is a contact between thermal system that conducts heat
(while exchanging some extensive quantities).
A **non-thermal contact** is a contact between thermal system that does not conduct heat
(while exchanging some extensive quantities).
For reversible processes, thermodynamically and mathematically,
heat is equivalent to a form of work,
where the entropy is the displacement and where the temperature is the force.
However, this is not true for non-reversible processes because of the Clausius theorem.
This should have something to do with the fact that
entropy is different from other extensive quantities (as is illustracted in
part 1).

First, I may introduce how we may cope with the reversible processes of two subsystems in non-thermal contact in thermodynamics. As an example, consider a tank of monatomic ideal gas separated into two parts by a thermally non-conductive, massless, incompressible plate in the middle that can move. The two parts can then adiabatically exchange energy ($U$) and volume ($V$) but not number of particles ($N$). For one of the parts, we have

\[0=\delta Q=\mathrm dU+p\,\mathrm dV=\mathrm dU+\frac{2U}{3V}\,\mathrm dV,\]which is good and easy to deal with because it is simply a differential 1-form.

However, this convenience is not possible for non-reversible processes because then we do not have the simple relation $p=2U/3V$. Actually, the pressure is only well-defined for equilibrium states, and it is impossible to define a pressure that makes sense during the whole non-reversible process, which involves non-equilibrium states. Therefore, although it seems that the “thermally non-conductive” condition imposes a stronger restriction on what states can the composite system reach without external sources, it actually does not because the energy exchanged by the subsystems when they exchange volume is actually arbitrary (as long as it does not violate the second law of thermodynamics) if the process is not reversible.

The possible states of the non-thermally composite system then cannot be simply described by a vector subspace of $W^{(1)}\times W^{(2)}$. If we try to use the same approach as constructing the thermally composite system to construct the non-thermally composite system, the attempt will fail.

Continuing with our example of a tank of gas. Although the pressure is not determined in the non-reversible process, there is one thing that is certain: the pressure on the plate by the gas on one side is equal to the pressure on the plate by the gas on the other side. This is because the plate must be massless (otherwise its kinetic energy would be an external source of energy; also, remember that it is incompressible: this means that it cannot be an external source of volume). Therefore, the relation between the volume exchanged and the energy exchanged is determined as long as at least one side of the plate is undergoing a reversible process because then the reversible side has determined pressure, which determines the pressure of the other side.

This is the key idea of formulating the non-thermal ensembles without formulating the non-thermally composite system. In a thermal or non-thermal ensemble, the composite system consists of two subsystems, one of which is the system in question, and the other is the bath which we are in control of. We can let the bath have zero relaxation time (the time for it to reach thermal equilibrium) so that any process of it is reversible. Then, the pressure (or generally, any other intensive quantities that we are in control of times the temperature) is determined (and actually constant), and we can express the non-conductivity restriction as

\[\mathrm dU+p\,\mathrm dV=0,\]where $p$ is the pressure, which is a constant. This is a homogeneous linear equation on $\vec W^{\parallel(1)}$ (whose vectors are denoted as $(\mathrm dU,\mathrm dV)$ in our case) which defines a vector subspace of $\vec W^{\parallel(1)}$, which we call $\vec W^{\parallel\parallel(1)}$. The dimension of $\vec W^{\parallel\parallel(1)}$ is that of $\vec W^{\parallel(1)}$ minus one. The physical meaning of $\vec W^{\parallel\parallel(1)}$ in this example is the hyperplane of fixed enthalpy.

Note that our bath actually has the fixed intensive quantities $i=\left(1/T,p/T\right)\in\vec W^{\parallel(1)\prime}$, we can rewrite the above equation as

\[\begin{equation} \label{eq: W star parallel} \vec W^{\parallel\parallel(1)} =\left\{s_1\in\vec W^{\parallel(1)}\,\middle|\,i\!\left(s_1\right)=0\right\}. \end{equation}\]Wait! What does $T$ do here? It is supposed to mean the temperature of the bath, but the temperature of the bath is irrelevant since the contact is non-thermal. Actually, it is. The temperature of the bath serves as an overall constant factor of $i$, which does not affect $\vec W^{\parallel\parallel(1)}$ as long as it is not zero or infinite. So far, this means that the temperature of the bath is not necessarily fixed, so the actual number of fixed intensive quantities is the dimension of $\vec W^{\parallel(1)\prime}$ minus one, which is the same as the dimension of $\vec W^{\parallel\parallel(1)}$. Later we will see that anything that is relevant to the temperature of the bath will finally be irrelevant to our problem. This seems magical, but you will see the sense in that after we introduce another way of developing the non-thermal ensembles (that do not involve baths and non-thermal contact) later.

We can define a complement of $\vec W^{\parallel\parallel(1)}$ in $\vec W^{\parallel(1)}$ as $\vec W^{\parallel\perp(1)}$. Then, we have $\vec W^{\parallel(1)}=\vec W^{\parallel\parallel(1)}+\vec W^{\parallel\perp(1)}$. The space $\vec W^{\parallel\perp(1)}$ is a one-dimensional vector space.

For convenience, define $W^{\star\perp(1)}:=W^{\perp(1)}+\vec W^{\parallel\perp(1)}$. The vector space $\vec W^{\star\perp(1)}$ associated with it is a complement of $\vec W^{\parallel\parallel(1)}$ in $\vec W^{(1)}$. To make the notation look more consistent, we can use $\vec W^{\star\parallel(1)}$ as an alias of $\vec W^{\parallel\parallel(1)}$. They are the same vector space, but $\vec W^{\star\parallel(1)}$ emphasizes that it is a subspace of $\vec W^{(1)}$, and $\vec W^{\parallel\parallel(1)}$ emphasizes that it is a subspace of $\vec W^{\parallel(1)}$. Then, we have $W^{(1)}=W^{\star\perp(1)}+\vec W^{\star\parallel(1)}$. Every point in $W^{(1)}$ can be uniquely written as a sum of a point in $W^{\star\perp(1)}$ and a vector in $\vec W^{\star\parallel(1)}$. We can describe the decomposition by a projection $\pi^{\star(1)}:W^{(1)}\to W^{\star\perp(1)}$.

We will heavily use the “$\star$” on the superscripts of symbols. Any symbol that was labeled “$\star$” is dependent on $i$ (but independent on an overall constant factor on $i$). You can regard those symbols to have an invisible “$i$” in the subscript so that you can keep in mind that they are dependent on $i$.

*Example.*
Suppose we have a tank of gas with three extensive quantities $U,V,N$.
It is in non-thermal contact with a pressure bath with pressure $p$
so that it can exchange $U$ and $V$ with the bath.
Then, the projection $\pi^{\star(1)}$ projects macrostates
with the same enthalpy and number of particles into the same point.
Because a complement of a vector subspace is not determined,
there are multiple possible ways of constructing the projection.
One possible way is

Here the fixed intensive quantity $p$ is involved. Note that this projection is still valid for different temperatures of the bath, so an overall constant factor of $i$ does not affect the projection.

Now, after introducing non-thermal contact with an example, we can now formulate the non-thermal contact with a bath.

Suppose we have a system $\left(\mathcal E^{(1)},\mathcal M^{(1)}\right)$. The main approach is constructing a composite system out of the composite system for the $\vec W^{\parallel(1)}$-ensemble.

The composite system for the $\vec W^{\parallel(1)}$-ensemble was introduced in part 1. We denote the bath that is in contact with our system as $\left(\mathcal E^{(2)},\mathcal M^{(2)}\right)$.

Consider this projection $\pi^\star:W\to W^{\star\perp}$ (where $W^{\star\perp}$ is an affine subspace of $W$ and the range of $\pi^\star$):

\[\begin{equation} \label{eq: pi star} \pi^\star\!\left(e_1,e_2\right) :=\left(\pi^{\star(1)}\!\left(e_1\right), \rho_{\pi(e_1,e_2)}\!\left(\pi^{\star(1)}\!\left(e_1\right)\right)\right). \end{equation}\]To ensure that it is well-defined, we need to guarantee that $\pi^{\star(1)}\!\left(e_1\right)\in W^{\parallel(1)}_{\pi(e_1,e_2)}$ for any $e_1,e_2$, and this is true.

The two spaces $W^{\star\perp}$ and $W^{\perp}$ do not have any direct relation. The only relation between them is that the dimension of $W^{\star\perp}$ is one plus the dimension of $W^{\perp}$ (if they are finite-dimensional).

What is good about the projection $\pi^\star$ is that it satisfies $\vec W^{\star\parallel(1)}=\vec c^{(1)}\!\left(\vec\pi^\star(0)\right)$. This makes our notation consistent if we construct another composite system out of $\pi^\star$. Now, consider the composite system of $\left(\mathcal E^{(1)},\mathcal M^{(1)}\right)$ and $\left(\mathcal E^{(2)},\mathcal M^{(2)}\right)$ under the projection $\pi^\star$. In the notation of the spaces and mappings that are involved in the newly constructed composite system, we write “$\star$” in the superscript.

Just like how $\vec W^{\star\parallel(1)}$ is a subspace of $\vec W^{(1)}$, $\vec W^{\star\parallel(2)}$ is also a subspace of $\vec W^{(2)}$. This means that both $\vec\rho^{-1}\circ\vec\rho^\star$ and $\vec\rho\circ\vec\rho^{\star-1}$ are well-defined. The former maps $\vec W^{\star\parallel(1)}$ to another subspace of $\vec W^{(1)}$, and the latter maps $\vec W^{\star\parallel(2)}$ to another subspace of $\vec W^{(2)}$.

We can think the construction of the new composite system as replacing the “plate” between the subsystems in the original composite system from a “thermally conductive plate” to a “thermally non-conductive plate”. Suppose that in the new situation, the intensive quantities “felt” by subsystem 1 is $i^\star\in\vec W^{\star\parallel(1)\prime}$. Then, because the bath is still the same bath in the two situations, we have

\[-i^\star\circ\vec\rho^{\star-1}=-i\circ\vec\rho^{-1}.\]Therefore,

\[\begin{equation} \label{eq: i star} i^\star:=i\circ\vec\rho^{-1}\circ\vec\rho^\star \end{equation}\]would be a good definition of $i^\star$. However, actually $i^\star$ is trivial:

\[\begin{equation} \label{eq: i star = 0} i^\star=0. \end{equation}\]This is because \ref{eq: pi star} shows that $\rho\!\left(W^{\star\parallel(1)}_e\right)=W^{\star\parallel(2)}_e$, and thus

\[\vec\rho^{-1}\!\left(\vec\rho^\star\!\left(\vec W^{\star\parallel(1)}\right)\right) =\vec W^{\star\parallel(1)},\]which is the kernel of $i$ by definition.

Because $i^\star$ is trivial, it is irrelevant to the temperature of the bath because it is zero no matter what temperature the bath is at.

*Example.*
Suppose a system described by $U_1,V_1,N_1$ is in non-thermal contact with a pressure bath,
and they can exchange energy and volume.
The projection $\pi$ is

Then, the projection $\pi^\star$ can be

\[\pi^\star\!\left(U_1,V_1,N_1,U_2,V_2,N_2\right) =\left(U_1+pV_1,0,N_1,U_2-pV_1,V_1+V_2,N_2\right).\]By choosing a different $\pi^{\star(1)}$ or a different $\pi$, we can get a different $\pi^\star$. They physically mean the same composite system.

The space $W^\perp$ is four-dimensional, and the space $W^{\star\perp}$ is five-dimensional. We can denote the five degrees of freedom as $U,V,H_1,N_1,N_2$, where $U:=U_1+U_2$ is the total energy, $V:=V_1+V_2$ is the total volume, and $H_1:=U_1+pV_1$ is the enthalpy of subsystem 1. Then, the projection $\pi^\star$ can be written as

\[\pi^\star\!\left(U_1,V_1,N_1,U_2,V_2,N_2\right) =\left(H_1,0,N_1,U-H_1,V,N_2\right).\]We can get $W^{\star\parallel}_e$ by finding the inverse of the projection, where $e:=\left(H_1,0,N_1,U-H_1,V,N_2\right)$:

\[W^{\star\parallel}_e:=\pi^{\star-1}\!\left(e\right) =\left\{\left(H_1-pV_1,V_1,N_1,U-H_1+pV_1,V-V_1,N_2\right)\middle|\,V_1\in\mathbb R\right\}.\]Because it is parameterized by one real parameter $V_1$, it is a one-dimensional affine subspace of $W$. Projecting it under $c^{(1)}$ and $c^{(2)}$ will respectively give us $W^{\star\parallel(1)}_e$ and $W^{\star\parallel(2)}_e$:

\[W^{\star\parallel(1)}_e :=\left\{\left(H_1-pV_1,V_1,N_1\right)\middle|\,V_1\in\mathbb R\right\},\] \[W^{\star\parallel(2)}_e :=\left\{\left(U-H_1+pV_1,V-V_1,N_2\right)\middle|\,V_1\in\mathbb R\right\}.\]The affine isomorphism $\rho^\star_e$ is then naturally

\[\rho^\star_e\!\left(H_1-pV_1,V_1,N_1\right)=\left(U-H_1+pV_1,V-V_1,N_2\right).\]Its vectoric form is then

\[\vec\rho^\star\!\left(-p\,\mathrm dV_1,\mathrm dV_1,0\right) =\left(p\,\mathrm dV_1,-\mathrm dV_1,0\right).\]Our fixed intensive quantities are $i$, which is defined as $i\!\left(\mathrm dU_1,\mathrm dV_1,0\right)=\frac1T\,\mathrm dU_1+\frac pT\,\mathrm dV_1$. We can then get $i^\star$ by

\[i^\star:=i\circ\vec\rho^{-1}\circ\vec\rho^\star =\left(-p\,\mathrm dV_1,\mathrm dV_1,0\right)\mapsto0.\]This is consistent with Equation \ref{eq: i star = 0}.

Now, we can define the non-thermal contact with a bath to be the same as the thermal contact with a bath under $\pi^\star$. Utilizing this definition, we can define the composite system for non-thermal ensembles.

*Definition.*
A **composite system for the non-thermal $\vec W^{\parallel(1)}$-ensemble**
of the system $\left(\mathcal E^{(1)},\mathcal M^{(1)}\right)$
with fixed intensive quantities $i$
is the same as the composite system for the thermal $\vec W^{\star\parallel(1)}$-ensemble
with fixed intensive quantities $i^\star=0$ (given by Equation \ref{eq: i star = 0}),
where $\vec W^{\star\parallel(1)}$ is defined by Equation \ref{eq: W star parallel}.

This definition looks very neat. Also, just like how we define the domain of fixed intensive quantities of a thermal ensemble, we can define the domain of fixed intensive quantities of a non-thermal ensemble to consist of those values that makes the integral in the definition of the partition function converge.

Because we already derived the formula of the partition function in part 1 that does not involve information about the bath anymore, we can drop the “$(1)$” in the superscripts. The partition function of the non-thermal ensemble is then

\[Z^\star\!\left(e,i^\star\right)=\int_{s\in\vec E^{\star\parallel}_e} \Omega\!\left(e+s\right) \mathrm e^{-i^\star\left(s\right)}\,\mathrm d\lambda^{\parallel}\!\left(s\right),\quad e\in E^{\star\perp},\quad i^\star\in I^\star_e\subseteq\vec W^{\star\parallel\prime}.\]Here, the $i^\star$ is not fixed at the trivial value $0$ (I abused the notation here) but actually is an independent variable serving as one of the arguments of the partition function that takes values in $I^\star_e$ (which is not the domain of fixed intensive quantities of the non-thermal ensemble that was mentioned above). However, the only meaningful information about this non-thermal ensemble is in the behavior of $Z^\star$ at $i^\star=0$ instead of any arbitrary $i^\star\in I^\star_e$, but we do not know whether $0\in I^\star_e$ or not. This is then a criterion of judge whether $i$ is in the domain of fixed intensive quantities of the non-thermal ensemble or not. To be clear, we define

\[J:=\left\{i\in\vec W^{\parallel\prime}\,\middle|\, \exists e\in E^{\star\perp}:0\in I^\star_{e}\right\}.\]A problem about this formulation is that it is possible to have two $i$s that share the same thermal equilibrium state. In that case, the non-thermal ensemble is not defined.

Because $i^\star=0$, the observed extensive quantities in thermal equilibrium are just

\[\begin{equation} \label{eq: epsilon^circ} \varepsilon^\circ =e+\left.\frac{\partial\ln Z^\star\!\left(e,i^\star\right)}{\partial i^\star}\right|_{i^\star=0} =e+\frac{\int_{s\in\left(E-e\right)\cap\vec W^{\star\parallel}} s\Omega\!\left(e+s\right)\mathrm d\lambda^{\parallel}\!\left(s\right)} {\int_{s\in\left(E-e\right)\cap\vec W^{\star\parallel}} \Omega\!\left(e+s\right)\mathrm d\lambda^{\parallel}\!\left(s\right)}, \end{equation}\]and the entropy in thermal equilibrium is just

\[\begin{equation} \label{eq: S^circ} S^\circ=\ln Z^\star\!\left(e,0\right) =\ln\int_{s\in\left(E-e\right)\cap\vec W^{\star\parallel}} \Omega\!\left(e+s\right)\mathrm d\lambda^{\parallel}\!\left(s\right). \end{equation}\]We can cancel the parameter $e$ by Equation \ref{eq: epsilon^circ} and \ref{eq: S^circ} to get

\[\begin{equation} \label{eq: S^circ vs epsilon^circ} S^\circ=\ln Z^\star\!\left(\pi^\star\!\left(\varepsilon^\circ\right),0\right) =\ln\int_{s\in\left(E-\varepsilon^\circ\right)\cap\vec W^{\star\parallel}} \Omega\!\left(\varepsilon^\circ+s\right)\mathrm d\lambda^{\parallel}\!\left(s\right). \end{equation}\]What is interesting about Equation \ref{eq: S^circ vs epsilon^circ} is that it actually does not guarantee the intensive variables to be defined in $\vec W^\parallel$. Physically this means that the temperature is not necessarily defined, unlike the case of thermal ensembles (this is because the thermal contact makes the temperature the same as the bath and thus defined). The thing that is guaranteed is that the intensive variables are defined in $\vec W^{\star\parallel}$ and they must be zero. Therefore, whenever the intensive variables are defined in $\vec W^\parallel$, it must be parallel to $i$ (and remains the same if we scale $i$ by an arbitrary non-zero factor). Physically, this means that the system must have the same intensive variables as the bath up to different temperatures.

It may seem surprising that we can define non-thermal ensembles without a bath. How is it possible to fix some features about the intensive variables without a bath? The inspiration is looking at Equation \ref{eq: W star parallel}. We can make a guess here: if we contract the system along $\vec W^{\star\parallel}$, the contraction satisfy the equal a priori probability principle. We make this guess because of the following arguments:

- Mathematically, contraction is a legal new system, so it should also satisfy the axioms that we proposed before.
- Physically, because the temperature of the bath is arbitrary, the different accessible macrostates should not be too different because otherwise the temperature would matter (as appears in the expression of the partition function).

After finding the equilibrium state of the contraction, we can use the contractional pullback to find the equilibrium state of the original system.

If you do it right, you should get the same answer as Equation \ref{eq: S^circ vs epsilon^circ}.

The only axiom that we used is the equal a priori probability principle. Then, we formulated three types of ensembles: microcanonical, thermal, and non-thermal.

]]>写这篇 blog 文章主要是因为我感到自己正在逐渐脱离物理初学者的阶段, 是时候对自己的学习方法做一个总结了. 这篇文章可以为理科的初学者们提供一些参考.

本文中所有指人的第三人称代词皆用*他*而不区分性别.

所有我在现实中遇到过的例子都会在引用块中呈现.

*理科*这个概念起源于中国的文理分科制度, 用于与*文科*对立.
你可以将本文所说的*理科*理解为*科学*,
包括形式科学 (如数学和计算机科学), 自然科学 (如物理学和化学), 社会科学 (如经济学和政治科学) 等.
不用*科学*一词主要是因为中文语境下*科学*一般指自然科学, 甚至不包括数学.

所有的理科都有一些共同的特点, 这导致很多学习方法对各种不同的理科都适用.
但是, 考虑到我本人是物理专业, 对其他领域涉猎有限, 并不能保证本文中介绍的学习方法对所有的理科都适用.
实际上, 不同的理科都有属于它们自己的**特点**, 比如特别的方法论和一些核心思想.
你在阅读本文的时候需要在心中留意自己专业的特点.

我们需要界定什么叫做*初学阶段*. 这是一个很模糊的概念.
我个人倾向于将本科专业课程设定为初学阶段, 因为这是你最早专业化地接触这个学科的方式.
中小学课程中的学科知识更偏向于通识和应试, 因此不能算作 (真正的专业意义上的) 学习.
另外, 为学科竞赛 (包括中学生的竞赛) 学习的人**不算**初学者,
因为竞赛意味着竞争, 其强度不是初学者应当面对的.

这个阶段有很多特点:

- 有非常系统的知识体系. 你可以将这个专业领域划分为多个细分领域, 每个细分领域内又有比较固定的学习顺序. 各个细分领域之间并没有固定的学习顺序, 但是不同细分领域内的知识可以 “相互解锁”.
- 你遇到的几乎所有的学术问题 (作业, 考试等, 不包括 project 之类) 都
**并没有本质上的复杂性**. 大多数问题都能在一天内解决. - 学习资源非常丰富. 你可以找到大量教材. 因为文献充足, Wikipedia 上一般也有比较完善的词条. 因为学的人多, Stack Exchange 上一般也有人问你会遇到的问题, 问了之后也一般都有人能回答.
- 大多数知识都是 “可记忆的”.
当然, 理论上来说任何知识, 只要你肯背, 都是可记忆的,
但是这里我所说的
*可记忆*是指理解意义上的记忆:**长期上, 如果你曾经理解过一个知识, 那么你在很久以后仍然可以理解它; 短期上, 如果你理解了一个知识, 那么你很可能可以在不去专门背诵的情况下记住它.** - 你学习到的知识基本上都是别人告诉你的, 而不是你自己创造或发现的.

这些特点看起来是在说这个阶段的学习是非常容易的 (虽然一定程度上也确实如此), 但是并不是所有人都能顺利地度过这个阶段. 或许你可以将其归因于这部分人不够聪明, 但其实我认为主要还是经验和方法上的欠缺 (当然, 肯定有一小部分人是真的不够聪明, 但是这部分人的比例很小, 因为世界上的人都是基本上差不多聪明的, 非常笨和非常聪明的人的比例是极小的). 我帮过很多同学解决学习上遇到的问题, 同时发现了很多不同的学习方法上的问题.

这引出了我想要用这篇文章来总结我自己的学习方法的目的: 为这些初学者提供一些参考. 不过, 这并不是说我所说的方法就不适用于其他阶段的学习, 只是在那之后的你应当已经有了足够的经验, 可以找到最适合自己的学习方法.

所有讲学习方法的文章都有一个共同的问题: **实际上它并没有什么作用**.
其实我也能预想得到读我的文章多半也不会有什么用
(毕竟即使是由教育界或者其他特定行业的专业人士写的关于学习方法的文章也多半没有用).
这样的文章就如同告诉你如何将大象放进冰箱一样, 道理你都懂, 但是你还是无法实践.

有两件事情是很难的: 一是将自己的经验转化为能写成文字的理论, 二是将别人用文字写成的理论转化为自己的经验. 经过多年的经验, 我或许已经将一些实际上并不平凡的过程当做了理所应当的事, 而从理论的表述中忽略, 而且这一过程是在无意识中发生的, 我也不知道我到底忽略了什么. 它有可能是很关键的东西, 有可能是我自己的学习方法的核心, 但是我却不知道. 而你在阅读文章的时候, 一眼扫过一串文字, 可能以为是一些你本来就知道的废话, 但其实它表达的意思并不完全是你想象的意思, 或者说你对某些表达的理解与作者想要表达的真正意思不完全一致. 这一过程也是在无意识中发生的, 即使你读得非常仔细也很难避免.

我指出这两个困难并不是想要引出解决这些矛盾的方法. 实际上并没有什么有效的方法能解决这些矛盾. 我指出它只是为了提醒作者和读者双方这个矛盾的存在性, 在写作和阅读时留个心眼.

我认为, **贯彻所有理科学习的最核心的思想是 抽象**.

**抽象化可以使理论一般化.** 我举一个简单的例子.
假设我们现在有一个皮球, 我们想要估算它的体积, 于是我们开发了一套理论来计算它的体积.
但是, 现在这个理论是针对皮球的, 而不针对其他的物体, 比如铅球.
皮球有非常多的参量: 弹性, 重量, 位置, 颜色, 等等.
这些参量对于另一个物体—比如某个铅球—来说都是完全不同的, 原先的理论多半是不适用于铅球的.
然而, 如果我们一开始就将皮球抽象化为球体, 舍弃所有除了半径以外的参量,
而我们最后发现任何球体的体积都可以通过它的半径来计算,
那么我们的理论将适用于任何可以抽象化为球体的物体, 比如任何铅球.

数学中的抽象代数 (或者直接就说*代数*也可以) 是另一个极好的例子—从它的名字就可以看出来.
只要满足某些运算性质的数学对象就可以构成某种结构, 我们并不关心怎样的底层实现才使它们具有这样的性质,
这可以使得我们的理论能适用于非常多不同的对象, 只要它们都满足这些性质.

你可能觉得我说的这些都是废话, 你本来就知道当我们的问题是估算体积时皮球可以被抽象化为球体,
但是实际上当你在现实生活中遇到一个问题时你往往无法发现这些抽象的可能性.
**有的时候你的思维已经完全绑定在了一个固有的事物上, 根本没有想象到它实际上可以被进一步抽象化.**

曾经有人问我为什么进动 (讲人话就是角动量 $\mathbf L$ 在转) 中有 $\dot{\mathbf L}=\boldsymbol\Omega\times\mathbf L$, 我说这就是圆周运动的公式. 他问我原来的公式是什么, 我说 $\dot{\mathbf r}=\boldsymbol\Omega\times\mathbf r$. 他说这里的 $\mathbf L$ 是角动量, 不是位置, 我说我知道, 但是它们都是圆周运动. 他还是不明白.

实际上这就是一个很好的缺乏抽象思维的例子. 你学到的圆周运动的公式实际上并不只能描述现实中位移意义上的圆周运动, 任何会随时间变化的向量的圆周运动都可以用这个公式来描述, 比如说角动量. 如果你理解不了, 说明你的大脑已经被固定在了一个具体的场景, 你手上的工具只会被用来解决这个场景里的问题, 你意识不到实际上这个场景还可以被抽象到更一般的场景里, 你的工具实际上还可以解决更一般的问题.

我见过有人很抗拒将化学反应方程式写成 $\sum_i\nu_iA_i=0$, 指出这种形式仅仅只是一个看上去比较 fancy 的写法, 而并不能带来任何实质性的好处.

这就是一个抗拒抽象思维的例子. 这个反应方程式是对任何单相化学反应的抽象, 对这个反应进行研究得到的理论可以适用于任何单相化学反应. 具体是哪些物质在参与反应, 生成了什么物质, 配平系数具体是多少, 这些都被抽象化了. 抽象到仅剩这些信息的时候你仍然可以获得质量作用定律, 化学亲和势, Le Chatelier 原理等等. 但是如果你不抽象, 那么你永远都不知道在这个反应中成立的东西是否在另一个反应中也成立.

曾经有人问我一个问题 (背景是他正在学统计学), 他是这样问的: “如果一个人三次都放同一首歌, 那 variance 是 $\operatorname{var}(3X)$ 吗?” 我问他 $X$ 是什么, 他说 $X$ 就是选一首歌. 我来来回回问了无数次, 解释为什么我看不懂他的问题, 最后扯了 20min 才明白原来 $X$ 是指歌的时间长度, 而他的问题是: 随机选一首歌, 然后将它重复三次, 所需要的总时长的方差是多少.

这就是一个错误地抽象的例子. 实际上在这个案例中 “歌” 是一个无意义的概念, 包括 “长度” 也是. 唯一有意义的概念是 “某个随机变量 $X\sim f(X)$”, 而 “重复一首歌三遍” 可以被抽象化为 $3X$.

统计学中有大量的抽象 (实际上统计学的所有应用题都是抽象). 我见过不少正在学统计学的人问问题, 有的时候他们问的问题都非常具体 (确切地说就是直接贴了整道题), 比如说 “Kat 调查了 Mar 的苹果店里的苹果” 之类的, 然后问题背景说了一大堆, 我看着都觉得密密麻麻的. 其实应该抽象一下, 比如说 “从两个 population 各抽取一些 sample, 如何证明这两个 population 的 mean 不同” 就是一个很好的抽象. 还有的时候他们抽象了一下自己的问题, 但是抽象出来是错的, 保留了一些无意义的概念的同时略去了一些有意义的概念, 比如说 “如何比较两个苹果店的 mean 是否一样” 之类的, 这里 “苹果店” 就是一个无意义的概念 (甚至还是错误的, 因为一个苹果店没有 mean), 而 “比较” 就是一个被过度抽象的概念 (略去了一些与问题相关的细节: 仅仅是比较两个实数中哪个大? 检验两个分布的 mean 是否相等? 检验是否其中一个大于另一个?).

我觉得 *pace* 与*速度*是并不完全等同的概念, 我个人认为 *pace* 更多地是指你每天要花费多少精力去学习,
而*速度*则强调你实际上学到了多少知识.
学习速度是会被学习方法显著影响的, 而且 (长期来看基本上) 越快越好
(当然, 这里指的是在机会成本保持不变的情况下越快越好), 但是学习的 pace 并不是这样.
实际上, 学习的 pace 是学习方法的一部分, **你要找到适合你自己的 pace, 而不是越快越好**.

最佳的 sweet point 与偏好, 身体素质, 学习目标, 环境等因素有关, 而与天赋的关系不大. 我个人喜欢的 pace 是比较轻松随意的, 想学习就学习, 不想学习就不学习, 并没有一个非常严谨的规划. 我认为这样的 pace 可以让我让我长期保持比较高效率的学习. 大多数人的偏好中, 一个好的 pace 就是一个能让人感到轻松随意的 pace. 也有的人喜欢稍微给自己一些压力, 这当然也是很好的. 然而, 不幸的是, 对于非常不喜欢学习的人来说, 如果按照自己的偏好来设定 pace, 那么他们的学习速度恐怕无法帮助他们达到他们的学习目标. 比如说这个大学毕业要求一个学生至少要修 120 学分, 但是有的人特别不爱学习, 他们但凡修一个学分都轻松不起来 (这种现象当然是存在的, 只不过这种情况下一般你需要质问自己你学习的真正目的是什么: 是不是可以选择不上大学呢? 由此其实可以引伸出一些相关的社会问题, 并不是我现在所要讨论的). 这种情况就是我所说的, 学习目标和环境也会影响你最佳的 pace. 不过, 学习目标也有好坏之分, 在后面我将会讨论.

说到这里, 我发现有一个非常普遍的现象: 人们会在考试前高强度复习.
这个现象不仅普遍, 而且还是被认为是理所当然的 (甚至被认为是一种很好的学习方法).
然而, 我认为对于理科初学者来说, 这是一个**误区**.
首先需要指出, 中文词*复习*与英文词 *review* 在字面意义上是不同的:
中文的*复习*的字面意思是*再次学*, 而英文的 *review* 的字面意思是*再次看*.
*再次学*更加接近我所认为的误区.
另外, 我需要澄清一下, *再次学*是建立在已经学过的基础上的.
如果你之前没有学过 (比如说在并不掌握相关知识的情况下不去上课), 只在快要考试的时候去高强度学习,
那么这不能被称为*再次学*, 这就是普通的*学*
(其实这种临时抱佛脚的行为我并不完全反对, 因为这可能就是对你来说最佳的 pace).

我之前说过, 在理科初学阶段大多数知识都是可记忆的: 如果你曾经理解了, 那么你仍然能理解.
从这个意义上来说, 你不需要再次学, 你只需要再次看就可以了,
因为你只需要强化一下记忆, 而不需要重新去理解.
然而, 现实中人们的高强度复习往往是真的在再次学:
他们其实之前就并没有理解某些知识, 而尝试通过考前复习的方式去理解它们.
这就是我所说的误区.
**如果你第一遍学的时候没有理解, 这就说明你第一遍学习花费的时间都被浪费了, 这是很可惜的**
(有那学第一遍的时间还不如用来看剧或者打游戏呢).
如果上课学不会, 那你就别去上课了 (不要觉得浪费学费, 因为那是沉没成本), 靠自学;
如果要记出勤 (我很反对教授/学校的这个做法) 的话, 就上课的时候自学, 总之不要浪费时间.

很多人复习的理由都是要强化一下记忆. 不过, 我已经解释过了, 这并不需要高强度的复习 (也就是再次学). 对于那些不能靠理解记住的东西, 一般考试都会允许你使用 cheetsheet, 你只需要写一下你背不出来的东西就可以了. 如果考试不允许 cheetsheet, 那考卷上肯定是会给你提供所有你可能背不出来的信息的. 如果考试不允许 cheetsheet, 而且考卷上不给你深刻理解了但是还是死活背不出来的东西, 而且还真的考了, 那这就不是一个好的考试, 你要怪这个考试.

曾经某个微分方程课的课友问我一道题的做法, 我给他一步步讲了, 但是有一步他愣是不理解怎么来的. 我们俩扯了半天, 我才发现原来是因为他不会将 $y^2$ 对 $x$ 求导 (其中 $y$ 是 $x$ 的函数). 我跟他说这个叫链式法则, 他说: “链式法则啊, 听上去是好久远的东西, 高中的时候学的, 早就快忘光了.”

这种就是典型的因为不理解所以忘了的情况. 链式法则是典型的只要理解了就不可能忘记的东西. 我甚至可以原谅你记不住 $\mathrm d\tan x/\mathrm dx=\sec^2x$ (好吧, 其实这也不是那么容易原谅), 但是我绝对不会原谅你记不住链式法则. 类似地, 你可以背不出和差化积, 但你绝不能不知道频率相近的振动叠加会产生拍现象; 你可以背不出 Fourier 级数展开的公式, 但你绝不能不知道 $\mathrm e^{\mathrm ikx}$ 构成一族完备正交基. 这样的例子有很多, 核心的思想就是公式不要背, 但你不能不理解.

除了强化记忆之外, 很多人复习的另一个理由是为了提高熟练度, 而他们复习的方式就是做题.
你可以说做题是学习的一部分 (开个玩笑: 比如说当你妈问你在不在学习的时候, 如果你在做题, 你也可以说你是在学习),
但是严格来说做题不算学习.
或者说, **纯粹的做题是不会让你学习到东西的, 而仅仅是能让你提高运用你已经学到的知识的熟练度**.
所以考前做题并不能严格地被称为复习.
实际上, **做题是一种很好的锻炼自己能力的方式**, 它是无论是否在考前都适合进行的活动.

作业和考试, 则是被赋予了特别意义的做题, 它们的意义在于你做题的好坏会影响到你的成绩.
虽然通过作业和考试是最合理的评价学生学习成果的方式, 但是它们引入了一个学习中非常根本性的矛盾:
它们很容易扭曲学生的学习动机, 而被扭曲之后的这种动机叫做*应试*.

**应试是非常不健康的学习动机**: 它会导致很多负面情绪 (源于压力), 以及导致一些错误的学习方法.
应试学习倾向于对做题技巧的研究, 而不是对知识本身的理解, 这会几乎必然地导致遗忘
(我非常怀疑前面那位忘记链式法则的同学就是受了应试学习的荼毒).
不过, 其实这还跟考卷设计有关, 如果考分与理解程度强相关, 那么应试学习也就不会产生这么严重的问题,
但可惜的是设计一份好考卷是非常困难的.

在中小学, 应试学习是难以避免的.
然而, 所幸的是, 在大学有一种非常简单的避免应试学习的方法: **忘记考试**.
这个方法, 在中小学无法起效, 以至于人们似乎经常会忘记这个十分显然的方法,
即使是在离开中小学之后仍然没有想到用这个方法来避免应试学习.

“忘记考试” 并不是一件说起来简单做起来难的事情.
它说起来简单, 做起来也简单: 你只需要不去理睬 “考试” 这件事的存在就可以了.
有这么几句被说烂了的话: **上学是为了学习, 而不是为了考试; 考试是为了检验学习成果, 而不是学习的目的**.
这几句话并非说得不对, 而且确实说到了点子上, 但大家实在听得太多了, 以至于这些话已经显得空洞而毫无意义了.
如果你是一个陷入在考试中的学生, 那么你最好重新审视一下这些话的意义, 不能被来自中小学的思维惯性左右.

我们学校附近有家饭店每周打折. 有个朋友约我去那里吃一顿, 但是我连着两周都突然有事, 于是我们拖了两周, 结果就拖到了考试周. 他说: “得下学期吃了.” 我说: “下周不还上学吗?” 他说: “我周一二三都考试.” 我说: “考试就不吃饭了?”

其实这是一件生活中的小事, 跟学习并没有什么紧密的关系, 但是从这个小事件中其实可以投射出他的学习方法存在的问题: 他就是犯了应试学习的问题 (当然这么说感觉好像是在扣帽子, 其实不是, 因为我还是比较了解他的, 我只是拿了一个比较典型的小事来说明问题).

你可能觉得我站着说话不腰疼, 阴阳怪气地说: “我没你那么聪明, 不像你, 把考试当喝水一样.”
这话其实只说对了一半: 我确实把考试当喝水一样, 但这跟聪明的关系不大.
这就涉及到一个话题: **理解知识**.
实际上这是非常复杂而且深刻的话题, 同是它也是学习方法中最重要的核心.
接下来的几节我都将围绕这个话题来展开.

以我的经验, **理解知识的本质是将知识变成直觉的一部分**.
你可以细细品味一下这句话是不是对的.
我认为这句话巧妙地诠释了知识是如何被理解的, 而且我甚至很难用更详细的语言来解释这句话,
但是我还是想尝试着展开一点来说.
当你获得了一个知识的时候, 你不一定立刻就理解了, 因为你的内心仍然还有一些小小的抗拒, 还是没能转过这个弯.
但是会有那么一瞬间, 或许你可以将它称为*顿悟*, 你突然就发现这个知识是多么简单, 多么显然, 多么自然,
突然这个知识就进入到你的世界观的一部分了, 进入到你的直觉中了, 让你感觉被击中了一般, 什么都明白了.
当你有了这种感觉的时候, 你就理解了这个知识.
当然, 并不是说理解就一定要有顿悟的过程, 你可以刚看到这个知识的时候就理解了, 而且这种情况其实更常见.

于是, 我们可以说**学习是对直觉的后天塑造**. 关于这一点,
杨振宁的短文*我的学习与研究经历*中有一段话是这样说的:

我没有念过高中物理学, 为了参加那次入学考试, 借了一本高中物理教科书, 闭门自修了几个星期, 发现原来物理是很适合我研读的学科, 所以在联大我就选择了物理系. 记得非常清楚的是, 那次我在教科书中读到, 圆周运动加速的方向是向心的, 而不是沿着切线方向的. 最初我觉得这与我的直觉感受不同, 仔细考虑了一两天以后才了解, 原来速度

是一个向量, 它不仅有大小而且是有方向的.这个故事给了我很大的启发: 每个人在每个时刻都有一些直觉, 这些直觉多半是正确的, 可是也有一些需要修正, 需要加入一些新的观念, 变成新的较正确的直觉. 我从而了解到:

一方面直觉非常重要, 可是另一方面又要能及时吸取新的观念修正自己的直觉.

我直接抄了这段话, 因为我觉得这段话说得相当精彩: 它诠释出了直觉在学习过程中所扮演的角色. 直觉既是学习的工具, 又是学习的主体. 这就好像一个编程 AI 通过写程序来修正自己, 从而使得自己更擅长写程序.

人类是擅长想象的生物, 而且我们经常去想象.
人类运用直觉的方式, 往往也是通过想象.
比如说, 我问你房子是不是方的, 你就会想象一个房子; 如果它是方的, 那么你就会说是的.
不过, 房子是一个你见过的事物, 不一定需要你去想象.
再比如说, 我让你解释一下 S_{N}2 反应, 你就会想象一个 nucleophile 从屁股慢慢把一个原子核吸过来,
同时另一侧一个 leaving group 掉出来了, 虽然你完全没有亲眼见过这个过程.

然而想象的局限性非常明显: 它过于依赖人类的感官了.
很多人宣称自己是 visual learner. 很遗憾的是, 如果你也是这样的话, **不要**以此为傲.
在理科中, 通过视觉 (或者其他任何感官) 来理解知识在很多情况下是很难的,
这是因为感官接收到的信息都是非常具体而且受限于你已经观察过的现实世界的事物的,
但理科中的所研究的对象往往是非常抽象, 不对应于任何现实中能直接观测到的事物的.

这里其实有一个与专业相关的问题. 对于自然科学和社会科学, 初学者阶段中依赖于感官的想象其实也是有用的, 这是因为这些学科研究的对象就是现实中的事物. 但是, 在数学和计算机科学中, 依赖于感官的想象就不那么有用了, 甚至有可能会误导你.

**最好的想象, 并不是基于感官的想象, 而是基于概念的想象**.
当你想象一个事物的时候, 你并不是要想着去用眼睛看它, 而是想象它的存在, 用心灵以一种直接而抽象的方式去体会它的概念.
比如说, 假如我让你想象一个集合,
如果你想象了一堆或圆或方的物体, 或者是任何具体的数字, 或者是任何对你来说有视觉形象的东西, 那你就是在依赖于感官地进行想象.
而我在想象一个集合的时候, 我不会看到任何东西, 而是在脑中浮现出了概念性的 “一坨东西”.
同样地, 想象一个可数集的时候我就会想到一坨有编号的东西, 想象一个函数的时候我就会想到两坨东西之间的对应关系, 等等.
不同的想象能给我带来不同的感觉, 一些很难用语言描述的感觉, 这些感觉就是来自于学习中对直觉的积累.
为了让你理解这种感觉, 我稍微举一个例子, 如果你学过一点代数和泛函分析的话.
我现在说一串概念, 然后每说一个概念的时候你就去想象一下它的样子:
模, 自由模, 向量空间, TVS, Hausdorff TVS, Banach 空间, Hilbert 空间, Euclid 空间.
这一串概念中每一个概念都是后一个的推广, 所以你每次推进到下一个概念的时候你想象到的东西就会被附加上更多的结构,
而额外的结构会改变你对这个概念的感觉.
如果你知道我在说什么的话, 你会感到一开始的概念会给你非常强的 “代数感”,
而越到后面 “代数感” 就会越弱, 而 “几何感” 就会越强.

有的时候, 在我思考问题的时候我不会动笔, 而是将我所研究的对象全部放在脑内, 想象出它的存在, 然后在脑内思考. 当我思考出了一个点子的时候, 我才会将它记录下来; 如果我太早就动了笔, 我的抽象思维会一定程度上被视觉信息干扰, 变得过于具象, 反而不再清晰了.

人们总是津津乐道 “相对论” “量子力学” “四维空间”, 使它们成为著名的难以想象的东西, 但那只是因为人们的想象力总是受到自己感官的限制, 他们想象出的东西总是受限于他们观察到过的信息带给他们的经验. 当你懂得了如何跳脱于感官之外概念性地想象事物的时候, 你就会发现这些东西其实并不难想象.

另一种运用直觉的方式是自然语言.
语言本身就是一种对概念的抽象, 而且它有一个巨大的优势: 人类对它非常熟悉.
因为自然语言是人类在交流的过程中自发产生的, 语言的逻辑总是非常符合人认知事物的逻辑.
于是, 你可以通过组织语言来让你理解抽象的概念, 通过对自己说话的方式来让自己理解知识.
这意味着**语言不仅是交流的工具, 还是思考的工具**.
有的时候人们常把这样学习的人称为 verbal learner,
但是说 verbal learner 的时候往往会给人一种擅长听说读写, 擅长文科, 然而缺乏对逻辑和公式的理解的刻板印象,
所以我不喜欢用这个称呼.

说一件我小时候的事. 二年级的时候学到了加法和乘法的交换律, 结合律, 分配律. 所有的运算律中唯一没有理解的是乘法分配律. 多年后的一天, 在初中的数学课上, 老师在教我们解方程时说了一句 “一个 $x$ 加上两个 $x$ 得到三个 $x$”, 我就顿悟了. 从此以后我能够直观地理解乘法分配律了.

那是我第一次体会到什么叫做语言是思考的工具.
当然, 我相信大家小时候基本上都能很容易理解乘法分配律, 看我这个例子可能看不出来什么.
不过, 有一个非常典型的大家肯定都会学到, 但是很多人一上来没法理解的东西: 极限的 $\varepsilon$-$\delta$ 定义.
这应该是很多人第一次在数学中接触到自己比较难理解的东西了, 巧的是这个东西我也是在初中的时候理解的
(我初中的时候看同济的*高等数学*看到的, 后来上下册都看完了).

初三的时候我去平和 (某国际高中) 面试, 面试官说: “你说学了微积分, 请你解释一下什么是

极限.” 我当时说: “$f(x)$ 可以任意地接近 $\lim_{x\to x_0}f(x)$, 当 $x$ 足够接近 $x_0$ 的时候.”

虽然后来没被录取, 但是我很自信肯定不是在这个问题上栽了跟头: 我的回答相当完美.
这就是用极其简炼的语言解释了极限的 $\varepsilon$-$\delta$ 定义,
而这句话就是在我第一次读到 $\varepsilon$-$\delta$ 定义的时候, 自然而然地用自然语言诠释它时所想到的,
它使我做到了一下子就理解了*极限*的概念.

理科学习中有大量的运用自然语言来理解概念的例子. 有时教科书上会直接将用于理解它的句子告诉你, 但有的时候不会. 不过, 这实际上是 instructor 的工作: 本质上 instructor 的工作就是用自然语言将教科书中的内容给学生诠释一遍. 不依赖教科书和 instructor, 依靠自己去诠释, 也是很有用的.

这里还有一个值得探讨的点. 考虑到并不是所有的人都是以英语为母语的, 而学习资源以英语为主, 那么对于非英语母语者, 应该用英语来思考还是应该用母语来思考, 还是应该混合多种语言来思考呢? 其实都可以, 视你的英语水平, 你的阅读材料的语言, 以及你的个人偏好而定.

第三种运用直觉的方式是记法和符号. 这是一种介于想象和自然语言之间的一种方式.
这里所说的记法和符号是指专业领域内的记法和符号, 而不是自然语言的.
它不是一种自然语言, 而是一种类似于行业黑话的东西, 我暂且将它称为*符号语言*.

符号语言与自然语言有多个区别:

- 符号语言多用于读和写, 自然语言多用于听和说. 换句话说, 符号语言是 non-oral 的. 这个关键性的区别导致了你不能通过符号语言来与自己 “交流”.
- 符号语言是专家发明的, 自然语言是由社会中的所有人类互相交流时自发产生而塑造的.
- 符号语言的能指和所指之间的对应关系相比较自然语言更加固定, 因此表意更清晰.
- 符号语言是非线性书写的, 自然语言是线性书写的.
这里
*线性*指的是将字符排列在一条线上, 知道化学中的 line notation 的同学应该都知道是什么意思.

符号语言的专业性, 清晰性, 非线性性使得其表意能力极大地超越自然语言 (在专业领域中). 利用符号语言, 你可以以极少的笔墨以较为精确的方式表达你的想法. 又, 相比较想象, 符号语言的可记录性又使得它能解放大脑的一部分内存, 使你能够同时思考更多的东西. 然而, 符号语言有一个极大的弊端: 它是 non-oral 的. 这意味着你无法用一种如同说话一般的方式去阅读它或表达它, 也就是说你不能用人类最自然的方式去使用它 (如果你会说话, 那么书写系统总是远不如口头语言自然, 比如说在你读这篇文章的时候一定会在心里默默地念出来).

通过符号语言运用直觉, 是靠**观察符号语言与自己过去的经验在结构上的相似点**.
也就是说, 透过符号表达, 你可以更加容易地发现不同的事物可以被抽象为同一事物, 从而迁移你已有的经验.
比如说看见 $x^2+xy+y^2$, 初中的时候你可能会想余弦定理,
也有可能想到把它除以 $x^2$ 可以变出来一个关于 $y/x$ 的二次函数;
高中的时候你可能会想椭圆, 可能会想三角换元, 可能会想轮换对称;
大学的时候你可能会想这个二次型可以正规对角化, 或者别的跟你专业有关的东西.
这些就是你透过符号语言的含义所引申出的直觉.

这种直觉是比较有技巧性的, 而且比较依赖训练.

说到透过符号和记法运用直觉来使用技巧, 这就来到了我最不擅长东西了.
或者说, 这是所有人都不擅长的东西, 因为对技巧的掌握是需要**大量的训练**的, 只有天赋异禀的人才能不加训练地学会技巧.
如果你可以, 那么你就是天赋异禀的, 恭喜你!

技巧并不是运用直觉的方式, 而是运用直觉的**结果**.
它类似于一种条件反射: 你尝试在训练的过程中在事物 (研究对象, 题型, 符号语言的形式等) 和技巧之间建立联系,
试图在每次遇到能用某个技巧的时候, 总能想到并正确地运用它.
这种联系的建立是需要训练的, 所以它是困难的.

所幸的是, 在初学阶段, 除了数学和计算机科学以外的理科的技巧含量是**极少**的;
而数学和计算机科学的初学阶段用到的技巧, 不能说极少, 但是也**完全不多**.

初学阶段学到的**几乎所有**技巧都会在课堂或者教科书中以例题的方式呈现,
而其余的技巧需要靠你**在做题的过程中自行发现**, 作为一种对技巧的训练.
但是你要记住, 初学阶段你做的所有的题都是 toy problem, 所以做题要运用技巧,
那么它一定被精巧地设计为要运用某些技巧 (即出题人的意图),
而且这个技巧基本上肯定是在课堂或者教科书中被教过的.
这就意味着我们可以在初学阶段规避因为技巧不熟练而产生的短板,
而将学习的重心放在对知识的理解, 而不是对技巧的训练上.

这里会有一些跟专业有关的区别. 其实如果你不是数学专业或计算机科学专业的话, 在初学阶段你可以完全忽视对技巧的训练. 数学的各个领域都会比较看中技巧, 而计算机科学则只有算法和软件基础等领域会比较看中技巧, 其他领域不会. 不过, 就算你是数学或计算机专业的, 一般考试也不会在技巧方面过于为难你 (学生分数太差对学校也没有好处).

总之你完全不用担心因为想不出技巧而导致自己无法胜任学习. 你只需要能熟练地用那些在课堂和教材中已经呈现过的少量技巧即可.

接下来可以介绍一些不那么 “虚幻” 的方面 (只与脑内的思维有关的学习方法的方面), 而谈论一些更加实际的东西, 也就是学习的外在形式, 也就是自学与课堂学习的对立.

首先我们需要承认一点: **课堂学习的效率可以是非常高的**.
虽然有个人差异, 但是一般来说课堂学习的效率的上限在自学的效率的上限的两倍以上.
然而, 遗憾的是你每周花在课堂学习上的时间是很少的
(我们大学里一门 4 学分的课一般一周只有 2.5 小时的 lecture),
而且还有另一个非常重要的因素: 几乎所有的课堂都是无法让学生达到 (或者接近) 课堂学习效率的上限的
(这其实很正常, 毕竟首先参与同一个课堂的学生偏爱不同的课堂形式, 而且更重要地, instructor 的能力也是有限的).
这些因素导致了一个后果: **课堂学习的实际效率比自学低**.

当然, 这里比较的是只靠课堂学习和只靠自学的情况. 有的时候, 你需要结合两者才能达到最高的效率. 不过, 实际上你要相信自己, 初学阶段自学的速度是非常快的, 你可以很容易地赶超课堂学习的进度. 至于考试, 那对你来说就是小菜一碟了.

在初学阶段, 阅读材料主要指的是教材. 如果你是为了学习学校的某门课程而自学, 可以直接用这门课用的教材来自学.

如果你不是为了学习某门课程 (比如超前学习, 或者兴趣驱动), 那么你会面临教材选择的问题. 虽然初学阶段阅读材料很丰富, 但实际上常见的被学校广泛采用的教材并不特别多, 你可以按照你自己的喜好来选择. 对于初学阶段的教材, 网上总会有各种对比评价, 你参考后挑选一本即可. 我个人最看中的主要有两个方面, 一是涉及知识的广度, 二是它有多 formal (你可以理解为越严谨, 越数学的书越 formal). 这两项都不是越高越好, 实际上在我学习的过程中我的偏好也一直在变.

有的时候一本书会对你来说太难或者太简单. 一般来说教科书都会预设你会了某些前置知识
(比如说基本上所有的物理书都预设了你会线性代数和微积分), 但**所谓的 “前置知识” 其实只是一个建议**,
学习的时候未必一定要按照知识依赖图的拓扑排序来.
有的时候你就算没学过前置知识, 一本书也可以很简单; 你就算学过了前置知识, 一本书也可以很难.
实际上, 真正判定一本书是否适合你自学的标准是这样的:
**如果书中的任何知识都可以在你花了一些时间之后可以且才可以理解, 那么这本书的难度就是适合自学的**.
这个判据对书的难度并没有做出多少限制, 因为 “一些时间” 是一个很模糊的概念, 根据你自己所能接受的学习速度来决定.

虽然说了一小段关于挑选教材的事情, 但其实你要知道**教材的选择并不很重要**.
有些人特别看中这个, 但其实有名的教材都不会差, 而一般好和特别好的教材对你的学习也不会有多少区别明显的影响.
关键是你要真的能认真去读, 把它学完就行了.

课堂学习中笔记一般是抄板书以及记一些 instructor 说的比较重要的话, 但自学的笔记显然不能是直接抄书 (那你不得抄死?). 我记笔记的方式是将书中的关键结论用数学证明或者解题的形式呈现一遍, 初次之外再记录一些 facts 或者 corollaries. 这样记的主要好处是以后 refer 的时候比较方便查找.

学习的时候不能只盯着一本书看.
有时如果你觉得某个地方理解有困难, 你可以上 Wikipedia 和 Stack Exchange 看看, 还可以找别的书对照着看.
理解了之后, 记得**将你的理解写在笔记里**, 这有助于记忆, 以及在将来忘了的时候可以 refer 一下
(还有就是如果将来有人问你一个东西怎么理解, 你就可以拿出你的笔记, 跟他说: “这个东西我以前研究过…”).

教科书里一般都有习题. 我会在笔记里面做习题, 一般都是挑着做. 有的书的习题质量都很高, 我就会全做; 但是如果题太多了, 我也会挑着做. 做习题的选择是比较自由的.

关于笔记的形式, 我会比较推荐电子手写, 这样既易于管理又方便. 用 $\LaTeX$ 写笔记也可以, 但是我觉得有一点浪费时间, 如果你 $\LaTeX$ 写得没那么熟练的话; 用纸质笔记也可以 (我以前就是用纸质的, 因为以前在中学上课的时候自学, 而课上不让用电脑; 但我后悔了), 但我觉得比较难管理, refer 的时候比较难找到以前写的东西在哪里, 还有就是每次搬家的时候都要抱一堆笔记本去新家非常麻烦.

我经常写文章. 你现在所看到的 blog 就是由我写的文章组成的. 大多数我写的文章有两种: 一种是我突然理解了一个知识, 于是写了篇文章记录下来 (类似于笔记); 另一种是我找到了一个 idea, 然后去研究, 得到了一些成果, 于是写了篇文章记录下来 (这实际上是一种超于初学阶段的操作, 因为理科的初学阶段一般不要求你输出知识, 但你仍然可以像我一样做, 因为这是很好的锻炼). 前者本质上是一种类似于 Feynman 技巧的学习方法 (以教代学).

介绍 Feynman 技巧的文章有很多, 你随便 Google 一下就能找到一堆, 我就不细说了.
Feynman 技巧能够起效的原理, 在我看来, 是它**逼迫你用自然语言表达你的理解, 从而构筑你能通过自然语言运用的直觉**.
但是 Feynman 技巧有一个关键性的缺点, 就是它对你的朋友圈要求比较高:
一个会津津有味地听你大谈理科知识的朋友是很难得的, 更难得的是一个会津津有味地听你大谈理科知识的恋爱对象,
更难得的是一个不仅会津津有味地听你大谈理科知识还会让你津津有味地听他大谈理科知识的恋爱对象 (这也太浪漫了).
所以, 在朋友圈不够好的情况下, 写文章就是一种非常好的运用 Feynman 技巧的方式.

写文章与记笔记的不同之处在于, 笔记是给自己看的, 但文章是给别人看的.
这意味着你要在写文章时**在心中预设文章的可读性**, 它也能逼迫你用能被人理解的自然语言来表达你的理解.

如果你领会了我前面提到的所有内容并且愿意花时间去实践, 很快你的知识水平就会超过学校的课程进度 (因为理科初学阶段的课程进度都是很慢的). 达到这个境界之后, 忘记考试就会是一个非常自然的做法: 考试完全不值得你去花时间准备. 当然, 你还是有可能会没学到一些课程里有的知识, 但你不用担心这一点: 做作业 (这个应该是必须要做的, 一般大学课程不会允许你不做作业) 的时候你肯定会遇到, 到时候顺便学一下就可以了.

关于生活与娱乐, 严格来说它们不算学习方法的一部分, 所以我只简短地说一下.

学生经常被灌输 “学生要干的事就是学习” 这样的观点, 但实际上学习是生活的一部分, 而不是生活的全部. 一个典型的学生生活, 除了衣食住行之外, 还包括学习, 兴趣爱好, 娱乐活动 (以及可能还有找对象谈恋爱). 后面这几项活动很可能是有交叉的, 但如果很不幸它们不交叉, 你也不必要求学习在这几项活动中占据绝对的优势.

]]>Suppose $(\Omega,\sigma(\Omega),P)$ is a probability space. Suppose $W$ is an affine space. For some map $f:\Omega\to W$, we define the $P$-expectation of $f$ as

\[\mathrm E_P\!\left[f\right]:=\int_{x\in\Omega}\left(f(x)-e_0\right)\mathrm dP(x)+e_0,\]where $e_0\in W$ is arbitrary. Here the integral is Pettis integral. The expectation is defined if the Pettis integral is defined, and it is then well-defined in that it is independent of the $e_0$ we choose.

Suppose $X,Y$ are Polish spaces. Suppose $(Y,\sigma(Y),\mu),(X,\sigma(X),\nu)$ are measure spaces, where $\mu$ and $\nu$ are σ-finite Borel measures. Suppose $\pi:Y\to X$ is a measurable map so that

\[\forall A\in\sigma(X):\nu(A)=0\Rightarrow\mu\!\left(\pi^{-1}\!\left(A\right)\right)=0.\]Then, for each $x\in X$, there exists a Borel measure $\mu_x$ on the measurable subspace $\left(\pi^{-1}(x),\sigma\!\left(\pi^{-1}(x)\right)\right)$, such that for any integrable function $f$ on $Y$,

\[\int_{y\in Y}f\!\left(y\right)\mathrm d\mu(y) =\int_{x\in X}\mathrm d\nu(x)\int_{y\in\pi^{-1}(x)}f\!\left(y\right)\mathrm d\mu_x(y).\]*Proof.*
Because $\mu$ is σ-finite,
we have a countable covering of $Y$
by pairwise disjoint measurable sets of finite $\mu$-measure,
denoted as $\left\{Y_i\right\}$.
Each $Y_i$ is automatically stroke=’#currentColor’ and inherits the σ-algebra from $Y$,
and $\left(Y_i,\sigma\!\left(Y_i\right),\mu\right)$ is a measure space.

Define $\pi_i:Y_i\to X$ as the restriction of $\pi$ to $Y_i$, then $\pi_i$ is automatically a measurable map from $Y_i$ to $X$, and for any $x\in X$,

\[\pi^{-1}(x)=\bigcup_i\pi_i^{-1}(x),\]and the terms in the bigcup are pairwise disjoint.

Let $\nu_i$ be a measure on $X$ defined as

\[\nu_i(A):=\mu\!\left(\pi_i^{-1}\!\left(A\right)\right).\]This is a measure because $\pi_i$ is a measurable map. According to the disintegration theorem, for each $x\in X$, there exists a Borel measure $\mu_{i,x}$ on $Y_i$ such that for $\nu$-almost all $x\in X$, $\mu_{i,x}$ is concentrated on $\pi_i^{-1}(x)$ (in other words, $\mu_{i,x}\!\left(Y\setminus\pi_i^{-1}(x)\right)=0$); and for any integrable function $f$ on $Y_i$,

\[\int_{y\in Y_i}f\!\left(y\right)\mathrm d\mu(y) =\int_{x\in X}\mathrm d\nu_i(x)\int_{y\in\pi_i^{-1}(x)}f\!\left(y\right)\mathrm d\mu_{i,x}(y).\]From the condition in the original proposition, we can easily prove that $\nu_i$ is absolutely continuous w.r.t. $\nu$. Therefore, we have their Radon–Nikodym derivative

\[\varphi_i(x):=\frac{\mathrm d\nu_i(x)}{\mathrm d\nu(x)}.\]For each $x\in X$, define the measure $\mu_x$ on $\pi^{-1}(x)$ as

\[\mu_x(A):=\sum_i\varphi_i\!\left(x\right)\mu_{i,x}\!\left(A\cap Y_i\right).\]This is a well-defined measure because the sets $A\cap Y_i$ are pairwise disjoint, and $\mu_{i,x}$ is well-defined measure on $Y_i$.

Then, for any integrable function $f$ on $Y$,

\[\begin{align*} \int_{y\in Y}f\!\left(y\right)\mathrm d\mu(y) &=\sum_i\int_{y\in Y_i}f\!\left(y\right)\mathrm d\mu(y)\\ &=\sum_i\int_{x\in X}\mathrm d\nu_i(x)\int_{y\in\pi_i^{-1}(x)}f\!\left(y\right)\mathrm d\mu_{i,x}(y)\\ &=\sum_i\int_{x\in X}\varphi_i\!\left(x\right)\mathrm d\nu(x) \int_{y\in\pi_i^{-1}(x)}f\!\left(y\right)\mathrm d\mu_{i,x}(y)\\ &=\int_{x\in X}\mathrm d\nu(x)\sum_i\int_{y\in\pi_i^{-1}(x)}f\!\left(y\right)\mathrm d\mu_x(y)\\ &=\int_{x\in X}\mathrm d\nu(x)\int_{y\in\pi^{-1}(x)}f\!\left(y\right)\mathrm d\mu_x(y).&\square \end{align*}\]Here, the family of measures $\left\{\mu_x\right\}$ is called
the **disintegration** of $\mu$ w.r.t. $\pi$ and $\nu$.

For two vector spaces $\vec W_1,\vec W_2$, we denote $\vec W_1\times\vec W_2$ as the direct sum of them. Also, rather than calling the new vector space their direct sum, I prefer to call it the product vector space of them (not to be confused with the tensor product) so that it is consistent with the notion of product affine spaces, product measure spaces, product topology, etc. Those product spaces are all notated by “$\times$” in this article.

Also, “$\vec W_1$” can be an abbreviation of $\vec W_1\times\left\{0_2\right\}$, where $0_2$ is the zero vector in $\vec W_2$.

Suppose $W$ is an affine space associated with the vector space $\vec W$. For any $A\subseteq W$ and $B\subseteq\vec W$, we denote $A+B$ as the Minkowski sum of $A$ and $B$, i.e.,

\[A+B:=\left\{a+b\,\middle|\,a\in A,\,b\in B\right\}.\]This extends the definition of usual Minkowski sums for affine spaces.

By the way, because of the abbreviating “$\vec W_1$” meaning $\vec W_1\times\left\{0_2\right\}$ above, we can abuse the notation and write

\[\vec W_1+\vec W_2=\vec W_1\times\vec W_2,\]where “$+$” denotes the Minkowski sum. This is true for any two vector spaces $\vec W_1,\vec W_2$ that do not share a non-trivial vector subspace.

In general, it is not necessarily possible to decompose a topology as a product of two topologies. However, it is always possible for locally convex Hausdorff TVSs. We can always decompose the topology of a locally convex Hausdorff TVS as the product of the topologies on a pair of its complementary vector subspaces, one of which is finite-dimensional. This is true because every finite-dimensional subspace in such a space is topologically complemented. The complete statement is the following:

Let $\vec W$ be a locally convex Hausdorff TVS. For any finite-dimensional subspace $\vec W^\parallel$ of $\vec W$, there is a complement $\vec W^\perp$ of it such that the topology $\tau\!\left(\vec W\right)$ is the product topology of $\tau\!\left(\vec W^\parallel\right)$ and $\tau\!\left(\vec W^\perp\right)$.

This decomposition is also valid for affine spaces. If an affine space $W$ is associated with a locally convex Hausdorff TVS $\vec W$, then for any finite-dimensional vector subspace $\vec W^\parallel$ of $\vec W$, we can topologically decompose $W$ into $W^\perp+\vec W^\parallel$.

Because the product topology of subspace topologies is the same as the subspace topology of the product topology, we can also decompose $E^\perp+\vec W^\parallel$ as the product topological space of $E^\perp$ and $\vec W^\parallel$ if $E^\perp\subseteq W^\perp$.

Such decompositions are useful because they allow us to disintegrate Borel measures. If we already have a σ-finite Borel measure on $E^\perp+\vec W^\parallel$ and we can define a σ-finite Borel measure on $\vec W^\parallel$, then we can define a measure on $E^\perp$ by the disintegrating, and we guarantees that the disintegration is also σ-finite and Borel.

When I want to use multi-index notations, I will use “$\bullet$” to denote the indices. For example,

\[\Sigma\alpha_\bullet:=\sum_\bullet\alpha_\bullet.\] \[\alpha_\bullet\beta_\bullet:=\sum_\bullet\alpha_\bullet\beta_\bullet.\] \[\alpha_\bullet^{\beta_\bullet}:=\prod_\bullet\alpha_\bullet^{\beta_\bullet}.\] \[\alpha_\bullet!:=\prod_\bullet\alpha_\bullet!.\]First, I need to point out that the most central state function of a thermal system is not its energy, but its entropy. The energy is regarded as the central state function in thermodynamics, which can be seen from the fundamental equation of thermodynamics

\[\mathrm dU=-p\,\mathrm dV+T\,\mathrm dS+\mu\,\mathrm dN.\]We also always do the Legendre transformations on the potential function $U$ to get other potential functions instead of doing the transformation on other extensive quantities. All such practices make us think that $S$ is just some quantity that is similar to $V$ and $N$, and mathematically we can just regard it as an extensive quantity whose changing is a way of doing work.

However, this is not the case. The entropy $S$ is different from $U,V,N$ in the following sense:

- The entropy is a derived quantity due to a mathematical construction from the second law of thermodynamics, while $U,V,N$ are observable quantities that have solid physical meanings before we introduce anything about thermodynamics.
- The entropy may change in an isolated system, while $U,V,N$ do not.
- We may have an intuitive understanding of how different systems in contact may exchange $U,V,N$ with each other, but $S$ cannot be “exchanged” in such a sense.
- In statistical mechanics, $U,V,N$ restrict what microstates are possible for a thermal system, but $S$ serves as a totally different role: it represents something about the probability distribution over all the possible microstates.

Therefore, I would rather rewrite the fundamental equation of thermodynamics as

\[\begin{equation} \label{eq: fundamental} \mathrm dS=\frac1T\,\mathrm dU+\frac pT\,\mathrm dV-\frac\mu T\,\mathrm dN. \end{equation}\]Equation \ref{eq: fundamental} embodies how different quantities serve different roles more clearly, but it becomes vague in its own physical meaning. Does it mean different ways of changing the entropy in quasi-static processes? Both mathematically and physically, yes, but it is not a useful interpretation. Because what we are doing is mathematical formulation of physical theories, we do not need to try to assign physical meanings to anything we construct. This new equation is purely mathematical, and the only way we use it is to relate intensive variables to derivatives of $S$ w.r.t. extensive quantities.

From now on, I will call quantities like $U,V,N$ the **extensive quantities**,
not including $S$.
However, this is not a good statement as part of our mathematical formulation.
Considering that there is a good notion of how different systems
may exchange values of extensive quantities
and that we can scale a system by multiplying the extensive quantities by a factor,
we require that the extensive quantities must support at least linear operations… do we?

Well, actually we will see that if we require a space a vector space, things would be a little bit complex because sometimes we need to construct a new space of extensive quantities out of the affine subspace of an existing one, which is not a vector space by nature. If we require the space to be a vector space, we need to translate that affine subspace to make it pass through the zero element of the vector space, which is possible but does not give any insight about the physics except adding complicationg to our construction. Therefore, I will not require the space of extensive quantities to be a vector space, but be an affine space.

You may ask, OK then, but how do we “add” or “scale” extensive quantities
if they live one an affine space?
First, regarding the addition operation, we will use an abstraction for such operations
so that the actual implementation about how do we combine the summands is hidden under this abstraction.
We will see that this abstraction is useful because it also applies to other senarios or useful operations
that does not necessarily involve any meaningful addition.
Regarding the scaling operation, I would argue that now we do not need them.
I have generalized the notion of extensive quantities so that now the notion “extensive quantities”
includes some quantities that are not really extensive quantities in any traditional sense.
They are no longer meant to be scaled because they simply cannot.
Actually, rather than calling them extensive quantities, I would like to call them
a **macrostate**, with the only difference from the general notion macrostate being that
it has an affine structure so that I can take the ensemble average of it to get its macroscopic value.
I would stick to the term “extensive quantities” because they are actual extensive quantities in all my examples
and because it is a good way to understand its physical meaning with this name,
but you need to keep in mind that what I actually refer to is a macrostate.

There is another difficulty. If we look closely, Equation \ref{eq: fundamental} actually does not make much sense in that $N$ is quantized (and also $U$ if we are doing quantum). If we are doing real numbers, we can always translate a quantized quantity to something that is not allowed, which means that we cannot have the full set of operations on the allowed values of the extensive quantities. Therefore, we need to specify a subset on the affine space to represent the allowed values of the extensive quantities.

We also see that Equation \ref{eq: fundamental} is a relation between differentials. Do we need to require that we have differential structure on the space of extensive quantities? Not yet, because it actually is somehow difficult. The same difficulty about the quantized quantities applies. The clever way is to just avoid using the differentials. (Mathematicians are always skeptical about differentiating something while physicists just assume everything is differentiable…) It may seem surprising, but actually differentials are evitable in our mathematical formulation if you do not require intensive variables to be well-defined inside the system itself (actually, they are indeed not well-defined except when you have a system in thermal equilibrium and take the thermaldynamic limit).

If we have to use differentials, we can use the Gateaux derivative. It is general enough to be defined on any locally convex TVS, and it is intuitive when it is linear and continuous.

Although differential structure is not necessary, there is an inevitable structure on the space of extensive quantities. Remember that in canonical and grand canonical ensembles, we allow $U$ or $N$ to fluctuate, so we should be able to describe such fluctuations on our space of extensive quantities. To do this, I think it is safe to assume that we can have some topology on the allowed subset to make it a Polish space, just like how probabilists often assume about the probability space they are working on.

A final point. Here is a difference in how physicists and mathematicians describe probability distributions: physicists would use a probability density function while mathematicians would use a probability measure. Mathematically, to have a probability density function, we need to have an underlying measure on our space for a notion of “volume” on the space, and then we can define the probability density function as the Radon–Nikodym derivative of the probability measure w.r.t. the underlying volume measure. Also, for t he Radon–Nikodym derivative to exist, the probability measure must be absolutely continuous w.r.t. the volume measure, which means that we have to sacrifice all the probability distributions that are not absolutely continuous to take the probability density function approach. Then, it seems that if we use the probability density function approach, we are introducing an excess measure structure on the space of extensive quantities and losing some possibilities and generalizabilities, but it would turn out that the extra structure is useful. Therefore, I will use the probability density function approach.

Here is our final definition of the space of extensive quantities:

*Definition.*
A **space of extensive quantities** is a tuple $(W,E,\lambda)$, where

- $W$ is an affine space associated with a reflexive vector space $\vec W$ over $\mathbb R$, and it is equipped with topology $\tau(W)$ that is naturally constructed from the topology $\tau\!\left(\vec W\right)$ on $\vec W$;
- $E\subseteq V$ is a topological subspace of $W$, and its topology $\tau(E)$ makes $E$ a Polish space; and
- $\lambda:\sigma(E)\to[0,+\infty]$ is a non-trivial σ-finite Borel measure, where $\sigma(E)\supseteq\mathfrak B(E)$ is a σ-algebra on $E$ that contains the Borel σ-algebra on $E$.

Here, I also added a requirement of σ-finiteness. This is necessary when constructing product measures. At first I also wanted to require that $\lambda$ has some translational invariance, but I then realized that it is not necessary, so I removed it from the definition (but we will see that we need them as a property of baths).

*Example.*
Here is an example of a space of extensive quantities.

Physically we may think of this as the extensive quantities of the system of ideal gas. The three dimensions of $W$ are energy, volume, and number of particles.

*Example.*
Here is another example of a space of extensive quantities.

Physically we may think of this as the extensive quantities of the system of Einstein solid with $\hbar\omega=1$. The two dimensions of $W$ are energy and number of particles.

Remember I said above that, in statistical mechanics, $U,V,N$ restrict what microstates are possible for a thermal system. We can translate this as such: for each possible values of extensive quantities, denoted as $e\in E$, here is a set of possible microstates, denoted as $M_e$ (you can then see why we excluded the entropy from the extensive quantities: otherwise we cannot do such a classification of microstates).

Now the problem is what structures should we add to $M_e$ for each $e\in E$. Recall that in statistical mechanics, we study probability distribution over all possible microstates. Therefore, we need to be able to have a probability measure on $M_e$. In other words, $M_e$ should be a measurable space. As said before, we can either use a probability measure directly, or use a volume measure together with a probability density function. This time, we seem to have no choice but the probability density function approach because there is a natural notion of volume on $M_e$: the number of microstates.

Wait! There is a problem.
Recall that in microcanonical ensemble,
we allow the energy to fluctuate.
The number of microstates at exactly a certain energy is actually zero in most cases,
so we are actually considering those microstates with some certain small range of energy.
In other words, we are considering the **microstate density**:
the number of microstates inside unit range of energy.
Similarly, we should define a measure on $M_e$ to represent the microstate density,
which is the number of microstates inside unit volume of extensive quantities,
where the “volume” is measured by the measure $\lambda$ in the space of the extensive quantities.

This makes our formulation a little bit different from the microcanonical ensemble: our formulation would allow all extensive quantities to fluctuate while the microcanonical ensemble would only allow the energy to fluctuate. This is inevitable because we are treating extensive quantities like energy, volume, and number of particles as the same kind of quantity. It is not preferable to separate a subspace out from our affine space $W$ to say “these are the quantities that may fluctuate, and those are not.” Therefore, we need to justify why we may allow all extensive quantities to fluctuate. The justification is: mathematically, we are actually not allowing any extensive quantities to fluctuate. There is no actual fluctuation, and we are directly considering the microstate density without involving any change in the extensive quantities. In other words, using the language of microcanonical ensemble, we are considering the area of the surface of the energy shell instead of the volume of the energy shell with a small thickness.

Another important point is that we must make sure that specifying all the extensive quantities should be enough to restrict the system to finite number of microstates. In other words, the total microstate density should be finite for any possible $e\in E$. Also, there should be at least some possible microstates in $M_e$, so the total microstate density should not be zero.

We may them sum up the above discussion to give $M_e$ enough structure to make it the set of microstates of a thermal system with the given extensive quantities $e$. Then, the disjoint union of all of them (the family of measure spaces) is the thermal system.

*Definition.*
A **thermal system** is a pair $\left(\mathcal E,\mathcal M\right)$,
where

- $\mathcal E:=\left(W,E,\lambda\right)$ is a space of extensive quantities;
- $\mathcal M:=\bigsqcup_{e\in E}M_e$ is a family of measure spaces; and
- For each $e\in E$, $M_e$ is a measure space equipped with a measure $\mu_e$ such that $\mu_e\!\left(M_e\right)$ is finite and nonzero.

From now on, I will use a pair $(e,m)\in\mathcal M$ to specify a single microstate, where $e\in E$ and $m\in M_e$.

*Example.*
For the thermal system of a solid consisting of spin-$\frac12$ particles,
where each particle has two possible states with energy $0$ and $1$,
we can construct

This should be the simplest example of a thermal system.

*Example.*
We may complete the example of the system of ideal gas.
Suppose we are considering the system of ideal atomic gas inside a cubic box.
The construction of the space of extensive quantities is the same as before.
Denote possible values of extensive quantities in coordinates $e=(U,V,N)$.
Now the measure spaces $M_e$ may be constructed as such:

The “lexicographic order” here means that only those configurations where particle indices coincides with the lexicographic order are included in $M_e$. This is because the particles are indistinguishable, and the order of particles is irrelevant. The lexicographic order restriction is the same as using the quotient of the $N$-fold Cartesian product by permutation actions, but then defining $\mu_e$ would be difficult. Alternatively, we may still make them ordered, but divide the result by $N!$ in the definition of $\mu_e$, but this way is less clear in its physical meaning.

Here $H^d$ is the $d$ dimensional Hausdorff measure. To understand, the expression $H^{6N-1}(A)$ is just the $(6N-1)$-dimensional “volume” of $A$.

Since we have microstate density, why do not we have the true **number of microstates**?
We can define a measure on $\mathcal M$ to represent the number of microstates.

*Definition.*
The **measure of number of microstates** is a measure $\mu:\sigma(\mathcal M)\to\left[0,+\infty\right]$,
where

and the measure is defined by

\[\mu(A):=\iint\limits_{(e,m)\in A}\mathrm d\mu_e(m)\,\mathrm d\lambda(e).\]The uniqueness of $\mu$ is guaranteed by the σ-finiteness of $\lambda$ and $\mu_e$.
The expression $\mu(A)$ is called the **number of microstates** in $A$.

Here is a central idea in statistical ensembles:
a **state** is a probability distribution on the microstates of a thermal system.
It is among the ideas upon which the whole theory of statistical ensembles is built.
I will take this idea, too.

As said before, I have taken the probability density approach of defining a probability distribution. Therefore, a state is just a probability density function.

*Definition.*
A **state** of a thermal system $(\mathcal E,\mathcal M)$ is a function
$p:\mathcal M\to\left[0,+\infty\right]$ such that $(\mathcal M,\sigma(\mathcal M),P)$ is a probability space,
where $P:\sigma(\mathcal M)\to\left[0,1\right]$ is defined by

Two states are the same if they are equal $\mu$-almost everywhere.

A probability space is just a measure space with a normalized measure, and here the physical meaning of $p$ is the probability density on $\mathcal M$, and $P(A)$ is the probability of finding a microstate in $A$.

Note that a state is not necessarily an equilibrium state (thermal state). We will introduce the concept of equilibrium states later.

Now we may introduce the concept of **entropy**.

I need to clarify that the entropy that we are talking about here is just the entropy in statistical mechanics. The reason I add this clarification is that we may also formally define an entropy in the language of measure theory, which is defined for any probability space and does not depend on any so-called probability density function or a “volume” measure (which is the number of microstates in our case). The definition of this entropy is (if anyone is interested)

\[S^{\mathrm{info}}:=\sup_\Pi\sum_{A\in\Pi}-P(A)\ln P(A),\]where $P$ is the probability measure on the probability space, and the supremum is taken over all $P$-almost partition $\Pi$ of the probability space ($\Pi$ is a subset of the σ-algebra so that $P(\bigcup_{A\in\Pi}A)=1$ and $P(A\cap B)=0$ for $A,B\in\Pi$). This definition looks intuitive and nice, and not surprisingly it is… not consistent with the entropy in statistical mechanics. The discrepancy happens when we are doing classical statistical mechanics because the entropy defined above will diverge to infinity for those “continuous” probability distributions. A quick check is that the entropy of the uniform distribution over $[0,1]$ is $+\infty$.

*Definition.*
The **entropy** of a state $p$ is defined by

Different from extensive quantities, the entropy is a functional of $p$. The entropy here is consistent with the entropy in thermodynamics or statistical mechanics.

This definition of entropy is called the Gibbs entropy formula. It agrees with the entropy defined in thermodynamics, but we are unable to show that at this stage because we have not defined temperature or heat yet.

Note that the base of the logarithm is not important, and it is just a matter of unit system. In SI units, the base would be $\exp k_\mathrm B^{-1}$, where $k_\mathrm B$ is the Boltzmann constant.

Physically, the extensive quantities may be measured macroscopically. The actual values that we get when we measure them are postulated to be the ensemble average. Therefore, for a given state $p$, we can define the measured values of extensive quantities by taking the $P$-expectation of the extensive quantities.

*Definition.*
For a thermal system $(\mathcal E,\mathcal M)$
and a state $p$ of it, the **measured value of extensive quantities** of the state $p$ is
the $P$-expectation of the $E$-valued random variable $(e,m)\mapsto e$.
Explicitly, the definition is

where the probability measure $P$ on $\mathcal M$ is defined in Equation \ref{eq: probability measure}.

In the definition, it involves taking the $P$-expectation of a $W$-valued function. This involves doing a Pettis integral, which I claim to exist. It exists because the map $(e,m)\mapsto e-e_0$ must be weakly $P$-measurable, and such a function must be Pettis-integrable on a reflexive space.

Note that $\varepsilon[p]\in W$, and it is not necessarily in $E$.

The usage of the measured value of extensive quantities is that we can use it to get the
**fundamental equation** of a thermal system,
which describes the relationship between the extensive quantities and the entropy
at any equilibrium state.
Suppose that we postulate a family of states $p_t^\circ$ of the thermal system
(or its slices, which will be introduced below),
labeld by different $t$’s, and call them the possible equilibrium states.
Then, we can have the following two equations:

By cancelling out the $t$ in the two equations (which may be impossible but assumed to be possible), we can get the fundamental equation in this form:

\[\begin{equation} \label{eq: fundamental equation} S^\circ=S^\circ\!\left(\varepsilon^\circ\right). \end{equation}\]Then, here we get the function $S^\circ:E^\circ\to\mathbb R$, where $E^\circ$ is a subset of $W$ consisting of all possible measured values of extensive quantities among equilibrium states. If we can possibly define some differential structure on $E^\circ$ so that we can possibly take the differential of $S^\circ$ and write something sensible like

\[\mathrm dS^\circ=i\!\left(\varepsilon^\circ\right)(\mathrm d\varepsilon^\circ),\]where $i^\circ\!\left(\varepsilon^\circ\right)\in\vec W’$ is a continuous linear functional,
then we can define $i^\circ\!\left(\varepsilon^\circ\right)$ to be the **intensive quantities**
at $\varepsilon^\circ$.
A proper comparison with differential geometry is that we may analogly call $i^\circ$
be a covector field on $E^\circ$ defined as the differential of the scalar field $S^\circ$.

However, as I have said before, I did not postulate there to be any differential structure on $E^\circ$, so the intensive quantities should not be generally defined in this way.

A good notion about thermal systems is that we can get new thermal systems from existing ones (although they are physically essentially the same system, they have different mathematical structure and contain different amount of information about them). There are two ways of constructing new thermal systems from existing ones:

- By fixing some extensive quantities.
I call this way
**slicing**. - By allowing some extensive quantities to change freely.
I call this way
**contracting**.

I chose the words “slicing” and “contracting”. They are not present in actual physics textbooks, but I found the notion of them necesesary.

Slicing fixes extensive quantities. How we do it is to pick out a subset of $E$ and make it our new accessible values of extensive quantities. I find a special way of picking out such a subset is especially useful: picking it from an affine subspace of $W$. In this way, we can use a smaller affine space as the underlying space of our new thermal system. Then we see why I chose the word “slicing”: we are slicing the original affine space into parallel pieces, and picking one piece as our new affine space, and picking the corresponding accessible values of extensive quantities and possible microstates within that piece to form our new thermal system.

*Definition.*
A **slicing** of a space of extensive quantities $\left(W,E,\lambda\right)$
is a pair $\left(W^\parallel,\lambda^\parallel\right)$, where

- $W^\parallel\subseteq W$ is an affine subspace of $W$;
- $E^\parallel:=E\cap W^\parallel$ is non-empty, and it is Polish as a topological subspace of $E$; and
- $\lambda^\parallel:\sigma\!\left(E^\parallel\right)\to\left[0,+\infty\right)$ is a non-trivial σ-finite Borel measure on $E^\parallel$, where $\sigma\!\left(E^\parallel\right)\subseteq\mathfrak B\!\left(E^\parallel\right)$ is a σ-algebra on $E^\parallel$ that contains the Borel σ-algebra on $E^\parallel$.

This constructs a new space of extensive quantities $\left(W^\parallel,E^\parallel,\lambda^\parallel\right)$,
called a **slice** of the original space of extensive quantities $\left(W,E,\lambda\right)$.

*Definition.*
A **slice** of a thermal system $\left(\mathcal E,\mathcal M\right)$
defined by the slicing $\left(W^\parallel,\lambda^\parallel\right)$ of $\mathcal E$
is a new thermal system $\left(\mathcal E^\parallel,\mathcal M^\parallel\right)$ constructed as such:

- $\mathcal E^\parallel:=\left(W^\parallel,E^\parallel,\lambda^\parallel\right)$ is the slice of $\mathcal E$ corrsponding to the given slicing; and
- $\mathcal M^\parallel:=\bigsqcup_{e\in E^\parallel}M_e$.

The idea behind slicing is to make some extensive quantities become extrinsic parameters and not part of the system itself. It would physically mean fixing some extensive quantities. However, here is a problem: if we fix some extensive quantities, the dimension (“dimension” as in “dimensional analysis”) of the volume element in the space of extensive quantities would be changed. In other words, the dimension of $\lambda$ does not agree with $\lambda^\parallel$. This is physically not desirable because we want to keep the number of microstates dimensionless so that its logarithm does not depend on the units we use. However, this is not a problem because here is an argument: in any physical construction of a thermal system, it is fine to have non-dimensionless number of microstates, the cost is that the model must not be valid under low temperature; in mathematical construction, dimension is never a thing, so we do not even need to worry about it. In low temperature, we must use quantum statistical mechanics, where all quantities are quantized so that the number of microstates is literally the number of microstates, which must be dimensionless. In high temperature, we do not need the third law of thermodynamics, which is the only law that restricts how we should choose the zero (ground level) of the entropy, and in this case we may freely change our units because it only affects the entropy by an additive constant.

*Example.*
In the example of a system of ideal gas,
we may slice the space of extensive quantities to the slice $V=1$ to fix the volume.

Here is a special type of slicing.
Because a single point is an (zero-dimensional) affine subspace, it may form a slicing.
Such a slicing fixes all of the extensive quantities.
We may call it an **isolating**.

A thermal system with a zero-dimensional space of extensive quantities is called an **isolated system**.
The physical meaning of such a system is that it is isolated from the outside
so that it cannot exchange any extensive quantities with the outside.
We may construct an isolated system out of an existing thermal system by the process of isolating.

*Definition.*
An **isolating** (at $e^\circ$) of a space of extensive quantities $\left(W,E,\lambda\right)$
is a slicing $\left(W^\parallel,\lambda^\parallel\right)$ of it, constructed as

where $e^\circ\in E$.

*Definition.*
An **isolated system** is a thermal system whose underlying affine space of its space of extensive quantities
is a single-element set.

*Definition.*
An **isolation** (at $e$) of a thermal system $\left(\mathcal E,\mathcal M\right)$
is the slice of it corresponding to the isolation at $e^\circ$ of $\mathcal E$.

Here is an obvious property of isolated systems: the measured value of extensive quantities of any state of an isolated system is $e^\circ$, the only possible value of the extensive quantities.

After introducing isolated systems,
we can now introduce the **equal a priori probability postulate**.
Although we may alternatively use other set of axioms to develop the theory of statistical ensembles,
using the equal a priori probability postulate is a simple and traditional way to do it.
Most importantly, this is a way that does not require us to define concepts like the temperature
beforehand, which is a good thing for a mathematical formulation because it would require
less mathematical structures or objects that are hard to well define at this stage.

*Axiom* (the **equal a priori probability postulate**).
The equilibrium state of an isolated system is the uniform distribution.

Actually, instead of saying that this is an axiom, we may say that formally this is a definition of equilibrium states. However, I still prefer to call it an axiom because it only defines the equilibrium state of isolated systems rather than any thermal systems.

The equilibrium state of an isolated system $\left(\mathcal E,\mathcal M\right)$ may be written mathematically as

\[p^\circ\!\left(\cdot\right):=\frac1{\mu\!\left(\mathcal M\right)}.\](The circle in the superscript denotes equilibrium state.)
After writing this out, we have successfully derived the **microcanonical ensemble**.
We can then calculate the entropy of the state, which is

Mentioning the entropy, a notable feature about the equilibrium state of an isolated system is that it is the state of the system that has the maximum entropy, and any state different from it has a lower entropy.

*Theorem.*
For an isolated system, for any state $p$ of it,

where $S^\circ$ is the entropy of the equilibrium state of it. The equality holds iff $p$ is the same state as the equilibrium state.

*Proof.*
Define a probability measure $P^\circ$ on $\mathcal M$ by

then $\left(\mathcal M,\sigma\!\left(\mathcal M\right),P^\circ\right)$ is a probability space. Any state $p$, as a function on $\mathcal M$, can be regarded as a random variable in the probability space $\left(\mathcal M,\sigma\!\left(\mathcal M\right),P^\circ\right)$.

Define the real function

\[\varphi(x):=\begin{cases} x\ln x,&x\in\left(0,+\infty\right),\\ 0,&x=0. \end{cases}\]It is a convex function, so according to the probabilistic form of Jensen’s inequality,

\[\varphi\!\left(\mathrm E_{P^\circ}\!\left[p\right]\right) \le\mathrm E_{P^\circ}\!\left[\varphi\circ p\right].\]In other words,

\[\frac1{\mu(\mathcal M)}\ln\frac1{\mu(\mathcal M)} \le\int_{m\in\mathcal M}p\!\left(m\right)\ln p\!\left(m\right) \,\frac{\mathrm d\mu\!\left(m\right)}{\mu(\mathcal M)}.\]Then, it follows immediately that $S[p]\le S^\circ$. The equality holds iff $\varphi$ is linear on a convex set $A\subseteq\left[0,+\infty\right)$ such that the value of the random variable $p$ is $P^\circ$-almost surely in $A$. However, because $\varphi$ non-linear on any set with more than two points, the only possibility is that the value of $p$ is $P^\circ$-almost surely a constant, which means that the probability distribution defined by the probability density function $p$ is equal to the uniform distribution $\mu$-almost everywhere. Therefore, the equality holds iff $p$ is the same state as the equilibrium state. $\square$

This theorem is the well-known relation between the entropy and the equilibrium state.

By Equation \ref{eq: microcanonical entropy}, we can now derive the relationship between the entropy and the extensive quantities at equilibrium states by the process of isolating. Define a family of states $\left\{p^\circ_e\right\}_{e\in E}$, where each state $p^\circ_e$ is the equilibrium state of the system isolated at $e$. Then, we have the fundamental equation

\[\begin{equation} \label{eq: mce fundamental eq} S^\circ(e)=\ln\Omega(e), \end{equation}\]where $\Omega(e):=\ln\mu_e\!\left(M_e\right)$ is called the **counting function** (I invented the phrase),
which is the **microscopic characteristic function** of microcanonical ensembles.
This defines a function $S^\circ:E\to\mathbb R$,
which may be used to give a fundamental equation in the form of Equation \ref{eq: fundamental equation},
and it is the **macroscopic characteristic function** of microcanonical ensembles.

We will encounter microscopic or macroscopic characteristic functions for other ensembles later.

*Example.*
In the example of a system of a tank of ideal atomic gas, we have the fundamental equation

where $S_n(r)$ is the surface area of an $n$-sphere with radius $r$, which is proportional to $r^n$. Taking its derivative w.r.t. $U,V,N$ and taking the thermodynamic limit will recover familiar results.

I have previously mentioned that the other way of deriving a new system out of an existing one is called contracting. Now we should introduce this concept because it is very useful later when we need to define the contact between subsystems of a composite system (whose definition will be given later).

The idea behind contracting is also to reduce the dimension of the space of extensive quantities. However, rather than making some of the extensive quantities extrinsic parameters, it makes them “intrinsic” within the space of microstates. A vivid analogy is this: imagine a thermal system as many boxes of microstates with each box labeled by specific values of extensive quantities, then we partition those boxes to classify them, and put all the boxes in each partition into one larger box. The new set of larger boxes are labeled by a specific values of fewer extensive quantities, and it is the so-called contraction of the origional set of boxes.

I call it contracting because it is like contracting the affine space of extensive quantities into a flat sheet of its subspace. The way we do this should be described by a projection. A projection in affine space maps the whole space into one of its affine subspace, and the preimage of each point in the subspace is another affine subspace of the original space. The preimages forms a family of parallel affine subspaces labeled by their image under the projection. The family of affine subspaces may be used to define a family of slices of the space of extensive quantities or the thermal system, which are useful when defining the contraction of the space of extensive quantities or the system.

*Definition.*
A **contracting** of a space of extensive quantities $\left(W,E,\lambda\right)$
is given by a tuple $\left(\pi,\lambda^\perp\right)$, where

- $\pi:W\to W^\perp$ is a projection map from $W$ to an affine subspace $W^\perp$ of $W$;
- $E^\perp:=\pi(E)$, the image of $E$ under $\pi$, is equipped with the minimal topology $\tau\!\left(E^\perp\right)$ so that $\pi$ is continuous, and the topology makes $E^\perp$ Polish;
- $\lambda^\perp:\sigma\!\left(E^\perp\right)\to\left[0,+\infty\right]$ is a non-trivial σ-finite Borel measure on $E^\perp$, where $\sigma\!\left(E^\perp\right)\supseteq\mathfrak B\!\left(E^\perp\right)$ is a σ-algebra of $E^\perp$ that contains the Borel σ-algebra of $E^\perp$; and
- For any $A\in\sigma\!\left(E^\perp\right)$, $\lambda^{\perp}(A)=0$ iff $\lambda\!\left(\pi^{-1}(A)\right)=0$.

This contracting defines a new space of extensive quantities
$\left(W^\perp,E^\perp,\lambda^\perp\right)$, called a **contraction** of
the original space of extensive quantities $\left(W,E,\lambda\right)$.

*Definition.*
The **contractive slicings** of a space of extensive quantities $\left(W,E,\lambda\right)$
defined by a contracting $\left(\pi,\lambda^\perp\right)$ of it is a family of slicings
$\bigsqcup_{e\in W^\perp}\left(W^\parallel_e,\lambda^\parallel_e\right)$, where

- $W^\parallel_e:=\pi^{-1}(e)$ is the preimage of $\left\{e\right\}$ under $\pi$, an affine subspace of $W$; and
- $\lambda_e^\parallel:\sigma\!\left(E_e^\parallel\right)\to\left[0,+\infty\right]$ is a Borel measure; the family of measures is the disintegration of $\lambda$ w.r.t. $\pi$ and $\lambda^\perp$.

*Definition.*
A **contraction** of a thermal system $\left(\mathcal E,\mathcal M\right)$
defined by the contracting $\left(\pi,\lambda^\perp\right)$ of $\mathcal E$
is a new thermal system $\left(\mathcal E^\perp,\mathcal M^\perp\right)$ constructed as such:

- $\mathcal E^\perp:=\left(W^\perp,E^\perp,\lambda^\perp\right)$ is the contraction of $\mathcal E$ corresponding to the given contracting;
- $\mathcal M^\perp:=\bigsqcup_{e\in E^\perp}M_e^\perp$, where for each $e\in E^\perp$, $M_e^\perp:=\mathcal M_e^\parallel$; the family of systems $\left(\mathcal E_e^\parallel,\mathcal M_e^\parallel\right)$ (labeled by $e\in E^\perp$) are slices of $\left(\mathcal E,\mathcal M\right)$ corresponding to the contractive slicings of $\mathcal E$ defined by the contracting $\left(\pi,\lambda^\perp\right)$; the measure equipped on $\mathcal M_e^\parallel$ is the measure of number of microstates of $\left(\mathcal E_e^\parallel,\mathcal M_e^\parallel\right)$.

In some cases, the total number of microstates in $\mathcal M^\parallel_e$ is not finite for some $e$, then the contraction is not defined in this case.

*Example.*
For the thermal system of a solid consisting of spin-$\frac12$ particles,
define a constracting $\left(\pi,\lambda^\perp\right)$ by

Then the corresponding contraction of the thermal system may be written as a thermal system $\left(\left(W,E,\lambda\right),\bigsqcup_{e\in E}M_e\right)$, where

\[\begin{align*} W&:=\mathbb R,\\ E&:=\mathbb Z^+,\\ \lambda\!\left(A\right)&:=\operatorname{card}A,\\ M_N&:=\left\{0,1\right\}^N,\\ \mu_N\!\left(A\right)&:=\operatorname{card}A. \end{align*}\]Different from a slice of a system, a contraction of a system does not have the problem about the dimension (“dimension” as in “dimensional analysis”) of the measure on the space of extensive quantities. Although the dimension of $\lambda^\perp$ is different from $\lambda$, the dimension of $\mu^\perp_e$ (the measure on $M^\perp_e$) is also different from $\mu$, and they change together in such a way that the resultant $\mu^\perp$ (the measure of number of microstates on $\mathcal M^\perp$) has the same dimension as $\mu$.

This fact actually hints us that a contraction of a thermal system is essentially the same as the original thermal system in such a sense that the microstates in the two systems are naturally one-to-one connected. Indeed, the natural bijection from $\mathcal M$ to $\mathcal M^\perp$ is given by $\left(e,m\right)\mapsto\left(\pi(e),\left(e,m\right)\right)$. It is obvious that for any measurable function $f$ on $\mathcal M^\perp$ we have

\[\int_{\left(e,m\right)\in\mathcal M}f\!\left(\pi(e),(e,m)\right)\mathrm d\mu(e,m) =\int_{\left(e,m\right)\in\mathcal M^\perp}f\!\left(e,m\right)\mathrm d\mu^\perp(e,m).\]Using this map, we can pull back any function $f^\perp$ on $\mathcal M^\perp$ to become a function on $\mathcal M$ by

\[f\!\left(e,m\right):=f^\perp\!\left(\pi(e),\left(e,m\right)\right)\]and the other way around.
I want to call $f$ the **contractional pullback** of $f^\perp$ under $\pi$
and call $f^\perp$ the **contractional pushforward** of $f$ under $\pi$.
Specially, we may pull back any state $p^\perp$ of a contraction
to become a state $p$ on the original thermal system.
We will see that pullbacks of states are rather useful.

Obviously, the family of affine subspaces $\left\{W^\parallel_e\right\}_{e\in W^\perp}$ are parallel to each other. Therefore, their associated vector subspaces are the same vector subspace $\vec W^\parallel$ of $\vec W$, which is a complement of the vector subspace $\vec W^\perp$, the vector space that $W^\perp$ is associated with. We can write

\[\vec W=\vec W^\perp+\vec W^\parallel,\quad W=W^\perp+\vec W^\parallel.\]Each point in $W$ can be written in the form of $e+s$, where $e\in W^\perp$ and $s\in\vec W^\parallel$. Furthermore, for any $e\in W^\perp$, the map $s\mapsto e+s$ is a bijection from $\vec W^\parallel$ to $W^\parallel_e$. This bijection can then push forward linear operations from $\vec W^\parallel$ to $W^\parallel_e$. For example, we can define the action of some continuous linear functional $i\in\vec W^{\parallel\prime}$ on a point $e’\in W^\parallel_e$ as

\[\begin{equation} \label{eq: linear op on affine} i\!\left(e'\right):=i\!\left(e'-\pi\!\left(e'\right)\right), \end{equation}\]where $\pi\!\left(e’\right)$ is just $e$.

However, we need to remember that there is no generally physically meaningful linear structure on $W^\parallel_e$. The linear structure that we have constructed is just for convenience in notations.

An interesting fact about slicing, isolating, and contracting is that: an isolation of a contraction is a contraction of a slice.

Suppose we have a thermal system $\left(\mathcal E,\mathcal M\right)$, and by a contracting $\left(\pi,\lambda^\perp\right)$ we derive its contraction $\left(\mathcal E^\perp,\mathcal M^\perp\right)$.

Now, consider one of its contractive slices $\left(\mathcal E^\parallel_{e^\circ},\mathcal M^\parallel_{e^\circ}\right)$, where $e^\circ\in E^\perp$. Then, we contract this slice by the contracting $\left(\pi,\lambda^{\perp\prime}\right)$, where $\pi$ is the same $\pi$ as used above but whose domain is restricted to $W^\parallel_{e^\circ}$, and $\lambda^{\perp\prime}$ is the counting measure. Because the whole $W^\parallel_{e^\circ}$ is mapped to $e^\circ$ under $\pi$, the contraction becomes an isolated system whose only possible value of extensive quantities is $e^\circ$. Its spaces of microstates consist of only one measure space, which is $\mathcal M^\parallel_{e^\circ}$.

On the other hand, consider isolating $\left(\mathcal E^\perp,\mathcal M^\perp\right)$ at $e^\circ$. Its isolation at $e^\circ$ is an isolated system whose only possible value of extensive quantities is $e^\circ$. Its spaces of microstates consist of only one measure space, which is $M^\perp_{e^\circ}$, which is the same as $\mathcal M^\parallel_{e^\circ}$.

Therefore, an isolation of a contraction is a contraction of a slice.

This fact is useful because it enables us to find the equilibrium state of a slice. Using microcanonical ensemble, we can already find the equilibrium state of any isolated system, so we can find the equilibrium state of an isolation of a contraction. Then, it is the equilibrium state of a contraction of a slice. Then, by the contractional pullback, it is the equilibrium state of a slice.

Composite systems are systems that are composed of other systems. This is a useful concept because it allows us to treat multiple systems as a whole. The motivation of develop this concept is that we should use it to derive the canonical ensemble and the grand canonical ensemble. In those ensembles, the system is not isolated but in contact with a bath. To consider them as a whole system, we need to define composite systems.

The simplest case of a composite system is where
the subsystems are independent of each other.
Physically, this means that the subsystems do not have any thermodynamic contact between each other.
I would like to call the simplest case a **product thermal system**
just as how mathematicians name their product spaces constructed out of existing spaces.

*Definition.*
The **product space of extensive quantities** of two spaces of extensive quantities
$\left(W^{(1)},E^{(1)},\lambda^{(1)}\right)$ and $\left(W^{(2)},E^{(2)},\lambda^{(2)}\right)$
is a space of extensive quantities $\left(W,E,\lambda\right)$ constructed as such:

- $W:=W^{(1)}\times W^{(2)}$ is the product affine space of $W^{(1)}$ and $W^{(2)}$;
- $E:=E^{(1)}\times E^{(2)}$ is the product topological space as well as the product measure space of $E^{(1)}$ and $E^{(2)}$; and
- $\lambda$ is the product measure of $\lambda^{(1)}$ and $\lambda^{(2)}$, whose uniqueness is guaranteed by the σ-finiteness of $\lambda^{(1)}$ and $\lambda^{(2)}$.

*Definition.*
The **product thermal system** of two thermal systems
$\left(\mathcal E^{(1)},\mathcal M^{(1)}\right)$ and $\left(\mathcal E^{(2)},\mathcal M^{(2)}\right)$
is a thermal system $\left(\mathcal E,\mathcal M\right)$ constructed as such:

- $\mathcal E:=\left(W,E,\lambda\right)$ is the product space of extensive quantities of $\mathcal E^{(1)}$ and $\mathcal E^{(2)}$; and
- $\mathcal M:=\bigsqcup_{(e_1,e_2)\in E}M_{e_1,e_2}$, where $M_{e_1,e_2}:=M^{(1)}_{e_1}\times M^{(2)}_{e_2}$ is the product measure space of $M^{(1)}_{e_1}$ and $M^{(2)}_{e_2}$, equipped with measure $\mu_{e_1,e_2}$, the product measure of $\mu^{(1)}_{e_1}$ and $\mu^{(2)}_{e_2}$.

By this definition, $\mathcal M$ is naturally identified with $\mathcal M^{(1)}\times\mathcal M^{(2)}$, and the measure of number of microstates $\mu$ on $\mathcal M$ is in this sense the same as the product measure of $\mu^{(1)}$ and $\mu^{(2)}$ (the measures of number of microstates on $\mathcal M^{(1)}$ and $\mathcal M^{(2)}$). We can project elements in $\mathcal M$ back into $\mathcal M^{(1)}$ and $\mathcal M^{(2)}$ by the map $(e_1,e_2,m_1,m_2)\mapsto(e_1,m_1)$ and the map $(e_1,e_2,m_1,m_2)\mapsto(e_2,m_2)$.

This hints us that a probability distribution on $\mathcal M$
(which may be given by a state $p$ of $(\mathcal E,\mathcal M)$)
can be viewed as a joint probability distribution of the two random variables on $\mathcal M$:
$(e_1,e_2,m_1,m_2)\mapsto(e_1,m_1)$ and $(e_1,e_2,m_1,m_2)\mapsto(e_2,m_2)$.
As we all know, a joint distribution encodes conditional distributions and marginal distributions.
Therefore, given any state of a product thermal system,
we can define its **conditional states** and **marginal states** of the subsystems.
Conditional states are not very useful because they are not physically observed states of subsystems.
The physically observed states of subsystems are marginal states,
so marginal states are of special interest.

*Definition.*
Given a state $p$ of the product thermal system $(\mathcal E,\mathcal M)$
of $\left(\mathcal E^{(1)},\mathcal M^{(1)}\right)$ and $\left(\mathcal E^{(2)},\mathcal M^{(2)}\right)$,
its **marginal state** of the subsystem $\left(\mathcal E^{(1)},\mathcal M^{(1)}\right)$
is a state $p^{(1)}$ of the system $\left(\mathcal E^{(1)},\mathcal M^{(1)}\right)$ defined by

Physically, if a product thermal system is in equilibrium, then each of its subsystems is in equilibrium as well. Therefore, if $p^\circ$ is an equilibrium state of the product thermal system, then the marginal states of $p^\circ$ are equilibrium states of the subsystems.

Now, we need to consider how to describe the thermodynamic contact between subsystems. In the simplest case, where there is no thermodynamic contact between subsystems, the composite system is just the product thermal system of the subsystems, and the dimension of its space of extensive quantities is the sum of the that of the subsystems’. If there is some thermal contact between subsystems, then the dimension of the space of extensive quantities of the composite system will be less than that of the product thermal system. For example, if the subsystems are allowed to exchange energy, then two original extensive quantities (the energy of the first subsystem and that of the second subsystem) will be replaced by a single extensive quantity (the total energy of the composite system). Such a reduction in the dimension of the space of extensive quantities is the same as contracting that we defined above. Therefore, we can define a thermally composite system as a contraction of the product thermal system. Denote the projection map of the contracting as $\pi:W\to W^\perp:(e_1,e_2)\mapsto e$. (From now on in this section, composite systems refer to thermally composite system. I will introduce non-thermally composite systems later (in part 2), which describe non-thermal contacts between subsystems and are more complicated.)

Besides being the contraction of the product thermal system, there is an additional requirement. Given the extensive quantities of the composite system and those of one of the subsystems, we should be able to deduce those of the other subsystem. For example, if the subsystems are allowed to exchange energy, then the total energy of the composite system minus the energy of one of the subsystems should be the energy of the other subsystem, which is uniquely determined (if this is an allowed energy). Mathematically, thie means that for any $e_1\in W^{(1)}$ and $e_2\in W^{(2)}$, the two maps $\pi\!\left(e_1,\cdot\right)$ and $\pi\!\left(\cdot,e_2\right)$ are both injections.

*Definition.*
A **(thermally) composite thermal system** of two thermal systems
is the contraction of their product thermal system
corresponding to a contracting $(\pi,\lambda^\perp)$, where
$\pi:W\to W^\perp:(e_1,e_2)\mapsto e$ satisfies that
for any $e_1\in W^{(1)}$ and $e_2\in W^{(2)}$,
the two maps $\pi\!\left(e_1,\cdot\right)$ and $\pi\!\left(\cdot,e_2\right)$
are both injections.

We may define projection maps to get the extensive quantities of the subsystems from those of the composite system:

\[c^{(1)}:W\to W^{(1)}:(e_1,e_2)\mapsto e_1,\quad c^{(2)}:W\to W^{(2)}:(e_1,e_2)\mapsto e_2.\]Then, for each $e\in W^\perp$, the two spaces

\[W^{\parallel(1)}_e:=c^{(1)}\!\left(W_e^\parallel\right),\quad W^{\parallel(2)}_e:=c^{(2)}\!\left(W_e^\parallel\right)\]are respectively affine subspaces of $W^{(1)}$ and $W^{(2)}$, where $W_e^\parallel:=\pi^{-1}\!\left(e\right)$. The two affine subspaces are actually isomorphic to each other because of our additional requirement on the projection map $\pi$. Because $\pi\!\left(e_1,\cdot\right)$ is an injection, for any $e_1\in W^{\parallel(1)}_e$ there is a unique $e_2\in W^{\parallel(2)}_e$ such that $\pi\!\left(e_1,e_2\right)=e$, and vice versa. This gives a correspondence between the two affine subspaces. In other words, for each $e\in W^\perp$, there is a unique bijection $\rho_e:W^{\parallel(1)}_e\to W^{\parallel(2)}_e$ such that

\[\begin{equation} \label{eq: pi and rho_e} \forall e_1\in W^{\parallel(1)}_e: \pi\!\left(e_1,e_2\right)=e\Leftrightarrow e_2=\rho_e\!\left(e_1\right). \end{equation}\]The bijection $\rho_e$ is an affine isomorphism from $W^{\parallel(1)}_e$ to $W^{\parallel(2)}_e$.

What is more, $c^{(1)}$ is an affine isomorphism from $W^{\parallel}_e$ to $W^{\parallel(1)}_e$, and $c^{(2)}$ is an affine isomorphism from $W^{\parallel}_e$ to $W^{\parallel(2)}_e$. The three affine spaces $W^{\parallel}_e,W^{\parallel(1)}_e,W^{\parallel(2)}_e$ are then mutually isomorphic.

*Example.*
Suppose we have two thermal systems,
each of them have two extensive quantities called the energy and the number of particles.
We write them as $\left(U_1,N_1\right)$ and $\left(U_2,N_2\right)$.
They are in thermal contact so that they can exchange energy but not particles.
Then, the extensive quantities of the composite system may be written as $\left(U/2,U/2,N_1,N_2\right)$,
with $\pi:\left(U_1,U_2\right)\mapsto\left(U/2,U/2\right)$ defined as

The isomorphism $\rho_{U/2,U/2,N_1,N_2}$ is then

\[\rho_{U/2,U/2,N_1,N_2}\!\left(U_1,N_1\right)=\left(U-U_1,N_2\right).\]The contracting is not unique. For example, $\left(U_1,U_2\right)\mapsto\left(3U/4,U/4\right)$ is another valid projection for constructing the composite thermal system, and it has exactly the same physical meaning as the one I constructed above.

The isomorphism from $W^{\parallel}_e$
can push forward the measure $\lambda^\parallel_e$ on $E^\parallel_e$
to a new measure $\lambda^{\parallel(1)}_e$ on $E^{\parallel(1)}_e$.
Then, $\left(W^{\parallel(1)}_e,\lambda^{\parallel(1)}_e\right)$
is a slicing of $\left(W^{(1)},E^{(1)},\lambda^{(1)}\right)$,
and we can get a slice $\left(\mathcal E^{\parallel(1)}_e,\mathcal M^{\parallel(1)}_e\right)$
of $\left(\mathcal E^{(1)},\mathcal M^{(1)}\right)$
out of this slicing.
I would like to call this slice the
**compositing slice** of $\left(\mathcal E^{(1)},\mathcal M^{(1)}\right)$ at $e$.
Similarly, we define compositing slices of $\left(\mathcal E^{(2)},\mathcal M^{(2)}\right)$,
denoted as $\left(\mathcal E^{\parallel(2)}_e,\mathcal M^{\parallel(2)}_e\right)$.

Similarly to how we can define marginal states of subsystems of a product thermal system, we can define marginal states of the compositing slices given a state of a contractive slice of the composite system. However, this time, there is a key difference: the subsystems (compositing slices) have isomorphic and completely dependent (deterministic) extensive quantities instead of having completely independent extensive quantities. Taken this into account, we can define marginal states of compositing slices as follows:

\[\begin{equation} \label{eq: slice marginal state} p^{\parallel(1)}\!\left(e_1,m_1\right) :=\int_{m_2\in M^{(2)}_{\rho_e(e_1)}}p^\parallel\!\left(e_1,\rho_e(e_1),m_1,m_2\right) \mathrm d\mu^{(2)}_{\rho_e(e_1)}\!\left(m_2\right), \end{equation}\]where $p^{\parallel(1)}$ is a state of $\left(\mathcal E^{\parallel(1)}_e,\mathcal M^{\parallel(1)}_e\right)$, and $p^\parallel$ is a state of $\left(\mathcal E^{\parallel}_e,\mathcal M^{\parallel}_e\right)$ (a contractive slice of the composite system).

There is an additional property that $\rho_e$ has.

As we all know, an affine map is a linear map combined with a translation:

\[\begin{equation} \label{eq: rho_e and vec rho} \rho_e\!\left(e_1\right)=\vec\rho\!\left(e_1-e_0\right)+\rho_e\!\left(e_0\right), \end{equation}\]where $e_0$ is a fixed point in $W^{\parallel(1)}_e$, and $\vec\rho:\vec W^{\parallel(1)}_e\to \vec W^{\parallel(2)}_e$ is a linear map that is independent of the choice of $e_0$. Because $\rho_e$ is a bijection, $\vec\rho$ is also a bijection, and is thus a linear isomorphism from $\vec W^{\parallel(1)}_e$ to $\vec W^{\parallel(2)}_e$.

Because different slices $W^{\parallel(1)}_e$ with different $e$ are parallel to each other, actually $\vec W^{\parallel(1)}_e$ is the same vector subspace of $\vec W^{(1)}$ for any $e\in W^\perp$. We can write it as $\vec W^{\parallel(1)}$. Similarly, $\vec W^{\parallel(2)}_e$ is the same vector subspace $\vec W^{\parallel(2)}$ of $\vec W^{(2)}$ for any $e\in W^\perp$. Therefore, we can say $\vec\rho$ is a linear isomorphism from $\vec W^{\parallel(1)}$ to $\vec W^{\parallel(2)}$.

Then, here is the interesting claim:

*Theorem.*
The linear map $\vec\rho$ defined above is independent of the choice of $e$.

*Proof.*
Because $\pi$ is an affine map, we have

where $e\in W^\perp$ is fixed, $e_0\in W^{\parallel(1)}_e$ is also fixed, and $\vec\pi:\vec W\to\vec W^\perp$ is a linear map that is independent of the choice of $e$ and $e_0$.

Let $e_2:=\rho_e\!\left(e_1\right)$ in the equation above, and we have

\[\pi\!\left(e_1,\rho_e\!\left(e_1\right)\right) =\vec\pi\!\left(e_1-e_0,\rho_e\!\left(e_1\right)-\rho_e\!\left(e_0\right)\right) +\pi\!\left(e_0,\rho_e\!\left(e_0\right)\right).\]According to Equation \ref{eq: pi and rho_e} and \ref{eq: rho_e and vec rho}, we have

\[e=\vec\pi\!\left(e_1-e_0,\vec\rho\!\left(e_1-e_0\right)\right)+e.\]In other words,

\[\begin{equation} \label{eq: pi(s1, rho(s1))=0} \vec\pi\!\left(s_1,\vec\rho\!\left(s_1\right)\right)=0, \end{equation}\]where $s_1\in\vec W^{\parallel(1)}$ is an arbitrary vector.

Prove by contradition. Assume that $\vec\rho$ is dependent on the choice of $e$, then there exists two choices of $e$ such that we have two different $\vec\rho$’s, denoted as $\vec\rho$ and $\vec\rho’$. Because they are different maps, there exists an $s_1\in\vec W^{\parallel(1)}$ such that $\vec\rho(s_1)\ne\vec\rho’(s_1)$.

On the other hand, we have

\[\vec\pi\!\left(s_1,\vec\rho\!\left(s_1\right)\right)=0,\quad \vec\pi\!\left(s_1,\vec\rho'\!\left(s_1\right)\right)=0.\]Subtract the two equations, and because of the linearity of $l$, we have

\[\vec\pi\!\left(0,\delta\right)=0,\]where $\delta:=\vec\rho(s_1)-\vec\rho’(s_1)$ is a nonzero vector. Then, we have

\[\pi\!\left(e_1,e_2+\delta\right)-\pi\!\left(e_1,e_2\right)=\vec\pi(0,\delta)=0,\]which contradicts with the requirement that $\pi\!\left(e_1,\cdot\right)$ is injective. $\square$

Besides, because $\vec\rho$ is a linear isomorphism from $\vec W^{\parallel(1)}$ to $\vec W^{\parallel(2)}$, the map $i_1\mapsto i_1\circ\vec\rho^{-1}$ is a linear isomorphism from $\vec W^{\parallel(1)\prime}$ to $\vec W^{\parallel(2)\prime}$. The inverse of this isomorphism is $i_2\mapsto i_2\circ\vec\rho$.

As we know, $i_1$ and $i_2$ are actually intensive quantities.
The physical meaning of them being each other’s image/preimage under this isomorphism is that,
if the two subsystems in thermal contact have intensive quantities $-i_1$ and $i_2$ respectively,
then they are in equilibrium with each other.
Therefore, I would like to call this pair of intensive quantities to be **anticonsistent**.

Since we have a family of slices called the compositing slices of a subsystem, can we make them the contractive slices of some contracting of the subsystem? Well, it depends. The first difficulty is that $W^{\parallel(1)}_e$ may be the same subspace of $W^{(1)}$ for different $e\in W^\perp$ and thus make $E^{\parallel(1)}_e$ equipped with possibly different measures.

Anyway, ignore this at this stage. Let me first construct a subspace $W^{\perp(1)}$ and a projection $\pi^{(1)}:W^{(1)}\to W^{\perp(1)}$ so that $W^{\parallel(1)}_e$ are preimages of points in $W^{\perp(1)}$, and then see what will happen.

Since any vector subspace has a complement, we can pick a subspace of $\vec W^{(1)}$ that is a complement of $\vec W^{\parallel(1)}$ and call it $\vec W^{\perp(1)}$. Any vector in $\vec W^{(1)}$ can be uniquely decomposed into the sum of a vector in $\vec W^{\perp(1)}$ and a vector in $\vec W^{\parallel(1)}$.

Then, we pick some fixed $e_1\in W^{(1)}$, and it can be used to generate an affine subspace $W^{\perp(1)}:=e_1+\vec W^{\perp(1)}$ of $W^{(1)}$. Then, each point in $W^{(1)}$ can be uniquely decomposed into the sum of a point in $W^{\perp(1)}$ and a vector in $\vec W^{\parallel(1)}$. Such unique decompositions can be encoded into a projection map $\pi^{(1)}:W^{(1)}\to W^{\perp(1)}$.

It seems that we are now halfway to the construction of our contracting. However, before we proceed, I would like to prove a property of $W^{\perp(1)}$ we construct:

*Theorem.*
The map $\pi$ is an affine isomorphism
from the product affine space $W^{\perp(1)}\times W^{(2)}$ to $W^\perp$.

*Proof.*
The map $\pi$ is itself affine, so we just need to prove that it is injective and surjective.

To prove it is injective, suppose that we have two points $(e_1,e_2)$ and $(e_1’,e_2’)$ in $W^{\perp(1)}\times W^{(2)}$, such that

\[\pi\!\left(e_1,e_2\right)=\pi\!\left(e_1',e_2'\right)=:e.\]Then, we have

\[\left(e_1,e_2\right),\left(e_1',e_2'\right)\in W^\parallel_e.\]Therefore, $e_1,e_1’\in W^{\parallel(1)}_e$, so

\[e_1-e_1'\in\vec W^{\parallel(1)}.\]On the other hand, because $e_1,e_1’\in W^{\perp(1)}$, we have

\[e_1-e_1'\in\vec W^{\perp(1)}.\]Because $\vec W^{\perp(1)}$ is a complement of $\vec W^{\parallel(1)}$, the only possible case is that $e_1=e_1’$. Then, due to $\pi\!\left(e_1,\cdot\right)$ being injective, $e_2=e_2’$. Therefore, $\left(e_1,e_2\right)=\left(e_1’,e_2’\right)$. Therefore, $\pi$ is injective if its domain is restricted to $W^{\perp(1)}\times W^{(2)}$.

To prove it is surjective, suppose $e\in W^\perp$. Because $\pi$ is surjective from $W$ to $W^\perp$, there exists some $\left(e_1’,e_2’\right)\in W$ such that

\[\pi\!\left(e_1',e_2'\right)=e.\]According to Equation \ref{eq: pi and rho_e}, this is equivalently

\[e_2'=\rho_e\!\left(e_1'\right).\]We can uniquely decompose $e_1’\in W^{(1)}$ into the sum of a point $e_1\in W^{\perp(1)}$ and a vector $\delta\in\vec W^{\parallel(1)}$. Then, according to Equation \ref{eq: rho_e and vec rho}, we have

\[e_2'=\rho_e\!\left(e_1+\delta\right)=\rho_e\!\left(e_1\right)+\vec\rho\!\left(\delta\right).\]Thus $e_2:=e_2’-\vec\rho\!\left(\delta\right)=\rho_e\!\left(e_1\right)$. According to Equation \ref{eq: pi and rho_e}, this is equivalently

\[\pi\!\left(e_1,e_2\right)=e.\]Therefore, $\left(e_1,e_2\right)\in W^{\perp(1)}\times W^{(2)}$ is the desired point in $W^{\perp(1)}\times W^{(2)}$ that is mapped to $e$ under $\pi$. Therefore, $\pi$ is surjective if its domain is restricted to $W^{\perp(1)}\times W^{(2)}$. $\square$

Then, it seems that if we need a measure on $E^{\perp(1)}$ that is consistent with our theory, the product measure of it and that on $E^{(2)}$ should be equal to that on $E^\perp$. However, it is not always possible to find such a measure. This is our second difficulty.

Therefore, in order to construct a contracting, we need to following assumptions:

- For different $e\in E^\perp$, $\lambda^{\parallel(1)}_e$ is the same measure whenever $W^{\parallel(1)}_e$ is the same subspace.
- There exists a measure $\lambda^{\perp(1)}$ on $E^{\perp(1)}$ so that $\lambda^\perp$ is the pushforward of the product measure of $\lambda^{\perp(1)}$ and $\lambda^{(2)}$ under $\pi$.

Given those assumptions, if we define $\lambda^{\parallel(1)\prime}_{e_1}$ to be the measures from the disintegration of $\lambda^{(1)}$ w.r.t. $\pi^{(1)}$ and $\lambda^{\perp(1)}$ (just the way we constructed the measures in constructive slicings), then we can verify that they are actually the same as $\lambda^{\parallel(1)}_e$ defined before, for any $e$ in the image of $\pi\!\left(e_1,\cdot\right)$. You can verify this easily by the following check (not a rigorous proof), where $\otimes$ denotes product measures or integration:

\[\lambda=\lambda^{\perp}\otimes\left\{\lambda^\parallel_e\right\} =\lambda^{\perp(1)}\otimes\lambda^{(2)}\otimes\left\{\lambda^\parallel_e\right\}.\]On the other hand,

\[\lambda=\lambda^{(1)}\otimes\lambda^{(2)} =\lambda^{\perp(1)}\otimes\left\{\lambda^{\parallel(1)\prime}_{e_1}\right\}\otimes\lambda^{(2)}.\]Comparing them, we have

\[\left\{\lambda^{\parallel(1)\prime}_{e_1}\right\}=\left\{\lambda^\parallel_e\right\} =\left\{\lambda^{\parallel(1)}_e\right\}.\]An explicit verification is more tedious and is omitted here.

Those assumptions are very strong, so we do not want to assume them. Without those assumptions, we still have a well-constructed $W^{\perp(1)}$ and $\pi^{(1)}$ so that $W^{\parallel(1)}_e$ are preimages of points in $W^{\perp(1)}$ under $\pi$. Then, we can use similar tricks as Equation \ref{eq: linear op on affine} to define the action of any continuous linear functional $i_1\in\vec W^{\parallel(1)\prime}$ on a point $e_1\in W^{(1)}$ as

\[i_1\!\left(e_1\right):=i_1\!\left(e_1-\pi^{(1)}\!\left(e_1\right)\right).\]We can also do the same thing on $W^{(2)}$. Then, an interesting thing to notice is that if we have $e_1\in W^{(1)}$ and $e_2\in W^{(2)}$ such that

\[e:=\pi\!\left(e_1,e_2\right) =\pi\!\left(\pi^{(1)}\!\left(e_1\right),\pi^{(2)}\!\left(e_2\right)\right),\]then we have

\[i_1\!\left(e_1\right)=i_2\!\left(e_2\right),\]where $i_1\in\vec W^{\parallel(1)\prime}$ and $i_2\in\vec W^{\parallel(2)\prime}$ are anticonsistent to each other.

*Example.*
In the example of two thermal systems that can exchange energy but not number of particles,
we may choose

Such projections are not unique, but this is the simplest one and also the most natural one considering their physical meanings.

We have newly defined some vector spaces. There are interesting relations between them:

*Theorem.*

*Proof.*
Obviously
$\vec\pi\!\left(\vec W^{\parallel(2)}\right)\subseteq
\vec\pi\!\left(\vec W^{\parallel(1)}\times\vec W^{\parallel(2)}\right)$,
so we just need to prove that
$\vec\pi\!\left(\vec W^{\parallel(1)}\times\vec W^{\parallel(2)}\right)
\subseteq\vec\pi\!\left(\vec W^{\parallel(2)}\right)$.
To prove this, we just need to prove that for any

where $s_1\in\vec W^{\parallel(1)}$ and $s_2\in\vec W^{\parallel(2)}$, we have $s\in\vec\pi\!\left(\vec W^{\parallel(2)}\right)$. To prove this, subtract Equation \ref{eq: pi(s1, rho(s1))=0} from the definition of $s$, and we have

\[s=\vec\pi\!\left(0,s_2-\vec\rho\!\left(s_1\right)\right)\in\vec\pi\!\left(\vec W^{\parallel(2)}\right).\]Therefore, $\vec\pi\!\left(\vec W^{\parallel(1)}\times\vec W^{\parallel(2)}\right) \subseteq\vec\pi\!\left(\vec W^{\parallel(2)}\right)$. Similarly, $\vec\pi\!\left(\vec W^{\parallel(1)}\times\vec W^{\parallel(2)}\right) \subseteq\vec\pi\!\left(\vec W^{\parallel(1)}\right)$. Therefore, we proved the theorem. $\square$

Here we defined a new vector space $\vec W^{\perp\parallel}$. Obviously it is a subspace of $\vec W^\perp$. Because $\vec\pi(s_1,\cdot)$ and $\vec\pi(\cdot,s_2)$ are injective, $\vec\pi$ is a linear isomorphism from $\vec W^{\parallel(1)}$ to $\vec W^{\perp\parallel}$ and a linear isomorphism from $\vec W^{\parallel(2)}$ to $\vec W^{\perp\parallel}$.

Here is another interesting thing about this vector space:

*Theorem.*
Suppose $e,e’\in W^\perp$.
Iff $W^{\parallel(1)}_e=W^{\parallel(1)}_{e’}$ and $W^{\parallel(2)}_e=W^{\parallel(2)}_{e’}$,
then $e’-e\in\vec W^{\perp\parallel}$.

*Proof.*
First, prove the “if” direction.

Because $W^{\parallel(1)}_e=W^{\parallel(1)}_{e’}$, we have $c^{(1)}\!\left(\pi^{-1}\!\left(e\right)\right)=c^{(1)}\!\left(\pi^{-1}\!\left(e’\right)\right)$. In other words,

\[\forall x\in\pi^{-1}(e):\exists s_2\in\vec W^{(2)}:x+\left(0,s_2\right)\in\pi^{-1}(e').\]Equivalently, this means

\[\pi(x)=e\Rightarrow\exists s_2\in\vec W^{(2)}:\pi\!\left(x+\left(0,s_2\right)\right)=e'.\]Note that $\pi\!\left(x+\left(0,s_2\right)\right)=\pi(x)+\vec\pi\!\left(0,s_2\right)$, which is just $e+\vec\pi\!\left(0,s_2\right)$, and we have

\[\exists s_2\in\vec W^{(2)}:e'-e=\vec\pi\!\left(0,s_2\right).\]Similarly,

\[\exists s_1\in\vec W^{(1)}:e'-e=\vec\pi\!\left(s_1,0\right).\]Subtract the two equations, and we have

\[0=\vec\pi\!\left(s_1,-s_2\right),\]which means

\[\left(s_1,-s_2\right)\in\vec\pi^{-1}(0)=\vec W^\parallel.\]Therefore,

\[s_1\in c^{(1)}\!\left(\vec W^\parallel\right)=\vec W^{\parallel(1)}.\]Therefore,

\[e'-e=\vec\pi\!\left(s_1,0\right)\in\vec\pi\!\left(\vec W^{\parallel(1)}\right) =\vec W^{\perp\parallel}.\]Now, prove the “only if” direction.

Because $e’-e\in\vec W^{\perp\parallel}=\vec\pi\!\left(\vec W^{\parallel(2)}\right)$, there exists $s_2\in\vec W^{\parallel(2)}$ such that

\[e'=e+\vec\pi\!\left(0,s_2\right).\]Therefore, obviously we have $c^{(1)}\!\left(\pi^{-1}\!\left(e\right)\right)=c^{(1)}\!\left(\pi^{-1}\!\left(e’\right)\right)$, and thus $W^{\parallel(1)}_e=W^{\parallel(1)}_{e’}$.

Similarly, we can prove that $W^{\parallel(2)}_e=W^{\parallel(2)}_{e’}$. $\square$

This means that, given both $W^{\parallel(1)}_e$ and $W^{\parallel(2)}_e$, we can determine $e$ upto a vector in $\vec W^{\perp\parallel}$.

Because we already have $\vec W^{\perp\parallel}$, we can define a new affine subspace $W^{\perp\perp}:=\pi\!\left(W^{\perp(1)}\times W^{\perp(2)}\right)$ so that $W^\perp=W^{\perp\perp}+\vec W^{\perp\parallel}$, and each point in $W^\perp$ can be uniquely decomposed as a sum of a point in $W^{\perp\perp}$ and a vector in $\vec W^{\perp\parallel}$. We can prove this easily. Such decomposition can be encoded into a projection $\pi^\perp:W^\perp\to W^{\perp\perp}$ so that for any $e\in W^\perp$, we have $e-\pi^\perp(e)\in\vec W^{\perp\parallel}$. Also, we can easily prove that $\pi$ is an affine isomorphism from $W^{\perp(1)}\times W^{\perp(2)}$ to $W^{\perp\perp}$.

Now that we have defined many affine spaces and vector spaces, here is a diagram of the relation between (some of) them (powered by quiver):

*Example.*
In the example of two thermal systems that can exchange energy but not number of particles,
we may have

**Bath**s are a special class of thermal systems.
They are systems that have some of their intensive quantities well-defined and constant.

According to Equation \ref{eq: mce fundamental eq}, to make the intensive quantities constant, $\ln\Omega(e)$ should be linear in $e$. If we just require some of the intensive quantities to be constant, we need to make it be linear when $e$ moves in directions in some certain vector subspace.

The requirement above is required by the microcanonical ensemble, which does not involve change in extensive quantities. An intuitive requirement is that $\lambda$ is also translationally invariant in such directions.

Then, here comes the definition of a bath:

*Definition.*
A thermal system $(\mathcal E,\mathcal M)$ is called
a **$\left(\vec W^\parallel,i\right)$-bath**,
where $\mathcal E=(W,E,\lambda)$ and $\mathcal M=\bigsqcup_{e\in W}M_e$, if

- $\vec W^\parallel$ is a vector subspace of $\vec W$ and is a Polish reflexive space;
- For any $e\in E$ and $s\in\vec W^\parallel$, $e+s\in E$.
- $\lambda$ is invariant under translations in $\vec W^\parallel$; in other words, for any $s\in\vec W^\parallel$ and $A\in\sigma(E)$, we have $\lambda(A+s)=\lambda(A)$;
- $i\in\vec W^{\parallel\prime}$ is a continuous linear functional on $\vec W^\parallel$,
called the
**constant intensive quantities**of the bath; and - For any $e\in E$ and $s\in\vec W^\parallel$,

An important notice is that $\vec W^\parallel$ must be finite-dimensional because a metrizable TVS with a non-trivial σ-finite translationally quasi-invariant Borel measure must be finite-dimensional (Feldman, 1966).

We can then define the non-trivial σ-finite translationally invariant Borel measure on $\vec W^\parallel$, denoted as $\lambda^\parallel$. It is unique up to a positive constant factor.

We may construct an affine subspace $W^\perp$ for the bath so that every point in $W$ can be uniquely decomposed into the sum of a point in $W^\perp$ and a vector in $\vec W^\parallel$. Then, we have a projection map $\pi:W\to W^\perp$ so that for any $e\in W$ we have $e-\pi(e)\in\vec W^\parallel$. Then, obviously, $\mu_e\!\left(M_e\right)$ must be in the form

\[\begin{equation} \label{eq: Omega of bath} \mu_e\!\left(M_e\right)=f\!\left(\pi(e)\right)\mathrm e^{i(e-\pi(e))}, \end{equation}\]where $f:W^\perp\to\mathbb R^+$ is some function. The eplicit formula of $f$ is $f(e):=\mu_e\!\left(M_e\right)$.

Further, we may require that $W^\perp$ is associated with a topological complement of $\vec W^\parallel$ (this is because $\vec W$ is locally convex and Hausdorff and $\vec W^\parallel$ is finite-dimensional). Then, by the mathematical tools that were introduced in the beginning, we can disintegrate the measure $\lambda$ w.r.t. $\lambda^\parallel$ to get a measure $\lambda^\perp$ on $W^\perp$ (it is the same for any element in $\vec W^\parallel$ because $\lambda$ is $\vec W^\parallel$-translationally invariant). Then, $\lambda$ is the product measure of $\lambda^\perp$ and $\lambda^\parallel$. In other words, for any measurable function $f:E\to\mathbb R$, we have

\[\int_Ef\,\mathrm d\lambda= \int_{e\in E^\perp}\int_{s\in\vec W^\parallel}f\!\left(e+s\right) \mathrm d\lambda^\perp\!\left(e\right)\mathrm d\lambda^\parallel\!\left(s\right).\]Different from microcanonical ensembles,
**thermal ensemble**s are ensembles where the system we study is in thermal contact with a bath.
For example, canonical ensembles and grand canonical ensembles are thermal ensembles.
There are also non-thermal ensembles,
which will be introduced later after we introduce non-thermal contacts
(in part 2).

The thermal ensemble of a thermal system is the ensemble of the composite system of the system in question (subsystem 1) and a $\left(\vec W^{\parallel(2)},-i\circ\vec\rho^{-1}\right)$-bath (subsystem 2), where $i\in\vec W^{\parallel(1)\prime}$ is a parameter, with an extra requirement:

\[\begin{equation} \label{eq: W2 translationally invariant} \forall s_2\in\vec W^{\parallel(2)},A\in\sigma(E): \lambda^\perp\!\left(\pi\!\left(A+s_2\right)\right)=\lambda^\perp\!\left(\pi\!\left(A\right)\right). \end{equation}\]The physical meaning of $i$ is the intensive variables that the system is fixed at by contacting the bath.

This composite system is called the
**composite system for the $\vec W^{\parallel(1)}$-ensemble**.
It is called that because we will see that the only important thing
that distinguishes different thermal ensembles is the choice of $\vec W^{\parallel(1)}$,
and the choices of $\pi,\lambda^\perp,W^{\perp(1)},W^{\perp(2)}$ are not important.

*Definition.*
The **composite system for the $\vec W^{\parallel(1)}$-ensemble**
of the system $\left(\mathcal E^{(1)},\mathcal M^{(1)}\right)$ is the composite system
of $\left(\mathcal E^{(1)},\mathcal M^{(1)}\right)$ and $\left(\mathcal E^{(2)},\mathcal M^{(2)}\right)$,
where

- $\left(\mathcal E^{(2)},\mathcal M^{(2)}\right)$
is a $\left(\vec W^{\parallel(2)},-i\circ\vec\rho^{-1}\right)$-bath,
where $i\in\vec W^{\parallel(1)\prime}$ is a parameter called the
**fixed intensive quantities**; - Equation \ref{eq: W2 translationally invariant} holds.

From the properties of a bath, we can derive a useful property of $\lambda^{\parallel(1)}_e$.

Because $\lambda^{\parallel(1)}_e$ is the pullback of $\lambda^{\parallel(2)}_e$ under $\rho_e$, but $\lambda^{\parallel(2)}_e$ is just the same $\lambda^{\parallel(2)}$ for all $e$ (although $\lambda^{\parallel(2)}_e$ is defined on $W^{\parallel(2)}_e$ but $\lambda^{\parallel(2)}$ is defined on $\vec W^{\parallel(2)}$), we have $\lambda^{\parallel(1)}_e$ is the same as long as $W^{\parallel(1)}_e$ is the same. This means that we are able to be consistent with different compositing slices of our subsystem.

As we have claimed before, the isolation of a contraction is the same as the full contraction of a contractive slice. Therefore, we can use the microcanonical ensemble to find the equilibrium state of any contractive slice. Then, we can use the marginal state of each contractive slice to get the equilibrium state of each compositing slice in the subsystem.

Because the equal a priori probability postulate, the equilibrium state $p^{\parallel\circ}_e$ on the contractive slice \(\left(\mathcal E^\parallel_e,\mathcal M^\parallel_e\right)\) is

\[p^{\parallel\circ}_e\!\left(e_1,e_2,m_1,m_2\right) =\frac1{\mu^\parallel_e\!\left(\mathcal M^\parallel_e\right)}\propto1,\]where $\mu^\parallel_e$ is the measure of the number of microstates on $\mathcal M^\parallel_e$. Here $\propto$ means that the factor is only related to $e$. We just need “$\propto$” instead of “$=$” because we can always normalize a probability density function.

Substitute this into Equation \ref{eq: slice marginal state}, and we get that the equilibrium state $p^{\parallel\circ(1)}_e$ on the compositing slice \(\left(\mathcal E^{\parallel(1)}_e,\mathcal M^{\parallel(1)}_e\right)\) is

\[\begin{align} p^{\parallel\circ(1)}_e\!\left(e_1,m_1\right) &\propto\mu^{(2)}_{\rho_e(e_1)}\!\left(M^{(2)}_{\rho_e(e_1)}\right) \nonumber\\ &=f\!\left(\pi^{(2)}\!\left(\rho_e\!\left(e_1\right)\right)\right) \mathrm e^{\left(-i\circ\vec\rho^{-1}\right)\left(\rho_e(e_1)-\pi^{(2)}(\rho_e(e_1))\right)} \nonumber\\ &\propto\mathrm e^{-i(e_1)}. \label{eq: p^(1) propto e^-i(e1)} \end{align}\]Here we utilized Equation \ref{eq: Omega of bath} and the fact that for any $e_1\in W^{\parallel(1)}_e$, $\pi^{(2)}\!\left(\rho_e(e_1)\right)=\pi^{(2)}\!\left(W^{\parallel(2)}_e\right)$ is the same and is only related to $e$. Note that we have already illustrated that $\lambda^{\parallel(1)}_e$ is the same as long as $W^{\parallel(1)}_e$ is the same, so we can normalize $p^{\parallel\circ(1)}_e$ to get the same state as long as $W^{\parallel(1)}_e$ is the same, avoiding any inconsistency.

Before we proceed to normalize $p^{\parallel\circ(1)}_e$, I would like to talk about what is just enough information to determine $\lambda^{\parallel(1)}_e$. First, we need to know how different $e$ can still make $W^{\parallel(1)}_e$ the same. We already know that $W^\perp$ is just $W^{\perp\perp}+\vec W^{\perp\parallel}$, and the component in $\vec W^{\perp\parallel}$ does not affect $W^{\parallel(1)}_e$ and $W^{\parallel(2)}_e$, so we only need to know no more than $\pi^\perp(e)$. Then, because $W^{\perp\perp}$ is isomorphic to $W^{\perp(1)}\times W^{\perp(2)}$ but the corresponding change in $W^{\perp(2)}$ does not affect $W^{\parallel(1)}_e$, we only need to know the component $\pi^{(1)}\!\left(e_1\right)=\pi^{(1)}\!\left(\pi^{-1}(e)\right)$, where $e_1$ is just the $e_1$ in Equation \ref{eq: p^(1) propto e^-i(e1)}. The space $W^{\parallel(1)}_e$ is just $\pi^{(1)-1}\!\left(e_1\right)$.

Besides these information (components of $e$) is useless, there is other useless information. I have previously mentioned that the choices of $\lambda^\perp$, $\lambda^{\perp(2)}$ etc. are also irrelevant. We can see this by noting that $\lambda^{\parallel(1)}$ is always the non-trivial translationally invariant σ-finite Borel measure on $W^{\parallel(1)}_e$, which is unique up to a constant postive factor (and exists because it is finite-dimensional). This is not related to the choices of $\lambda^\perp$, $\lambda^{\perp(2)}$ etc. By this, we reduced the only thing that we need to care about into three ones $\lambda^{(1)}$, $\lambda^{\perp(1)}$, and $\lambda^{\parallel(1)}$, and their relation is given by the following:

\[\int_{E^{(1)}}f\,\mathrm d\lambda^{(1)}= \int_{e_1\in E^{\perp(1)}}\mathrm d\lambda^{\perp(1)}\!\left(e_1\right) \int_{s_1\in\vec E^{\parallel(1)}_{e_1}} f\!\left(e_1+s_1\right)\mathrm d\lambda^{\parallel(1)}\!\left(s_1\right),\]where $E^{\perp(1)}:=\pi^{(1)}\!\left(E^{(1)}\right)$ and $\vec E^{\parallel(1)}_{e_1}:=\left(E^{(1)}-e_1\right)\cap\vec W^{\parallel(1)}$ is the region of $s_1\in\vec W^{\parallel(1)}$ in which $e_1+s_1$ is in $E^{(1)}$.

Next, what we need to do is to normalize Equation \ref{eq: p^(1) propto e^-i(e1)}.
The denominator in the normalization factor, which we could call the **partition function**
$Z:\bigsqcup_{e_1\in E^{\perp(1)}}I^{(1)}_{e_1}\to\mathbb R$, is

where $I_{e_1}\subseteq\vec W^{\parallel(1)\prime}$ is the region of $i$ in which the integral converges. It is possible that $I_{e_1}=\varnothing$ for all $e_1\in E^{\perp(1)}$, and in this case the thermal ensemble is not defined.

Because we have got rid of arguments about the bath and the composite system, we can now define the partition function without the “$(1)$” superscript:

\[Z\!\left(e,i\right)=\int_{s\in\vec E^{\parallel}_e} \Omega\!\left(e+s\right) \mathrm e^{-i\left(s\right)}\,\mathrm d\lambda^{\parallel}\!\left(s\right),\quad e\in E^\perp,\quad i\in I_e\subseteq\vec W^{\parallel\prime}.\]By looking at the definition, we may see that the partition function is just the partial Laplace transform of $\Omega$.

Note that the partition function is unique only up to a positive constant factor because we can choose another $\lambda^\parallel$ by multiplying a positive constant factor.

The partition function has very good properties.

*Theorem.*
For any $e\in E^\perp$, $I_e$ is convex.

*Proof.*
Suppose $i,i’\in I_e$.
The functional $i’-i$ defines a hyperplane $H:=\operatorname{Ker}\!\left(i’-i\right)$.
The hyperplane separate $\vec W^\parallel$ into two half-spaces $H^+$ and $H^-$ defined as

By definition, $Z\!\left(e,i\right)$ and $Z\!\left[e,i’\right]$ both converge. Let $t\in\left[0,1\right]$, and we have

\[\begin{align*} Z\!\left(e,i+t\left(i'-i\right)\right) &=\left(\int_{s\in\vec E^{\parallel}_e\cap H^+}+\int_{s\in\vec E^{\parallel}_e\cap H^-}\right) \Omega\!\left(e+s\right) \mathrm e^{-i(s)-t(i'(s)-i(s))}\,\mathrm d\lambda^{\parallel}\!\left(s\right)\\ &\le\int_{s\in\vec E^{\parallel}_e\cap H^+}\Omega\!\left(e+s\right) \mathrm e^{-i(s)}\,\mathrm d\lambda^{\parallel}\!\left(s\right) +\int_{s\in\vec E^{\parallel}_e\cap H^-}\Omega\!\left(e+s\right) \mathrm e^{-i'(s)}\,\mathrm d\lambda^{\parallel}\!\left(s\right)\\ &<\infty. \end{align*}\]Therefore, $Z!\left[e,i+t\left(i’-i\right)\right]$ converges. $\square$

Being convex is good because it means that $I_e$ is not too shattered. It is connected, and its interior $\operatorname{Int}I_e$ and closure $\operatorname{Cl}I_e$ look very much like $I_e$ itself. Also, every point in $I_e$ is a limit point of $I_e$. This makes it possible to talk about the limits and derivatives of $Z\!\left(e,i\right)$ w.r.t. $i$.

Since $I_e$ is a region in a finite-dimensional space $\vec W^{\parallel\prime}$, we may define the derivatives w.r.t. $i$ in terms of partial derivatives to components of $i$. To define the components of $i$, we need first a basis on $\vec W^\parallel$, which sets a coordinate system although actually we should finally derive coordinate-independent conclusions.

Suppose we have a basis on $\vec W^\parallel$. Then, for any $s\in\vec W^\parallel$, we can write its components as $s_\bullet$, and for any $i\in\vec W^{\parallel\prime}$, we can write its components as $i_\bullet$. The subscript “$\bullet$” here can act as dummy indices (for multi-index notation). For example, we can write $i(s)=i_\bullet s_\bullet$. I do not use superscript and subscript to distinguish vectors and linear functionals because it is just for multi-index notation and because I am going to use them to label multi-index objects that are neither vectors nor linear functionals.

*Theorem.*
For any $e\in E^\perp$, $Z\!\left(e,i\right)$ is $C^\infty$ w.r.t. $i$ on $\operatorname{Int}I_e$.

*Proof.*
By the definition of the interior of a region,
for any $i\in\operatorname{Int}I_e$ and any $p\in\vec W^{\parallel\prime}$,
there exists $\delta_{i,p}>0$ such that $i+\delta_{i,p}p\in I_e$.

By Leibniz’s integral rule, the partial derivatives of $Z\!\left(e,i\right)$ w.r.t. $i$ (if existing) are given by

\[\begin{align*} \frac{\partial^{\Sigma\alpha_\bullet}Z\!\left(e,i\right)}{\partial^{\alpha_\bullet}i_\bullet} &=\int_{s\in\vec E^{\parallel}_e} \Omega\!\left(e+s\right)\left(-s_\bullet\right)^{\alpha_\bullet} \mathrm e^{-i\left(s\right)}\,\mathrm d\lambda^{\parallel}\!\left(s\right)\\ &\le\int_{s\in\vec E^{\parallel}_e} \Omega\!\left(e+s\right)\left|s_\bullet\right|^{\alpha_\bullet} \mathrm e^{-i\left(s\right)}\,\mathrm d\lambda^{\parallel}\!\left(s\right) \end{align*}\]where $\alpha_\bullet$ is some natural numbers indexed by $\bullet$. Now we just need to prove that this integral converges for any $i\in\operatorname{Int}I_e$.

Because of the inequality

\[a\ln x-bx\le a\left(\ln\frac ab-1\right),\quad a,b,x>0,\]where the equality holds when $x=a/b$, we have

\[\left|s_\bullet\right|^{\alpha_\bullet} \le\left(\frac{\alpha_\bullet}{\mathrm eb}\right)^{\alpha_\bullet}\mathrm e^{b\Sigma\left|s_\bullet\right|}, \quad b>0\]There are $2^{\dim\vec W^\parallel}$ orthants in $\vec W^\parallel$. We can label each of them by a string $\sigma_\bullet$ of $\pm1$ of length $\dim\vec W^\parallel$. Then, each orthant can be denoted as $O_\sigma$. Then, we have

\[\forall s\in O_\sigma:\sigma_\bullet s_\bullet=\Sigma\left|s_\bullet\right|.\]Therefore,

\[\forall s\in O_\sigma:\left|s_\bullet\right|^{\alpha_\bullet} \le\left(\frac{\alpha_\bullet}{\mathrm eb}\right)^{\alpha_\bullet}\mathrm e^{b\sigma_\bullet s_\bullet}, \quad b>0.\]Let $b:=\delta_{i,-\sigma}$, where $\sigma:s\mapsto\sigma_\bullet s_\bullet$ is a linear functional. Then,

\[\forall s\in O_\sigma:\left|s_\bullet\right|^{\alpha_\bullet}\mathrm e^{-i(s)} \le\left(\frac{\alpha_\bullet}{\mathrm e\delta_{i,-\sigma}}\right)^{\alpha_\bullet} \mathrm e^{-\left(i-\delta_{i,-\sigma}\sigma\right)(s)}.\]Because $i-\delta_{i,-\sigma}\sigma\in I_e$, we have

\[\frac{\partial^{\Sigma\alpha_\bullet}Z\!\left(e,i\right)}{\partial^{\alpha_\bullet}i_\bullet} \le\sum_\sigma\left(\frac{\alpha_\bullet}{\mathrm e\delta_{i,-\sigma}}\right)^{\alpha_\bullet} \int_{s\in\vec E^{\parallel}_e\cap O_\sigma}\Omega\!\left(e+s\right) \mathrm e^{-\left(i-\delta_{i,-\sigma}\sigma\right)(s)}\, \mathrm d\lambda^{\parallel}\!\left(s\right)<\infty.\]Therefore, the partial derivatives exist. $\square$

The next step is to find the macroscopic quantities. The equilibrium states are

\[p_e^{\parallel\circ}\!\left(e,m\right) =\frac{\mathrm e^{-i\left(e\right)}}{Z\!\left(\pi(e),i\right)}.\]where $Z$ is the partition function. Here the role of $e$ becomes the label parameter in Equation \ref{eq: fundamental equation before}. The measured value of extensive quantities under equilibrium is then

\[\begin{align*} \varepsilon^\circ &=\frac1{Z\!\left(e,i\right)}\int_{s\in\vec E^{\parallel}_e} \left(e+s\right)\mathrm e^{-i\left(s\right)} \Omega\!\left(e+s\right)\mathrm d\lambda^{\parallel}\!\left(s\right)\\ &=e+\frac1{Z\!\left(e,i\right)}\int_{s\in\vec E^{\parallel}_e} s\mathrm e^{-i\left(s\right)} \Omega\!\left(e+s\right)\mathrm d\lambda^{\parallel}\!\left(s\right)\\ &=e+\frac{\partial\ln Z\!\left(e,i\right)}{\partial i}. \end{align*}\]The entropy under equilibrium is then

\[\begin{align*} S^\circ &=\int_{s\in\vec E^{\parallel}_e} \frac{\mathrm e^{-i(s)}}{Z\!\left(e,i\right)}\ln\frac{\mathrm e^{-i(s)}}{Z\!\left(e,i\right)} \Omega\!\left(e+s\right)\mathrm d\lambda^{\parallel}\!\left(s\right)\\ &=-\frac1{Z\!\left(e,i\right)}\int_{s\in\vec E^{\parallel}_e} i\!\left(s\right)\mathrm e^{-i\left(s\right)} \Omega\!\left(e+s\right)\mathrm d\lambda^{\parallel}\!\left(s\right) +\ln Z\!\left(e,i\right)\\ &=-i\!\left(\frac{\partial\ln Z\!\left(e,i\right)}{\partial i}\right)+\ln Z\!\left(e,i\right). \end{align*}\]By this two equations, we can eliminate the parameter $e$ and get the fundamental equation in the form of Equation \ref{eq: fundamental equation}:

\[S^\circ=i\!\left(\varepsilon^\circ\right)+\ln Z\!\left(\pi\!\left(\varepsilon^\circ\right),i\right).\]We can see that $S^\circ$ decouples into two terms, one of which is only related to the $\vec W^\parallel$ component of $\varepsilon^\circ$, and the other of which is only related to the $W^\perp$ component of $\varepsilon^\circ$. What is good is that we have a good notion of derivative w.r.t. the first term, and it is $i$. Therefore, the intensive quantities corresponding to change of extensive quantities in the subspace $\vec W^\parallel$ is well defined and is constant $i$, which is just what we have been calling the fixed intensive quantities. The other components of the intensive quantities are not guaranteed to be well-defined because $Z\!\left(\cdot,i\right)$ is not guaranteed to have good enough properties.

*This articled is continued in part 2.*

A **(binary) voting system** is a tuple $(P,V,q)$, where
$P$ is any set, called the set of **proposals**,
and $V$ is a finite set of preference relations on $P$, called the set of **voters**,
and $q$ is an integer between (inclusive) $0$ and $\left|V\right|$,
called the **quota**.

For each voter $v\in V$ and two proposals $x,y\in P$, we denote “$v$ prefers $x$ to $y$” by

\[x\succeq_vy.\]A proposal $x\in P$ is a **defeat** of $y\in P$ if

denoted as $x\succsim_{V,q}y$
(despite this notation, $\succsim_{V,q}$ is *not* necessarily a preference relation on $P$
because it is not transitive generally,
which is actually a well-known example of irrationality).

The **core** $\mathcal C(P,V,q)$ of the voting system is the set of such element $x\in P$:
$x$ does not have any defeat other than $x$ itself (non-trivial defeat).

Pareto sets are common concepts in economics. To clarify, I also give the mathematical definition of them here.

Let $P$ be a set and $Q$ be a family of preference relations on $P$.
Then, $x\in P$ is called a (weak) **$Q$-Pareto improvement** of $y\in P$ if $\forall v\in V:x\succeq_vy$,
denoted as $x\succsim_Qy$
(despite the notation, $\succsim_Q$ is *not* necessarily a preference relation on $P$).

The **Pareto set** $\mathcal P(P,Q)$ is the set of all such element $x\in P$:
$x$ does not have any $Q$-Pareto improvement other than $x$ itself
(non-trivial $Q$-Pareto improvement).

Here is the main result. For a voting system $(P,V,q)$,

\[\mathcal C(P,V,q)=\bigcap_{Q\subseteq V,\left|Q\right|=q}\mathcal P(P,Q).\]*Proof.*
To prove this, we need to show that
$x\in P$ does not have any non-trivial Pareto improvement for any $q$ voters iff
$x$ does not have any non-trivial defeat.

To prove the forward direction, suppose that $x\in P$ does not have any non-trivial Pareto improvement for any $q$ voters. Let $y\in P$ such that $y\ne x$, and the goal is to prove that $y$ is not a defeat of $x$.

Let

\[Y:=\left\{v\in V\,\middle|\,y\succeq_vx\right\}.\]Then, $y$ is a $Y$-Pareto improvement of $x$, so we have $\left|Y\right|<q$ (because otherwise there is a subset of $Y$ with $q$ voters for which $y$ is a Pareto improvement of $x$). Therefore, $y$ is not a defeat of $x$.

To prove the backward direction, suppose that $x\in P$ has a non-trivial $Q$-Pareto improvement, where $Q\subseteq V$ and $\left|Q\right|=q$. Denote the improvement as $y$. Let

\[Y:=\left\{v\in V\,\middle|\,y\succeq_vx\right\}.\]because $y$ is a $Q$-Pareto improvement of $x$, we have $Q\subseteq Y$. Therefore, $\left|Y\right|\geq\left|Q\right|=q$. Therefore, $y$ is a defeat of $x$. $\square$

Specially, we have

\[\mathcal C\!\left(P,V,\left|V\right|\right)=\mathcal P(P,V).\]Here is an example. Suppose we have 5 voters, and the set of proposals is $\mathbb R^2$. Each voter has an ideal point and prefers points nearer to the ideal point. The 5 ideal points form a convex pentagon. Then we can find the core easily by the conclusion above:

]]>First, define the Lorenz curve: it is the curve that consists of all points $(u,v)$ such that the poorest $u$ portion of population in the country owns $v$ portion of the total wealth.

The Gini coefficient $G/\mu$ is defined as the area between the Lorenz curve and the line $u=v$ divided by the area enclosed by the three lines $u=v$, $v=0$, and $u=1$.

Now, suppose the wealth distribution in the country is $p(X)$, where $p\!\left(x\right)\mathrm dx$ is the portion of population that has wealth in the range $[x,x+\mathrm dx]$.

Then, the Lorenz curve is the graph of the function $g$ defined as

\[g(F(x))=\frac1\mu\int_{-\infty}^xtp\!\left(t\right)\mathrm dt,\]where

\[F\!\left(x\right):=\int_{-\infty}^xp\!\left(t\right)\mathrm dt\]is the cumulative distribution function of $p(X)$, and

\[\begin{equation} \label{eq: def mu} \mu:=\int_{-\infty}^{+\infty}tp\!\left(t\right)\mathrm dt \end{equation}\]is the average wealth of the population, which is just $\mathrm E[\mathrm X]$ ($X$ is a random variable such that $X\sim p(X)$).

Then, the Lorenz curve is

\[v=g(u):=\frac1\mu\int_{-\infty}^{F^{-1}(u)}tp\!\left(t\right)\mathrm dt.\]According to the definition of the Gini coefficient,

\[\begin{align*} G&:=2\mu\int_0^1\left(u-g(u)\right)\mathrm du\\ &=\mu-2\mu\int_0^1g\!\left(u\right)\mathrm du\\ &=\mu-2\int_{u=0}^1\int_{t=-\infty}^{F^{-1}(u)}tp\!\left(t\right)\mathrm dt\,\mathrm du. \end{align*}\]Interchange the order of integration, and we have

\[\begin{align*} G&=\mu-2\int_{t=-\infty}^{+\infty}\int_{u=F(t)}^1tp\!\left(t\right)\mathrm dt\,\mathrm du\\ &=\mu-2\int_{-\infty}^{+\infty}\left(1-F(t)\right)tp\!\left(t\right)\mathrm dt. \end{align*}\]Substitute Equation \ref{eq: def mu} into the above equation, and we have

\[\begin{align*} G&=\int_{-\infty}^{+\infty}2tF\!\left(t\right)p\!\left(t\right)\mathrm dt-\mu\\ &=\int_{-\infty}^{+\infty}\left(2tF\!\left(t\right)-1\right)tp\!\left(t\right)\mathrm dt\\ &=\int_0^1\left(2u-1\right)F^{-1}\!\left(u\right)\mathrm du. \end{align*}\]Now here is the neat part. Separate it into two parts, and write them in double integrals:

\[\begin{align*} G&=\int_0^1uF^{-1}\!\left(u\right)\mathrm du-\int_0^1\left(1-u\right)F^{-1}\!\left(u\right)\mathrm du\\ &=\int_{u_2=0}^1\int_{u_1=0}^{u_2}F^{-1}\!\left(u_2\right)\mathrm du_1\,\mathrm du_2 -\int_{u_1=0}^1\int_{u_2=u_1}^1F^{-1}\!\left(u_1\right)\mathrm du_1\,\mathrm du_2. \end{align*}\]Interchange the order of integration of the second term, and we have

\[\begin{align*} G&=\int_{u_2=0}^1\int_{u_1=0}^{u_2}\left(F^{-1}\!\left(u_2\right)-F^{-1}\!\left(u_1\right)\right)\mathrm du_1\,\mathrm du_2\\ &=\frac12\int_{u_2=0}^1\int_{u_1=0}^1\left|F^{-1}\!\left(u_2\right)-F^{-1}\!\left(u_1\right)\right|\mathrm du_1\,\mathrm du_2\\ &=\frac12\int_{-\infty}^{+\infty}\int_{-\infty}^{+\infty}\left|x_2-x_1\right|p\!\left(x_1\right)p\!\left(x_2\right)\mathrm dx_1\,\mathrm dx_2\\ &=\frac12\mathrm E\!\left[\left|X_2-X_1\right|\right], \end{align*}\]where $X_1$ and $X_2$ are two independent random variables with $p$ being their respective distribution functions: $\left(X_1,X_2\right)\sim p\!\left(X_1\right)p\!\left(X_2\right)$.

By this result, we can easily see how the Gini coefficient represents the statistical dispersion.

We can apply similar tricks to the variance $\sigma_X^2$.

\[\begin{align*} \sigma_X^2&=\mathrm E\!\left[X^2\right]-\mathrm E\!\left[X\right]^2\\ &=\int_{-\infty}^{+\infty}t^2p\!\left(t\right)\mathrm dt -\left(\int_{-\infty}^{+\infty}tp\!\left(t\right)\mathrm dt\right)^2\\ &=\int_0^1F^{-1}\!\left(u\right)^2\,\mathrm du -\left(\int_0^1F^{-1}\!\left(u\right)\mathrm du\right)^2. \end{align*}\]Separate the first into two halves, and write the altogether three terms in double integrals:

\[\begin{align*} \sigma_X^2&=\frac12\int_0^1F^{-1}\!\left(u_2\right)^2\,\mathrm du_2\int_0^1\mathrm du_1\\ &\phantom{=~}{}-\int_0^1F^{-1}\!\left(u_1\right)\mathrm du_1\int_0^1F^{-1}\!\left(u_2\right)\mathrm du_2\\ &\phantom{=~}{}+\frac12\int_0^1F^{-1}\!\left(u_1\right)^2\,\mathrm du_1\int_0^1\mathrm du_2\\ &=\frac12\int_0^1\int_0^1 \left(F^{-1}\!\left(u_2\right)^2-2F^{-1}\!\left(u_1\right)F^{-1}\!\left(u_2\right)+F^{-1}\!\left(u_1\right)^2\right) \mathrm du_1\,\mathrm du_2\\ &=\frac12\int_{-\infty}^{+\infty}\int_{-\infty}^{+\infty} \left(x_2-x_1\right)^2p\!\left(x_1\right)p\!\left(x_2\right)\mathrm dx_1\,\mathrm dx_2\\ &=\frac12\mathrm E\!\left[\left(X_2-X_1\right)^2\right]. \end{align*}\]Then we can derive the relationship between the Gini coefficient and the variance:

\[2\sigma_X^2-4G^2=\sigma_{\left|X_2-X_2\right|}^2.\]]]>Personally, I have the demand of handwriting math/physics notes, but an annoying fact about this is that I usually cannot distinguish every letter that may be possibly used well enough.

In this article, I will try to settle this problem.

**This article does not involve calligraphy,
and I myself have not learnt calligraphy specially ever.**

This article is written also for giving some warn and advice to those who
**read letters arbitrarily without actually recognizing them
and thus even mislead others (unintentionally)**.

Here is a full list of different styles **except for their bold counterparts**:

Style name | $\LaTeX$ command | Example |
---|---|---|

Roman | `\mathrm` |
$\mathrm{ABC}$ |

Italic | `\mathit` |
$\mathit{ABC}$ |

Blackboard | `\mathbb` |
$\mathbb{ABC}$ |

Calligraphic | `\mathcal` |
$\mathcal{ABC}$ |

Script | `\mathscr` |
$\mathscr{ABC}$ |

Fraktur | `\mathfrak` |
$\mathfrak{ABC}$ |

Sans-serif | `\mathsf` |
$\mathsf{ABC}$ |

Typewriter | `\mathtt` |
$\mathtt{ABC}$ |

We are **not** going to distinguish all the letters and all the styles.

I will try to find a handwriting style that satisfies the following conditions (in descending order of importance):

- I am able to write them fast and simply.
- I am able to recognize each character at a glance.
- The style is consistent for all letters.
- The shape is similar to the default mathematical font of $\LaTeX$ (Computer Modern).
- If the last condition cannot be satisfied, the shape is similar to some style that ever existed.

The reason for the 2nd principle to be lower than the 1st is that the efficiency of taking notes should not be too low and that one may distinguish letters and styles by the context.

If a style fails to satisfy the 5th or the 4th principle
(i.e. this style is invented by me),
I will add an exclamation mark (**!**) to inform you of this.

The following lists all the letters and the styles that I want to distinguish:

- Digits:
*0*,*1*,*2*,*3*,*4*,*5*,*6*,*7*,*8*,*9*(they are not letters, but they deserve distinguishing). - Roman style of uppercase English letters:
*A*,*B*,*C*,*D*,*E*,*F*,*G*,*H*,*I*,*J*,*K*,*L*,*M*,*N*,*O*,*P*,*Q*,*R*,*S*,*T*,*U*,*V*,*W*,*X*,*Y*,*Z*. - Italic style of uppercase English letters:
*A*,*B*,*C*,*D*,*E*,*F*,*G*,*H*,*I*,*J*,*K*,*L*,*M*,*N*,*P*,*Q*,*R*,*S*,*T*,*U*,*V*,*W*,*X*,*Y*,*Z*(not including*O*). - Roman style of lowercase English letters:
*a*,*b*,*c*,*d*,*e*,*f*,*g*,*h*,*i*,*j*,*k*,*l*,*m*,*n*,*o*,*p*,*q*,*r*,*s*,*t*,*u*,*v*,*w*,*x*,*y*,*z*. - Italic style of lowercase English letters:
*a*,*b*,*c*,*d*,*e*,*f*,*g*,*h*,*i*,*j*,*k*,*l*,*m*,*n*,*p*,*q*,*r*,*s*,*t*,*u*,*v*,*w*,*x*,*y*,*z*(not including*o*). - Roman style of uppercase Greek letters:
*Gamma*,*Delta*,*Theta*,*Lambda*,*Xi*,*Pi*,*Sigma*,*Upsilon*,*Phi*,*Psi*,*Omega*(not including any letters that cannot be distinguished from english uppercase letters). - Italic style of lowercase Greek letters:
*alpha*,*beta*,*gamma*,*delta*,*epsilon*,*zeta*,*eta*,*theta*,*iota*,*kappa*,*lambda*,*mu*,*nu*,*xi*,*pi*,*rho*,*sigma*,*tau*,*upsilon*,*phi*,*chi*,*psi*,*omega*(not including*omicron*). - Blackboard bold style of uppercase English letters:
*A*,*B*,*C*,*D*,*E*,*F*,*G*,*H*,*I*,*J*,*K*,*L*,*M*,*N*,*O*,*P*,*Q*,*R*,*S*,*T*,*U*,*V*,*W*,*X*,*Y*,*Z*. - Calligraphic style of uppercase English letters:
*A*,*B*,*C*,*D*,*E*,*F*,*G*,*H*,*I*,*J*,*K*,*L*,*M*,*N*,*O*,*P*,*Q*,*R*,*S*,*T*,*U*,*V*,*W*,*X*,*Y*,*Z*. - Script style of uppercase English letters:
*A*,*B*,*C*,*D*,*E*,*F*,*G*,*H*,*I*,*J*,*K*,*L*,*M*,*N*,*O*,*P*,*Q*,*R*,*S*,*T*,*U*,*V*,*W*,*X*,*Y*,*Z*. - Fraktur style of uppercase English letters:
*A*,*B*,*C*,*D*,*E*,*F*,*G*,*H*,*I*,*J*,*K*,*L*,*M*,*N*,*O*,*P*,*Q*,*R*,*S*,*T*,*U*,*V*,*W*,*X*,*Y*,*Z*. - Fraktur style of lowercase English letters:
*a*,*b*,*c*,*d*,*e*,*f*,*g*,*h*,*i*,*j*,*k*,*l*,*m*,*n*,*o*,*p*,*q*,*r*,*s*,*t*,*u*,*v*,*w*,*x*,*y*,*z*.

In terms of linguistic terminology, each entry in the above list is a grapheme in my handwritten notes. However, in extreme cases, even if I have actively avoided, it is still possible that two graphemes are indistinguishable. Then, I will design allographs for those graphemes to provide extra distinguishability in extreme cases.

Here are some of the general rules that I set up:

- We do not write any serif unless it is a must for distinguishing letters. (This is also why I did not plan to distinguish sans-serif styles.)
- The roman style of all english letters does not have tails (either ornamental or used for ligatures in connected writing).
- For both roman and italic styles, all uppercase letters (both English and Greek) have the same position of bottom and top.

For other details, look at this image:

In italic style, the slanted line in the right side of *A* is nearly vertical.
Actually, in the italic style of uppercase letters,
almost all top-left-to-bottom-right slanted lines are nearly vertical.

To write conveniently, use the single-story glyph of *a* even for its roman style.

The difference of the glyph of *alpha* and that of *a* should be noticeable.

*C* and *c* are tricky because it is very hard to distinguish roman and italic styles for them,
but we have to because they are very commonly used.
We need to be careful when writing and recognizing them.

Roman style of *C* is largely vertically symmetrical,
while the italic style of *C* is not.
In the italic style of *C*, the top endpoint of the stroke is to the right of the bottom endpoint,
and the left-most position on the stroke is below the center
instead of being at the same level as the center.

The opening direction of the roman style of *c* is to the right,
while that of the italic style is to the top-right.

(I once tried using ornamental tails to distinguish the italic style of *c* from the roman style,
but it would make them look strange and may possibly confuse with other letters.)

At first, I did not want to distinguish the roman and italic styles of *c*,
but I found that it is useful to distinguish them.
For example, some times we use $a,b,c$ for indices, so the italic style of *c* may be used as an index;
meanwhile, we may use roman style of *c* to represent “center”
so that we can express the position of the center as $\mathbf r_\mathrm c$.
In both cases, the letter *c* appears in the position of a subscript,
but they need to be distinguished from each other.

I want to talk about *sigma* here because in Greek, its final form $\varsigma$ looks very similar to *c*.
Just do not use that glyph for *sigma*.

It is important to distinguish the roman and italic styles of *e*
because we may use $\mathrm e$ for the base of natural logarithm
and use $e$ for the electric charge of a proton.

At the turning point of the stroke at the center-right of the glyph,
the roman style of *e* is sharp while the italic style is round.
This detail is enough to distinguish them.

The roman style of *f* is not a descender
while the italic style is a descender.
Also, the italic style of *f* has a left-tail in the bottom.

To make writing convenient, the roman style of *g* uses the single-story glyph.
It would make it hard to distinguish it from the italic style,
but we may write descender of the italic style of *g* in a exaggerated way
to distinguish them.

Here we are at the only extreme case where multiple graphemes share the same glyph:
*1*, roman style of *I*, and roman style of *l*.
They are all simply a vertical line.

Normally we should be able to distinguish them by their context,
but in some cases we need to distinguish them clearly.
We may add some small turnings at the top and bottom of *I* to distinguish it from *l*.
It is like we are trying to write the serifs of *I* but we write so fast that they are connected
and look like small turnings.

A small sharpe turning may be added at the top of *1* to distinguish it from *l*.

The italic style of *i* has two tails (one left-tail in the middle and one right-tail in the bottom).
It looks exactly the same as *iota* except for the dot at the top.

In both the roman and italic styles of *K*,
the endpoint of the stroke branch of *K* at the top-right is approximately at the same level
as the top endpoint of the vertical line at the left.

The slantation of the left vertical line should be enough to distinguish the italic style of *K*
from the roman style,
but we may also add a small tail at the bottom-right to distinguish them further.
Do not worry about confusing with *kappa* because we have other ways to distinguish it.

In the italic style of *k*, the top-right stroke branch is written as a closed circle.
This makes it easier to distinguish from *K* and *kappa*.

*kappa* is shorter than *K* and *k*.
The bottom-right stroke branch is written in shape of an inclined mirrored S-curve
to distinguish from *K* and *k*.
The endpoint of the stroke branch of *kappa* at the top-right is approximately at the same level
as the top endpoint of the vertical line at the left.

In the italic style of *M*, the bottom is wider than the top,
while in the roman style, the top is as wide as the bottom.
Write *M* in four strokes to distinguish it from *mu*.

As for *mu*, note that the bottom-left corner is a descender,
while other parts are not.

These are the most cursed characters, even more than *1*, *I*, and *l*.
They are so cursed that I refuse to distinguish the roman style of *O* and *o* from the italic style,
and I would refuse to use the italic style of *O* and *o* in my hand written notes.

The digit *0* is narrower than *O* and *o*.

Just avoid using *omicron* because it is indistinguishable from *o*.

Write the italic style of *p* in two strokes,
and it has two left-tails, one at the top-left and one at the bottom-left.

Write *rho* in one stroke.
Starting the stroke from below the baseline (at the bottom of the descender) is recommended.

In the italic style of *Q*, the last stroke looks like a tilde.
It is straight for the roman style.

The italic style of *q* has a sharp right-tail in the bottom.

OK, this is important.

Every physicist must have met at least one person who mistakenly recognized

nuasv.

The roman style of *r* does not have tails
(the arc at the top-right does not count as a tail).
The italic style of *r* has a left-tail at the top-left and a right-tail at the top-right.
The downward part and the upward part of the stroke overlap at the bottom
to distinguish it from *v*.

The italic style of *u* has a left-tail at the top-left and a right-tail at the bottom-right.
The tail at bottom-right distinguishes it from *v*.

The italic style of *v* has a left-tail at the top-left and a left-tail at the top-right.
The tail at top-right is ommitable because it is not very noticeable.
The bottom of both the roman style and italic style of *v* is a sharp turning.

The top-left of *gamma* is curvy while the top-right is straight.
The letter is also a descender, so make its bottom lower than the baseline.

The left of *nu* is a vertical line.
The right of *nu* is like a broken line (**!**).
The left and right parts are tangent to each other at the bottom but separates quickly (**!**).

Both the top-left and top-right of *Upsilon* are curvy.
It is thus different from *r* or *gamma*.

The letter *upsilon* is not commonly used.
If it is used, its bottom is round instead of being sharp,
to distinguish it from the italic style of *v*.

They are cursed, but not as cursed as *O* and *o*.

In the italic style of *S* and *s*, the bottom-left is to the left of the top-left.
In the roman style, the bottom-left and the top-left are aligned instead.

The roman style of *t* is a straight cross (no curvy strokes)
to distinguish it from the italic style.

The horizontal stroke of multiple *f*’s and *t*’s may be connected (ligature).
Note that they may only be connected if they are intended to form a word.
If they are written together just for mathematical multiplication,
there should not be a ligature.

The bottom of *tau* may be either turing to the right or stopping just straightly.
I prefer it turning to the right.

To distinguish from cup (the symbol for set union), add a vertical line at the right of the glyph (for both the roman and italic styles), but the italic style of it does not have a tail.

Just like how many people mistakenly recognize *nu* as *v*,
many people also mistakenly recognize *omega* as *w*.

The top-left and top-right of *w* are the same as those of *v*
for both roman and italic styles.

The letter *W* is not the same as a upside-down *M*.
For both roman and italic styles, the top of *W* is wider than the bottom.

There is a right-tail at the top-left of *omega*.
The bottom of *omega* is round instead of being sharp.

There are not as many people who mistakenly recognize *chi* as *x* as there are for *nu* and *omega*,
but there are still many.

It is a little hard to distinguish the roman and italic styles of *X*.
First, the top-right-to-bottom-left stroke of *X* is longer in the italic style
to embody the feel of slantation.
Also, in the italic style, the top-left of *X* is to the left of the bottom-left.
These should be enough to distinguish it from the roman style.
Note that the italic style of *X* is a little different from the italic styles of other letters
in that the top-left-to-bottom-right stroke is not nearly vertical
(because otherwise it would look strange).

The italic style of *x* has a left-tail at the top-left and a right-tail at the bottom-right.
The bottom-left and the top-right do not have tails
(for convenience).
Write *x* as a cross instead of two C-curves tangent to each other
(I know some people write it like that).

The top-left of *chi* has a left-tail, and the bottom-right has a right-tail.
The bottom-left of *chi* has a right-tail (**!**), which is the main feature to distinguish it from *x*.
Also, note that *chi* is a descender, and the intersection of the two strokes is at the baseline.

Write *Y* in three strokes.

Write the roman style of *y* in two strokes,
both of which are straight.
The italic style of *y* is the the same as the italic style of *u*
but the tail at the bottom-right is changed into a descender like that of *g*.

Some people add a short stroke in the middle of *z* (I used to do that)
or add a descender at the bottom like that of *g*
to distinguish it from *2*.
I use neither of them because the sharp turning corner at the top-right of *z*
is enough to distinguish it from *2*.

The top and bottom of *Z* are aligned in the roman style,
but the top is a little bit offset to the left of the bottom in the italic style.

The bottom of the italic style of *z* is written like a tilde.

In Greek, there are two glyphs for *epsilon*,
one of which is called the lunate epsilon or the uncial epsilon $\epsilon$,
and the other $\varepsilon$ does not have a name but I like to call it varepsilon
(because the command for the glyph in $\LaTeX$ is `\varepsilon`

).

Use varepsilon. Never use the lunate epsilon because it confuses with the set membership symbol.

Write *Theta* as wide as *O*, and do not make the stroke in the middle touch either side.
Tilt *theta* a bit.
Because we do not use italic uppercase Greek letters and roman lowercase Greek letters,
*Theta* and *theta* should be distinguishable enough.

I have never imagined someone would write *Omega* that looks very similar to *Lambda*,
but there are people like that.
They are very different! OK?

In Greek, there are two glyphs for *phi*,
the loopy / open one $\varphi$ or the stroked / closed one $\phi$.
Just stick to the loopy one and forget about the stroked one so that we can distinguish it from *Phi*.

Some sources say that we should use the stroked one for the golden ratio. Just forget about that. I never use the letter to represent the golden ratio.

The tops of the two strokes of *Psi* are at the same level.

The top of the middle stroke of *psi* is a little bit higher than the top of the other stroke.
There is a left-tail at the top-left of *psi*.
There is a left-tail at the bottom (decender) of *psi* (**!**).

We only need to write blackboard style for uppercase English letters. Generally, we just add one or two strokes to the roman style of the letters to make them blackboard style. The general rules are as follows:

- If there are multiple vertical strokes, add a vertical stroke next to each of them, and we are done.
- Otherwise, if there is a non-horizontal stroke that starts from the top-left, add a stroke next to it.
- Otherwise, if the leftmost stroke is a curve that span from top to bottom, at a vertical stroke in the inner, next to the leftmost part of the curve.
- Otherwise, this is a special letter!

There are some special letters as well as some exceptions to the general rules listed below.

Add a stroke next to the leftmost stroke.

It does not contain a vertical stroke, but we regard the right part of the stroke as one vertical troke.

Add two short vertical strokes to the inner of the leftmost part curve and the rightmost part curve.

It would be strange if we only add one additional stroke.
I want to add two to make it looks like double *V* (actually, it indeed should be).

Add a stroke next to the to-left stroke and a stroke next to the bottom stroke.

Add a stroke next to the middle part of the stroke.

Different from roman style,
some uppercase letters in calligraphic and script styles are descenders.
The descenders are: *G*, *J*, *Q*, *Y*.
Some people possibly write *F*, *H* (less likely), *P*, and *Z* as descenders as well,
but I do not.

As for details, I am tired of explaining for each letter. Just look at the image before.

This is the most tricky style. You may think it is hard to write in Fraktur style when you look at how $\LaTeX$’s default typeface renders it. Actually, it indeed is, but it is not intended for you to handwrite. I recommend write them as shown here (ignore the final line because we do not need them):

They look very distinguishable.

]]>小明不是爱用草稿本的人。他一直以为草稿本上记录着他的失败——其上的字迹是他向他“低下”的心算能力屈服后留下的痕迹，任何一个写在其上的显然的推导都是对他能力的羞辱。然而，尽管他认为打草稿是“弱者的行为”（他的原话如此），他并不吝惜成为弱者的机会——他仍然时常不得不打草稿。

草稿本翻开后，他总喜欢在右手边的那页上打草稿。当他把草稿本的每一张纸都写过之后，他会把草稿本倒过来再用一遍。

我偶尔能看到小明从草稿本上匆匆撕下两张纸然后离开。有一天等他再次这么做之后又回来，我问他去干什么了。他告诉我他去上厕所了。我问他为什么要撕草稿本。他就告诉我，有一天他急着要去大解，却突然发现自己已经耗尽了携带的纸巾，情急之下，他用草稿本里的纸代替了。后来，他觉得草稿本用了之后也卖不了几个钱，卫生间纸篓倒是个好去处。他从此每当大解，就用两张草稿本里的两面都写过的纸，虽然它们有些过于光滑了。

我听后觉得害羞又有趣。后来，我有时也会在小明大解后回来时问他刚刚的手纸上记录了他的什么灵感。他就会给我讲一道物理题，以及他当时打这些草稿时使用的“愚蠢”的做法。这时我或许能得到一些启发，但他总是说草稿上写的都是些没有灵感的东西，没有价值。

关于小明的草稿本，还有两件趣事。且说我们数学课，老师不时会让我们在课后做一些思考题。众所周知，“思考题”是“普通人思考不出来的题”的简称。很多人对这些思考题望而生畏，就不做了。但对于小明来说，这些题目并不难，他课后总会花几分钟时间简单地在草稿本上写下思路。

有一天数学课，数学老师让我们五人一组一起思考解决黑板上的几道题。我与小明恰好在一组。黑板上的题目有难度，我们几人纷纷表示没有思路。小明感受到了目光的集中，瞟了眼黑板，说这些都是之前做过的思考题，然后就继续坐角落里自学去了。于是，我们组除了小明以外的四个人，每两人看一本小明的草稿本，一人正着看，一人倒着看，直把小明的两本草稿本翻了个遍。终于是找到了大部分的题目，我们依着小明的思路把题目做了出来，黑板上只剩一道题没做出来。

那节课，我们组战绩辉煌。其他组有的只做出来一道题，有的一道题都没做出来。

后来，出于好奇，我问小明还有一道题他做在哪里了。他说，应该是原本也在草稿本上，后来到卫生间纸篓里了，因为他有印象曾经用过写了数学思考题的纸。我又问，为什么把思考题做在草稿纸上。他反问道：“我总不可能专门为了做思考题开一本本子吧？”

又有天，体育课上我想上厕所，一下课我就第一个奔回教室拿纸巾去厕所。然而我突然发现我的纸巾已经用完了，教室里又没有人能借我纸巾，目所能及之处也没有能白嫖的纸巾。这时，小明摆在桌面上的草稿本吸引了我的注意。我迅速上去撕了两张纸，红着脸冲向厕所。

蹲下来，我一边向坑空投物资，一边看着手中的草稿。一方面是因为刚刚奔得太猛，另一方面是因为把男生的东西带进来令我有一种莫名的羞愧，再一方面是因为我蹲在一个狭小而安静的空间，我能听到清晰的心跳。

来不及等我的羞愧感消退，我又注意到我手中的草稿中有一些内容勾起了我的回忆。当时我想知道一个数学命题的证明，后来就是小明写在草稿本上给我看的。数月过去，我早已忘记了证明的方法。手中的草稿仿佛在指责我的遗忘。我感到这草稿存在着它的价值。但很可惜，几分钟之后它就会进入我后方的纸篓。这样的矛盾迫使我做出这样的决定：在这几分钟之内将这则证明背下来。

几分钟后，我终于背出了证明。尽管如此，用它擦去我身上遗留的污物时我仍感到自己在亵渎一样神圣之物。这使我倾向于快速解决，而无暇关注它光滑的表面是否足以带走所有的污物。结束之后，我赶紧离开厕所回到教室。我注意到我的心跳还没恢复。

我回到教室，小明早已在我旁边的座位坐下了。还没等我开口，他问我脸怎么那么红。

“女厕所从来没有过像现在这样散发着智慧的光芒。”

“你把你聪明的小脑瓜落在那里了？”

我没理他。我要赶紧把刚背的证明写下来。但很明显，我的不回应引起了他对我的当务之急的好奇。他看着我，我感到慌张，证明写了一半就忘了剩下的内容，于是更加慌张，于是更加想不起来。我急得眉头皱了起来。

“傻子，这里写错了。我记得我给你写过的……”

他不合时宜的指导令我感觉受到了贬低，虽然我本不应该解读出这层意思。

“我明明记得我在这里写过的啊，写在哪里了呢……”响起了小明翻草稿本的声音。

突然，他安静了下来。他看着我。我感到异常尴尬，也不知道是出于何种情感，我发出“呜呜”的声音。我觉得我可能要失态了，赶紧跑出了教室。

我看着窗外的风景。今天天气很好。

过了一段时间，我回到教室。我看到我桌面上摊着一本草稿本，上面写着小明刚写的我尝试写下的证明。

]]>