<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://thvinhtruong.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://thvinhtruong.github.io/" rel="alternate" type="text/html" /><updated>2026-05-21T16:14:11+00:00</updated><id>https://thvinhtruong.github.io/feed.xml</id><title type="html">My TU Brain</title><subtitle>A place for my notes and thoughts.</subtitle><author><name>Vinh Truong</name></author><entry><title type="html">Neural Networks from First Principles</title><link href="https://thvinhtruong.github.io/2026/05/21/neural-networks-from-first-principles/" rel="alternate" type="text/html" title="Neural Networks from First Principles" /><published>2026-05-21T03:00:00+00:00</published><updated>2026-05-21T03:00:00+00:00</updated><id>https://thvinhtruong.github.io/2026/05/21/neural-networks-from-first-principles</id><content type="html" xml:base="https://thvinhtruong.github.io/2026/05/21/neural-networks-from-first-principles/"><![CDATA[<p>A condensed walkthrough of the core math behind a feed-forward neural network, written for someone learning for the first time. Companion notes to the from-scratch Go implementation in this repo.</p>

<p>The whole network reduces to <strong>four ideas</strong>:</p>

<ol>
  <li>A neuron is a weighted sum + bias + non-linearity.</li>
  <li>A layer is the same thing in matrix form.</li>
  <li>A loss function turns a prediction into a single “how wrong” number.</li>
  <li>Backpropagation computes how to nudge every parameter to make that number smaller.</li>
</ol>

<p>That’s it. Everything else is engineering.</p>

<hr />

<h2 id="step-1--a-single-neuron">Step 1 — A single neuron</h2>

<p>A neuron takes inputs <code class="language-plaintext highlighter-rouge">x₁, …, xₙ</code>, multiplies each by a learned <strong>weight</strong> <code class="language-plaintext highlighter-rouge">wᵢ</code>, sums them, adds a <strong>bias</strong> <code class="language-plaintext highlighter-rouge">b</code>, then passes the result through a non-linear <strong>activation function</strong> <code class="language-plaintext highlighter-rouge">σ</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>z = w₁·x₁ + w₂·x₂ + … + wₙ·xₙ + b
a = σ(z)
</code></pre></div></div>

<ul>
  <li><code class="language-plaintext highlighter-rouge">z</code> is the <strong>pre-activation</strong>.</li>
  <li><code class="language-plaintext highlighter-rouge">a</code> is the <strong>activation</strong> — the neuron’s output.</li>
</ul>

<h3 id="why-the-activation-function">Why the activation function?</h3>

<p>Without <code class="language-plaintext highlighter-rouge">σ</code>, stacking layers would just be one big linear function — no matter how deep, the network could only learn straight-line relationships. The non-linearity is what lets the network learn curves, edges, digits, anything interesting.</p>

<p>Common choices:</p>

<ul>
  <li><strong>Sigmoid</strong>: <code class="language-plaintext highlighter-rouge">σ(z) = 1 / (1 + e⁻ᶻ)</code> — smooth, squashes to <code class="language-plaintext highlighter-rouge">(0, 1)</code>. Classic, slow.</li>
  <li><strong>ReLU</strong>: <code class="language-plaintext highlighter-rouge">σ(z) = max(0, z)</code> — fast, works great in practice. Default for hidden layers.</li>
</ul>

<h3 id="tiny-example">Tiny example</h3>

<p>One neuron, 2 inputs:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">x = [1.0, 2.0]</code></li>
  <li><code class="language-plaintext highlighter-rouge">w = [0.5, -0.3]</code></li>
  <li><code class="language-plaintext highlighter-rouge">b = 0.1</code></li>
</ul>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>z = 0.5·1.0 + (-0.3)·2.0 + 0.1 = 0.0
a = ReLU(0.0) = 0.0
</code></pre></div></div>

<p>A <strong>layer</strong> is just many neurons in parallel, each with its own weights and bias, all reading the same input.</p>

<hr />

<h2 id="step-2--forward-pass-for-a-layer">Step 2 — Forward pass for a layer</h2>

<p>Writing neuron-by-neuron gets messy. Pack everything into matrices.</p>

<h3 id="setup">Setup</h3>

<p>3 neurons, 2 inputs. Each neuron has 2 weights and 1 bias.</p>

<p>Neuron-by-neuron:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>z₁ = w₁₁·x₁ + w₁₂·x₂ + b₁
z₂ = w₂₁·x₁ + w₂₂·x₂ + b₂
z₃ = w₃₁·x₁ + w₃₂·x₂ + b₃
</code></pre></div></div>

<p>Notation: <code class="language-plaintext highlighter-rouge">wᵢⱼ</code> = weight from input <code class="language-plaintext highlighter-rouge">j</code> into neuron <code class="language-plaintext highlighter-rouge">i</code>.</p>

<h3 id="same-thing-as-matrix-math">Same thing as matrix math</h3>

<p>Stack the weights into a matrix <code class="language-plaintext highlighter-rouge">W</code> (one row per neuron):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>W = [ w₁₁  w₁₂ ]     b = [ b₁ ]     x = [ x₁ ]
    [ w₂₁  w₂₂ ]         [ b₂ ]         [ x₂ ]
    [ w₃₁  w₃₂ ]         [ b₃ ]
</code></pre></div></div>

<p>The whole layer’s computation is:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>z = W·x + b           (shape: 3×1)
a = σ(z)              (element-wise)
</code></pre></div></div>

<p><strong>One matrix multiply + one vector add + one element-wise function = a full layer.</strong></p>

<h3 id="numerical-example">Numerical example</h3>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>W = [ 0.5  -0.3 ]    b = [ 0.1 ]    x = [ 1.0 ]
    [ 0.2   0.8 ]        [ 0.0 ]        [ 2.0 ]
    [-0.1   0.4 ]        [ 0.2 ]
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>z₁ = 0.5·1.0 + (-0.3)·2.0 + 0.1 =  0.0
z₂ = 0.2·1.0 +   0.8·2.0  + 0.0 =  1.8
z₃ = -0.1·1.0 +  0.4·2.0  + 0.2 =  0.9

a = ReLU(z) = [0.0, 1.8, 0.9]
</code></pre></div></div>

<h3 id="chaining-layers">Chaining layers</h3>

<p>A network is just this, repeated:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>a⁽¹⁾ = σ(W⁽¹⁾·x    + b⁽¹⁾)
a⁽²⁾ = σ(W⁽²⁾·a⁽¹⁾ + b⁽²⁾)
a⁽³⁾ = σ(W⁽³⁾·a⁽²⁾ + b⁽³⁾)   ← final output
</code></pre></div></div>

<h3 id="shape-cheat-sheet">Shape cheat-sheet</h3>

<p>For layer <code class="language-plaintext highlighter-rouge">ℓ</code> with <code class="language-plaintext highlighter-rouge">nₗ</code> neurons reading <code class="language-plaintext highlighter-rouge">nₗ₋₁</code> inputs:</p>

<table>
  <thead>
    <tr>
      <th>object</th>
      <th>shape</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">W⁽ˡ⁾</code></td>
      <td><code class="language-plaintext highlighter-rouge">nₗ × nₗ₋₁</code></td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">b⁽ˡ⁾</code></td>
      <td><code class="language-plaintext highlighter-rouge">nₗ × 1</code></td>
    </tr>
    <tr>
      <td>input to layer</td>
      <td><code class="language-plaintext highlighter-rouge">nₗ₋₁ × 1</code></td>
    </tr>
    <tr>
      <td>output</td>
      <td><code class="language-plaintext highlighter-rouge">nₗ × 1</code></td>
    </tr>
  </tbody>
</table>

<p>For MNIST with one hidden layer of 128:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">W⁽¹⁾: 128 × 784</code>, <code class="language-plaintext highlighter-rouge">b⁽¹⁾: 128 × 1</code></li>
  <li><code class="language-plaintext highlighter-rouge">W⁽²⁾: 10 × 128</code>, <code class="language-plaintext highlighter-rouge">b⁽²⁾: 10 × 1</code></li>
</ul>

<hr />

<h2 id="step-3--loss-function">Step 3 — Loss function</h2>

<p>The forward pass gives us a prediction. The <strong>loss</strong> is a single number that says how wrong it was. Training = make this number small.</p>

<h3 id="part-a--softmax-raw-scores--probabilities">Part A — Softmax (raw scores → probabilities)</h3>

<p>The final layer outputs 10 arbitrary numbers (positive, negative, large). <strong>Softmax</strong> turns them into a probability distribution:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>softmax(zᵢ) = exp(zᵢ) / Σⱼ exp(zⱼ)
</code></pre></div></div>

<p>It exponentiates (making everything positive) and normalizes.</p>

<p>Example with 3 classes, <code class="language-plaintext highlighter-rouge">z = [2.0, 1.0, 0.1]</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>exp(2.0) = 7.389
exp(1.0) = 2.718
exp(0.1) = 1.105
sum      = 11.212

p = [7.389/11.212, 2.718/11.212, 1.105/11.212]
  = [0.659, 0.242, 0.099]
</code></pre></div></div>

<p>They sum to 1.0 — a real probability distribution.</p>

<blockquote>
  <p><strong>Implementation note</strong>: subtract <code class="language-plaintext highlighter-rouge">max(z)</code> before <code class="language-plaintext highlighter-rouge">exp</code> to avoid overflow. Same result mathematically (the constant cancels), but <code class="language-plaintext highlighter-rouge">exp(1000)</code> is <code class="language-plaintext highlighter-rouge">+Inf</code> otherwise.</p>
</blockquote>

<h3 id="part-b--one-hot-encoding-the-label">Part B — One-hot encoding the label</h3>

<p>The true class index becomes a vector with a 1 at the right slot, 0s elsewhere:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>true class = 0  →  y = [1, 0, 0]
true class = 2  →  y = [0, 0, 1]
</code></pre></div></div>

<p>Same shape as the prediction so we can compare directly.</p>

<h3 id="part-c--cross-entropy-loss">Part C — Cross-entropy loss</h3>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>L = -Σᵢ yᵢ · log(pᵢ)
</code></pre></div></div>

<p>Because <code class="language-plaintext highlighter-rouge">y</code> is one-hot, this collapses to:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>L = -log(p_correct)
</code></pre></div></div>

<p>where <code class="language-plaintext highlighter-rouge">p_correct</code> is the probability the network assigned to the true class.</p>

<p>Example: true class is 0, <code class="language-plaintext highlighter-rouge">p_correct = 0.659</code>:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>L = -log(0.659) = 0.417
</code></pre></div></div>

<ul>
  <li>Confident-right (<code class="language-plaintext highlighter-rouge">p = 0.95</code>): <code class="language-plaintext highlighter-rouge">L = 0.051</code> ← small, good</li>
  <li>Confident-wrong (<code class="language-plaintext highlighter-rouge">p = 0.05</code>): <code class="language-plaintext highlighter-rouge">L = 2.996</code> ← huge, very bad</li>
</ul>

<h3 id="why-this-loss">Why this loss?</h3>

<ol>
  <li><strong>It punishes confident-wrong answers extremely hard.</strong> <code class="language-plaintext highlighter-rouge">-log(p)</code> shoots to infinity as <code class="language-plaintext highlighter-rouge">p → 0</code>. A network that’s 99% sure of the wrong answer pays a huge price.</li>
  <li><strong>The gradient is incredibly clean.</strong> Combine softmax + cross-entropy and the gradient w.r.t. the pre-softmax scores <code class="language-plaintext highlighter-rouge">z</code> is just:
    <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>∂L/∂z = p - y
</code></pre></div>    </div>
    <p><em>Predicted minus true.</em> This is why softmax and cross-entropy are always paired.</p>
  </li>
</ol>

<hr />

<h2 id="step-4--backpropagation">Step 4 — Backpropagation</h2>

<p>Goal: for every weight <code class="language-plaintext highlighter-rouge">W</code> and bias <code class="language-plaintext highlighter-rouge">b</code>, compute <code class="language-plaintext highlighter-rouge">∂L/∂W</code> and <code class="language-plaintext highlighter-rouge">∂L/∂b</code> — how the loss changes if we nudge that parameter — then step against the gradient:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>W ← W - η·∂L/∂W
b ← b - η·∂L/∂b
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">η</code> (eta) is the <strong>learning rate</strong>. Small number like <code class="language-plaintext highlighter-rouge">0.01</code>.</p>

<h3 id="the-chain-rule-applied-backwards">The chain rule, applied backwards</h3>

<p>Loss depends on the output, which depends on the previous layer, which depends on the one before — chain rule lets us compute gradients piece by piece, walking <em>backward</em>.</p>

<p>The efficient algorithm for this is <strong>backpropagation</strong>.</p>

<h3 id="what-to-save-during-the-forward-pass">What to save during the forward pass</h3>

<p>For each layer <code class="language-plaintext highlighter-rouge">ℓ</code>:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">a⁽ˡ⁻¹⁾</code> — the input to that layer</li>
  <li><code class="language-plaintext highlighter-rouge">z⁽ˡ⁾</code>   — the pre-activation</li>
  <li><code class="language-plaintext highlighter-rouge">a⁽ˡ⁾</code>   — the activation</li>
</ul>

<p>These are the ingredients for the backward pass.</p>

<h3 id="the-recipe-3-lines-repeated-per-layer">The recipe (3 lines, repeated per layer)</h3>

<p>For each layer <code class="language-plaintext highlighter-rouge">ℓ</code>, going from output toward input:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>δ⁽ˡ⁾    = (W⁽ˡ⁺¹⁾)ᵀ · δ⁽ˡ⁺¹⁾  ⊙  σ'(z⁽ˡ⁾)        ← chain through the activation
∂L/∂W⁽ˡ⁾ = δ⁽ˡ⁾ · (a⁽ˡ⁻¹⁾)ᵀ
∂L/∂b⁽ˡ⁾ = δ⁽ˡ⁾
</code></pre></div></div>

<p>Symbols:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">δ⁽ˡ⁾</code> = <code class="language-plaintext highlighter-rouge">∂L/∂z⁽ˡ⁾</code>, the “error” at layer <code class="language-plaintext highlighter-rouge">ℓ</code>.</li>
  <li><code class="language-plaintext highlighter-rouge">⊙</code> = element-wise (Hadamard) product.</li>
  <li><code class="language-plaintext highlighter-rouge">σ'(z)</code> = derivative of the activation, e.g. <code class="language-plaintext highlighter-rouge">ReLU'(z) = 1 if z &gt; 0 else 0</code>.</li>
</ul>

<h3 id="the-boundary-condition-where-backprop-starts">The boundary condition (where backprop starts)</h3>

<p>For the output layer, thanks to the softmax + cross-entropy identity from Step 3:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>δ⁽ᴸ⁾ = p − y
</code></pre></div></div>

<p>That’s the whole starting point. <em>Predicted minus true.</em> If the output layer uses a Linear (identity) activation (which is what we do, since softmax is applied by the loss), the formula above is consistent: <code class="language-plaintext highlighter-rouge">δ = ∂L/∂a ⊙ σ'(z) = (p − y) ⊙ 1</code>.</p>

<h3 id="full-backprop-for-a-2-layer-network">Full backprop for a 2-layer network</h3>

<p>Architecture: <code class="language-plaintext highlighter-rouge">x → [W⁽¹⁾, b⁽¹⁾, ReLU] → a⁽¹⁾ → [W⁽²⁾, b⁽²⁾, Linear] → z⁽²⁾ → softmax → p → L</code></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>δ⁽²⁾     = p - y
∂L/∂W⁽²⁾ = δ⁽²⁾ · (a⁽¹⁾)ᵀ
∂L/∂b⁽²⁾ = δ⁽²⁾

δ⁽¹⁾     = (W⁽²⁾)ᵀ · δ⁽²⁾  ⊙  ReLU'(z⁽¹⁾)
∂L/∂W⁽¹⁾ = δ⁽¹⁾ · xᵀ
∂L/∂b⁽¹⁾ = δ⁽¹⁾
</code></pre></div></div>

<p>Then update everything:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>W⁽ˡ⁾ ← W⁽ˡ⁾ - η·∂L/∂W⁽ˡ⁾
b⁽ˡ⁾ ← b⁽ˡ⁾ - η·∂L/∂b⁽ˡ⁾
</code></pre></div></div>

<p><strong>Important</strong>: compute all gradients first, <em>then</em> apply the updates. If you mutate <code class="language-plaintext highlighter-rouge">W⁽²⁾</code> before computing <code class="language-plaintext highlighter-rouge">δ⁽¹⁾</code>, you’ll multiply by the updated weights — wrong.</p>

<h3 id="why-is-it-called-back-propagation">Why is it called “back” propagation?</h3>

<p>You <em>could</em> derive each gradient from scratch using the full chain rule. But you’d recompute the same intermediate products over and over. By walking backward and reusing <code class="language-plaintext highlighter-rouge">δ⁽ˡ⁺¹⁾</code> to compute <code class="language-plaintext highlighter-rouge">δ⁽ˡ⁾</code>, the total cost is about the same as a single forward pass.</p>

<p>That efficiency is what made deep learning practical.</p>

<hr />

<h2 id="the-whole-training-loop-in-one-paragraph">The whole training loop in one paragraph</h2>

<p>For each training example <code class="language-plaintext highlighter-rouge">(x, y)</code>:</p>
<ol>
  <li><strong>Forward pass</strong>: walk layers, compute <code class="language-plaintext highlighter-rouge">a⁽ˡ⁾</code>, cache <code class="language-plaintext highlighter-rouge">xIn</code> and <code class="language-plaintext highlighter-rouge">z⁽ˡ⁾</code>.</li>
  <li><strong>Loss</strong>: <code class="language-plaintext highlighter-rouge">p = softmax(z_last)</code>, <code class="language-plaintext highlighter-rouge">L = -log(p_correct)</code>.</li>
  <li><strong>Backward pass</strong>: start with <code class="language-plaintext highlighter-rouge">δ = p − y</code>. Walk layers in reverse:
    <ul>
      <li>record <code class="language-plaintext highlighter-rouge">dW⁽ˡ⁾ = δ · xInᵀ</code>, <code class="language-plaintext highlighter-rouge">db⁽ˡ⁾ = δ</code></li>
      <li>propagate: <code class="language-plaintext highlighter-rouge">δ ← (W⁽ˡ⁾)ᵀ · δ ⊙ σ'(z⁽ˡ⁻¹⁾)</code></li>
    </ul>
  </li>
  <li><strong>Update</strong>: <code class="language-plaintext highlighter-rouge">W -= η·dW</code>, <code class="language-plaintext highlighter-rouge">b -= η·db</code> for every layer.</li>
</ol>

<p>Repeat thousands of times. The network learns.</p>

<hr />

<h2 id="mental-picture">Mental picture</h2>

<ul>
  <li><strong>Forward pass</strong>: data flows in → predictions flow out.</li>
  <li><strong>Loss</strong>: measures the error.</li>
  <li><strong>Backward pass</strong>: error flows from output back through the network — at each layer we ask “how much did <em>you</em> contribute to this error?” That’s the gradient.</li>
  <li><strong>Update</strong>: every weight steps a tiny bit in the direction that reduces error.</li>
</ul>

<p>Everything else in modern deep learning (convolutions, attention, batch norm, optimizers like Adam) is variation and refinement on top of these four ideas.</p>

<hr />

<h2 id="glossary">Glossary</h2>

<table>
  <thead>
    <tr>
      <th>term</th>
      <th>meaning</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>weight <code class="language-plaintext highlighter-rouge">W</code></td>
      <td>learned parameter; multiplies an input</td>
    </tr>
    <tr>
      <td>bias <code class="language-plaintext highlighter-rouge">b</code></td>
      <td>learned offset added after the weighted sum</td>
    </tr>
    <tr>
      <td>pre-activation <code class="language-plaintext highlighter-rouge">z</code></td>
      <td>the linear output <code class="language-plaintext highlighter-rouge">W·x + b</code> before the non-linearity</td>
    </tr>
    <tr>
      <td>activation <code class="language-plaintext highlighter-rouge">a</code></td>
      <td>the output of a neuron, <code class="language-plaintext highlighter-rouge">σ(z)</code></td>
    </tr>
    <tr>
      <td>σ (sigma)</td>
      <td>activation function, e.g. ReLU or Sigmoid</td>
    </tr>
    <tr>
      <td>σ’</td>
      <td>derivative of the activation</td>
    </tr>
    <tr>
      <td>forward pass</td>
      <td>computing predictions layer by layer</td>
    </tr>
    <tr>
      <td>loss <code class="language-plaintext highlighter-rouge">L</code></td>
      <td>scalar measure of how wrong the prediction is</td>
    </tr>
    <tr>
      <td>softmax</td>
      <td>turns raw scores into a probability distribution</td>
    </tr>
    <tr>
      <td>one-hot</td>
      <td>label encoded as a vector with a single 1</td>
    </tr>
    <tr>
      <td>cross-entropy</td>
      <td>classification loss: <code class="language-plaintext highlighter-rouge">-log(p_correct)</code></td>
    </tr>
    <tr>
      <td>backprop</td>
      <td>algorithm for computing gradients by walking the chain rule backward</td>
    </tr>
    <tr>
      <td>δ (delta)</td>
      <td><code class="language-plaintext highlighter-rouge">∂L/∂z</code>, the error signal at a layer</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">∂L/∂W</code></td>
      <td>gradient of loss w.r.t. weights — tells us how to nudge W</td>
    </tr>
    <tr>
      <td>learning rate η</td>
      <td>step size for the gradient update</td>
    </tr>
    <tr>
      <td>SGD</td>
      <td>stochastic gradient descent — update on one example at a time</td>
    </tr>
    <tr>
      <td>epoch</td>
      <td>one full pass through the training data</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="how-this-maps-to-the-code-in-this-repo">How this maps to the code in this repo</h2>

<table>
  <thead>
    <tr>
      <th>concept</th>
      <th>file</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>matrix math (Dot, Add, Hadamard)</td>
      <td><code class="language-plaintext highlighter-rouge">matrix/matrix.go</code></td>
    </tr>
    <tr>
      <td>activation + its derivative</td>
      <td><code class="language-plaintext highlighter-rouge">nn/activations.go</code></td>
    </tr>
    <tr>
      <td>forward pass, backward, SGD step</td>
      <td><code class="language-plaintext highlighter-rouge">nn/layer.go</code></td>
    </tr>
    <tr>
      <td>softmax + cross-entropy</td>
      <td><code class="language-plaintext highlighter-rouge">nn/loss.go</code></td>
    </tr>
    <tr>
      <td>MNIST loader (IDX format)</td>
      <td><code class="language-plaintext highlighter-rouge">mnist/mnist.go</code></td>
    </tr>
    <tr>
      <td>training loop</td>
      <td><code class="language-plaintext highlighter-rouge">cmd/train/main.go</code></td>
    </tr>
  </tbody>
</table>

<p>Run <code class="language-plaintext highlighter-rouge">go test ./...</code> — the tests verify the exact numerical examples above (Step 2’s <code class="language-plaintext highlighter-rouge">z = [0, 1.8, 0.9]</code>, Step 3’s <code class="language-plaintext highlighter-rouge">p ≈ [0.659, 0.242, 0.099]</code>, etc.). The XOR test in <code class="language-plaintext highlighter-rouge">nn/train_test.go</code> is a tiny end-to-end proof that forward + backward + update are all correct.</p>

<p>Run <code class="language-plaintext highlighter-rouge">go run ./cmd/train</code> to actually train on MNIST. Expect ~97% test accuracy after 3 epochs.</p>]]></content><author><name>Vinh Truong</name></author><category term="ml" /><category term="neural-networks" /><category term="math" /><category term="backprop" /><summary type="html"><![CDATA[A condensed walkthrough of the core math behind a feed-forward neural network, written for someone learning for the first time. Companion notes to the from-scratch Go implementation in this repo.]]></summary></entry><entry><title type="html">Welcome to my notes</title><link href="https://thvinhtruong.github.io/2026/05/21/welcome/" rel="alternate" type="text/html" title="Welcome to my notes" /><published>2026-05-21T02:00:00+00:00</published><updated>2026-05-21T02:00:00+00:00</updated><id>https://thvinhtruong.github.io/2026/05/21/welcome</id><content type="html" xml:base="https://thvinhtruong.github.io/2026/05/21/welcome/"><![CDATA[<p>This is my first note. Posts live in <code class="language-plaintext highlighter-rouge">_posts/</code> as markdown files named
<code class="language-plaintext highlighter-rouge">YYYY-MM-DD-title.md</code>.</p>

<h2 id="writing-a-new-note">Writing a new note</h2>

<ol>
  <li>Create <code class="language-plaintext highlighter-rouge">_posts/2026-05-22-my-topic.md</code></li>
  <li>Add frontmatter (title, date, tags)</li>
  <li>Write markdown</li>
  <li><code class="language-plaintext highlighter-rouge">git push</code> — GitHub Pages builds and deploys automatically</li>
</ol>

<h2 id="markdown-works-as-expected">Markdown works as expected</h2>

<ul>
  <li><strong>bold</strong>, <em>italic</em>, <code class="language-plaintext highlighter-rouge">inline code</code></li>
  <li><a href="https://example.com">links</a></li>
  <li>lists, tables, images</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">hello</span><span class="p">():</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"notes!"</span><span class="p">)</span>
</code></pre></div></div>

<blockquote>
  <p>Block quotes too.</p>
</blockquote>]]></content><author><name>Vinh Truong</name></author><category term="meta" /><category term="intro" /><summary type="html"><![CDATA[This is my first note. Posts live in _posts/ as markdown files named YYYY-MM-DD-title.md.]]></summary></entry></feed>