The twist with a sparse autoencoder: we make the hidden layer much wider than the input (e.g., 768 -> 6144), but force most hidden units to be zero. This means: The network has more features than ...