A baby inherently possesses the capability of recognizing a new visual concept (e.g., chair, dog) by learning from only very few positive instances taught by parent(s) or others, and this recognition capability can then be gradually further improved by exploring and/or interacting with the real instances in the physical world. In this work, we aim to build a computational model to interpret and mimic this baby learning process, based on prior knowledge modelling, exemplar learning, and learning with video contexts. The prior knowledge of a baby, inherited through genes or accumulated from life experience, is modelled with a pre-trained Convolutional Neural Network (CNN), and the convolution layers form the knowledge base of the baby brain. When very few instances of a new concept are taught, an initial concept detector is built by exemplar learning over the deep features from the pre-trained CNN. Furthermore, when the baby explores the physical world, once a positive instance is detected/identified with high score, the baby shall further observe/track the variable instance possibly from different view-angles and/or different distances, and thus more instances are accumulated. We mimic this process by the massive online unlabeled videos and well-designed tracking solution. Then the concept detector can be fine-tuned based on these new instances. This process can be repeated again and again till the baby has a very mature concept detector in the brain. Extensive experiments on Pascal VOC-07/10/12 object detection datasets well demonstrate the effectiveness of the proposed computational baby learning framework. It can beat the state-of-the-art full-training based performances by learning from only two positive instances for each object category, along with ~20,000 videos which mimic the baby exploration of the physical world.