Author: Changan I Biteye Content Team
Can someone who has never edited videos before create an AI-generated short video with a storyline, dialogue, and camera transitions?
Yes, and the whole process takes no more than half a day.
This article teaches you how to: think of a story → break it down into storyboards → generate video → edit it into a film.
No prior knowledge is required. Just follow along and you'll get a complete AI short video.
I. From Idea to Story: AI Videos Are Not Generated from a Single Prompt
Many people's first step in creating AI videos is to open Jimo, stare blankly at the input box, and not know what to write. After typing a few words, the generated content is far from what they imagined, and then they begin to doubt whether the tool is faulty or whether they don't know how to write prompts.
For example, "I want to be a Biteye junior sister who is reborn as a big shot in the crypto world" is an idea, not a story.
An idea is a direction; it tells you roughly what you want to do. A story is a structure; it tells you what to shoot in each frame. From idea to story, there's a process in between: script planning.
The simplest way is to open any LLM program, tell it your vague idea, and let it help you build up the story. You don't need to figure out all the details yourself; you just need to provide a direction, and you can work with it to deduce the rest.
Once the storyline is established, don't immediately break it down into segments. Instead, divide it into several large sections according to the narrative rhythm, clearly defining the core theme of each segment. This step is to control the overall pace and prevent any segment from dragging on or rushing.
The maximum length of a single video in JiMeng is 15 seconds, but in practice, videos under 12 seconds are the most stable and have the lowest probability of display issues. A 1-minute video, assuming each segment is an average of 10 seconds, would require approximately 5 segments.
We divided the story into five parts:
Paragraph 1: The opening, the core task is to explain the scene and the characters.
Paragraph Two: Time Travel, the core task is to establish the timeline.
Paragraph 3: Shows the character's transformation from confusion to clarity.
Paragraph 4: Calculating wealth, pushing emotions to a climax.
Paragraph 5: Completes the reversal, forming a closed loop with the opening.
Once the paragraphs are finalized, break each paragraph down into specific shot descriptions. Each shot should include four elements: the main subject, its location, what is happening, and the shooting angle. Do not include movement in the storyboard; only describe the still moment.
Copy the script from paragraph one into the AI chat box, and type "Give me a storyboard description based on the script for scene one". The result is shown below 👇
II. From Story to Visuals: First, identify the characters, scenes, and storyboards.
This chapter is the most crucial one in the entire process. The quality of the images you generate here directly determines the upper limit of the final video quality.
First, create the three-view drawing to identify your main subject.
Before generating any storyboards, the first thing to do is to create the three-view drawing of the main character.
Three-view drawings are three images of the same character: the front, side, and back. The purpose is to fix the character's appearance so that no matter what scene is generated later, these three images are used to maintain the consistency of the character.
If you skip this step and directly generate the storyboard, you'll find that the characters generated each time look different—the hairstyle changes, the face shape changes—and you simply can't make the video.
Open ChatGPT/Seedream and enter the following in the dialog box:
"Generate a three-view image of Biteye's junior sister."
The AI will generate an image containing the same person from three different angles. If the generated image differs significantly from what you expect, you can upload a reference image.
Once you are satisfied with the three-view drawing, download it. You will need to upload it back as a reference every time you generate a video.
Create another scene reference image to define your background.
Once the characters are determined, use the same logic to generate a separate reference image for your scene. In the dialog box, type "Generate an image of an office for me".
Before we begin generating storyboards, we need to understand a basic concept: a shot is the smallest unit of expression in a video.
The camera can also speak; different shot sizes convey different information. Common shot sizes include the following:
Panoramic view: It conveys information, allowing the audience to know where the scene is and who the characters are.
Medium shot: Used to advance the plot, allowing clear views of actions and expressions; it is the most frequently used shot type in narrative.
Close-up: A technique used to create emotion, where the camera focuses on the face, hands, or a key prop, magnifying the details to give the audience a strong emotional impact.
After understanding a single shot, we need to go to another level: a video is not a single shot, but the result of multiple shots combined in a rhythmic sequence.
In actual production, we usually use "four-square grid" and "nine-square grid" to organize the shot structure of a video - that is, to arrange 4 or 9 shots in a video to complete a complete expression.
The choice between a four-square grid and a nine-square grid is essentially about controlling the rhythm:
For slow-paced segments, such as the opening scene that sets the scene and the ending that concludes the emotional flow, a four-panel frame is sufficient. The four shots provide enough space for each frame to breathe.
Fast-paced segments: such as the climax of a fight, where the camera needs to switch rapidly to create tension, using a nine-grid layout, with nine shots compressed into one video, results in a completely different feel.
Once you understand camera angles and pacing, you can begin the actual production: turning an abstract story into concrete visuals.
Once the character's three-view drawings and scene reference images are prepared, the next step is to transform the previously written storyboard descriptions into visual images, one by one. The reason is simple: AI is better at handling "deterministic single frames" rather than "continuously changing processes," which can also significantly reduce the gacha rate.
The specific steps are as follows:
Each time a shot is generated, first upload the character's three-view drawing and the corresponding scene reference image to the ChatGPT dialog, and then enter the generation prompt words for the storyboard image.
"Please generate a four-panel storyboard based on the story synopsis and storyboard descriptions (including the previous and AI-generated storyboard phrases), along with scene and character images."
The model will break down this shot into four frames based on the storyboard information you provide, ensuring consistency between the characters and the scene, as shown below:
💡Quick Tips: There are a few common pitfalls when creating and reproducing text images. Knowing them in advance can save you a lot of trouble:
To generate a shot of a person playing a game on their phone, the generated phone screen will automatically face the viewer. The AI's logic is to make the content "readable," making the game a source of image pollution. The correct approach is to hold the phone horizontally with both hands, the screen facing the person's face, and the back of the phone facing the camera.
Occupational terms can trigger AI to associate them with entire scenarios: write "nurse," and AI will associate it with a hospital; write "chef," and AI will associate it with a kitchen. The correct approach is to only describe the clothing you actually want, without mentioning the occupational term.
Raw images can only generate still images; there is no corresponding visual state for "turning the head". The correct approach is to only describe what exists in this single frame.
3. From visuals to video: The prompts should describe the actions, not just the visuals.
The storyboards are all ready; now we're going to turn them into a moving video.
🌟Register and dream again
Open your browser and search for "Jimeng AI" to enter the official website. Click "Login" in the upper right corner. You can register with your Douyin account or mobile phone number. It can be accessed directly within China.
New users can generate a 15-second video for free. If a membership is required, Biteye has compared the prices of Seedance 2.0 across multiple platforms online. For details, please see: "The Lowest Cost Subscription Guide for Seedance 2.0 is Here!"
🌟How to write video prompts?
This is the most crucial part of this step, and also the part that beginners are most likely to make mistakes on.
First, dump all the reference images into the chat box. Jimeng supports uploading multiple reference images at the same time; simply drag and drop the images into the chat box. Then, drag in all the materials you prepared in the previous chapter—character three-view drawings, scene reference images, and four- or nine-panel storyboards—at once. Jimeng will then combine the information from these images to generate the video.
Many beginners make a mistake here: they try to describe what's in the picture again. The app can already see the picture you uploaded; you don't need to tell it what's in it.
The prompts should include: what is moving in the scene, how it is moving, whether the camera itself is moving, and what is happening at each interval.
Follow this template, with each line corresponding to a specific time segment in the video:
"Please refer to the storyboard above and generate a video for me."
[Start time to end time], [Shot type], [Camera movement], [Character or main subject] + [Specific action], Sound effects: [Sound description].
🌟Voice description is the part most easily overlooked by beginners. If there is dialogue in the video, simply writing "speaking voice" is not enough; the model will randomly generate a voice as a reference. To ensure that the character's voice is consistent across multiple video clips, there are two methods:
1️⃣ Use the audio from the first segment as a reference.
First, generate the first video segment. Once satisfied with the result, export the audio separately. For each subsequent segment, upload this audio as a sound reference; the system will use this timbre to generate the vocals for later segments, ensuring audio consistency.
2️⃣ Use Fish Audio to find reference sounds
Open Fish Audio, search for a voice that matches the character's temperament, listen to it, and download a sample as a reference audio. Use this reference audio consistently in every video segment you generate, ensuring consistent sound throughout the entire film.
🌟Use punctuation to control the tone of AI voiceover
Writing dialogue for an AI voice-over model isn't as simple as just typing in the text. The same sentence can be delivered in completely different tones depending on the punctuation.
The core logic is: punctuation marks control pauses, and pauses determine the mood.
...The ellipsis breaks the sound but the breath continues, which is suitable for thinking, hesitating, or when words are not finished.
...! When used in combination, it represents a sudden outburst after suppression.
The volume of the content within parentheses will automatically decrease, becoming a breathy sound, suitable for inner monologues and talking to oneself.
*Content* Words enclosed in asterisks become lower, slower, and heavier, used to emphasize key information.
Write instructions instead of dialogue in square brackets, such as "[take a deep breath]" or "[pause for 1 second]". The model will perform the action instead of reciting it.
💡Quick Tips:
AI lacks spatial awareness and often confuses left and right, requiring a separate "positional reference diagram" to show the AI how the character moves, as shown in Figure 1. A simpler method is to use arrows to describe the character's movement trajectory and add a "remove arrows" option at the end.
Write slowly, not fast. Models are much more stable at handling slow motion than fast motion. For fast-paced segments, prioritize using editing speed rather than having the model generate fast motion.
Please upload a reference image for each video, not just once. The model does not have cross-segment memory; the character's appearance will be off for segments where a reference image is not uploaded.
IV. From Clips to Finished Product: Editing Determines the Final Quality of a Video
Editing and post-production are the finishing touches in the whole process. Each piece of footage generated earlier is independent, with different tones, inconsistent rhythms, and scattered sounds. The role of editing is to combine these fragments into a complete story.
Adding music to a video can better engage the audience's emotions; adding subtitles makes the dialogue clearer; the same material, edited well and poorly, can result in a difference of an order of magnitude in the final product.
The process involves four steps: arranging materials → unifying the color tone → adding sound → adding subtitles, and finally exporting.
Step 1: Arrange the materials
Open CapCut and drag all the clips into the timeline in scene order. Ignore the color tone and sound for now, confirm the order, and check the overall pacing. Cut off any excessively long clips at this step.
Step 2: Unify the color scheme
Clips generated at different times may have slight differences in color temperature and brightness, which can make them appear disjointed when placed together. Solution: Select all clips, add an overall filter in "Adjustments," use a cool blue tone for scene one, and switch to a warm yellow tone for scene two and onwards, keeping the internal color tone consistent within each scene.
Step 3: Add background music and sound effects
The dialogue audio has already been processed when the video is generated. This step mainly adds two types of audio: background music and ambient sound effects.
Background music sets the overall mood; keep the volume below 30% of the dialogue volume so it doesn't drown out the vocals.
Step 4: Add subtitles
Use CapCut's "Smart Subtitles" to automatically recognize dialogue. After recognition, check for typos and standardize font and placement. For narration or monologues, it's recommended to use different styles to distinguish them from regular dialogue, such as italics or different colors.
V. From Tools to Expression: What Has AI Video Really Changed?
In the previous article , "GPT Image 2.0 empowers Seedance 2.0: Everyone can shoot Hollywood blockbusters," we believed that in the AI era, the threshold for "shooting videos" has been lowered, and everyone will be able to shoot Hollywood blockbusters in the future.
But a low barrier to entry doesn't mean you can do it.
The tools are all publicly available, and tutorials are everywhere, but most people get stuck in the same place: they never get through the entire process.
This article from Biteye has guided you step-by-step from a vague idea into a complete video.
In the past, this process required a complete division of labor: screenwriting, storyboarding, art direction, cinematography, and editing, each of which was a hurdle to overcome.
Now, these steps haven't disappeared; they've simply been compressed into a single process.
This signifies a more fundamental change: video is no longer a product of "production capacity," but rather a product of "expressive capacity."

