This Little Red Book graphic text layout AI Skill has found a way to bypass AI labeling for graphic and text generation.

In February 2026, Xiaohongshu issued an announcement requiring that AI-generated composite content must be proactively labeled, and unlabeled content would be restricted from distribution. More than three months later, an open-source project named guizang-social-card-skill appeared on GitHub, specifically generating Xiaohongshu 3:4 graphic text and WeChat official account covers. Its technical pathway made an unusual choice: it does not use any AI model to generate image pixels; the entire image is rendered using HTML+CSS, with images sourced from photographic libraries like Unsplash. The output is not “AI-generated images”, but a screenshot of a webpage rasterized by a browser engine.

This choice corresponds to a specific change. Since 2026, Xiaohongshu has launched audio-visual recognition models that determine AIGC content by analyzing the distribution patterns of image pixels and audio features. During the same period, over 800,000 AI-managed accounts were dealt with, and nearly 150,000 AI-fabricated notes were handled. For content creators requiring high-frequency graphic text production, the probability of images generated by Midjourney or Canva AI being detected and flagged continues to rise. Master Zang's Skill chose another path: letting AI make layout decisions while handing over the final pixels to the rendering engine and photographic libraries.

This is a deliberate technical circumvention. However, how far this plan can go depends on the platform's flexibility in defining the term “AI-generated composite content.”

28 Layout Skeletons, AI is Responsible for Layout Logic Rather than Drawing

Master Zang's real name is Guizang, and he previously released guizang-ppt-skill, which is also an AI tool aimed at graphic text layout scenarios. This time, the social-card-skill is more focused: aimed at Xiaohongshu 3:4 graphic text, WeChat official account 1:1 and 21:9 covers, with output resolutions of 1080×1440, 1080×1080, and 2100×900, respectively.

In terms of technical architecture, this Skill has 28 built-in layout skeletons, divided into two visual systems: Editorial (magazine style, 16 layouts) and Swiss (Swiss international style, 12 layouts), accompanied by 10 preset theme color schemes. After users input the destination, itinerary, or note topic, AI is responsible for selecting the appropriate layout skeleton, determining text positions, and processing map annotation parameters, then writing all design decisions into HTML+CSS. The Playwright rendering engine takes over the subsequent steps, outputting PNG screenshots page by page.

A particularly useful component for travel bloggers is the map module. It uses MapLibre to load real tiles from OpenStreetMap, supporting multiple location markers and connections. Users only need to provide city or attraction names, and AI automatically generates an annotated base map and integrates it into the layout. The workflow for sourcing images has a clear priority: user-provided real photos take precedence; if no user images are available, they are automatically retrieved in the order of Unsplash → Pexels → Flickr CC → Wallhaven.

The entire process is executed in seven steps: Intake (receive input) → Style & Theme (determine style and theme) → Layout Selection (select layout) → Asset Prep (prepare assets) → Compose & Render (layout and render) → Deliver & Review (output and review) → Iterate (iteration modification). Each step is recorded in the .poster file in the task directory. When bulk outputting images, the node render.mjs is run, and Playwright renders each one. There is also a verification script validate-social-deck.mjs that measures DOM elements in a real browser environment, checking for text overflow, font sizes exceeding limits, footer element collisions, and other layout incidents.

The design goal of this mechanism is very clear: to be as precise and controllable as print layout software, rather than as free yet unpredictable as diffusion models. The cost is that creative freedom is constrained within 28 boxes. For creators who rely on personal photography styles, hand-drawn elements, or irregular collages, the layout skeletons provide not an efficiency boost but a design constraint.

In terms of accessibility, the CLI version requires installing Playwright, Node environment, and obtaining API access for Claude Code or Codex. There is also a web version entry xiaohongshu.guizang.ai aimed at non-developer users, but it is unclear whether the functionality completeness is consistent with the CLI version, and no public comparison information is available. Several X platform tweets and repeatedly updated README from the developer indicate that this project is still rapidly iterating.

Pixels Do Not Come from Generation Models, But Compliance Does Not Equal Long-term Safety

Xiaohongshu's AI content detection logic, based on public information and technical data analysis, is primarily reliant on audio-visual recognition models. This model determines whether content comes from AI generation models by analyzing the pixel distribution patterns of images. Diffusion models and GANs leave specific statistical features at the pixel level when generating images, which differ from the natural light and shadow, lens distortion, and noise patterns captured by camera sensors. The training goal of the audio-visual recognition model is precisely to capture such inconsistencies in statistical patterns.

The evasion logic of the Master Zang Skill is built on a key distinction: the image pixels it outputs do not come from any generation model. The HTML rendering engine rasterizes CSS styles, generating pixel distribution characteristics that are closer to screenshots of browser interfaces or outputs from desktop layout software. The photo portions come from real images in libraries like Unsplash; these images are taken with cameras, processed manually, and do not carry traces of diffusion models.

However, the premise for this distinction to hold is that the platform's definition of “AI-generated composite content” precisely hinges on the line of “AI model generated pixels.” The official announcement from Xiaohongshu uses the term "AI-generated composite content," which covers a broad scope. If the platform extends the definition to "program-rendered outputs assisted by AI," or incorporates the browser rendering characteristics of HTML rasterized images into the training set of the recognition model, the current technical advantages of this plan will disappear.

The platform has the technical basis and governance motivation to expand the definition. The audio-visual recognition model itself is continuously iterating. If a large amount of comparison samples between HTML-rendered images and AI-generated images is included in the training data, the model can learn to distinguish between “subpixel anti-aliasing features of browser font rendering” and “irregular pixel blocks from GAN when generating text.” Currently, there is no public information indicating that Xiaohongshu has initiated this direction of training, but based on the boundaries of the model's capabilities, such an extension is technically valid.

Another important fact to pay attention to is the compliance elements related to mini-program hosting. Currently, there has been no official documentation indicating that this Skill has integrated model registration numbers or completed related compliance registration. If the platform adds tracing requirements for output toolchains in the content review process, the lack of registration information may become a new interception point.

API Template Engine, Platform Customized Tools, and HTML Rendering are Creating Three Diverging Paths

Observing the tools available in the market for generating images for social media, it becomes clear that they are diverging into three different technical routes. Each of these faces different structures of review risk.

AI Models Generating Images Directly. This route is represented by the Magic Design feature released by Canva AI in April 2026, which directly generates designs containing AI visual elements from textual prompts. Images generated by models such as Midjourney and DALL·E also fall into this category. The issue is clear: these images are the main detection targets of the audio-visual recognition model. Canva's response is to encourage transparent labeling rather than to evade detection. On Xiaohongshu, whether posts generated by AI models are labeled will reduce their recommendation weight remains unconfirmed by public data, but the platform's statement of "restricting distribution of unlabeled AI content" has become established policy. Each time a new version of the diffusion model is updated, the statistical features of the pixels may change, and the corresponding detection model will also iterate, presenting creators with a continuously moving target.

API Template Engine Rendering. Bannerbear is a typical example of this route. Users create templates in a designer, modifying layer variables by passing JSON data through the REST API, and the server renders and outputs PNG or JPG. Its core is also “program rendering” rather than “model-generating pixels,” thus the output does not contain traces of diffusion models. The difference from Master Zang's Skill is that Bannerbear's templates depend on manual design, with AI not participating in layout decisions; Master Zang's Skill allows Claude to directly read and write HTML, giving layout selection to AI. The risk of the Bannerbear solution lies in another dimension: when many accounts use the same templates, colors, and fonts to produce graphic text, even if each image is not AI-generated, it may trigger “programmatic mass production” mode recognition on the platform side. The activation conditions of anti-spam rules are not entirely equivalent to AI detection, but for creators operating multiple accounts, the result remains restricted distribution.

Platform Customized Generation. Pin Generator is specifically designed for Pinterest, automatically generating Pins that conform to the platform's algorithm preferences. The core of this route is not evasion, but full adaptation—size, visual style, and publishing rhythm are all aligned with platform norms. The advantage is the lowest risk of review; the clear drawback is that the capabilities of the tool are tightly bound to platform rules, meaning when Pinterest adjusts its algorithms or restricts third-party API calls, the tool becomes directly ineffective. In contrast to Master Zang's Skill, the former is a platform-exclusive tool, while the latter is a cross-platform common solution. Platform exclusivity is safer but more fragile; cross-platform commonality is more flexible but more complex, representing a repeated trade-off in the field of AI tools.

The risk structures of these three paths are different. AI-generated images are the most free but respond to new detection models with each update. The template engine is the most stable but may be mistakenly hit by anti-spam rules. HTML rendering operates between these two: layout is flexibly controlled by AI, while pixels are handled by browsers and real images, evading detection at the pixel generation level but unable to cope with the platform's semantic rule expansion.

The Upper Limit of the Layout System Lies not in the Code but in Content Types

The 28 layout skeletons cover two mainstream visual systems: magazine style and Swiss style. For travel bloggers needing to display mapped routes, timelines, and multi-day itineraries, this system is highly compatible. Map annotations and itinerary connections are core information for these notes, and layout skeletons structure the information while maintaining a professional feel in the layout.

However, the content ecosystem on Xiaohongshu is far richer than just travel guides. Outfit notes rely on personal photography styles and color tones, beauty reviews need high-definition macro photos and product comparison images, and lifestyle content frequently uses multi-image collages and handwritten annotations. The “layout” of these content types is not about structuring information but about expressing personal aesthetics and emotions. The 28 layout skeletons become constraints rather than tools in such scenarios.

Technical limitations are also very real. Currently, the supported sizes are 1080×1440 (Xiaohongshu 3:4), 2100×900 (WeChat official account 21:9), and 1080×1080 (WeChat official account 1:1). Douyin's 9:16 vertical cover and Bilibili's 16:9 horizontal cover are not supported. The image library relies on Unsplash and Pexels, which tend to provide high-quality photography, suitable for travel, scenery, and urban architecture imagery needs. However, high-frequency materials for vertical content like food close-ups, cosmetic arrangements, and outfit items have limited coverage in these libraries. The user image priority strategy can partially alleviate this issue, provided the creators have a sufficient accumulation of real photographs.

The verification mechanism is a double-edged sword. The validate-social-deck.mjs can intercept layout incidents before images are generated, ensuring 100 batch renders do not contain errors. This is an efficiency guarantee in operational scenarios where dozens of images need to be updated daily. But it also means that any design not conforming to preset layout rules will be rejected by the script. Creators wanting to add a slanted text decoration or customize margins within standard layouts cannot freely drag and adjust like in Canva; they need to edit the HTML and CSS source code directly.

The local deployment threshold is another tier. Creators who can run Playwright and Node scripts can delve into the layout skeletons and rendering scripts for customization. However, for most Xiaohongshu bloggers, what they access is a subset of functions from the web interface. The practical value obtained from this Skill differs greatly between these two types of users. The core user group of the open-source project consists of creators and developers who are willing to experiment and have technical backgrounds, rather than the “one-click image generation” needs of regular content producers.

There Are No Universal Answers, But the Differentiation of Technical Routes Already Indicates Issues

A Xiaohongshu travel blogger faces three choices: use Midjourney to generate illustrated itinerary maps, bearing the risks of being flagged and downgraded; set up templates with Bannerbear and batch input data daily, bearing the anti-spam risks of template homogenization; or use Master Zang's Skill, letting AI select layouts and rendering images using HTML, bearing the risk of the platform expanding the definition of “composite content.” There is no safety card, only a combination of different risk structures.

This pattern conveys a message: the iteration of conflict between platforms and AI tools has already begun. Each time the platform updates its detection model, a batch of tools will have their technical advantage periods end. Each time a new tool finds a circumvention path, the platform will again adjust its strategy. This process will not converge to a stable state. The effectiveness of the HTML rendering solution depends on whether Xiaohongshu's audio-visual recognition model's training direction continues to focus on “diffusion model pixel features” or expands to “all non-native photography pixels.”

For content creators, distinguishing between “AI-assisted” and “AI-replaced” has practical significance. The platform's attitude has become clear: it encourages AI as a creative amplifier but opposes using AI to replace human-led low-quality mass production. In Master Zang Skill, AI makes layout decisions rather than content generation; the photos are real, and the layouts are skeletons pre-designed by human designers. This fits perfectly within the “AI-assisted” range. Those outputs generated entirely by models, from copy to images, are indeed the targets the platform has made clear it wants to combat.

Whether this distinction will become an operational standard for platform review is currently uncertain. However, tool developers are already responding to this definition using technical choices.

免责声明：本文章仅代表作者个人观点，不代表本平台的立场和观点。本文章仅供信息分享，不构成对任何人的任何投资建议。用户与作者之间的任何争议，与本平台无关。如网页中刊载的文章或图片涉及侵权，请提供相关的权利证明和身份证明发送邮件到support@aicoin.com，本平台相关工作人员将会进行核查。

This Little Red Book graphic text layout AI Skill has found a way to bypass AI labeling for graphic and text generation.

28 Layout Skeletons, AI is Responsible for Layout Logic Rather than Drawing

Pixels Do Not Come from Generation Models, But Compliance Does Not Equal Long-term Safety

API Template Engine, Platform Customized Tools, and HTML Rendering are Creating Three Diverging Paths

The Upper Limit of the Layout System Lies not in the Code but in Content Types

There Are No Universal Answers, But the Differentiation of Technical Routes Already Indicates Issues

Selected Articles by PANews

Table of Contents

Related Articles