- Do More Newsletter
- Posts
- Do More Newsletter
Do More Newsletter
This issue contains featured article "Computer Vision Explained: How Modern AI 'Sees' the World", and exciting product information about Nori – “Family AI” for Real Life Coordination, Kling AI 3.0 – “Everyone Can Be a Director”, HIX AI – All-in-One AI Agent Workspace for Creators & Knowledge Workers, Simpplr Comms AI – AI Teammate for Internal Communications, and Expedient AI CTRL Agentic Workflow Engine – Turning AI Pilots into Real Processes.
Keep up to date on the latest products, workflows, apps and models so that you can excel at your work. Curated by Duet.

Nori is a new “Family AI” platform designed to run your household like a shared command center instead of a dozen scattered apps and group chats. It connects family calendars, tasks, meals, and routines into one system-level assistant that every family member can access, reducing the mental load of remembering who needs to be where and when. Parents can use Nori to coordinate pickups, recurring chores, and activities, while kids see age-appropriate views of their own schedules and responsibilities. The app’s focus is not on generic chat, but on helping real families get through busy weeks with fewer dropped balls and fewer “did anyone remember…?” moments.
Kling AI 3.0 is a major new release of Kuaishou’s video and image generation models aimed at creators who want cinematic control without a studio budget. The new Video 3.0 and Video 3.0 Omni models support text, image, audio, and video as input, letting you storyboard multi-shot sequences, control camera movement, and edit existing clips in one AI-native workflow. For designers and marketers, Image 3.0 adds 2K and 4K ultra-high-definition stills with strong realism, maintaining textures, lighting, and materials for production-ready visuals. Together, Kling 3.0 turns AI from a “one-off clip generator” into a creative partner that understands your narrative and helps you turn ideas into polished shorts, ads, or explainer videos.
HIX AI has evolved into an all‑in‑one AI agent platform that bundles writing, research, slide creation, video, and image generation into a single interface. Instead of juggling separate tools, you pick a specialized agent—like an AI Deep Research Agent, Slides Agent, Writer Agent, or video/image agents—and work through an entire workflow in one place, from first idea to finished asset. For solo creators, marketers, and students, this means you can research a topic, draft an article, generate supporting visuals, and build a slide deck without exporting data between apps or re‑prompting different models. The focus is on approachable, purpose‑built agents that handle the “glue work” around content production so you can spend more time on what to say, not how to assemble it.
Simpplr’s new Comms AI is pitched as an “AI teammate” for internal communications teams that are currently stuck in spreadsheets, docs, and email threads. Instead of planning campaigns in one tool, drafting in another, and chasing approvals in your inbox, Comms AI creates a single AI‑native workspace where planning, writing, approvals, and publishing all happen in one connected flow. You can start with a goal or a few rough notes, and the system will help structure a comms plan, map audiences and channels, suggest timelines and risks, and generate on‑brand copy tailored to different formats, from intranet posts to short Slack updates. For communicators, the payoff is less “invisible work” and more visible impact—the platform tracks what’s planned, what’s approved, and how campaigns ladder up to business priorities.
Expedient has expanded its AI CTRL platform with an Agentic Workflow Engine designed to move AI projects from “cool pilot” to production workflows. Instead of isolated proof‑of‑concept bots, the engine weaves agentic AI into end‑to‑end processes that normally require human handoffs, running on a foundation of secure infrastructure, governance, and AI‑ready storage. Paired with an “AI Outcomes Team” and an ROI dashboard, AI CTRL helps organizations design and deploy workflows that show concrete time savings and cost reductions rather than abstract AI promises. For operations, IT, and business leaders, this means you can automate multi‑step tasks with agents while still having the oversight, security, and measurable results needed to win internal support.

Kling AI 3.0 is the latest generation of Kuaishou’s AI video and image creation system, built for anyone who wants to turn a rough idea into a high‑quality visual story. Instead of generating a single short clip from a prompt, the new release adds a family of models—Video 3.0, Video 3.0 Omni, Image 3.0, and Image 3.0 Omni—that understand text, images, audio, and video together. This multimodal foundation lets creators sketch a narrative with references, still frames, or voice, then have the model handle motion, transitions, and style in a cohesive way.
For productivity, the key benefit is control. Kling 3.0 is built around a unified product framework that supports text‑to‑video, image‑to‑video, reference‑to‑video, and in‑video editing in one pipeline, so you can refine a project instead of restarting from scratch with every change. Shot‑level control and better adherence to prompts mean that when you specify camera angles, pacing, or scene changes, the system is far more likely to follow your direction, which cuts down on iteration time and wasted renders.
On the creator side, the new Image 3.0 and Image 3.0 Omni models give you 2K and 4K ultra‑high‑definition output suitable for professional use cases like product renders, virtual set design, and marketing campaigns. The models focus on preserving textures, lighting, and materials, which makes it easier to match AI‑generated shots with real footage or existing brand assets. For small teams that can’t justify traditional 3D or VFX pipelines, this offers a practical way to produce “studio‑grade” visuals while staying inside a single tool.
Perhaps the most useful shift with Kling 3.0 is philosophical: it treats AI as a creative partner rather than a novelty button. By integrating multiple video and image tasks into one architecture and giving users more narrative control, Kling 3.0 makes it plausible for non‑experts to plan, iterate, and ship full video projects in days instead of weeks. That opens the door for more experimentation—social series, product explainers, music visuals, and learning content—because the cost of trying something new is dramatically lower than traditional production.
Computer Vision Explained: How Modern AI "Sees" the World

When you look at a photograph of a Golden Retriever sitting in a park, you don't just see colored dots. You instantly recognize a dog, grass, trees, and perhaps a tennis ball. You understand the spatial relationships—the dog is in front of the tree. You might even infer context, like "this dog is happy."
This instantaneous process of perception and understanding is incredibly complex biological work that your brain does effortlessly.
For decades, computer scientists have tried to replicate this ability in machines. It is called Computer Vision—the field of Artificial Intelligence focused on enabling computers to "see" and interpret the visual world.
Today, computer vision is no longer science fiction. It unlocks your smartphone with your face, helps self-driving cars spot pedestrians, identifies tumors in medical scans better than human radiologists, and automatically tags your friends in social media photos.
But how does a machine, which only understands cold, hard numbers, manage to comprehend something as abstract and varied as a photograph?
The Core Challenge: The Semantic Gap
To understand the miracle of modern computer vision, you must first understand the massive obstacle it had to overcome: the "semantic gap."
This is the disconnect between the low-level data a computer receives and the high-level interpretation humans apply.
When a computer "looks" at an image, it does not see shapes or objects. It sees a massive spreadsheet of numbers. A standard digital image is a grid of pixels. If an image is 1000 pixels wide and 1000 pixels high, that’s one million individual points.
Furthermore, most color images are made up of three layers (channels) of color: Red, Green, and Blue (RGB). So, that single image is actually three separate grids stacked on top of each other. Each cell in these grids contains a numerical value between 0 (black) and 255 (full color intensity).
To a computer, a picture of a Golden Retriever isn't a dog; it is a three-dimensional matrix of roughly three million numbers varying from 0 to 255.
The challenge of computer vision is building a mathematical translation layer that can take those millions of meaningless numbers as input and output a single, meaningful concept: "Dog."
The Old Way vs. The AI Revolution
Before the current AI boom, computer scientists tried to bridge the semantic gap manually. This was called "feature engineering."
Researchers would sit down and attempt to mathematically define what makes a dog look like a dog. They might write code that said: "Look for sharp changes in numerical values that indicate an edge. If you find two triangular shapes on top of a circular shape, it might be a cat."
This approach was brittle. It failed as soon as the lighting changed, or the cat was turned sideways, or it was partially hidden behind a sofa. You couldn't possibly write enough manual rules to account for every variation in the real world.
The revolution arrived around 2012 with the resurgence of Deep Learning.
Instead of us telling the computer how to recognize a dog, we realized we should build an architecture that allows the computer to learn what a dog looks like itself. We do this by feeding it millions of examples and using a specialized type of brain-inspired structure called a Convolutional Neural Network (CNN).
The Engine of Sight: The Convolutional Neural Network (CNN)
For nearly a decade, the CNN has been the workhorse of modern computer vision. While newer architectures (like Vision Transformers) are emerging, CNNs remain the best way to understand the fundamental concepts of how AI processes images.
A CNN is designed to mimic, very loosely, how the human visual cortex processes information in stages. It doesn't look at the whole image at once and try to guess. It breaks it down through a series of layers.
Here is the step-by-step process of how a CNN "sees."
1. The Convolution Operation (The "Filter")
Imagine you are in a pitch-black room looking at a large wall painting with a small flashlight. You can only see the small area your light illuminates at any given moment. To see the whole picture, you have to slide the flashlight across the wall, patch by patch, from top-left to bottom-right.
This is exactly what a CNN does to an image. This process is called convolution.
The "flashlight" is a small grid of numbers (usually 3x3) called a kernel or a filter. The AI slides this filter over the original grid of image pixels. At every stop, it performs a mathematical calculation (multiplying the pixel values by the filter values) to summarize that tiny patch of the image into a single number.
This creates a new, smaller grid called a "feature map."
Why do this? Because different filters look for different things.
One filter might be mathematically tuned to activate highly only when it slides over a vertical line.
Another filter might only activate when it sees a horizontal line.
Another might react to color transitions.
In the very first layers of the network, the AI is not seeing dogs or cars. It is only seeing basic geometry: lines, curves, edges, and blobs of color.
The crucial magic of modern AI is that humans do not design these filters. The network starts with random filters and, during the training process, it learns which filter shapes are necessary to distinguish between different objects.
2. Pooling (Downsampling)
Images contain a lot of redundant information. A patch of blue sky 10 pixels wide isn't much different than a patch 2 pixels wide.
After convolution, the network usually performs "pooling." This is a way of shrinking the image data down, making the calculations manageable and forcing the AI to focus only on the most important features. The most common method is "Max Pooling," where the network looks at a small patch of the feature map and keeps only the highest number, discarding the rest.
3. The Hierarchy of Features (Putting it Together)
A deep learning network consists of many of these convolution and pooling layers stacked like pancakes.
As the data moves deeper into the network, something amazing happens: the features become more complex.
Early Layers: The filters detect simple edges and colors.
Middle Layers: The network combines those edges to detect simple shapes—circles, squares, textures like fur or brick.
Deep Layers: The network combines those shapes to detect complex object parts. It might recognize an eye, a tire, a beak, or a door handle.
By the time the data reaches the final layers of the network, the AI has built a rich, high-level representation of what is in the image based on a hierarchy of simpler parts.
4. The Final Classification
The very last layer of the network is usually a standard "fully connected" neural network. It takes the high-level features discovered by the final CNN layers (e.g., "has fur," "has snout," "four legs") and acts as a classifier.
It outputs a list of probabilities that sum up to 100%. For our dog photo, the output might look like this:
Golden Retriever: 92%
Tennis Ball: 5%
Cat: 2%
Toaster: 1%
The AI picks the highest probability and presents its answer: "Golden Retriever."
Beyond Simple Labels: Types of Vision Tasks
While identifying a single object in an image is impressive, modern AI goes much further.
1. Image Classification This is what we just described. The AI looks at the image and asks: "What is the main subject of this picture?" It offers a single label.
2. Object Detection This is harder. The AI asks: "What objects are in this image, and where are they?" The output is not just labels, but bounding boxes (rectangles) drawn around every distinct object it recognizes, identifying multiple items in a single messy scene.
3. Semantic Segmentation This is the most precise task. Instead of drawing a crude box around an object, the AI tries to classify every single pixel in the image.
It will color every pixel belonging to a "car" blue, every pixel belonging to the "road" gray, and every pixel belonging to a "pedestrian" red. This creates a pixel-perfect map of the scene, which is vital for technologies like self-driving cars that need to know exactly where the drivable road ends and the sidewalk begins.
The Future of Sight
We are currently living through a golden age of computer vision. The combination of massive datasets (like ImageNet, containing millions of labeled photos), powerful GPU hardware, and clever architectures like CNNs has allowed machines to surpass human performance in specific visual tasks.
The field is moving fast. We are already seeing a shift away from pure CNNs toward architectures called "Vision Transformers," adapted from the technology behind language models like ChatGPT. We are also seeing the rise of generative vision AI, like Midjourney and DALL-E, which reverse the process—taking text concepts and turning them back into pixels.
However, the core principle remains: turning the chaos of raw visual data into structured mathematical understanding. By teaching machines to see, we aren't just building better cameras; we are giving AI the ability to navigate and understand our physical reality.

Partner Spotlight: Duet Display
Duet Display turns your iPad, Mac, Windows PC, or Android devices into a fast, high‑quality second display so you can spread out your work without buying new hardware. It’s designed for creators and professionals who want more screen real estate for timelines, canvases, dashboards, or research with very low latency. Duet Display also supports features such as touch and Apple Pencil input on supported devices, making it useful for sketching, annotating documents, or controlling creative apps directly from a tablet.
Learn more and download at Duet Display.
Trust-First AI, Built Into Your Browser
Agentic workflows are everywhere. Real trust is still rare.
Norton Neo is the world’s first AI-native browser designed from the ground up for safety, speed, and clarity. It brings AI directly into how you browse, search, and work without forcing you to prompt, manage, or babysit it.
Key Features:
Privacy and security are built into its DNA.
Tabs organize themselves intelligently.
A personal memory adapts to how you work over time.
This is zero-prompt productivity. AI that anticipates what you need next, so you can stay focused on doing real work instead of managing tools.
If agentic AI is the trend, Neo is the browser that makes it trustworthy.
Try Norton Neo and experience the future of browsing.
Stay productive, stay curious—see you next week with more AI breakthroughs!

