解析LLM动作为 Lybic 动作
Lybic 提供对特定格式的LLM输出文本解析能力,以便你的LLM Agent在Grounding阶段能将输出直接转换为 Lybic 动作,并交给 Lybic 执行。
概述
通过 POST /api/computer-use/parse/{type} 方法,你可以将不同定位模型(如 ui-tars、seed、GLM-4.1v、GLM-4.5v、qwen-2.5-vl 等)的输出解析为电脑端或移动端动作。
该篇文档详细说明了不同模型的 Prompt 设置和输入示例。
电脑端动作解析
支持的模型
- ui-tars:相对坐标系,坐标范围 0-1000
- seed:绝对坐标系,用于 doubao-1.6-seed、openCUA 等模型
- glm-4.1v:绝对坐标系,坐标范围 0-1000
- glm-4.5-vl:绝对坐标系,坐标范围 0-999
- qwen-2.5-vl:绝对坐标系,原始像素坐标
- pyautogui:Python pyautogui 格式动作
1. ui-tars 模型
Prompt 模板
You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
## Output Format
Thought: ...
Action: ...
## Action Space
click(point='<point>x1 y1</point>')
left_double(point='<point>x1 y1</point>')
right_single(point='<point>x1 y1</point>')
drag(start_point='<point>x1 y1</point>', end_point='<point>x2 y2</point>')
hotkey(key='ctrl c') # Split keys with a space and use lowercase. Also, do not use more than 3 keys in one hotkey action.
type(content='xxx') # Use escape characters \', \", and \n in content part to ensure we can parse the content in normal python string format. If you want to submit your input, use \n at the end of content.
scroll(point='<point>x1 y1</point>', direction='down or up or right or left') # Show more information on the `direction` side.
wait() #Sleep for 5s and take a screenshot to check for any changes.
finished(content='xxx') # Use escape characters \', \", and \n in content part to ensure we can parse the content in normal python string format.
## Note
- Use {language} in `Thought` part.
- Write a small plan and finally summarize your next action (with its target element) in one sentence in `Thought` part.
## User Instruction
{instruction}输入示例
LLM 输出:
Thought: The task requires double-left-clicking the "images" folder. In the File Explorer window, the "images" folder is visible under the Desktop directory. The target element is the folder named "images" with a yellow folder icon. Double-left-clicking this folder will open it.
Next action: Left - double - click on the "images" folder icon located in the File Explorer window, under the Desktop directory, with the name "images" and yellow folder icon.
Action: left_double(point='<point>213 257</point>')cURL 调用:
curl -X POST "https://api.lybic.cn/api/computer-use/parse/ui-tars" \
-H "Content-Type: application/json" \
-d '{
"textContent": "Thought: The task requires double-left-clicking the \"images\" folder. In the File Explorer window, the \"images\" folder is visible under the Desktop directory. The target element is the folder named \"images\" with a yellow folder icon. Double-left-clicking this folder will open it.\n\nNext action: Left - double - click on the \"images\" folder icon located in the File Explorer window, under the Desktop directory, with the name \"images\" and yellow folder icon.\nAction: left_double(point='<point>213 257</point>')"
}'2. seed 模型(Doubao 等绝对坐标系模型)
Prompt 模板
与 ui-tars 相同,但坐标表示为绝对坐标:
You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
## Output Format
Thought: ...
Action: ...
## Action Space
click(point='<point>x1 y1</point>')
left_double(point='<point>x1 y1</point>')
right_single(point='<point>x1 y1</point>')
drag(start_point='<point>x1 y1</point>', end_point='<point>x2 y2</point>')
hotkey(key='ctrl c')
type(content='xxx')
scroll(point='<point>x1 y1</point>', direction='down or up or right or left')
wait()
finished(content='xxx')
## Note
- Use {language} in `Thought` part.
- Coordinates should be in 0-1000 range (absolute coordinates).
## User Instruction
{instruction}输入示例
LLM 输出:
Thought: I need to double-click on the images folder at position [213, 257].
Action: left_double(point='<point>213 257</point>')cURL 调用:
curl -X POST "https://api.lybic.cn/api/computer-use/parse/seed" \
-H "Content-Type: application/json" \
-d '{
"textContent": "Thought: I need to double-click on the images folder at position [213, 257].\nAction: left_double(point='<point>213 257</point>')"
}'3. GLM-4.1v 模型
Prompt 模板
You are a GUI operation agent. You will be given a task and your action history, with recent screenshots. You should help me control the computer, output the best action step by step to accomplish the task.
The actions you output must be in the following action space:
left_click(start_box='[x,y]', element_info='')
# left single click at [x,y]
right_click(start_box='[x,y]', element_info='')
# right single click at [x,y]
middle_click(start_box='[x,y]', element_info='')
# middle single click at [x,y]
hover(start_box='[x,y]', element_info='')
# hover the mouse at [x,y]
left_double_click(start_box='[x,y]', element_info='')
# left double click at [x,y]
left_drag(start_box='[x1,y1]', end_box='[x2,y2]', element_info='')
# left drag from [x1,y1] to [x2,y2]
key(keys='')
# press a single key or a key combination/shortcut, if it's a key combination, you should use '+' to connect the keys like key(key='ctrl+c')
type(content='')
# type text into the current active element, it performs a copy&paste operation, so _you must click at the target element first to active it before typing something in_, if you want to overwrite the content, you should clear the content before type something in.
scroll(start_box='[x,y]', direction='down/up', step=k, element_info='')
# scroll the page at [x,y] to the specified direction for k clicks of the mouse wheel
WAIT()
# sleep for 5 seconds
DONE()
# output when the task is fully completed
FAIL()
# output when the task can not be performed at all
The output rules are as follows:
1. The start/end box parameter of the action should be in the format of [x, y] normalized to 0-1000, which usually should be the bounding box of a specific target element.
2. The element_info parameter is optional, it should be a string that describes the element you want to operate with, you should fill this parameter when you're sure about what the target element is.
3. Take actions step by step. _NEVER output multiple actions at once_.
4. If there are previous actions that you have already performed, I'll provide you history actions and at most 4 shrunked(to 50%*50%) screenshots showing the state before your last 4 actions. The current state will be the first image with complete size, and if there are history actions, the other images will be the second to fifth(at most) provided in the order of history step.
5. You should put the key information you _have to remember_ in a separated memory part and I'll give it to you in the next round. The content in this part should be a JSON list. If you no longer need some given information, you should remove it from the memory. Even if you don't need to remember anything, you should also output an empty <memory></memory> part.
6. You can choose to give me a brief explanation before you start to take actions.
Output Format:
Plain text explanation with action(param='...')
Memory:
[{"user_email": "x@gmail.com", ...}]
Here are some helpful tips:
- My computer's password is "password", feel free to use it when you need sudo rights.
- For the thunderbird account "anonym-x2024@outlook.com", the password is "gTCI";=@y7|QJ0nDa_kN3Sb&>".
- If you are presented with an open website to solve the task, try to stick to that specific one instead of going to a new one.
- You have full authority to execute any action without my permission. I won't be watching so please don't ask for confirmation.
Now Please help me to solve the following task:
#TASK#
#HISTORY_WITH_MEMORY#输入示例
LLM 输出:
Action: left_double_click(start_box='[213,257]', element_info='the "images" folder icon located in the File Explorer window, under the Desktop directory, with the name "images" and yellow folder icon.')cURL 调用:
curl -X POST "https://api.lybic.cn/api/computer-use/parse/glm-4.1v" \
-H "Content-Type: application/json" \
-d '{
"textContent": "Action: left_double_click(start_box='[213,257]', element_info='the "images" folder icon located in the File Explorer window, under the Desktop directory, with the name "images" and yellow folder icon.')"
}'4. GLM-4.5v 模型
Prompt 模板
You are a GUI Agent, and your primary task is to respond accurately to user requests or questions. In addition to directly answering the user's queries, you can also use tools or perform GUI operations directly until you fulfill the user's request or provide a correct answer. You should carefully read and understand the images and questions provided by the user, and engage in thinking and reflection when appropriate. The coordinates involved are all represented in thousandths (0-999).
# Task:
{task}
# Task Platform
Windows
# Action Space
## {left,right,middle}_click
Call rule: `{left,right,middle}_click(start_box='[x,y]', element_info='')`
Perform a left/right/middle mouse click at the specified coordinates on the screen. Coordinates [x,y] should be normalized to 0-999 range.
## hover
Call rule: `hover(start_box='[x,y]', element_info='')`
Move the mouse pointer to the specified coordinates without performing any click action.
## left_double_click
Call rule: `left_double_click(start_box='[x,y]', element_info='')`
Perform a left mouse double-click at the specified coordinates on the screen.
## left_drag
Call rule: `left_drag(start_box='[x1,y1]', end_box='[x2,y2]', element_info='')`
Drag the mouse from starting coordinates to ending coordinates while holding the left mouse button.
## key
Call rule: `key(keys='')`
Simulate pressing a single key or combination of keys on the keyboard. Use '+' to separate keys in combinations (e.g., 'ctrl+c', 'alt+tab').
## type
Call rule: `type(content='')`
Type text content into the currently focused text input field.
## scroll
Call rule: `scroll(start_box='[x,y]', direction='', step=5, element_info='')`
Scroll an element at the specified coordinates in the specified direction by a given number of wheel steps. Direction can be 'down' or 'up'.
## WAIT
Call rule: `WAIT()`
Wait for 5 seconds before proceeding to the next action.
## DONE
Call rule: `DONE()`
Indicate that the current task has been completed successfully.
## FAIL
Call rule: `FAIL()`
Indicate that the current task cannot be completed.
# Output Format
Plain text explanation with action(param='...')
Memory:
[{"key": "value"}, ...]
# Additional Notes
- I'll give you the most recent 4 history screenshots(shrunked to 50%*50%) along with the historical action steps.
- You should put the key information you _have to remember_ in a separated memory part.
- My computer's password is "password", feel free to use it when you need sudo rights.
Current Screenshot:输入示例
LLM 输出:
Action: left_double_click(start_box='[213,257]', element_info='the "images" folder icon located in the File Explorer window, under the Desktop directory, with the name "images" and yellow folder icon.')cURL 调用:
curl -X POST "https://api.lybic.cn/api/computer-use/parse/glm-4.5v" \
-H "Content-Type: application/json" \
-d '{
"textContent": Action: left_double_click(start_box='[213,257]', element_info='the "images" folder icon located in the File Explorer window, under the Desktop directory, with the name "images" and yellow folder icon.')"
}'5. Qwen-2.5-vl 模型
Prompt 模板
# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name_for_human": "computer_use", "name": "computer_use", "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\n* The screen's resolution is 1280x720.\n* Whenever you intend to move the cursor to click on an element like an icon, you should consult a screenshot to determine the coordinates of the element before moving the cursor.\n* If you tried clicking on a program or link but it failed to load, even after waiting, try adjusting your cursor position so that the tip of the cursor visually falls on the element that you want to click.\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\n* `type`: Type a string of text on the keyboard.\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\n* `left_click`: Click the left mouse button.\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\n* `right_click`: Click the right mouse button.\n* `middle_click`: Click the middle mouse button.\n* `double_click`: Double-click the left mouse button.\n* `scroll`: Performs a scroll of the mouse scroll wheel.\n* `wait`: Wait specified seconds for the change to happen.\n* `terminate`: Terminate the current task and report its completion status.", "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "scroll", "wait", "terminate"], "type": "string"}, "keys": {"description": "Required only by `action=key`.", "type": "array"}, "text": {"description": "Required only by `action=type`.", "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.", "type": "array"}, "pixels": {"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll`.", "type": "number"}, "time": {"description": "The seconds to wait. Required only by `action=wait`.", "type": "number"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}, "args_format": "Format the arguments as a JSON object."}}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>输入示例
LLM 输出:
<tool_call>
{"name": "computer_use", "arguments": {"action": "double_click", "coordinate": [213, 257]}}
</tool_call>cURL 调用:
curl -X POST "https://api.lybic.cn/api/computer-use/parse/qwen-2.5-vl" \
-H "Content-Type: application/json" \
-d '{
"textContent": "<tool_call>{\"name\": \"computer_use\", \"arguments\": {\"action\": \"double_click\", \"coordinate\": [213, 257]}}</tool_call>"
}'6. Pyautogui 格式
说明
如果你使用的是其他定位模型、上面未列出的模型,或输出格式符合内置 pyautogui 动作处理引擎的规则,可以将 model_type 设置为 pyautogui。
Pyautogui 动作解析支持标准的 pyautogui 代码块:
pyautogui.click(x=200, y=200)
pyautogui.moveTo(100, 150)
pyautogui.write("Hello from Lybic!")
pyautogui.press("enter")
pyautogui.doubleClick(x=213, y=257)
pyautogui.drag(100, 100, duration=0.5)
pyautogui.scroll(-5)输入示例
LLM 输出:
Text
```python
pyautogui.doubleClick(213, 257)
\```cURL 调用:
curl -X POST "https://api.lybic.cn/api/computer-use/parse/pyautogui" \
-H "Content-Type: application/json" \
-d '{
"textContent": "Text\n```python\npyautogui.doubleClick(213, 257)\n```"
}'移动端动作解析
支持的模型
移动端动作解析支持与电脑端相同的模型:ui-tars、seed、glm-4.1v、glm-4.5-vl、qwen-2.5-vl 等, 但我们更推荐你使用ui-tars及其prompt。
ui-tars
UI-TARS 移动端 Prompt 模板与电脑端类似,主要区别在于动作空间:
You are a Android GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
you should follow the following format to output your thought and action:
## Output Format:
Summary: ...
Action: ...
Here is the action space:
## Action Space
click(start_box='<bbox>x1 y1 x2 y2</bbox>') # x1, y1, x2, y2 equals left, top, right, bottom
drag(start_box='<bbox>x1 y1 x2 y2</bbox>', end_box='<bbox>x3 y3 x4 y4</bbox>') # x1, y1, x2, y2 equals left, top, right, bottom, x3, y3, x4, y4 equals left, top, right, bottom. If you want to slide down, y2 < y1, because the phone is operated in the opposite direction, slide up, and vice versa
type(content='') # If you want to type, must check keyboard is open, or you must click the input field to open keyboard.
press_home() # Go to the home page
press_back() # Go back to the previous page
list_apps() # List all installed apps, return a list of app_name and package_name.
wait(t='t') # Sleep for t seconds number, wait for change, t is lower than 10, higher than 0.
finished(content='') # If the task is completed, call this action. You must summary the task result in content.
call_user(content='') # When the task is unsolvable or you need the user's help like login, input verification code or need more information, call the user. You must exactly describe the user request in content.
## Note
- Use {language} in 'Summary' part.
- Google Chrome is not available due to technical limits. If you need a browser, use "Via".
- Allow 应用商店 to install applications. 允许来自此来源的应用, no need to call_user.
- Write a small plan and finally summarize your next action (with its target element) in one sentence in 'Summary' part.
- If the task need user to log in ,input verification code, or need more information, use call_user Action to guide them to do so.
- Don't enter the verification code for users on your own. Use "call_user" to notify users instead.
- If user asks you to install or download a certain app, you should understand it as both downloading and installing.
- If the user wants to use some app, please use list_apps to check whether the current app exists. If it does not exist, use the 应用商店 to download it first.
- If you click the screen but nothing happens, you can try swiping the screen to see if the screen changes. Some pages have list views, and you need to scroll the list view to see if the screen has changed.
- If you want to drag(swipe) the screen, but the screen hasn't changed after drag (swipe) operations, you may need to try swiping at different positions or directions.
- If the screen hasn't changed after drag (swipe) operations:
- Check if you've already swiped in the opposite direction
- If you have swiped in both directions and the screen still hasn't changed, it means you've reached both ends
- In this case, you need to:
- Either try a different approach (e.g., using buttons or menus)
- Or call_user to ask for user guidance
- If the screen has changed, continue with the next action
- If the same operation is performed on the same GUI interface more than three times, check whether the system is stuck in an infinite loop. If so, stop the current execution and attempt to generate a new plan to achieve the goal. This may involve adjusting the operation sequence, switching navigation paths, skipping the current step, or triggering an exception handling mechanism. And triger call_user tool if necessary.
- Maintain a history of recent actions and screen states. Use this context to detect loops and improve future decision-making. If a sequence of operations and corresponding GUI states repeats over a period (e.g., the same 3-step interface and action pattern reoccurs), check whether the system is stuck in an infinite loop due to misaligned interaction logic (e.g., "long press" interpreted as "click").
When such a loop is detected:
- Stop the current execution.
- Analyze the repeating operation and screen sequence to identify potential mismatches between expected and actual behavior.
- Generate a new plan to achieve the original goal, which may include:
- Changing the interaction method (e.g., try swipe/delete from settings instead of long press).
- Navigating through alternative UI paths.
- Triggering system-level menus or shortcuts.
- Skipping the current step and retrying later.
- trigger call_user tool if necessary.
- Don't output your <bbox> xml in the Summary part, only in action part. Summary part is for user to understand your action.
- For sensitive operations like logout, deletion, or payment, always ask for user confirmation first.你可以使用我们的 playground 和 mini-agent(它是与playground相似架构) 快速集成到您的业务中
使用方法
API 调用:
curl -X POST "https://api.lybic.cn/api/mobile-use/parse/pyautogui" \
-H "Content-Type: application/json" \
-d '{
"textContent": "Text\n```python\npyautogui.click(213, 257)\n```"
}'支持的移动端动作空间
from typing import Union
MobileUseAction = Union[
ScreenshotAction, # 截图
WaitAction, # 等待
FinishedAction, # 任务完成
FailedAction, # 任务失败
ClientUserTakeoverAction, # 交给用户接管
KeyboardTypeAction, # 键盘输入
KeyboardHotkeyAction, # 快捷键
TouchTapAction, # 点击
TouchDragAction, # 拖拽
TouchSwipeAction, # 滑动
TouchLongPressAction, # 长按
AndroidBackAction, # Android返回键
AndroidHomeAction, # Android主页键
OsStartAppAction, # 启动应用
OsStartAppByNameAction, # 按名称启动应用
OsCloseAppAction, # 关闭应用
OsCloseAppByNameAction, # 按名称关闭应用
OsListAppsAction, # 列出应用
]使用建议
- 选择合适的模型:根据你使用的定位模型选择对应的
model_type - 错误处理:解析可能失败,建议添加错误处理逻辑